unpaper 1.0 - post-processing scanned and photocopied book pages

Licensed under the GNU General Public License (GPL). This software comes with no warranty.

Overview
Usage
Options
Download
Examples
Sources
Related Links

Overview

unpaper is a post-processing tool for scanned sheets of paper, especially for book pages that have been scanned from previously created photocopies. The main purpose is to make scanned book pages better readable on screen after conversion to PDF. Additionally, unpaper might be useful to enhance the quality of scanned pages before performing optical character recognition (OCR). unpaper tries to clean scanned images by removing dark edges that appeared through scanning or copying on areas outside the actual page content (e.g. dark areas between the left-hand-side and the right-hand-side of a double- sided book-page scan). The program also tries to detect disaligned centering and rotation of pages and will automatically straighten each page by rotating it to the correct angle. This is called "deskewing". Note that the automatic processing will sometimes fail. It is always a good idea to manually control the results of unpaper and adjust the parameter settings according to the requirements of the input. Each processing step can also be disabled individually for each sheet. Input and output files can be in either .pbm or .pgm format, as also used by the Linux scanning tools scanimage and scanadf. Conversion to PDF can e.g. be achieved with the Linux tools pgm2tiff, tiffcp and tiff2pdf.

 
                  
 

Usage

Usage: unpaper [options] <input-file> <output-file>

Filenames may contain a formatting placeholder starting with '%' to insert a
page counter for multi-page processing. E.g.: 'scan%03d.pbm' to process files
scan001.pbm, scan002.pbm, scan003.pbm etc.

Options

-l --layout single|double            Set default layout options for a sheet:
                                     'single' - one page per sheet, oriented
                                                vertically without rotation
                                     'double' - two pages per sheet, rotated
                                                anti-clockwise (i.e. the top-
                                                sides of the pages are heading
                                                leftwards, and the pages are
                                                placed right-page above left-
                                                page on the unrotated sheet)
                                     Using this option automatically adjusts the
                                     --mask-point and --pre/post-rotation
                                     options.
-s --start-sheet <sheet>             Number of first sheet to process in multi-
                                     sheet mode. (default: 1)
-e --end-sheet <sheet>               Number of last sheet to process in multi-
                                     sheet mode. -1 indicates processing until
                                     no more input file file the corresponding
                                     page number is available (default: -1)
-# --sheet                           Optionally specifies which sheets to
     <sheet>{,<sheet>[-<sheet>]}     process in the range between start-sheet
                                     and end sheet.
-x --exclude                         Excludes sheets from processing in the
     <sheet>{,<sheet>[-<sheet>]}     range between start-sheet and end-sheet.
--pre-rotate -90|90                  Rotates the whole image clockwise (90) or
                                     or anti-clockwise (-90) before any other
                                     processing.
--post-rotate -90|90                 Rotates the whole image clockwise (90) or
                                     or anti-clockwise (-90) after any other
                                     processing.
-M --pre-mirror                      Mirror the image, after possible pre-
     [v[ertical]][,][h[orizontal]]   rotation. Either 'v' (for vertical
                                     mirroring), 'h' (for horizontal mirroring)
                                     or 'v,h' (for both) can be specified.
--post-mirror                        Mirror the image, after any other
  [v[ertical]][,][h[orizontal]]      processing except possible post-
                                     rotation.
--pre-wipe                           Manually wipe out an area before further
  <left>,<top>,<right>,<bottom>      processing. Any pixel in a wiped area
                                     will be set to white. Multiple areas to
                                     be wiped may be specified.
--post-wipe                          Manually wipe out an area after
  <left>,<top>,<right>,<bottom>      processing. Any pixel in a wiped area
                                     will be set to white. Multiple areas to
                                     be wiped may be specified.
--pre-border                         Clear the border-area of the sheet before
  <left>,<top>,<roght>,<bottom>      further processing. Any pixel inside the
                                     border will be set to white.
--post-border                        Clear the border-area after processing.
  <left>,<top>,<roght>,<bottom>      Any pixel inside the border will be set
                                     to white.
--pre-mask <x1>,<y1>,<x2>,<y2>       Specify masks to apply before any other
                                     processing. Any pixel outside a mask
                                     will be considered blank (white) pixels,
                                     unless another mask includes this pixel.
                                     Only pixels inside a mask will remain.
                                     Multiple masks may be specified. No
                                     deskewing will be applied to the masks
                                     specified by --pre-mask.
-bn --blackfilter-scan-direction     Directions in which to search for solidly
     [v[ertical]][,][h[orizontal]]   black areas. Either 'v' (for vertical
                                     mirroring), 'h' (for horizontal mirroring)
                                     of 'v,h' (for both) can be specified.
                                     (default: 'v,h')
-bs --blackfilter-scan-size          Width of virtual bar used for mask
      <size>|<h-size>,<v-size>       detection. Two values may be specified
                                     to individually set horizontal and vertical
                                     size. (default: 20,20)
-bd --blackfilter-scan-depth         Size of virtual bar used for black area
      <depth>|<h-depth,v-depth>      detection. (default: 500,500)
-bp --blackfilter-scan-step          Steps to move virtual bar for black area
      <step>|<h-step,v-step>         detection. (default: 5,5)
-bt --blackfilter-scan-threshold <t> Ratio of dark pixels above which a black
                                     area gets detected. (default: 0.95).
-bi --blackfilter-intensity <i>      Intensity with which to delete black areas.
                                     Larger values will leave less noise-pixels
                                     around former black areas, but may delete
                                     page content. (default: 20)
-ni --noisefilter-intensity <n>      Intensity with which to delete individual
                                     pixels or tiny clusters of pixels. Any
                                     cluster which only contains n dark pixels
                                     together will be deleted. (default: 4)
-ls --blurfilter-size                Size of blurfilter area to search for
      <size>|<h-size>,<v-size>       'lonely' clusters of pixels.
                                     (default: 100,100)
-lt --blurfilter-step                Size of 'blurring' steps in each
      <step>|<h-step>,<v-step>       direction. (default: 50,50)
-li --blurfilter-intensity <ratio>   Relative intensity with which to delete
                                     tiny clusters of pixels. Any blurred area
                                     which contains at most the ratio of dark
                                     pixels will be cleared. (default: 0.01)
-gs --grayfilter-size                Size of grayfilter mask to search for
      <size>|<h-size>,<v-size>       'gray-only' areas of pixels.
                                     (default: 50,50)
-gp --grayfilter-step                Size of steps moving the grayfilter mask
      <step>|<h-step>,<v-step>       in each direction. (default: 20,20)
-gt --grayfilter-threshold <ratio>   Relative intensity of grayness which is
                                     accepted before clearing the grayfilter
                                     mask in cases where no black pixel is
                                     found in the mask. (default: 0.5)
-p --mask-point <x>,<y>              Manually set starting point for masking.
                                     Multiple --mask-point parameters may be
                                     specified to process multiple pages on one
                                     sheet. Cannot be used in conjunction with
                                     --pages. (default: middle of image)
-m --mask <x1>,<y1>,<x2>,<y2>        Manually add a mask, in addition to masks
                                     automatically searched around the --point
                                     coordinates (unless --nomask is specified).
                                     Any pixel outside a mask will be considered
                                     a blank (white) pixel, unless another mask
                                     covers this pixel.
-mn --mask-scan-direction            Directions in which to search for inner mask
     [v[ertical]][,][h[orizontal]]   border. Either 'v' (for vertical
                                     scanning), 'h' (for horizontal scanning)
                                     of 'v,h' (for both) can be specified.
                                     (default: 'h' ('v' may cut paragraphs on
                                     single-page sheets))
-ms --mask-scan-size <size>|<h,v>    Width of virtual bar used for mask
                                     detection. Two values may be specified
                                     to individually set horizontal and vertical
                                     size. (default: 50,50)
-md --mask-scan-depth <dep>|<h,v>    Height of virtual bar used for mask
                                     detection. (default: -1,-1, using the whole
                                     width or height of the sheet)
-mp --mask-scan-step <step>|<h,v>    Steps to move virtual bar for mask
                                     detection. (default: 10,10)
-mt --mask-scan-threshold <t>|<h,v>  Ratio of dark pixels below which an edge
                                     gets detected, relative to max. blackness
                                     when counting from the starting coordinate
                                     heading towards one edge. (default: 0.1)
-mm --mask-scan-minimum <w>,<h>      Set minimum allowed size of an auto-
                                     detected mask. Masks detected below this
                                     size will be ignored and set to the size
                                     specified by mask-scan-maximum. (default:
                                     100,100)
-mM --mask-scan-maximum <w>,<h>      Set maximum allowed size of an auto-
                                     detected mask. Masks detected above this
                                     size will be shrunk to the maximum value,
                                     each direction individually. (default:
                                     sheet size, or page size derived from
                                     --layout option.
-mc --mask-color <color>             Set color / gray-scale value to overwrite
                                     pixels which are not covered by any
                                     detected mask. This may be useful for
                                     testing in order to visualize the effect
                                     of masking. (value: 0..255, default: 255)
-dn --deskew-scan-direction          Directions in which to scan for rotation.
     [v[ertical]][,][h[orizontal]]   Either 'h' (for horizontal scanning,
                                     starting at the left and right edges of a
                                     mask) or 'v' (for vertical scanning,
                                     starting at the top and bottom), or 'v,h'
                                     (for both) can be specified.
                                     (default: 'h' ('v' may be confused by
                                     headlines or footnotes))
-ds --deskew-scan-size <pixels>      Size of virtual line for rotation
                                     detection. (default: 1500)
-dd --deskew-scan-depth <ratio>      Amount of dark pixels to accumlate until
                                     scan is finished, relative to scan-bar
                                     size. (default: 0.66)
-dr --deskew-scan-range <degrees>    Range in which to search for rotation,
                                     from -degrees to +degrees rotation.
                                     (default: 2.0)
-dp --deskew-scan-step <degrees>     Steps between single rotation-range
                                     detections.
                                     Lower numbers lead to better results but
                                     slow down processing. (default: 0.1)
-dv --deskew-scan-deviation <dev>    Maximum deviation allowed between results
                                     from all detected edges to perform auto-
                                     rotating, else ignore. (default: 1.0)
-W --wipe                            Manually wipe out an area. Any pixel in
     <left>,<top>,<right>,<bottom>   a wiped area will be set to white.
                                     Multiple --wipe areas may be specified.
                                     This is applied after deskewing and
                                     before automatic border-scan.
-mw --middle-wipe                    If --layout is set to 'double', this
      <size>|<left>,<right>          may specify the size of a middle area to
                                     wipe out between the two pages on the
                                     sheet. This may be useful if the
                                     blackfilter fails to remove some black
                                     areas (which e.g. occur by photo-copying
                                     in the middle between two pages).
-B --border                          Manually add a border. Any pixel in the
     <left>,<top>,<right>,<bottom>   border area will be set to white. This is
                                     applied after deskewing and before
                                     automatic border-scan.
-Bn --border-scan-direction          Directions in which to search for outer
     [v[ertical]][,][h[orizontal]]   border. Either 'v' (for vertical
                                     scanning), 'h' (for horizontal scanning)
                                     of 'v,h' (for both) can be specified.
                                     (default: 'v')
-Bs --border-scan-size <size>|<h,v>  Width of virtual bar used for border
                                     detection. Two values may be specified
                                     to individually set horizontal and vertical
                                     size. (default: 5,5)
-Bp --border-scan-step <step>|<h,v>  Steps to move virtual bar for border
                                     detection. (default: 5,5)
-Bt --border-scan-threshold <t>      Absolute number of dark pixels covered by
                                     the border-scan mask above which a border
                                     is detected. (default: 5)
-w --white-threshold <threshold>     Brightness ratio above which a pixel is
                                     considered white. This is used when
                                     converting to black-and-white mode
                                     (default: 0.9)
-b --black-threshold <threshold>     Brightness ratio below which a pixel is
                                     considered black (non-gray). This is used
                                     by the gray-filter. (default: 0.5)
--no-blackfilter                     Disables black area scan. Individual sheet
  <sheet>{,<sheet>[-<sheet>]}        indices can be specified.
--no-noisefilter                     Disables noisefilter. Individual sheet
  <sheet>{,<sheet>[-<sheet>]}        indices can be specified.
--no-blurfilter                      Disables blurfilter. Individual sheet
  <sheet>{,<sheet>[-<sheet>]}        indices can be specified.
--no-mask-scan                       Disables auto-masking around the areas
  <sheet>{,<sheet>[-<sheet>]}        searched beginning from points specified
                                     by --point or auto-specified by --layout.
                                     Masks explicitly set by --mask will still
                                     have effect.
--no-mask-center                     Disables auto-centering of each mask.
  <sheet>{,<sheet>[-<sheet>]}        Auto-centering is performed by default
                                     if the --layout option has been set.
--no-deskew                          Disables auto-rotation to a straight
  <sheet>{,<sheet>[-<sheet>]}        alignment for individual sheets.
--no-wipe                            Disables explicitly wipe-areas.
  <sheet>{,<sheet>[-<sheet>]}        This means the effect of parameter
                                     --wipe is disabled individually per
                                     sheet.
--no-border                          Disables explicitly set borders.
  <sheet>{,<sheet>[-<sheet>]}        This means the effect of parameter
                                     --border is disabled individually per
                                     sheet.
--no-border-scan                     Disables automatic border-scanning at the
  <sheet>{,<sheet>[-<sheet>]}        edges of the sheet after most other
                                     processing has been done.
-n --no-processing                   Do not perform any processing on a sheet
     <sheet>{,<sheet>[-<sheet>]}     except pre/post rotating and mirroring,
                                     and file-type conversions on saving.
                                     This option has the same effect as setting
                                     --no-blackfilter, --no-noisefilter,
                                     --no-blurfilter, --no-grayfilter,
                                     --no-mask-scan, --no-deskew, --no-wipe,
                                     --no-mask-center, --no-border-scan and
                                     --no-border simultaneously.
--no-qpixels                         Disable qpixel-mode for deskewing
                                     (internally rotate a 4x bigger image and
                                     reshrink afterwards).
--no-multi-pages                     Disable multi-page processing even if the
                                     input filename contains a '%' (usually
                                     indicating the start of a placeholder for
                                     the page counter).
-t --type pbm|pgm                    Output file type. (default: as input file)
-T --test-only                       Do not write any output. May be useful in
                                     combination with --verbose to get informa-
                                     tion about the input.
-q --quiet                           Quiet mode, no output at all.
-v --verbose                         Verbose output, more informational messages.
-vv                                  Even more verbose output, show parameter
                                     settings before processing.
-V --version                         Output version and build information.

Download

unpaper is available for download at http://download.berlios.de/unpaper/unpaper-1_0.tgz.

You may also want to browse the source-code online in the CVS archive of the project development site.

Examples

A typical sequence of application would be:

; Scan multiple sheets of paper to .ppm-files (for scanners without automatic
; document feeder, use any scan software to manually scan sheets):
scanadf -o sheet%03d.ppm

; Convert .ppm-files to gray-scale .pgm-files:
for i in `ls *.ppm`; do ppmtopgm $i > $i.pgm; done

; Run unpaper, performing all auto-corrections an all sheets except on the
; title sheet 1, and without auto-detection of masks (incuding deskewing and
; centering) on sheets 100-110 and 200:
unpaper -v --layout double --exclude 1 --no-mask-scan 100-110,200 sheet%03d.pgm unpaper%03d.pgm

; Convert generated .pgm-files to individual .tiff-files:
for i in `ls unpaper*`; do ppm2tiff $i $i.tiff; done

; Combine individual .tiff-files to one multi-page-tiff:
tiffcp *.tiff all.tiff

; Create PDF-document from multi-page-tiff:
tiff2pdf -z -o Document.pdf all.tiff

The source sheets need not to be scanned from paper directly but could also originate from a previously created PDF-document or other files. This way, unpaper can be used to 'clean' existing documents. There are several tools to convert other file formats to .pgm/.pbm-files for processing with unpaper.

Related Links

The SANE project http://www.sane-project.org/.


Written by Jens Gulden 2005.
Modifications under the GPL are welcome.

Hosted on