Process Files

Table of Contents

The process files step converts various document formats into machine-readable text. Most flows begin with this step, but you can place the step anywhere in a flow. If the preceding step outputs .ibdoc files, the OCR operation reverts to the original source files.

You can define processing configuration options using an existing digitization profile, connected as a module in the process files step, or you can specify processing options directly in the step.

Tip

If you’re developing a solution that included digitizing documents with non-default options, re-use the digitization profile from the Documents stage of your Solution Builder project. Reusing this profile ensures that production documents processed by your flow are optimally digitized.

Parameters and settings

Use these parameters and settings to control how the process files step operates.

Use Reader module

If you want to use a digitization profile that you developed in Solution Builder or Reader, select True and then select the digitization profile in the Reader Module field.

To use default digitization options, or to specify configuration within the process files step, select False.

Processing functions

Processing function indicates the document type for input files. To automatically process documents based on file extension, select Auto-extract. Otherwise, select the option that matches your input files.

PDF documents (.pdf) with machine-readable text
Images/Scanned documents (.pdf, .png, .jpeg, .tiff)
Microsoft Excel documents (.xlsx)
Microsoft Word documents (.docx)
Microsoft Rich Text Format documents (.rtf)
Microsoft PowerPoint documents (.pptx)
Email messages (.eml, .msg)
Web Pages (.html, .mht)
Text-based files (.csv, .txt)

For more details, see Supported file types and Supported settings by file type.

OCR model type

OCR model type specifies the processing engine to use for digitization. The Default (Microsoft OCR) model is suitable for most use cases, but you can select a different model depending on your circumstances.

Abbyy – Deprecated and scheduled for removal in 24.04. Use MSFT OCR instead.
Google Vision (Cloud) – Appropriate for lower-quality scanned input documents, including third- and fourth-generation documents with fuzzy images, shadows, and folds.
Microsoft Read OCR – Appropriate for high-accuracy extraction, and handwritten or foreign language documents. This engine returns a character-based OCR confidence score that can be used in Refiner.
Tesseract – Appropriate for high-quality scanned input documents.

Scripts directory

Scripts directories contain custom Python scripts intended to be run as part of the process files step. Most commonly, scripts are used to register custom image filters. Each .py file must contain a custom registration function. For detailed information, see Custom image filters.

Layout scope

Layout scope defines how layout analysis is performed on each document.

Per page – Performs analysis once per page. This option is appropriate for documents with relatively self-contained pages, and it works well with a variety of layouts, including mixed vertical and horizontal orientation, different page sizes, and formatting that varies per page.
Per document – Performs analysis on the entire document. This option can be used to extract information from tables that span multiple pages of text because the output alignment preserves layout across multiple pages.

Layout algorithm

Layout algorithm defines how text is arranged in the output document.

Spatial – Arranges output text from top to bottom, left to right. This option is appropriate for documents such as letters, forms, news articles, and invoices. The default, V2.0, is suitable for most standard documents. For documents with a mix of vertical and horizontal text, use V3.0.
By Paragraph – Arranges output text according to inferred paragraph flow. This option is best for multi-column documents.

Page range

Page range defines which pages in a multipage document to digitize, for example 1 or 2–10. Wild cards aren’t supported. If left blank, all pages are digitized.

Encryption config

If your input documents are password-protected PDFs, specify passwords in a runtime_config JSON input object provided as a value to the passwords key.

The runtime_config is a JSON object of type Dict[Text, Any].

To provide passwords in the runtime configuration:

Create a dictionary with keys that specify filenames of encrypted PDFs and values that specify the corresponding passwords.
Provide the dictionary as a value to the key passwords in the runtime_config JSON file.
Define the instabase and pdf namespaces.

For example:

{
  "instabase": {
    "pdf": {
      "passwords": {
        "input_file1.pdf": "password1",
        "input_file2.pdf": "password2"
      }
    }
  }
}

OCR config

OCR configuration specifies general processing, pre-processing, and OCR processing options for digitization.

General processing

Tip

To configure options in JSON format, select Show advanced view.

Setting	Description	JSON	Value
Force Image OCR	Specifies whether to treat the document as an image. Required `true` when using visual extraction functions. If you set `force_image_ocr` to `true`, you can’t set `extract_all_pdf_layers` to `true`.	`force_image_ocr`	`false` (default), `true`
Write Converted Image	Specifies whether to save per-page image files to disk.	`write_converted_image`	`true` (default), `false`
Write Thumbnail	Specifies whether to generate thumbnail images, which can speed page loading while annotating documents.	`write_thumbnail`	`true` (default), `false`
Write Model Training Image	Specifies whether to save a grayscale, resized version of the original page images to disk in a new `model_training_images` directory. Images are saved as a version of the original image, scaled down to fit in a 1024 x 1024 box. These smaller images are more suitable for training machine learning models.	`write_model_training_image`	`false` (default), `true`
Extract All PDF Layers	Specifies whether to extract text elements from the machine-readable PDF page and any text within the image layer of the same PDF page. Not supported with paragraph layout algorithm. Extracting all layers guarantees the highest fidelity text results, but it’s resource intensive because it runs OCR on all pages. If you set `force_image_ocr` to `true`, you can’t set `extract_all_pdf_layers` to `true`.	`extract_all_pdf_layers`	`false` (default), `true`
Produce Metadata List	Specifies whether page layouts and metadata are set within `.ibdoc` files.	`produce_metadata_list`	`true` (default), `false`
Produce Word Metadata	Specifies whether to include word confidence and position information in the .ibmsg file. The word metadata is required for confidence calculations and overlaying words onto the doc in the Review OCR app.	`produce_word_metadata`	`true` (default), `false`
Remove Space Wordpolys from IBDOC	Specifies whether to remove empty spaces and words with no text from `.ibdoc` files.	`remove_space_wordpolys`	`true` (default), `false`
Cache PDF Results	Specifies whether to cache results for PDF files.	`cache_pdf_results`	`false` (default), `true`
Output Formats	Specifies the file format for output OCR text files.	`output_formats`	.csv, .txt
Repair PDFs	Specifies whether to rewrite a PDF before processing to remove possible PDF corruption.	`repair_pdfs`	`false` (default), `true`
Skip Text Extraction	Specifies whether to skip text extraction, which reduces runtime. Useful for flows that focus on entity detection.	`skip_text_extraction`	`false` (default), `true`
Force Column Width CSV to PDF	Specifies whether to dynamically size columns to prevent truncation in PDFs generated from CSV.	`force_column_width_csv_to_pdf`	`true` (default), `false`

Preprocessing

Setting	Description	JSON	Value
Remove Boxes	Specifies whether to attempt to remove boxes from the document before processing text, which can sometimes improve OCR.	`remove_boxes`	`false` (default), `true`
Remove Boxes over Height %	Specifies the minimum height, as a percentage of the page, of vertical lines to remove when performing box removal.	`remove_boxes_over_height_percent`	float; default: `0.2` (20 percent)
Remove Boxes over Width %	Specifies the minimum width, as a percentage of the page, of horizontal lines to remove when performing box removal.	`remove_boxes_over_width_percent`	float; default: `0.2` (20 percent)
Correct Color Inversion	Specifies whether to correct color-inverted images.	`correct_inversion`	`false` (default), `true`
Detect Blurry Files	Specifies whether to detect blurry input files and return their blur factors.	`detect_blurry_files`	`false` (default), `true`
Image Filters	Specifies any built-in or custom image filters to run and the arguments to pass in.	`image_filters`	JSON dictionary of filters to run.

OCR processing

Note

Available OCR options are displayed based on selected model type.

Setting	Description	JSON	Value
OCR Timeout (secs)	Specifies duration in seconds to wait for a response from the OCR service before timing out.	`ocr_timeout`	integer; default: `300` seconds
Correct Resolution	Specifies whether to attempt to resize the image for OCR processing. Usually, this method is inferior to `correct_resolution_auto`.	`correct_resolution`	`false` (default), `true`
Auto-Correct Resolution	Specifies whether to attempt to automatically change the image resolution for optimal OCR processing. Usually, this method is preferable to `correct_resolution`.	`correct_resolution_auto`	`false` (default), `true`
Correct Orientation	Specifies whether to attempt to correct page rotations of 90, 180, and 270 degrees.	`correct_orientation`	`false` (default), `true`
Page Dewarp	Specifies whether to attempt to correct skew and warp in the image.	`dewarp_page`	`false` (default), `true`
Reorient words	Specifies whether to attempt to transform the coordinates of the words so that the formatted text output is correct under rotation.	`reorient_words`	`false` (default), `true`
Languages	Specifies which language models the OCR uses. This helps OCR efficiency and accuracy.	`languages`	A list of language models to use when running OCR. See Supported language codes. Default: `en`, the English language model.
Fonts	For the deprecated Abbyy OCR model only, specifies which text types the OCR reads. By default, the text types are automatically selected. Available fonts are [`Normal`, `Typewriter`, `Matrix`, `Index`, `OCR-A`, `OCR-B`, `MICR-E13B`, `MICR-CMC7`, `Gothic`, `Receipt`]	`fonts`	A comma-separated list of strings specifying the text types.
Detect Barcodes	Specifies whether to attempt to extract barcode information from the document.	`detect_barcodes`	`false` (default), `true`
Find Lines	Specifies whether to detect lines in the documents.	`find_lines`	`false` (default), `true`
Model-Specific Settings	Specifies what model-specific setting to use, if any. See Model-specific settings for details.	`model_specific_settings`	- Default: none - `hq_v1`: Settings for the Abbyy OCR model (deprecated). - `lq_v1`: Settings for the Google Vision (Cloud) OCR model. - `marx_v1`: Settings for the Microsoft OCR model.

Model-specific settings

Options to configure Model-Specific Settings depend on OCR model type.

To configure OCR settings for Abby, use the hq_v1 JSON object with these keys:

Profile Name – A named profile that corresponds to a set of flags to optimize the OCR model for a specific purpose. Supported profile names are:
- doc_style_accurate – Includes the structure, style, and text in the document.
- doc_style_fast – Includes the structure, style, and text in the document, optimized for speed above accuracy.
- doc_text_accurate – Includes text embedded in logos and standard text detection. Excludes style and document structure information.
- doc_text_fast – Includes text embedded in logos and standard text detection. Excludes style and document structure information, optimized for speed above accuracy.
- text_only_accurate – Includes maximal text detection, including small text areas of low quality. Tables, photos, style, and document structure aren’t analyzed.
- text_only_fast – Tables, photos, style, and document structure aren’t analyzed. Optimized for speed above accuracy.

To configure OCR settings for Google OCR, use the lq_v1 JSON object with these keys:

Feature Type – Controls the types of images the low-quality OCR model is optimized for.
- general – Detects and extracts text from any image.
- document – Optimized for dense text and documents.

To configure OCR settings for Microsoft, use the marx_v1 JSON object with these keys:

Version – Controls the Microsoft OCR version.
- v2 – Uses the lite deployment.
- v3 – Uses the max deployment.

For example, entering the following parameters in Model-Specific Settings configures the model to use Microsoft OCR Max:

{
  "marx_v1": {
    "version": "v3"
  }
}

Native PDF settings

Native PDF settings let you to control the generation of native PDFs for image, PDF, or .tiff documents.

Specify PDF settings in JSON only, in the OCR configuration advanced view.

native_pdf - A JSON object that specifies options for generating native PDFs.
- write_native_pdf - Specifies whether to generate native PDFs for the given input documents. Setting to false generates no PDF.
- resolution_dpi - An integer between 72 and 300 that sets the resolution DPI of the output PDFs. Applicable only if write_native_pdf is true.

For example, in the following code sample, native PDFs are generated with a 300 DPI resolution.

"native_pdf": {
    "write_native_pdf": true,
    "resolution_dpi": 300
}

Filter settings

Filter settings enable you to reprocess records with a specified class using different processing settings. You can specify filter settings only if your flow includes a previous step that outputs .ibdoc files, because page range settings take precedence over filter settings.

Specify filter settings in JSON only, in the OCR configuration advanced view.

filter_settings - A JSON object that specifies whether to reprocess records with a specified class using different processing settings.
- class_type - Specifies a list of class labels you want to reprocess.
- skip_other_classes - Specifies whether to include records from other class labels in the output .ibdoc or not.
- merge_pages - Specifies whether a reprocessed record has a single record in the output .ibdoc or multiple records based on the number of pages in that record.

For example, in the following code sample, records belonging to class 1040 and W-4 are reprocessed. Other records aren’t reprocessed, but because skip_other_classes is false, these records get an entry in the resulting .ibdoc without any change. The reprocessed pages from a record are merged to create a single labeled record in the resulting .ibdoc because merge_pages is true.

"filter_settings": {
    "class_type": [
        "1040", "W-4"
    ],
    "skip_other_classes": false,
    "merge_pages": true
}

Custom image filters

Use a scripts directory to register custom image filters in a Python file, for example scripts.py.

Implementing a custom image filter

An image filter is a Python class that implements the following interface:

  from PIL import Image

  FilterConfigDict = TypedDict('FilterConfigDict',
    params=Dict[Text, Text]
  )

  class ImageFilter(object):
  """
  Interface for an image filter.
  """

  def __init__(self, filter_config):
    # type: (Dict[Text, FilterConfigDict) -> None
    raise NotImplementedError(u'To be implemented.')

  def execute(self, img):
    # type: (Image) -> Tuple[Image, Text]
    raise NotImplementedError(u'To be implemented.')

Your custom image filter is responsible for taking a pillow image, running any processing on it, and returning a tuple containing the processed image and any errors.

Passing parameters

The image filter constructor takes in a filter_config of type FilterConfigDict using this logic:

Read from the image filters config in the OCR config.
Overwrite what was passed in from the image filters config by any overlapping key in the runtime_config. This information is stored in the params part of FilterConfigDict.

Image filters are enabled in a flow by specifying them in the OCR config of a process files step. An image_filters flag specifies the filters and passes additional parameters.

{
  "image_filters": [
    {
      "filter_name": "example_filter_1",
      "parameters": {
        "additional_parameter_1": 1,
        "additional_parameter_2": true,
        "additional_parameter_3": "test"
      }
    },
    {
      "filter_name": "example_filter_2"
    }
  ]
}

Image filters that are passed in the image_filters list (example_filter_1 and example_filter_2 in this example) are applied in the process files step. The parameters are passed to their respective image filter objects in the param key of the filter_config dictionary. For example, you can access additional_parameter_1 in the example_filter_1 object through self.filter_config['params']['additional_parameter_1].

Registering a custom image filter

To make your custom image filter available, register it in the same Python module that defines it by creating a special register_image_filter function.

This function returns a Python dictionary of the following form:

  def register_image_filter():
    # type: () -> Dict
    return {
      '<FILTER-NAME>': {
        'class': <IMAGE-FILTER-CLASS>
      }
    }

Example: Background removal

This example image filter is configured to filter the background from an image.

  from PIL import Image
  import cv2
  import numpy as np

  class BackgroundRemoval(object):
      def __init__(self, filter_config):
          self.filter_config = filter_config

          # intensities of interest
          self.ioi = self.filter_config['params']['ioi']

      def _execute_bg_removal(self, img_rgb):
        if len(img_rgb.shape) > 2:
          gray = cv2.cvtColor(img_rgb, cv2.COLOR_BGR2GRAY)
        else:
          gray = img_rgb

        for intensity in self.ioi:
          ret, thresh = cv2.threshold(gray, intensity, 255, cv2.THRESH_BINARY_INV)
          bw = thresh > 0
          w, h = bw.shape[::-1]
          if (bw.sum() * 100)/(w * h) > 2:
            break

        # noise removal
        kernel = np.ones((1, 1), np.uint8)
        opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=2)
        # sure background area
        sure_fg = cv2.dilate(opening, kernel, iterations=3)
        # Marker labelling
        ret, markers = cv2.connectedComponents(sure_fg)
        # Add one to all labels so that sure foreground is not 0, but 1
        markers = markers + 1
        # Now, mark the region of unknown with zero
        markers[sure_fg == 0] = 0
        markers[markers > 0] = 255

        # Apply
        img_rgb[markers == 0] = [255, 255, 255]
        img_rgb = cv2.dilate(img_rgb, kernel, borderValue=255, iterations=2)
        img_rgb = cv2.morphologyEx(img_rgb, cv2.MORPH_CLOSE, kernel, iterations=2)
        return img_rgb

      def execute(self, img):
          rgb_img = img.convert('RGB')
          cv_img = np.array(rgb_img)
          cv_img = cv2.cvtColor(cv_img, cv2.COLOR_RGB2BGR)

          proc_img = self._execute_bg_removal(cv_img)
          ret_img = Image.fromarray(cv2.cvtColor(proc_img, cv2.COLOR_BGR2RGB))

          return ret_img, None

  def register_image_filter():
      # type: () -> Dict
      return {'background-removal': {'class': BackgroundRemoval}}