Process Files

The process files step converts various document formats into machine-readable text. Most flows begin with this step, but you can place the step anywhere in a flow. If the preceding step outputs .ibdoc files, the OCR operation reverts to the original source files.

You can define processing configuration options using an existing digitization profile, connected as a module in the process files step, or you can specify processing options directly in the step.

Tip

If you’re developing a solution that included digitizing documents with non-default options, re-use the digitization profile from the Documents stage of your Solution Builder project. Reusing this profile ensures that production documents processed by your flow are optimally digitized.

Parameters and settings

Use these parameters and settings to control how the process files step operates.

Use Reader module

If you want to use a digitization profile that you developed in Solution Builder or Reader, select True and then select the digitization profile in the Reader Module field.

To use default digitization options, or to specify configuration within the process files step, select False.

Processing functions

Processing function indicates the document type for input files. To automatically process documents based on file extension, select Auto-extract. Otherwise, select the option that matches your input files.

  • PDF documents (.pdf) with machine-readable text

  • Images/Scanned documents (.pdf, .png, .jpeg, .tiff)

  • Microsoft Excel documents (.xlsx)

  • Microsoft Word documents (.docx)

  • Microsoft Rich Text Format documents (.rtf)

  • Microsoft PowerPoint documents (.pptx)

  • Email messages (.eml, .msg)

  • Web Pages (.html, .mht)

  • Text-based files (.csv, .txt)

For more details, see Supported file types and Supported settings by file type.

OCR model type

OCR model type specifies the processing engine to use for digitization. The Default (Microsoft OCR) model is suitable for most use cases, but you can select a different model depending on your circumstances.

  • Abbyy – Deprecated and scheduled for removal in 24.04. Use MSFT OCR instead.

  • Google Vision (Cloud) – Appropriate for lower-quality scanned input documents, including third- and fourth-generation documents with fuzzy images, shadows, and folds.

  • Microsoft Read OCR – Appropriate for high-accuracy extraction, and handwritten or foreign language documents. This engine returns a character-based OCR confidence score that can be used in Refiner.

  • Tesseract – Appropriate for high-quality scanned input documents.

Scripts directory

Scripts directories contain custom Python scripts intended to be run as part of the process files step. Most commonly, scripts are used to register custom image filters. Each .py file must contain a custom registration function. For detailed information, see Custom image filters.

Layout scope

Layout scope defines how layout analysis is performed on each document.

  • Per page – Performs analysis once per page. This option is appropriate for documents with relatively self-contained pages, and it works well with a variety of layouts, including mixed vertical and horizontal orientation, different page sizes, and formatting that varies per page.

  • Per document – Performs analysis on the entire document. This option can be used to extract information from tables that span multiple pages of text because the output alignment preserves layout across multiple pages.

Layout algorithm

Layout algorithm defines how text is arranged in the output document.

  • Spatial – Arranges output text from top to bottom, left to right. This option is appropriate for documents such as letters, forms, news articles, and invoices. The default, V2.0, is suitable for most standard documents. For documents with a mix of vertical and horizontal text, use V3.0.

  • By Paragraph – Arranges output text according to inferred paragraph flow. This option is best for multi-column documents.

Page range

Page range defines which pages in a multipage document to digitize, for example 1 or 2–10. Wild cards aren’t supported. If left blank, all pages are digitized.

Encryption config

If your input documents are password-protected PDFs, specify passwords in a runtime_config JSON input object provided as a value to the passwords key.

The runtime_config is a JSON object of type Dict[Text, Any].

To provide passwords in the runtime configuration:

  1. Create a dictionary with keys that specify filenames of encrypted PDFs and values that specify the corresponding passwords.

  2. Provide the dictionary as a value to the key passwords in the runtime_config JSON file.

  3. Define the instabase and pdf namespaces.

For example:

{
  "instabase": {
    "pdf": {
      "passwords": {
        "input_file1.pdf": "password1",
        "input_file2.pdf": "password2"
      }
    }
  }
}

OCR config

OCR configuration specifies general processing, pre-processing, and OCR processing options for digitization.

General processing

Tip

To configure options in JSON format, select Show advanced view.

Setting Description JSON Value
Force Image OCR Specifies whether to treat the document as an image. Required true when using visual extraction functions. If you set force_image_ocr to true, you can’t set extract_all_pdf_layers to true. force_image_ocr false (default), true
Write Converted Image Specifies whether to save per-page image files to disk. write_converted_image true (default), false
Write Thumbnail Specifies whether to generate thumbnail images, which can speed page loading while annotating documents. write_thumbnail true (default), false
Write Model Training Image Specifies whether to save a grayscale, resized version of the original page images to disk in a new model_training_images directory. Images are saved as a version of the original image, scaled down to fit in a 1024 x 1024 box. These smaller images are more suitable for training machine learning models. write_model_training_image false (default), true
Extract All PDF Layers Specifies whether to extract text elements from the machine-readable PDF page and any text within the image layer of the same PDF page. Not supported with paragraph layout algorithm. Extracting all layers guarantees the highest fidelity text results, but it’s resource intensive because it runs OCR on all pages. If you set force_image_ocr to true, you can’t set extract_all_pdf_layers to true. extract_all_pdf_layers false (default), true
Produce Metadata List Specifies whether page layouts and metadata are set within .ibdoc files. produce_metadata_list true (default), false
Produce Word Metadata Specifies whether to include word confidence and position information in the .ibmsg file. The word metadata is required for confidence calculations and overlaying words onto the doc in the Review OCR app. produce_word_metadata true (default), false
Remove Space Wordpolys from IBDOC Specifies whether to remove empty spaces and words with no text from .ibdoc files. remove_space_wordpolys true (default), false
Cache PDF Results Specifies whether to cache results for PDF files. cache_pdf_results false (default), true
Output Formats Specifies the file format for output OCR text files. output_formats .csv, .txt
Repair PDFs Specifies whether to rewrite a PDF before processing to remove possible PDF corruption. repair_pdfs false (default), true
Skip Text Extraction Specifies whether to skip text extraction, which reduces runtime. Useful for flows that focus on entity detection. skip_text_extraction false (default), true
Force Column Width CSV to PDF Specifies whether to dynamically size columns to prevent truncation in PDFs generated from CSV. force_column_width_csv_to_pdf true (default), false

Preprocessing

Setting Description JSON Value
Remove Boxes Specifies whether to attempt to remove boxes from the document before processing text, which can sometimes improve OCR. remove_boxes false (default), true
Remove Boxes over Height % Specifies the minimum height, as a percentage of the page, of vertical lines to remove when performing box removal. remove_boxes_over_height_percent float; default: 0.2 (20 percent)
Remove Boxes over Width % Specifies the minimum width, as a percentage of the page, of horizontal lines to remove when performing box removal. remove_boxes_over_width_percent float; default: 0.2 (20 percent)
Correct Color Inversion Specifies whether to correct color-inverted images. correct_inversion false (default), true
Detect Blurry Files Specifies whether to detect blurry input files and return their blur factors. detect_blurry_files false (default), true
Image Filters Specifies any built-in or custom image filters to run and the arguments to pass in. image_filters JSON dictionary of filters to run.

OCR processing

Note

Available OCR options are displayed based on selected model type.

Setting Description JSON Value
OCR Timeout (secs) Specifies duration in seconds to wait for a response from the OCR service before timing out. ocr_timeout integer; default: 300 seconds
Correct Resolution Specifies whether to attempt to resize the image for OCR processing. Usually, this method is inferior to correct_resolution_auto. correct_resolution false (default), true
Auto-Correct Resolution Specifies whether to attempt to automatically change the image resolution for optimal OCR processing. Usually, this method is preferable to correct_resolution. correct_resolution_auto false (default), true
Correct Orientation Specifies whether to attempt to correct page rotations of 90, 180, and 270 degrees. correct_orientation false (default), true
Page Dewarp Specifies whether to attempt to correct skew and warp in the image. dewarp_page false (default), true
Reorient words Specifies whether to attempt to transform the coordinates of the words so that the formatted text output is correct under rotation. reorient_words false (default), true
Languages Specifies which language models the OCR uses. This helps OCR efficiency and accuracy. languages A list of language models to use when running OCR. See Supported language codes.
Default: en, the English language model.
Fonts For the deprecated Abbyy OCR model only, specifies which text types the OCR reads. By default, the text types are automatically selected. Available fonts are [Normal, Typewriter, Matrix, Index, OCR-A, OCR-B, MICR-E13B, MICR-CMC7, Gothic, Receipt] fonts A comma-separated list of strings specifying the text types.
Detect Barcodes Specifies whether to attempt to extract barcode information from the document. detect_barcodes false (default), true
Find Lines Specifies whether to detect lines in the documents. find_lines false (default), true
Model-Specific Settings Specifies what model-specific setting to use, if any. See Model-specific settings for details. model_specific_settings - Default: none
- hq_v1: Settings for the Abbyy OCR model (deprecated).
- lq_v1: Settings for the Google Vision (Cloud) OCR model.
- marx_v1: Settings for the Microsoft OCR model.

Model-specific settings

Options to configure Model-Specific Settings depend on OCR model type.

To configure OCR settings for Abby, use the hq_v1 JSON object with these keys:

  • Profile Name – A named profile that corresponds to a set of flags to optimize the OCR model for a specific purpose. Supported profile names are:

    • doc_style_accurate – Includes the structure, style, and text in the document.

    • doc_style_fast – Includes the structure, style, and text in the document, optimized for speed above accuracy.

    • doc_text_accurate – Includes text embedded in logos and standard text detection. Excludes style and document structure information.

    • doc_text_fast – Includes text embedded in logos and standard text detection. Excludes style and document structure information, optimized for speed above accuracy.

    • text_only_accurate – Includes maximal text detection, including small text areas of low quality. Tables, photos, style, and document structure aren’t analyzed.

    • text_only_fast – Tables, photos, style, and document structure aren’t analyzed. Optimized for speed above accuracy.

To configure OCR settings for Google OCR, use the lq_v1 JSON object with these keys:

  • Feature Type – Controls the types of images the low-quality OCR model is optimized for.

    • general – Detects and extracts text from any image.

    • document – Optimized for dense text and documents.

To configure OCR settings for Microsoft, use the marx_v1 JSON object with these keys:

  • Version – Controls the Microsoft OCR version.

    • v2 – Uses the lite deployment.

    • v3 – Uses the max deployment.

For example, entering the following parameters in Model-Specific Settings configures the model to use Microsoft OCR Max:

{
  "marx_v1": {
    "version": "v3"
  }
}

Native PDF settings

Native PDF settings let you to control the generation of native PDFs for image, PDF, or .tiff documents.

Specify PDF settings in JSON only, in the OCR configuration advanced view.

  • native_pdf - A JSON object that specifies options for generating native PDFs.

    • write_native_pdf - Specifies whether to generate native PDFs for the given input documents. Setting to false generates no PDF.

    • resolution_dpi - An integer between 72 and 300 that sets the resolution DPI of the output PDFs. Applicable only if write_native_pdf is true.

For example, in the following code sample, native PDFs are generated with a 300 DPI resolution.

"native_pdf": {
    "write_native_pdf": true,
    "resolution_dpi": 300
}

Filter settings

Filter settings enable you to reprocess records with a specified class using different processing settings. You can specify filter settings only if your flow includes a previous step that outputs .ibdoc files, because page range settings take precedence over filter settings.

Specify filter settings in JSON only, in the OCR configuration advanced view.

  • filter_settings - A JSON object that specifies whether to reprocess records with a specified class using different processing settings.

    • class_type - Specifies a list of class labels you want to reprocess.

    • skip_other_classes - Specifies whether to include records from other class labels in the output .ibdoc or not.

    • merge_pages - Specifies whether a reprocessed record has a single record in the output .ibdoc or multiple records based on the number of pages in that record.

For example, in the following code sample, records belonging to class 1040 and W-4 are reprocessed. Other records aren’t reprocessed, but because skip_other_classes is false, these records get an entry in the resulting .ibdoc without any change. The reprocessed pages from a record are merged to create a single labeled record in the resulting .ibdoc because merge_pages is true.

"filter_settings": {
    "class_type": [
        "1040", "W-4"
    ],
    "skip_other_classes": false,
    "merge_pages": true
}

Custom image filters

Use a scripts directory to register custom image filters in a Python file, for example scripts.py.

Implementing a custom image filter

An image filter is a Python class that implements the following interface:

  from PIL import Image

  FilterConfigDict = TypedDict('FilterConfigDict',
    params=Dict[Text, Text]
  )

  class ImageFilter(object):
  """
  Interface for an image filter.
  """

  def __init__(self, filter_config):
    # type: (Dict[Text, FilterConfigDict) -> None
    raise NotImplementedError(u'To be implemented.')

  def execute(self, img):
    # type: (Image) -> Tuple[Image, Text]
    raise NotImplementedError(u'To be implemented.')

Your custom image filter is responsible for taking a pillow image, running any processing on it, and returning a tuple containing the processed image and any errors.

Passing parameters

The image filter constructor takes in a filter_config of type FilterConfigDict using this logic:

  • Read from the image filters config in the OCR config.

  • Overwrite what was passed in from the image filters config by any overlapping key in the runtime_config. This information is stored in the params part of FilterConfigDict.

Image filters are enabled in a flow by specifying them in the OCR config of a process files step. An image_filters flag specifies the filters and passes additional parameters.

{
  "image_filters": [
    {
      "filter_name": "example_filter_1",
      "parameters": {
        "additional_parameter_1": 1,
        "additional_parameter_2": true,
        "additional_parameter_3": "test"
      }
    },
    {
      "filter_name": "example_filter_2"
    }
  ]
}

Image filters that are passed in the image_filters list (example_filter_1 and example_filter_2 in this example) are applied in the process files step. The parameters are passed to their respective image filter objects in the param key of the filter_config dictionary. For example, you can access additional_parameter_1 in the example_filter_1 object through self.filter_config['params']['additional_parameter_1].

Registering a custom image filter

To make your custom image filter available, register it in the same Python module that defines it by creating a special register_image_filter function.

This function returns a Python dictionary of the following form:

  def register_image_filter():
    # type: () -> Dict
    return {
      '<FILTER-NAME>': {
        'class': <IMAGE-FILTER-CLASS>
      }
    }

Example: Background removal

This example image filter is configured to filter the background from an image.

  from PIL import Image
  import cv2
  import numpy as np

  class BackgroundRemoval(object):
      def __init__(self, filter_config):
          self.filter_config = filter_config

          # intensities of interest
          self.ioi = self.filter_config['params']['ioi']

      def _execute_bg_removal(self, img_rgb):
        if len(img_rgb.shape) > 2:
          gray = cv2.cvtColor(img_rgb, cv2.COLOR_BGR2GRAY)
        else:
          gray = img_rgb

        for intensity in self.ioi:
          ret, thresh = cv2.threshold(gray, intensity, 255, cv2.THRESH_BINARY_INV)
          bw = thresh > 0
          w, h = bw.shape[::-1]
          if (bw.sum() * 100)/(w * h) > 2:
            break

        # noise removal
        kernel = np.ones((1, 1), np.uint8)
        opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=2)
        # sure background area
        sure_fg = cv2.dilate(opening, kernel, iterations=3)
        # Marker labelling
        ret, markers = cv2.connectedComponents(sure_fg)
        # Add one to all labels so that sure foreground is not 0, but 1
        markers = markers + 1
        # Now, mark the region of unknown with zero
        markers[sure_fg == 0] = 0
        markers[markers > 0] = 255

        # Apply
        img_rgb[markers == 0] = [255, 255, 255]
        img_rgb = cv2.dilate(img_rgb, kernel, borderValue=255, iterations=2)
        img_rgb = cv2.morphologyEx(img_rgb, cv2.MORPH_CLOSE, kernel, iterations=2)
        return img_rgb

      def execute(self, img):
          rgb_img = img.convert('RGB')
          cv_img = np.array(rgb_img)
          cv_img = cv2.cvtColor(cv_img, cv2.COLOR_RGB2BGR)

          proc_img = self._execute_bg_removal(cv_img)
          ret_img = Image.fromarray(cv2.cvtColor(proc_img, cv2.COLOR_BGR2RGB))

          return ret_img, None

  def register_image_filter():
      # type: () -> Dict
      return {'background-removal': {'class': BackgroundRemoval}}