About Refiner

Note: this product was called “Refiner 5” until the 21.4.0 release.

Refiner is a multi-modal document extraction platform that lets you work with documents directly to extract text and visual information in a unified, intuitive, and extensible interface.

Extraction is a diverse activity. Documents can be long or short. Structured or unstructured. Text-heavy, table heavy, bar-code covered, and everything in between.

Most extraction tools don’t manage diversity well. The Refiner platform manages the full diversity of extraction with these capabilities:

  • Uses open file formats for describing images, documents, training data, and extraction.

  • Includes a plugin system for extraction models to interact with these file formats. Instabase ships with about 40 pre-written extraction models.

  • Establishes rules for how these models behave to preserve audit trails, explainability, and composition.

This platform approach provides a stable base for all extraction and is extensible. Many extraction plugins are provided, including Refiner functions and selected visual functions.

Supported extraction forms

Refiner supports these main forms of extraction:

  • Spatial Extraction
    Extract the phrase to the right of the text “First Name:"

  • Regular Expressions
    Extract the phrase captured by $Amount: $([0-9.]+)^

  • Semantic
    Extract the ADDRESS below the text “Shipped To”

  • Visual
    Is the CHECKBOX associated with “Married - Filing Jointly” checked?

  • Metadata
    What is the FILENAME associated with this data?

  • Structure
    Generate a LIST of every bullet point under the heading Risks

  • Plugins (UDFs)
    Run my custom model and return its result.

Tip

For hands-on guidance to working with Refiner, see the text extraction guide and visual extraction guide.

When to use Refiner

Use Refiner to do additional refinement on data extracted by deep learning models, from post-processing model output to adding custom business logic.

Refiner features:

  • In-app visualization of the development dataset

  • Detailed in-app documentation of Refiner functions

  • Support for visual processing (checkboxes, signature, image crop, and so on)

  • Full UDF authoring, executing, and debugging support

Getting started

Creating Refiner programs

For Solution Builder projects, a Refiner program will be created for you through the Solution Builder UI.

To create a Refiner program outside of a Solution Builder project:

  1. Open the Refiner app.

  2. Click Create Classic Refiner.

  3. Fill out the fields in the creation dialog.

Layouts provide different ways to view documents, records, fields, and output in Refiner.

  • Document Shows document viewer, and field list.

  • Split Shows output table, document viewer, and field list.

  • Table Shows output table.

  • Custom You can drag to show or hide panels and save the layout.

When the document viewer is shown, you can toggle between the document image and the extracted OCR text.

Select records to run

Use the records drawer to control which records appear in the output table and the run process.

  1. To open the records drawer, select the Select Records button. The Available Records drawer opens with all the available records.

  2. Click a record to preview the first 3 pages of the document.

  3. Select All or select one or more record checkboxes to include the selected records in the output table and the run process.

Selecting only a subset of records is useful to speed up your development because unselected records are excluded from the run process.

Refiner functions

When you start typing a Refiner function, in-product documentation is shown. To view an in-app Formula list, use Help > Formula list.

Extracting text fields

To extract a text field:

  1. In the right panel, click + New Field.

    • Replace the provided field_ name with a self-describing unique field name, then press Enter or click outside of the field to apply the name change. Duplicate field names are not supported. For helper fields, prefix the field name with a double underscore (__). The double underscore is a naming convention that prevents helper fields from being generated in the output and downstream applications.
  2. In the bottom panel, define your field.

    • Leave the Field type with the default Text Field.

    • Optional: Select an Output type. The supported types are: No type, Text, Float, Integer, List, Image, Table, Dict. Defining a type lets you filter output display by the selected output type.

    • To use the Target Comparison feature, select a Target name to map the new field to a field in the targets file.

    • Optional: Add a Field description.

  3. Optional: To enable Target Comparison, move the Run with targets slider to the right.

  4. In the bottom panel, enter the Refiner formula, and then click Run Field.

  5. The results show for each field. If the Target Comparison feature is enabled for the run, the mapped fields are indicated with a purple bar in the records list and in the fields pane.

  6. Click Save to save adding this field to your program.

Note: Unsaved Changes displays below the Save button if Refiner program changes are not saved.

Extracting visual fields

Extraction is supported for these visual field types:

  • Image Crop: Returns a base64-encoded image section of the document. Set the field Type to Image and Run Field to render the image in the Output tab.

  • Checkbox: Returns a boolean indicating whether the checkbox or radio button is ticked.

  • Signature: Returns a boolean indicating whether a signature is present.

To add a visual field:

Visual fields require one or more anchor fields.

  • Anchor fields are Refiner functions with a consistent location in the documents.

  • Visual extraction uses anchor fields to figure out where in a record to locate the image, checkbox, or signature you want to extract.

  • Anchor fields must exist on the same page as the visual field.

  • You must run the anchor fields before you create the corresponding visual fields.

  1. In the right panel, click + New Field.

  2. In the fields panel, select + New Field to add an anchor field.

  3. In the bottom panel, enter the field name using a double underscore (__) prefix. For example, __anchor_signedcheckbox

    • The double underscore is a naming convention that prevents helper fields from being generated in the output and downstream applications.
  4. Click Run Field to run the anchor field.

  5. In the right panel, click + New Field and change the field type to Visual Field.

  6. Select the Visual Function Type type of visual field to extract from the list: Image Crop, Checkbox, or Signature.

  7. Click in Anchor Fields and select the previously defined anchor field.

  8. Click Run Field to run the field on all documents.

  9. Click Save to save adding this field to your program.

Using text fields to process visual fields

After you extract a visual field, you can use visual Refiner functions in text fields to process the visual field. Start typing a visual Refiner function to view the in-product documentation.

  • image_crop_relative

  • image_decode_checkbox

  • image_decode_signature

Field execution

Fields in the field panel are processed in order with Run All.

Right-click a field to perform these actions:

  • Move field to the top

  • Move field up

  • Move field down

  • Move field to the bottom

  • Duplicate the field

  • Create a field above

  • Create a field below

You can run all fields or just the selected field:

  • Click Run All to run the Refiner functions on all fields.

  • Click Run Field to run only the selected field.

View options

Use the View menu to filter views.

  • Show hidden fields toggles the display in the output table. Hidden fields follow the field name prefix convention of a double underscore (__).

  • Show annotations for selected fields only toggles the display for annotations.

Integrating a completed Refiner program

To integrate a completed program into a Flow, add the Apply Refiner step to your Flow and select the .ibrefiner file.

Keyboard shortcuts

Refiner supports the following keyboard shortcuts to turbocharge your development process.

To view in-app keyboard shortcuts, use Help > Keyboard shortcuts.

Name Shortcut Description
Save Command+S / Control+S Save the program
Run Current Field Command+Enter / Control+Enter Run the current field only
Run (all fields) Command+Shift+Enter or Control+Shift+Enter Run all fields
Formula List Command+/ or Control+/ Display searchable formula list
Next Record Down Arrow Key Go to the next record row
Previous Record Up Arrow Key Go to the previous record row
Next Field Right Arrow Key Go to the next field
Previous Field Left Arrow Key Go to the previous field

Fixed structure documents

The structure of fixed structure documents is known in advance and doesn’t change. Fixed structure documents have labels associated with fields to extract.

High extraction accuracy is instantly achievable, tolerates OCR errors, and has built-in provenance tracking support.

Examples: ADP Paystub, Bank of America Bank Statement, CA Driver License. Pre-built solutions handle fixed structure documents well and are available in the Marketplace.

Action Example Method Limitations
Extract text Extract the phrase to the right of the text First Name: Spatial functions, such as scan_right, scan_below, or scan_box. Or use regular expressions Doesn’t support scan_above, scan_left
Detect visual feature Is the CHECKBOX associated with Married - Filing Jointly checked? Visual functions Doesn’t support faces or logos.
Crop image Extract the image corresponding to signature above Authorizer Visual functions Doesn’t support faces or logos.
Extract structured information Generate a list of every bullet point underneath the heading Risks Use spatial functions to extract region, then split by delimiter, such as new line or comma Doesn’t support special view to handle structured information
Specify different output format Ensure dates are in mm-dd-yy format UDF
Validate Ensure expiry_date is after today’s date UDF

Variable structure documents

The structure of variable structure documents can change between documents. These documents have labels associated with the fields to extract.

Decent extraction accuracy is achievable, tolerates OCR errors, and has built-in provenance tracking support.

Examples: US W2, Bill of Lading, Invoices, Bill of Lading, and so on.

Action Example Method Limitations
Extract text around label Extract the phrase around the text Invoice Number: or Invoice No. Use regular exprerssions or more flexible spatial functions, such as scan_near or scan_box Is text-based, does not support spatial scanning in the document image domain
Extract text semantically Extract the first address on the document as shipper_address Use nlp_token_find methods for Instabase’s default entities such as name, global-address, date, create entities from uploaded datasets, and create your own entirely Existing Token Matchers are being continually improved, and new ones being added
Crop image Extract the image corresponding to signature above Authorizer Visual functions Can’t be done in a long-tail manner, relative direction from anchor is fixed
Extract structured information Generate a list of every bullet point underneath the heading Risks First use the text-based high-variability technique above to find the right labels, then scan_below More to come on table extraction
Specify different output format Ensure dates are in mm-dd-yy format UDF More native output format selection coming
Validate Ensure expiry_date is after today’s date UDF More native validation formulas coming

Advanced extraction

Advanced extraction includes provenance tracking and UDFs.

Provenance tracking

Provenance Tracking is the task of tracking the origin of some object to determine where some output came from within its input. Provenance tracking maps from output text coordinates to INPUT_COL text coordinates.

Provenance Tracking is automatically set up when your Refiner programs are created from the Recipe Book or Training Projects.

Adding a UDF

Although Refiner is powerful, it might not support all of your extraction requirements, particularly if you want to integrate specific business logic. You can extend capabilities by writing UDFs, and referencing the script directory in the Settings panel. See UDFs to provide custom code in Flow and Refiner.

Troubleshooting

Tips on isolating and resolving problems with Refiner.

What to do when files don’t load?

  1. Refresh the page.

  2. Reselect the IBOCR/IBDOC folder with File > Open Folder to make sure the path is still valid.

  3. When selecting the input folder, make sure that the folder contains valid .ibdoc (IBDOC) files that do not contain refined_phrases. The input folder is typically in the project out/s2_map_records folder.

  4. Try creating a new Refiner program by right-clicking the IBOCR/IBDOC folder that you want included. Create a new Refiner program to verify the file system and the .ibdoc files. Make sure that the upstream resources are in the expected location.

Note: You might get a An unexpected error occurred warning if files aren’t in the designated location.

What to do when the page goes blank?

  1. Open the JavaScript console, take a screenshot, and attach to a bug report. Provide details about what you were doing and enough information to help us reproduce the problem.

  2. Refresh the page.

Debug formulas and UDFs when you receive an error field in cell

You can see the error message directly in the cell, and that message can give you a clue.

Common errors:

  1. Make sure your Refiner formula does not have double quotation marks ("), use a single quote instead (').

  2. Make sure your parentheses are well-matched.

  3. Make sure your regular expressions are valid and do the right thing. You can use an external website such as regexr.com to test your expressions.

  4. Make sure you’re providing the correct values for the parameters that the Refiner functions accept. Use the in-app documentation for function usage information.

  5. If any UDFs are involved, make sure there are no errors.

Log messages in the UDF log

To isolate problems in UDFs, you can log messages using the logging module:

import logging
logging.info('my message here')

Known problems

  • Error messages are not precise or readable. For example, a quotation issue produces “not all text consumed” errors, which can be confusing.