Map Records

The map records step specifies how multipage documents are parsed. If required in a flow, this step always follows the process files step.

Parameters and settings

Use these parameters and settings to control how the map records step operates.

Input file type

The input file type is always IBOCR, the default file format for scanned documents. The IBOCR JSON-formatted files are the output of the process files step.

Mapping type

Specify how documents are split.

  • Single Record: Entire document (ignore blank pages) — All pages of the document are joined into one record, with blank pages excluded.

  • Single Record: Entire document — Default. All pages of the document are joined into one record.

  • Multiple Records: Split by page — Each page of the document represents one record. Page breaks are provided by OCR output.

  • Multiple Records: Split by custom formula — Maps document pages into a custom list of records using a custom formula.

  • Multiple Records: Split by page matcher — Use with Page Matcher Config.

Custom Formula

If you select the mapping type Multiple Records: Split by custom formula, use this field to specify a MATCH or SPLIT Python formula.

MATCH

MATCH(INPUT_COL, '<expression to match>', multiline=false)

Example value:

MATCH(INPUT_COL, '*2017', multiline=false)

For each page, MATCH performs the following:

if multiline:
  regexp = re.compile('<expression to match>', re.MULTILINE)
else:
  regexp = re.compile('<expression to match>')

matched_text_list = regexp.findall(page_txt)

SPLIT

SPLIT(INPUT_COL, '<regex pattern to split on>', multiline=false)

Example value:

SPLIT(INPUT_COL, '\n', multiline=false)

Each match in this example is an element of the matched\_text\_list. This match corresponds to one record. Matches are supported only within a page.

Scripts directory

Select a Python script that contains a custom registration function.

For example, the fns.py file registers the custom_contains function:

def custom_contains_fn(val, search_for, **kwargs):
  return search_for in val

def register(name_to_fn):
    more_fns = {
        'custom_contains': {
            'fn': custom_contains_fn,
            'ex': '',
            'desc': ''
        }
    }
    name_to_fn.update(more_fns)

Then, you can refer to the function CUSTOM_CONTAINS() in custom formulas or page matcher configs.

For example, the custom_map_fns.py file contains a custom_map function that dynamically looks up search term and splits pages into records:

def custom_map(clients, parsed_ibocr, **kwargs):
    # Custom map function needs to return a list of IBOCRRecords
    # INPUT_COL will be set as parsed_ibocr, can be passed into method to parse / implement
    # custom map functionality.

    content, err = clients.ibfile.read_file('user/repo/fs/drive/flow_path/search_config.txt')
    if err:
      raise Exception(err)
    
    search_term = content.decode()
    
    record_ranges = []
    num_pages = parsed_ibocr.get_num_records()
    
    for i in range(num_pages):
        page_text, err = parsed_ibocr.get_text_at_record(i)
        if err:
            raise Exception(err)

        if search_term in page_text:
            record_ranges.append((i,i))
    
    final_records, err = parsed_ibocr.get_joined_records_by_ranges(
        record_ranges)
    if err:
      raise Exception(err)

    return final_records

def register(name_to_fn):
    more_fns = {
        'custom_map': {
            'fn': custom_map,
            'ex': '',
            'desc': ''
        }
    }
    name_to_fn.update(more_fns)

The above function can be invoked with the custom formula custom_map(CLIENTS, INPUT_COL).

Clear metadata

Metadata is required for some features in review OCR. If this metadata causes increased storage consumption on your system, you can clear the metadata by specifying True.

Page matcher config

If you select the mapping type Multiple Records: Split by page matcher, use this field to specify JSON that identifies text strings on the first and last page of the record.

For example, given this page matcher config, which takes INPUT_COL as a reference to one page of text and emits a boolean if the page matches the specified expression:

{
  "first_page_expr": "CONTAINS(INPUT_COL, 'Social security tips')",
  "end_page_expr": "CONTAINS(INPUT_COL, 'Earned income credit')"
}

And given a document that contains these pages:

  • Page 1: “unmatched text”

  • Page 2: “Social security tips …”

  • Page 3: “Earned income credit …”

  • Page 4: ""

  • Page 5: ""

  • Page 6: “Social security tips …”

  • Page 7: ""

  • Page 8: “Earned income credit …”

Two records are output that consist of pages 2-3 and 6-8.

Input variables

These input variables are available in map records formulas:

  • INPUT_COL — Text string. For custom formulas, specifies the text associated with all text pages concatenated together. For page matcher configs, specifies the text associated with one page.