Scan Box

The scan_box Refiner function extracts text from a rectangle (box) in the image domain based on a label within that rectangle.

How to use Scan Box

Sample W2 Sample W2

In the image above, you can use scan_box to extract the employer’s name and address from the rectangle by using the label 'c Employer\'s name'. Make sure that the label specified is unique, or restrict the input text to the area surrounding your label. If the label is not unique, scan_box finds the box based on the first occurrence of the label, similar to other scan functions.

A best practice is to use the label argument with scan_box. The specified label can span one line at most.

The resulting text and the found box are provenance tracked. In Refiner, you are able to view the found box for your label.

Example of basic usage

scan_box(INPUT_COL,  label='c Employer\'s name')

Expected output:

  c Employer's name, address, and ZIP code
                 The    Big   Company
                 123   Main     Street
                 Anywhere,        PA    12345

Accepted arguments

The scan_box function accepts the same arguments as other scan functions that allow you to craft how you scan the input text.

The following labels are unique to scan_box:

pixel_tolerance

The pixel_tolerance argument accepts an integer value to specify the number of pixels that words can be past the region box borders, or the region box borders can be past/before the found label.

By default, this value is set to 2.

exclude_label_line

The exclude_label_line argument accepts a boolean value. If set to true, the line that contains your label is excluded from the output.

By default, this value is false.

An example:

scan_box(INPUT_COL,  label='c Employer\'s name', exclude_label_line=true)

The line with "Employer\'s name" is then removed from the output upon returning. The resulting output is:

                 The    Big   Company
                 123   Main     Street
                 Anywhere,        PA    12345

Enable line detection and OCR Config settings

For scan_box to work, you must enable line detection "find_lines": true in the Process Files settings. If you create a new Refiner Project with the project creation wizard, enable line detection in the Process Files step of your post_install_script.flow. Run the post-install script on your input folder. Your Refiner program is ready to be used with scan_box.

Enable these OCR Config settings with "find_lines" so the resulting OCR Config looks like:

{
  "produce_word_metadata": true, 
  "produce_metadata_list": true, 
  "force_image_ocr": true, 
  "write_converted_image": true, 
  "find_lines": true
}