Libraries for UDFs

Use Instabase libraries and objects to read files and manipulate data in UDFs.

Reading the Flow output

Use the load_from_str function from the ParsedIBOCRBuilder class to read the intermediary Flow output:

from instabase.ocr.client.libs import ibocr

def custom_function_fn(content, input_filepath, clients, 
                       root_output_folder, *args, **kwargs):
  builder, err = ibocr.ParsedIBOCRBuilder.load_from_str(input_filepath, content)
  if err:
    raise IOError(u'Could not load file: {}'.format(input_filepath))

The ParsedIBOCRBuilder provides access to the underlying records using these interfaces:

def custom_function_fn(content, input_filepath, clients, 
                       root_output_folder, *args, **kwargs):
  builder, err = ibocr.ParsedIBOCRBuilder.load_from_str(input_filepath, content)
  if err:
    raise IOError(u'Could not load file: {}'.format(input_filepath)

  for ibocr_record in builder.get_ibocr_records():
   text = ibocr_record.get_text()
   lines = ibocr_record.get_lines()
   refined_phrases = ibocr_record.get_refined_phrases()
   ExternalFunction(text, lines, refined_phrases)

Write and modify ibocr_records

Use the ParsedIBOCRBuilder library to mutate the ibocr_records with the IBOCRRecordBuilder class. After the records are mutated, use the serialize_to_string function to get the serialized string that can be returned in the UDF response.

The general use pattern is:

parsed_builder = ParsedIBOCRBuilder.load_from_str(input_filepath, content)
i = 0
for record in parsed_builder.get_ibocr_records():
  ibocr_record_builder = record.as_builder()
  <mutate_ibocr_record>
  text = modify_text(record.get_text())
  ibocr_record_builder.set_text(text)

return parsed_builder.serialize_to_string()

Classes and objects

The library, classes, objects, and methods are:

  • ParsedIBOCRBuilder

  • IBOCRRecord

  • IBOCRPageMetadata

  • IBOCRRecordLayout

  • IBOCRRecordBuilder

  • WordPolyDict

  • RefinedPhrases

  • ParsedIBOCR

  • Runtime Config

  • IBFile

ParsedIBOCRBuilder

Use the ParsedIBOCRBuilder library to read and modify the contents of the IBOCR. The ParsedIBOCRBuilder library provides convenience functions to serialize the parsed-ibocr to string.

The ParsedIBOCRBuilder reads the contents of the file and can read all the underlying records.

  • Use the IBOCRRecord interface to read the records
  • Use the IBOCRRecordBuilder interface to modify the records
class ParsedIBOCRBuilder(object):

  @staticmethod
  def load_from_str(filepath, content):
    # type: (Text, bytes) -> Tuple[ParsedIBOCRBuilder, Text]
    """Given a filepath and its corresponding contents, construct a new
    `ParsedIBOCRBuilder`.

    Returns an error in case the construction fails.
    """
    pass

  def __len__(self):
    # type: () -> int
    """
    Returns the number of IBRecords present in this builder.
    """
    pass

  def get_ibocr_records(self):
    """
    Returns the list of IBRecords present in this builder.
    """
    # type: () -> List[IBOCRRecord]
    pass

  def get_ibocr_record(self, index):
    # type: (int) -> IBOCRRecord
    """
    Returns the IBRecord present at ‘index’.  If the index is out of bounds
    this function returns None.
    """
    pass

  def set_ibocr_record(self, index, record):
    # type: (int, IBOCRRecord) -> ParsedIBOCRBuilder
    """
    Sets the IBRecord at ‘index’.  If the index is out of bounds, this function
    is a no-op.  
    """
    pass

  def add_ibocr_records(self, ibocr_records):
    # type: (List[IBOCRRecord]) -> ParsedIBOCRBuilder
    """
    Adds the list of IBRecords to the builder.
    """
    pass    

  def serialize_to_string(self):
    # type: () -> bytes
    """Serializes the content of the ParsedIBOCRBuilder."""
    pass

IBOCRRecord

Provides accessors to the underlying records in the Flow output.

class IBOCRRecord(object):
  def get_text(self):
    # type: () -> Text
    """
    Returns the text stored in the IBOCRRecord.
    """

  def get_lines(self):
    # type: () -> List[List[WordPolyDict]]
    """
    Returns the lines stored in the IBOCRRecord.
    """

  def get_metadata_list(self):
    # type: () -> List[IBOCRPageMetadata]
    """
    Returns the metadata list associated with the IBOCRRecord.
    """

  def get_refined_phrases(self):
    # type: () -> Tuple[List[RefinedPhrase], Text]
    """
    Returns the refined phrase stored inside the IBOCRRecord.
    """

  def get_class_label(self):
    # type: () -> Text
    """
    Returns the classification label this record was classified with.
    If no classification took place, this function returns None.
    """

IBOCRPageMetadata

Provides access to the metadata in a page of IBOCR output. This object is one element of the list returned from a call to .get_metadata_list() on an IBOCRRecord object.

class IBOCRPageMetadata(object):
  def get_layout(self):
    # type: () -> IBOCRRecordLayout
    return IBOCRRecordLayout(self._d['layout'])

IBOCRRecordLayout

Provides access to the record layout of a page of IBOCR output, including image dimensions and path to the processed image. This object is returned from a call to .get_layout() on an IBOCRPageMetadata object.

class IBOCRRecordLayout(object):

  def __init__(self, d):
    # type: (_LayoutDict) -> None
    self._d = d

  def get_width(self):
    # type: () -> float
    return self._d.get('width')

  def get_height(self):
    # type: () -> float
    return self._d.get('height')

  def get_processed_image_path(self):
    # type: () -> Text
    return self._d.get('processed_image_path')

  def get_is_image_page(self):
    # type: () -> bool
    return self._d.get('is_image_page')

  def as_dict(self):
    # type: () -> _LayoutDict
    return self._d

IBOCRRecordBuilder

f for the flow-records. A common usage pattern is:

parsed_builder = ParsedIBOCRBuilder.load_from_str(input_filepath, content)
i = 0
for record in parsed_builder.get_ibocr_records():
  ibocr_record_builder = record.as_builder()
  <mutate_ibocr_record>

You can set these attributes using IBOCRRecordBuilder:

class IBOCRRecordBuilder(object):
  def set_refined_phrases(self, refined_phrases):
    # type: (List[RefinedPhrase]) -> IBOCRRecordBuilder
    """
    Sets the refined phrases inside the IBOCRRecordBuilder.  Overrides any
    pre-existing refined phrases.
    """

  def add_refined_phrases(self, refined_phrases):
    # type: (List[RefinedPhrase]) -> IBOCRRecordBuilder
    """
    Appends the refined phrases to the pre-existing refined phrases
    inside the IBOCRRecordBuilder.
    """

  def set_text(self, text):
    # type: (Text) -> IBOCRRecordBuilder
    """
    Sets the text for this record.
    """

  def set_lines(self, lines):
    # type: (List[List[WordPolyDict]]) -> IBOCRRecordBuilder
    """
    Sets the lines for this record.
    """

  def set_from_deepcopy_of_record(self, record):
    # type: (IBOCRRecord) -> IBOCRRecordBuilder
    """
    Creates the builder with the copy of the various attributes present in the
    record.

    This is a deepcopy which means that the user of this function can safely
    modify the builder without affecting the original record.
    """

  def as_record(self):
    # type: (int) -> Tuple[IBOCRRecord, Text]
    """
    Returns the IBOCRRecord with the attributes set inside builder.

    If an error happens, reports it to the user.
    """

WordPolyDict

The WordPolyDict dictionary describes the metadata for extracted words with the following keys:

'WordPolyDict', {
    'raw_word': Text,
    'word': Text,
    'line_height': float,
    'word_width': float,
    'char_width': float,
    'start_x': float,
    'end_x': float,
    'start_y': float,
    'end_y': float,
    'page': int,
    'confidence': IBWordConfidenceDict,
    'style': StyleData
}

RefinedPhrases

RefinedPhrases are phrases that are set by Refiner. You can use IBOCRRecordBuilder to create your own phrases and append to them to the record.

class RefinedPhrase(object):
  def __init__(self, json_dict):
    # type: (Dict) -> None
    """
    Initialize the refined phrase with a dictionary.  Usually the clients
    should use an empty dictionary.
    """

  def get_column_name(self):
    # type: () -> Text
    """
    Get the column name for the refined phrase.
    """

  def set_column_name(self, val):
    # type: (Text) -> None
    """
    Set the column name for the refined phrase.
    """

  def get_column_value(self):
    # type: () -> Text
    """
    Get the column value for the refined phrase.
    """

  def set_column_value(self, new_val):
    # type: (Text) -> None
    """
    Set the column value for the refined phrase.
    """

  def get_is_edited(self):
    # type: () -> bool
    """
    Whether this phrase was manually edited.
    """

  def get_formula_text(self):
    # type: () -> Text
    """
    The formula used to generate the refined phrase.
    """

  def get_registered_return_type(self):
    # type: () -> Text
    """
    The registered return type for the refined phrase.
    """

  def get_page_index(self):
    # type: () -> int
    """
    The page from which the refined phrase was extracted.
    """

  def get_extracted_pos(self):
    # type: () -> Dict[Text, List[List[int]]]
    """
    A dictionary with one key, ‘pixels’, which is a list of regions in
    the image that contains the extracted value, represented as
    [top left X coordinate, top left Y coordinate, bottom right X coordinate,
     bottom right Y coordinate, page number]
    """

  def get_information_pos(self):
    # type: () -> Dict[Text, List[List[int]]]
    """
    A dictionary with one key, ‘pixels’, which is a list of regions in
    the image that contains information used to get the extracted value,  
    represented as [top left X coordinate, top left Y coordinate, bottom right X
    coordinate, bottom right Y coordinate, page number]
    """

  def get_char_confidence(self):
    # type: () -> float
    """
    The confidence score for the refined phrase.
    """

  def get_has_unsure_ex(self):
    # type: () -> bool

  def get_has_unsure_info(self):
    # type: () -> bool

  def get_was_frozen(self):
    # type: () -> bool

  def get_was_best_effort_tracked(self):
    # type: () -> bool

ParsedIBOCR object

To reference the ParsedIBOCR object in UDFs, use:

from instabase.ocr.client.libs.ibocr import ParsedIBOCR

The interface for the object looks like:

class ParsedIBOCR(object):

  @staticmethod
  def load_from_str(full_path, txt):
    # type: (Text, Union[bytes, str]) -> Tuple[ParsedIBOCR, Text]
    pass

  def get_num_records(self):
    # type: () -> int
    pass

  def get_ibocr_records(self):
    # type: () -> List[IBOCRRecord]
    pass

Each record within the IBOCR has the following interface:

IBOCRRecord

class IBOCRRecord(object):

  def get_text(self):
    # type: () -> Text
    pass

  def get_lines(self):
    # type: () -> List[List[WordPolyDict]]
    pass

Each word per line contains the following information:

WordPolyDict = TypedDict('WordPolyDict', {
  'raw_word': Text,
  'word': Text,
  'line_height': float,
  'word_width': float,
  'char_width': float,
  'start_x': float,
  'end_x': float,
  'start_y': float,
  'end_y': float,
  'page': int
})

Runtime Config

Runtime Config is a Dict[Text, Text] set of key-value pairs that are passed at runtime into a flow binary. These variables can then be used to dynamically change behavior in your Flow. An example runtime config is:

{"key1": "val1", "key2": "val2"}

IBFile

The ibfile object is an Instabase FileHandle reference that provides pre-authenticated access to file operations. All operations done with the ibfile object have the same permissions as the user that invoked the operation.

The methods of the ibfile object are:

is_file(complete_path)
Input type: Text
Output type: bool

is_dir(complete_path)
Input type: Text
Output type: bool

exists(complete_path)
Input type: Text
Output type: bool

list_dir(path, start_page_token)
Input type: Text, Text
Output type: Tuple[ListDirInfo, str]

mkdir(complete_path)
Input type: Text
Output type: Tuple[MkdirResp, str]

copy(complete_path, new_complete_path, force=False)
Input type: Text, Text, bool
Output type: Tuple[CopyResp, str]

rm(complete_path, recursive=True, force=True)
Input type: Text, bool, bool, bool
Output type: Tuple[RmResp, str]

open(path, mode='r')
Input type: Text, str
Output type: IBFileBase

read_file(file_path)
Input type: Text
Output type: Tuple[str, Text]

write_file(file_path, content)
Input type: Text, str
Output type: Tuple[bool, Text]

Here are the object definitions for the returned objects:

enum StatusCode {
  OK = 1

  # connect exceptions
  MISSING_PARAM = 2

  # file errors
  READ_ERROR = 4
  WRITE_ERROR = 5f

  # missing file or directory
  NONEXISTENT = 7

  # general exceptions
  FAILURE = 3
  NO_MOUNT_DETAILS = 6
  ACCESS_DENIED = 8
}

class Status(object):
  def __init__(self, code, msg):
    # type: (StatusCode, str) -> None
    self.code = code
    self.msg = msg

class MkdirResp(object):
  def __init__(self, status):
    # type: (Status) -> None
    self.status = status

class RmResp(object):
  def __init__(self, status):
    # type: (Status) -> None
    self.status = status

class CopyResp(object):
  def __init__(self, status):
    # type: (Status) -> None
    self.status = status

class NodeInfo(object):
  def __init__(self, name, path, full_path, node_type):
    # type: (Text, Text, Text, Text) -> None
    self.name = name # Name of the file or folder resource
    self.path = path # Path relative to the mounted repo
    self.full_path = full_path # Path including the location of the mounted repo
    self._type = node_type # Type of node, such as 'file' or 'folder'

class ListDirInfo(object):
  def __init__(self, nodes, start_page_token, next_page_token, has_more):
    # type: (List[NodeInfo], Text, Text, bool) -> None
    self.nodes = nodes # List of nodes in the directory
    self.start_page_token = start_page_token # Number representing the start page
    self.next_page_token = next_page_token # Number representing the start of the next page
    self.has_more = has_more # Is true if not all directory contents have been listed

VALID_MODES = frozenset([
    # Read only modes
    'r',
    'rU',
    'rb',
    'rbU',

    # Writeable
    'r+',
    'rb+',
    'r+c',
    'rb+c',
    'w',
    'w+',
    'wb',
    'wb+',
    'wc',
    'w+c',
    'wbc',
    'wb+c',
    'a',
    'a+',
    'ab',
    'ab+',
    'ac',
    'a+c',
    'abc',
    'ab+c'
])

class IBFileBase(object):
  def __init__(self, path, mode):
    # type: (Text, Text) -> None
    self.path = self.path # Relative path to the file
    self._mode = self.mode # One of VALID_MODES

Sample UDF

This sample UDF uses ibfile as a member variable of a clients variable:

import logging
import StringIO

def do_nothing_udf(content, input_filepath, **kwargs):
  builder, err = ibocr.ParsedIBOCRBuilder.load_from_str(input_filepath, content)
  if err:
    raise IOError(u'Could not load file: {}'.format(input_filepath))

  out = builder.serialize_to_string()
  return out

def do_nothing_out_dir_udf(content, input_filepath, **kwargs):
  builder, err = ibocr.ParsedIBOCRBuilder.load_from_str(input_filepath, content)
  if err:
    raise IOError(u'Could not load file: {}'.format(input_filepath))
  logging.info('Writing outdir file...')
  outdir_path = out_dir + '/outdir.txt'
  resp, err = clients.ibfile.write_file(outdir_path, out_dir)
  if err:
    raise IOError(u'Could not write file: {}'.format(outdir_path))

  out = builder.serialize_to_string()
  return out

def test_is_file(clients, complete_path):
  is_file = clients.ibfile.is_file(complete_path)
  logging.info('Is file at path {}: {}'.format(complete_path, is_file))

def test_is_dir(clients, complete_path):
  is_dir = clients.ibfile.is_dir(complete_path)
  logging.info('Is dir at path {}: {}'.format(complete_path, is_dir))

def test_exists(clients, complete_path):
  exists = clients.ibfile.exists(complete_path)
  logging.info('Exists at path {}: {}'.format(complete_path, exists))

def test_list_dir(clients, path, start_page_token):
  list_dir_info, err = clients.ibfile.list_dir(path, start_page_token)
  if err:
    logging.info('ERROR list_dir at path {}: {}'.format(path, err))
  logging.info('List dir at path {}: {}'.format(path, list_dir_info))
  for node in list_dir_info.nodes:
    logging.info('Node {}'.format(node.as_dict()))
  return list_dir_info.nodes

def test_mkdir(clients, complete_path):
  mkdir, err = clients.ibfile.mkdir(complete_path)
  logging.info('Mkdir at path {}: {}'.format(complete_path, mkdir.status))

def test_copy(clients, complete_path, new_complete_path):
  copy, err = clients.ibfile.copy(complete_path, new_complete_path)
  logging.info('Copy at path {}: {}'.format(complete_path, copy.status))

def test_rm(clients, complete_path):
  rm, err = clients.ibfile.rm(complete_path)
  logging.info('Rm at path {}: {}'.format(complete_path, rm.status))

def test_read_file(clients, complete_path):
  read_file = clients.ibfile.read_file(complete_path)
  logging.info('Read file at path {}: {}'.format(complete_path, len(read_file)))
  return read_file

def test_write_file(clients, complete_path, data):
  write_file = clients.ibfile.write_file(complete_path, data)
  logging.info('Write file at path {}: {}'.format(complete_path, write_file))

def test_file_ops_fn(val, root_out_folder, clients, **kwargs):
  test_is_file(clients, root_out_folder)
  test_is_dir(clients, root_out_folder)
  test_exists(clients, root_out_folder)

  nodes = test_list_dir(clients, root_out_folder, '')
  example_file_path = nodes[0].full_path
  example_file_name = nodes[0].name
  test_mkdir(clients, root_out_folder + '/test-folder/')
  test_copy(clients, example_file_path, root_out_folder + '/{}.copy.1'.format(example_file_name))
  test_copy(clients, example_file_path, root_out_folder + '/{}.copy.2'.format(example_file_name))
  test_rm(clients, root_out_folder + '/{}.copy.2'.format(example_file_name))
  example_text = test_read_file(clients, example_file_path)
  test_write_file(clients, example_file_path, example_text)

  return val

def register(name_to_fn):
  more_fns = {
    'do_nothing_udf': {
      'fn': do_nothing_udf,
      'ex': '',
      'desc': ''
    },
    'do_nothing_out_dir_udf': {
      'fn': do_nothing_out_dir_udf,
      'ex': '',
      'desc': ''
    },
    'test_file_ops': {
      'fn': test_file_ops_fn,
      'ex': '',
      'desc': ''
    }
  }
  name_to_fn.update(more_fns)

REFINER_FNS object

The REFINER_FNS object provides an API for executing Refiner functions within a UDF. The object supports the following methods:

call(fn_name, *args, **kwargs)
Input type: Text, *Any, **Any
Output type: Any, Text

call_v(fn_name, *args, **kwargs)
Input type: Text, *Value[Any], **Any
Output type: Value, Text

The call method takes in the case-sensitive name of a Refiner function and as the positional and keyword arguments for that function. The result is a tuple, with the first item being the result, and second item being an error (if one occurred).

The call_v method is similar to call, but provides provenance tracking functionality. The first argument is a case-sensitive name of a Refiner function, and the remaining arguments are positional and keyword arguments for that provenance tracked function (such as Value objects).The result is a tuple, with the first item being the provenance-tracked result, and second item being an error (if one occurred).

Here is a sample UDF that uses this object:

import json

# This function applies SCAN_RIGHT and CLEAN on the input IBOCR record
def demo_refiner_fns(val, out_dir, refiner, **kwargs):
  # type: (Text, Text, RefinerFns, **Any) -> Text

  # Example: Load text of the first record
  text = json.loads(val)[0]['text']

  # Use refiner functions to extract subtotal (scan then clean)
  scan_result, err = refiner.call('scan_right', text, 'Subtotal', ignorecase=True)
  if err:
    return err

  result, err = refiner.call('clean', scan_result)
  if err:
    return err

  return result

def demo_refiner_fns_v(text, refiner_fns, **kwargs):
  # type: (Value[Text], Value[RefinerFns], **Any) -> Value[Text]

  # Use refiner functions to extract subtotal (scan then clean)
  scan_result, err = refiner_fns.value().call_v('scan_right', text, Value('Subtotal'), ignorecase=Value(True))
  if err:
    return Value(err)

  result, err = refiner_fns.value().call_v('clean', scan_result)
  if err:
    return Value(err)

  return result

TokenFrameworkRegistry object

The TokenFrameworkRegistry object uses the following methods to facilitate access to TokenMatcher usage.

class TokenFrameworkRegistry(object):

  def lookup_matcher(self, name):
    # type: (Text) -> TokenMatcher
    return self.token_matcher_registry.lookup_matcher(name)

  def lookup_tokenizer(self, name):
    # type: (Text) -> Tokenizer
    return self.tokenizer_registry.lookup_tokenizer(name)

Logger object

The Logger object can be used to log within the UDF. These logs can be accessed via the Flow Dashboard.

Notes: Starting version 21.9.0, you can directly use Python’s logging library to log within a UDF.

class Logger(object):

  def debug(self, msg, *args, **kwargs):
    # type: (Text, *Any, **Any) -> None
    pass

  def info(self, msg, *args, **kwargs):
    # type: (Text, *Any, **Any) -> None
    pass

  def warning(self, msg, *args, **kwargs):
    # type: (Text, *Any, **Any) -> None
    pass

  def error(self, msg, *args, **kwargs):
    # type: (Text, *Any, **Any) -> None
    pass

  def critical(self, msg, *args, **kwargs):
    # type: (Text, *Any, **Any) -> None
    pass

FnContext object

The FnContext object contains references to input variables and certain clients which can be accessed during UDF execution time.

To access the FnContext from your UDF, use kwargs.get('_FN_CONTEXT_KEY'):

def custom_fn(val, out_dir, clients, **kwargs):
  fn_context = kwargs.get('_FN_CONTEXT_KEY')
  logger, err = fn_context.get_by_col_name('LOGGER')
  if logger:
      logger.info('Out dir is {}'.format(out_dir))
  return u'final val'

In addition to the LOGGER keyword, you can also access the REFINER_FNS object and the CLIENTS object from the FnContext object.

The get_cached_object and set_cached_object functions are in FnContext that is shared across all columns of one record within Refiner.

To implement, use the FnContext object to pass information to UDFs:

def get_cached_object(self, obj_id):
  #type: (Text) -> Tuple[Any, Text]
  if not self._object_cache:
    return None, u'No object cache was provided to this function context'
  return self._object_cache.get(obj_id), None

def set_cached_object(self, obj_id, obj):
  #type: (Text, Any) -> Tuple[bool, Text]
  if not self._object_cache:
    return False, u'No object cache was provided to this function context'
  self._object_cache.set(obj_id, obj)
  return True, None

The FnContext object looks like:

class FnContext(object):

  def get_ibfile(self):
    # type: () -> IBFile
    # Returns a reference to the IBFile object for performing file operations.
    pass

  def get_by_col_name(self, col_name):
    # type: (Text) -> Tuple[Any, Text]
    # Returns a tuple. If the column name exists as part of the 
    # function call, then it returns col_val, None
    # If the column name does not exist, it returns None, 
    # u'Could not find column names "LOGGER"'
    #
    # This function is useful when you are writing UDFs which need to 
    # work across many versions of Instabase.
    # As Instabase adds more features, we sometimes add input columns 
    # which are available in a new version which are not available in 
    # older versions. To get around this you can use:
    #
    #  fn_context = kwargs.get('_FN_CONTEXT_KEY')
    #  logger, err = fn_context.get_by_col_name('LOGGER')
    #  if logger:
    #    # perform some operation now that the logger is available.
    pass