Provenance Tracking

What is provenance tracking

Provenance tracking in Refiner functions and Refiner UDFs allows us to map the output of a function to where it came from in the original input document.

In Refiner, you can pass the output of one function as input to another function, such as clean(scan_right(INPUT_COL)). Provenance tracking allows us to map the final output of these functions to the corresponding text in the input document.

You can execute Refiner programs with or without provenance tracking (File > Settings > Enable provenance tracking). Enabling provenance tracking is recommended and is the default configuration, but it might add processing time and memory-usage depending on the complexity of your UDFs.

Writing a provenance-tracked UDF

Provenance-tracked UDFs have a different function signature from normal UDFs. All input and output arguments must be Value objects, which is our special class for tracking provenance information alongside the actual variable value.

Typically, this means there are two versions of a UDF: the provenance-tracked version which expects input and ouput to be Value objects, and the non-provenance tracked version, which expects input and output to be normal Python data types such as string, list, or dict. If your UDFs will only be used with provenance tracking, you don’t have to implement a non-provenance tracked version.

Register your provenance-tracked UDF by adding the register_fn decorator above your function. In the example below, adding the decorator makes the function provenance-tracked and registers it under the function name, custom_function_fn_v.

from instabase.provenance.registration import register_fn

@register_fn
def custom_function_fn_v(content_value_obj, *args, **kwargs):
  pass

For more information on UDF registration, see Registering UDFs.

Compatibility with previously written UDFs

Provenance tracked functions from Instabase versions earlier than July 2020 must be updated to use Value object methods instead of directly manipulating provenance information, because we no longer guarantee that those underlying APIs won’t change. See Update earlier provenance tracked functions for details and examples of how to convert your functions.

instabase.provenance.tracking: Provenance APIs

These classes are used to access data and provenance information in provenance-tracked Refiner UDFs.

instabase.provenance.tracking.Value class

The Value class is a wrapper around a variable used in a provenance-tracked Refiner UDF that tracks provenance information so the final value can be mapped to text or regions in the original input document. The Value class consists of two parts:

  • the actual value the class holds, and

  • the tracker information, represented as a ProvenanceTracker class for text values or as an ImageProvenanceTracker class for image regions.

Basic usage in Refiner UDF:

from instabase.provenance.tracking import Value

def my_udf(name, **kwargs):
    name_string = name.value()
    greeting = 'Welcome ' + name_string
    greeting_value = Value(greeting)
    greeting_value.set_tracker(name.tracker().deepcopy())
    return greeting_value

The Value class has the following methods:

tracker

def tracker(self) -> ProvenanceTracker:
    """ Access provenance tracker.

    Returns a ProvenanceTracker object.
    """

set_tracker

def set_tracker(self, tracker: ProvenanceTracker) -> None:
    """ Set provenance tracker.

    Args:
        tracker: provenance tracker to set on this Value object
    """

image_tracker

def image_tracker(self) -> ImageProvenanceTracker:
    """ Access image provenance tracker.

    Returns an ImageProvenanceTracker object.
    """

set_image_tracker

def set_tracker(self, tracker: ImageProvenanceTracker) -> None:
    """ Set image provenance tracker.

    Args:
        tracker: image provenance tracker to set on this Value object
    """

value

def value(self) -> Any:
    """ Returns the raw value that this Value class holds.

    Return type depends on what the Value object holds.
    """

get_copy

def get_copy(self) -> Value:
    """ Returns a deep copy of this Value object and its provenance information.

    Returns a Value object.
    """

Call this method if you plan to make any modifications to the Value object. Modifying the input arguments of a Refiner UDF can cause unexpected side effects to subsequent fields in your Refiner program.

freeze_tracker

def freeze_tracker(self) -> None:
    """ Freezes provenance tracking, so that any additional operations have no
    effect on the tracked regions.
    """

instabase.provenance.tracking.ProvenanceTracker class

The ProvenanceTracker class stores provenance information that connects any text values to text in the original document. Keep in mind that the Instabase document format (IBOCRRecord) represents text as both a string containing all the text in the document and as a list of words with bounding box information.

The ProvenanceTracker class preserves the relationship between text in the current Value object and the IBOCRRecord text. There is a separate ImageProvenanceTracker for tracking provenance for non-text values such as checkboxes.

When displaying provenance information in Refiner and Flow Review, we distinguish between “information” provenance and “extracted” provenance:

  • Information: regions of the document used to determine what to extract. For example, this might be the anchor word used for a scan_right function.

  • Extracted: regions of the document that directly reflect the extracted information. For example, if the Refiner field returns a date ‘November 2020’, extracted provenance would point to the piece of text in the document containing the date.

Basic usage in Refiner UDF:

from instabase.provenance.tracking import Value

def pay_period(start_date, end_date **kwargs):
    # Compute pay period.
    pay_period = end_date.value() - start_date.value()
    pay_period_value = Value(pay_period)

    # Store provenance from start_date and end_date as information provenance on
    # pay_period.
    pay_period_tracker = start_date.tracker().deepcopy()
    pay_period_tracker.convert_to_informational()
    pay_period_tracker.insert_information_from(end_date.tracker())
    pay_period_value.set_tracker(pay_period_tracker)
    return pay_period

The Provenance Tracker class has the following methods:

convert_to_informational

def convert_to_informational(self) -> None:
    """ Converts all extracted provenance to information provenance.
    """

insert_information_from

def insert_information_from(self, other_tracker: ProvenanceTracker) -> None:
    """ Adds the information and extracted provenance from other_tracker to this
    tracker as information provenance.

    Args:
        other_tracker: provenance tracker whose provenance information you want
          to add to this tracker
    """

deepcopy

def deepcopy(self) -> ProvenanceTracker:
    """ Makes a deep copy of this tracker.

    Returns copied ProvenanceTracker object.
    """

You should always make a copy of a ProvenanceTracker before modifying it.

instabase.provenance.tracking.ImageProvenanceTracker class

The ImageProvenanceTracker class stores provenance information for any rectangular regions in the original document image. The information is not tied to IBOCRRecord text. The is useful for tracking provenance for non-text values like checkboxes or signatures.

ImageProvenanceTracker has the same methods as ProvenanceTracker, except that input and output arguments use ImageProvenanceTracker instead.

string Value objects

If a Value object contains a string, there are special string methods available on Value that automatically update the provenance information to reflect the string operation performed. All string operations that modify the string return a copy, similar to Python string methods.

substring

def substring(self, start: int, end: int) -> Value:
   """ Returns a copy of the string Value containing the string from
   start (inclusive) to end (exclusive)

   Args:
       start: start of string
       end: end of string
   """

substring also works with Python slicing syntax, so the following code snippet is equivalent to using substring:

string_value[1:3]  # Equivalent to string_value.substring(1, 3)

Examples:

my_value = my_value.substring(2, 10)
my_value = my_value[2:10]
my_value = my_value[3]

length

def length(self) -> int:
    """Get the length of the value if it's a string.
    """

You can also pass the Value object as an argument to len() to get the string length. For example: len(my_value) -> 3 for text-based Values. length and len are also implemented for Value objects holding other data types.

Examples:

var = my_value.length()
var = len(my_value)

delete

def delete(self, start: int, end: int) -> Value[Text]:
	"""Delete slice from Value from start index to end index.

	Args:
		start (int): start index of substring to remove.
		end (int): end index of substring to remove.
	"""

If one or both of the given start and end index are negative, or if the start index is not lower than or equal to the end index, delete raises a ValueError.

Examples:

my_value = my_value.delete(2, 10)

concatenate

def concatenate(self, string: Value[Text]) -> Value[Text]:
    """Appends `string` to the string in Value.

    Args:
    	string (Value[Text]): string to concatenate to Value-string.
    """

Use concatenate to add one text-based Value to another. You can also concatenate a regular string to a provenance-tracked Value. __add__ and __radd__ are implemented so you can use + syntax to concatenate Values just like regular strings.

Examples:

my_new_value = first_value.concatenate(second_value)
my_new_value = first_value.concatenate('second value')
my_new_value = first_value + second_value
my_new_value = 'hello ' + second_value
my_new_value = first_value + ' world'

replace

def replace(self, 
            old: Union[Value[Text], Text], 
            new: Union[Value[Text], Text], 
            count: int = -1) -> Value[Text]:
	"""Replace `old` in Value with `new`.

	Args:
		old (Value[Text] or Text): string in Value to be replaced.
		new (Value[Text] or Text): string to replace occurrences of `old` in Value with.
		count (int): -1 by default. Number of occurrences of `old` to replace with `new`.
	"""

This function is used to replace specific substrings with a new string. This function is also implemented with regex as regex_sub.

Examples:

my_value = my_value.replace(old_value, new_value)
my_value = my_value.replace('old value', new_value)
my_value = my_value.replace(old_value, 'new value')
my_value = my_value.replace('old value', 'new value')

lstrip, rstrip, strip

def lstrip(self, chars_to_strip: Text = None) -> Value:
	"""Strip characters from start of Value.

	Args:
		chars_to_strip (Text): None by default, strips leading whitespace.
			If set to specific characters, strips these from start of string.
	"""

def rstrip(self, chars_t_strip: Text = None) -> Value:
	"""Strip characters from end of Value.

	Args:
		chars_to_strip (Text): None by default, strips trailing whitespace.
			If set to specific characters, strips these from end of string.
	"""

def strip(self, chars_to_strip: Text = None) -> Value:
	"""Strip characters from start and end of Value.

	Args:
		chars_to_strip (Text): None by default, strips leading and trailing
			whitespace. If set to specific characters, strips these from
			end of string.
	"""

These functions are used to strip specific leading or trailing characters from Value.

Examples:

my_value = my_value.strip()
my_value = my_value.rstrip()
my_value = my_value.lstrip()
my_value = my_value.strip('\n')

insert

def insert(self, position: int, insert_str: Union[Value[Text], Text]) -> Value[Text]:
	"""Insert given string `insert_str` in specified `position`.

	Args:
		position (int): index at which to insert given string into Value.
		insert_str (Value[Text] or Text) string to insert at given index into Value.
	"""

Use insert to insert a string at a place other than the start or end.

Examples:

my_value = my_value.insert(3, 'my string')
my_value = my_value.insert(3, my_value_wrapped_string)

split

def split(self, split_string: Text = None, maxsplit: int = -1) -> Value[Text]:
	"""Split string based on given separator or whitespace by default. `maxsplit`
	by default is -1, which means that there is no maximum on the number of splits.
	If `maxsplit` is set to 0, no split is performed.

	Args:
		split_string (Text): None by default. String to split on. If None,
			splits on whitespace.
		maxsplit (int): -1 by default. Number of splits to perform.
	"""

This function is used for splitting a string. This function is also implemented with regex as regex_split.

Examples:

my_value = my_value.split()
my_value = my_value.split('\n', 10)

join

@staticmethod
def join(join_char: Union[Value[Text], Text], vals: Sequence[Union[Value[Text], Text]]) -> Value[Text]:
	"""Joins the given Value objects with the given joining character.

	Args:
		join_char (Value[Text] or Text): character to insert between joined values.
		vals (Sequence of Value[Text] or Text): Value objects or strings to join.
	"""

Use the join function to join a list of text-based Values and strings with a specified join character.

Examples:

my_value = Value.join('\n', [my_value_1, my_value_2, my_value_3])
my_value = Value.join('\n', ['my value 1', 'my value 2', 'my value 3'])

Regex-based helper functions

The Value class includes static methods that mimic the re or regex library. Each function follows the documentation and interface provided by the regex Python library.

Note that where re and regex return Match objects, provenance tracking uses a TrackedMatchProxy class to hold both the Match and provenance information. See TrackedMatchProxy.

@staticmethod
def regex_search(pattern: Union[Value[Text], Text, Pattern],
                 string: Union[Value[Text], Text],
                 flags: int = 0,
                 pos: int = None,
                 endpos: int = None,
                 partial: bool = False,
                 **kwargs: Any) -> TrackedMatchProxy:
	"""Searches for the given pattern in the given string using `regex.search` internally.

	Args:
		pattern (Value[Text] or Text): string or regex pattern to search for.
		string (Value[Text] or Text): string to search in.
		flags (int): 0 by default. Regex flags for the given pattern.
		pos (int): None by default. Start index of substring to search in if not at the start of `string`.
		endpos (int): None by default. End index of substring to search in if not at the end of `string`.
		partial (bool): Allow partial matches.

	Returns a TrackedMatchProxy object.
	"""

Use regex_search if you need the functionality of regex.search but want provenance tracked Match objects. regex_search actually returns a TrackedMatchProxy object instead, which has a similar interface to a Match object. See TrackedMatchProxy.

Use this function to find a match in some provenance-tracked text for the given pattern.

Examples:

my_match = Value.regex_search(r'\w+', my_input_str)
my_match.start() -> 2
my_match.end() -> 3
my_match.group() -> Value[...] (returns a tracked Value object containing given group's match text)

regex_findall

@staticmethod
def regex_findall(pattern: Union[Value[Text], Text, Pattern],
					 string: Union[Value[Text], Text],
					 flags: int = 0): -> Iterator[Union[Value[Text], Tuple[Value[Text], ...]]]
	"""Returns all matches of the given pattern in string as an iterator of Value objects and
	tuples containing Value objects using `regex.finditer` internally.

	Args:
		pattern (Value[Text], Text, or Pattern): string or regex pattern to search for.
		string (Value[Text] or Text): string to search in.
		flags (int): 0 by default. Regex flags for the given pattern.
	"""

Use the regex_findall function if you need the functionality of regex.finditer. regex_findall returns an iterator of (tuples of) Value objects, unlike regex.finditer it does not return TrackedMatchProxy objects.

Examples:

my_matches = Value.regex_findall(r'\w+', 'my string')
my_matches = Value.regex_findall(r'\w+', my_value_obj)

regex_finditer

@staticmethod
def regex_finditer(pattern: Union[Value[Text], Text, Pattern],
                 string: Union[Value[Text], Text],
                 flags: int = 0) -> Iterator[TrackedMatchProxy]:
    """Returns all matches of the given pattern in string as an iterator of
    TrackedMatchProxy objects using regex.finditer internally.

    Args:
	   pattern (Value[Text], Text, or Pattern): string or regex pattern to search for.
	   string (Value[Text] or Text): string to search in.
	   flags (int): 0 by default. Regex flags for the given pattern.
    """

regex_finditer is similar to regex_findall, except it returns TrackedMatchProxy objects and does not group matches in tuples.

Examples:

my_matches = Value.regex_finditer(r'\w+', 'my string')
my_matches = Value.regex_finditer(r'\w+', my_value_obj)

regex_sub

@staticmethod
def regex_sub(pattern: Union[Value[Text], Text, Pattern],
              repl: Union[Value[Text], Text],
              string: Union[Value[Text], Text],
              count: int = 0,
              flags: int = 0,
              pos: int = None,
              endpos: int = None,
              **kwargs: Any) -> 'Value[Text]':
    """Replaces matches of the given pattern in given string with `repl`. Returns
    a tracked Value object.

    Args:
        pattern (Value[Text], Text, or Pattern): string or regex pattern to search for.
        repl (Value[Text], Text): string to replace matches with.
        string (Value[Text] or Text): string to search in.
        count (int): 0 by default. Number of replacements to make. If 0, every match
        	  is replaced.
        flags (int): 0 by default Regex flags for the given pattern.
        pos (int): None by default. Start index of substring to search in if not at the start of `string`.
        endpos (int): None by default. End index of substring to search in if not at the end of `string`.
    """

Use regex_sub if you need the functionality of regex.sub. regex_sub returns a Value object that contains the new string created by replacing occurrences of pattern in the string with the replacement repl.

Examples:

my_new_value = Value.regex_sub(my_pattern_value, 'replacement', my_string)
my_new_value = Value.regex_sub(r'\s+', r'\s', my_input_text_val)

regex_split

@staticmethod
def regex_split(pattern: Union[Value[Text], Text],
                string: Union[Value[Text], Text],
                maxsplit: int = 0,
                flags: int = 0,
                **kwargs: Any) -> 'List[Value[Text]]':
    """Splits given string on given pattern. Returns a list of Value objects.

    Args:
        pattern (Value[Text] or Text): string or pattern to search for.
        string (Value[Text] or Text): string to search in.
        maxsplit (int): 0 by default. Number of splits to make. If 0, string
           is split for every occurrence of pattern. If -1, no splits
           are made.
        flags (int): 0 by default. Regex flags for the given pattern.
    """

Use regex_split if you need the functionality of regex.split, or if you want to split based on a regex which split cannot do.

Examples

my_list = Value.regex_split(r'\s+', my_input_text, maxsplit=3)

TrackedMatchProxy

class TrackedMatchProxy(object):
  """An object that mimics the interface of the re.Match object. Used in cases where we want to
  use a Match object which we have control over during creation. Includes provenance tracking.
  """

  def __init__(self, original_text: Value[Text], match: Any) -> None:
    self._original_text = original_text
    self.match = match

  def start(self, group: int = 0) -> int:
    return self.match.start(group)

  def end(self, group: int = 0) -> int:
    return self.match.end(group)

  def group(self, group: int = 0) -> Value[Text]:
    return self._original_text[self.start(group):self.end(group)]

  def to_group_tuple(self) -> Union[Value[Text], Tuple[Value[Text], ...]]:
    # If there are no groups, return the list of matches as a list
    # If there is one group, return a list of first group matches as a list
    # If there is more than one group, return a list of tuples with groups
    ...

  def __str__(self) -> Text:
    return u'<ib.TrackedMatchProxy object; span=({}, {}), match=\'{}\'>'.format(
        self.start(), self.end(), Value.unwrap_value(self._original_text))

Collection Value objects

Value objects can also contain collections such as lists and dictionaries. However, the elements of the collection must also be Value objects if you want to track provenance correctly for them (see usage example below). Built-in Refiner functions that return lists (such as regex_get_all) value-wrap each item of the list, recursively for nested lists. This allows the UI to properly display provenance information for all the nested elements.

If the result of a Refiner field is a collection, the extracted provenance of the collection elements is marked as information provenance. If a subsequent field accesses a string item in the collection, the extracted provenance of the item shows as extracted provenance.

Basic usage:

# list Value object
list_example: List[Value[Text]] = [a, b, c]
prov_tracked_list = Value(list_example) # Extracted provenance of a, b, and c
                                        # show up as informational in the UI.

# dictionary Value object
dict_example = {
  'a': Value(...),
  'b': Value([Value(...), ...]),
  Value('c'): 'hello'
}
prov_tracked_dict = Value(dict_example) # Extracted provenance of all the
    # nested Value objects show up as informational in the UI.

Modifying provenance-tracked values

If you are writing a UDF that takes in tracked values and modifies those tracked values, be careful to not modify the underlying attributes of those Value objects. Otherwise, you might accidentally modify the results of other Refiner fields.

Both the Value and provenance tracker classes have methods to create deep-copies of themselves. The Value class has the method get_copy(), which will return a new Value object with a deep-copy of the original Value object’s underlying value and tracker information. If you want just a deep-copy of the Value object’s tracker information, you can call value_object.tracker().deepcopy(). See below for some examples.

def modify_input_value(input_val: Value[Text], **kwargs) -> Value[Text]:
    new_val = input_val.get_copy()
    # new_val has a deep-copy of input_val's underlying value and tracker information
    # can now modify new_val safely

def update_input_value(input_val: Value[Text], **kwargs) -> Value[Text]:
    updated_value = ... # ex. an updated/cleaned value based on input_val
    new_val = Value(updated_value)
    new_val.set_tracker(input_val.tracker().deepcopy())
    # new_val has a deep-copy of input_val's underlying tracker information
    # can now modify new_val safely

Advanced provenance tracking

Advanced provenance tracking concepts.

Freezing

When provenance tracking, it is sometimes helpful to save and terminate the tracking process as values propagate through the function being evaluated. For instance, imagine the following scenarios:

  1. You extract an unsanitized value, but the cleaning process for your final output is unrelated to the original value in form (that is, going from 10/02/2018 to October 2, 2018). In these cases, you would like to freeze the tracking at the point before sanitization is completed.

  2. During redaction and fraud detection, you can track the information and extracted region for multiple “checkpoints” throughout the formula. For instance, if you have SCAN_RIGHT(INPUT_COL, ‘Net Pay’), you can tag the Net Pay tracking information so that you can reference it later.

You can use the following Refiner function to freeze provenance:

  • freeze(val) - Causes the tracker on the given Value object to be frozen (that is, no changes take effect). NOTE: This method essentially causes val.tracker() to become None, and the final tracker used during output generation is the frozen instance.

Note: A frozen value is frozen forever, including in other columns that use that value. Therefore, if you want to use a field that was frozen, you must wrap that entire column in a freeze() call to replace the originally frozen tracker.

Accessing provenance information

You might want to access the provenance information of a Value object; for example, to find the line numbers of the words in a phrase so that you can write a custom function to group words by line.

You can use the following Refiner function:

provenance_get(val) - Returns a dictionary of information regarding the tracked information of this field.

Return format is a list of words and their bounding boxes:

[{
  "original_characters": string,
  "image_start": {
    "x": float,
    "y": float
  },
  "image_end": {
    "x": float,
    "y": float
  },
  "original_start_1d": integer,
  "original_end_1d": integer,
  "original_start_2d": {
    "x": float,
    "y": float
  },
  "original_end_2d": {
    "x": float,
    "y": float
  },
  "unsures": [
    boolean,
    ...
  ],
  "confidences": [
    float,
    ...
  ]
},
...
]
Parameters
original_characters Original word text
image_start Top-left corner of text bounding box in image space
image_end Bottom-right corner of text bounding box in image pixel space
original_start_1d Start index (inclusive) of word in 1D text space
original_end_1d End index (exclusive) of word in 1D text space
original_start_2d Top-left corner of bounding box in 2D text space
original_end_2d Bottom-right corner of bounding box in 2D text space
destinations_1d List of index of each character in the word in the current string in 1D text space
destinations_2d List of index of each character in the word in the current string in 2D text space
unsures List of OCR engine unsureness for each character of word
confidences List of OCR confidence for each character of word

Auto provenance tracking

The Refiner program can perform automatic provenance tracking so that even if a function does not support provenance tracking and the extracted results have no tracked information, Refiner can do a best-effort guess of where this value came from. Automatic provenance tracking works in the following ways:

  1. If the output appears in the input record, mark the first occurrence of that output as the extracted region.

  2. If the output, ignoring extra whitespace, appears in the input record, mark the first occurrence as the extracted region.

Switching between OCR and INPUT_COL domain

It is sometimes useful to work in OCR space (for example, WordPolys), and then take those OCR results and display them as text within Refiner. This helper class provides functionality for getting provenance-tracked Values from WordPolys that came from an IBOCRRecord:

# Import via...
from instabase.ocr.client.libs.algorithms import WordPolyInputColMapper

class WordPolyInputColMapper:

  def __init__(self, record: IBOCRRecord):
    ...

  def get_index(self, word_poly: WordPolyDict) -> int:
    """Returns the string index of the given word_poly within INPUT_COL for this record.
    """
    ...

  def get_value(self, word_poly: WordPolyDict) -> Value[Text]:
    """Returns a provenance-tracked Value object for the given word_poly.
    """
    ...

  def get_cluster(self, word_polys: List[List[WordPolyDict]]) -> Value[Text]:
    """Returns a provenance-tracked Value object for the given word_polys. The input
    is a list of list of WordPolys, where each internal list represents the word
    polys to be included in one line, and the collection of lists is the collection
    of lines.
    """
    ...

Troubleshooting provenance tracking

The return value of my UDF shows different provenance information than I would expect.

Check whether any of the input arguments of your UDF have frozen provenance information.

Frozen provenance information from input arguments takes precedence over any existing provenance information on the output value.

I used freeze() on a list or dictionary, but it’s not showing any provenance information

freeze() does not work on collection objects (such as list or dictionary), because freeze() freezes only the top-level provenance tracker, which is None for collection objects. When displaying provenance for collection objects, we aggregate the provenance information of the elements.