Parsing Functions

left_pos

left_pos(text, label=None, label_any=None, e=0, ignorecase=false, default=None)

Finds leftmost character position of a word in a given text



Args:

    text (str): original text

    label (str, optional): string whose leftmost character will be used for determining left position

    label_any (List<str>, optional): will search for each label in order and

        return the position of the first matching label.

    e (int, optional): number of errors allowed in the match.

    ignorecase (bool, optional): Whether casing should be ignored. By default false.

    default (int, optional): value to return if no match is found.



Returns:

    Returns leftmost character position of a word in a given text



Examples:

    left_pos('hello world', 'world') -> 6

    left_pos('hello! whole wide world', 'wide') -> 13

right_pos

right_pos(text, label=None, label_any=None, e=0, ignorecase=false, default=None)

Finds rightmost character position of a word in a given text



Args:

    text (str): original text

    label (str, optional): string whose rightmost character will be used for determining right position

    label_any (List<str>, optional): will search for each label in order

        and return the position of the first matching label.

    e (int, optional): number of errors allowed in the match.

    ignorecase (bool, optional): Whether casing should be ignored. By default false.

    default (int, optional): value to return if no match is found.



Returns:

    Returns rightmost character position of a word in a given text



Examples:

    right_pos('hello world', 'hello') -> 4

    right_pos('hello! whole wide world', 'whole') -> 11

scan

scan(text, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, num_lines=None, e=0, ignorecase=false)

Returns a region of text that matches the bounding criteria.



Args:

    text (str): original text

    starts_after (str, optional): narrows search space for finding the label.

        Only text following starts_after will be used for searching the label.

        Defaults to beginning of the original text.

    starts_after_any (List<str>): will search for each label in order and return

        the position of the first matching label.

    ends_before (str, optional): narrows original text, only text before

        ends_before string will be used for searching the label. Defaults

        to end of the original text.

    ends_before_any (List<str>): will search for each label in order and return

        the position of the first matching label.

    left_pos (int, optional): left position index.

    right_pos (int, optional): right position index.

    num_lines (int, optional): number of lines to consider from starts_after;

        defaults to all the lines.

    e (int): number of errors allowed in the match.

    ignorecase (bool, optional): Whether casing should be ignored. By default false.



Returns:

    returns a region of text that matches the bounding criteria.



Examples:

    scan(INPUT_COL, starts_after='Net Pay')

scan_below

scan_below(text, label=None, label_any=None, left_pos=None, right_pos=None, left_pad=None, right_pad=None, ends_before=None, ends_before_any=None, num_lines=None, e=0, ignorecase=false)

Returns value below the label, with provided padding on each side



Args:

    text (str): original text.

    label (str, optional): string used for determining position.

    label_any (List<str>, optional): will search for each label in order and return

        the position of the first matching label.

    left_pos (int, optional): left position index.

    right_pos (int, optional): right position index.

    left_pad (int, optional): extends the left position index towards left.

    right_pad (int, optional): extends the right position index towards right.

    ends_before (str, optional): narrows original text, only text before

        ends_before string will be used for searching the label. Defaults to

        end of the original text.

    ends_before_any (List<str>, optional): will search for each label in order and return

        the position of the first matching label.

    num_lines (int, optional): number of lines to consider below the label;

        defaults to all the lines.

    e (int, optional): number of errors allowed in the match.

    ignorecase (bool, optional): Whether casing should be ignored. By default false.



Returns:

    returns value below the label.



Examples:

    scan_below(INPUT_COL, 'NET PAY', num_lines=1, left_pad=3, right_pad=5)

scan_below_repeated

scan_below_repeated(text, label=None, label_any=None, left_pos=None, right_pos=None, left_pad=None, right_pad=None, ends_before=None, ends_before_any=None, num_lines=None, e=0, ignorecase=false, max_scans=10000)

Finds a list of matches by running scan_below() repeatedly on the remaining text after each match.



Args:

    text (str): original text.

    label (str, optional): string used for determining position.

    label_any (List<str>, optional): will search for each label in order and return

        the position of the first matching label.

    left_pad (int, optional): extends the left position index towards left.

    right_pad (int, optional): extends the right position index towards right.

    ends_before (str, optional): narrows original text, only text before

        ends_before string will be used for searching the label. Defaults to

        end of the original text.

    ends_before_any (List<str>, optional): will search for each label in order and return

        the position of the first matching label.

    left_pos (int, optional): left position index.

    right_pos (int, optional): right position index.

    num_lines (int, optional): number of lines to consider below the label;

        defaults to all the lines.

    e (int, optional): number of errors allowed in the match.

    ignorecase (bool, optional): Whether casing should be ignored. By default false.

    max_scans (int, optional): maximum number of results that should be populated.  If the

        value is not set, or is less than 0, we default to 10000.



Returns:

    a list of matches by running scan_below() repeatedly on the remaining text after each match.



Examples:

    scan_below_repeated(INPUT_COL, 'NET PAY', num_lines=1, left_pad=3, right_pad=5)

scan_box

scan_box(text, label=None, label_any=None, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false, pixel_tolerance=2, exclude_label_line=false)

Returns the contents of the box containing the search term.



Requires provenance tracking and line detection used in Process Files.



Args:

  text (str): The text to search in.

  label (str): Search term contained in a visual box with the content to extract.

  label_any (List<str>, optional): will search for each label in order and

      return the position of the first matching label.

  starts_after (str, optional): Narrows search space for finding the label.

      Only text following starts_after will be used for searching the label.

      Defaults to beginning of the original text.

  starts_after_any (List<str>, optional): Will search for each label in order and return

      the position of the first matching label.

  ends_before (str, optional): Narrows original text, only text before

      ends_before string will be used for searching the label. Defaults

      to end of the original text.

  ends_before_any (List<str>, optional): Will search for each label in order and return

      the position of the first matching label.

  left_pos (int, optional): Left position index.

  right_pos (int, optional): Right position index.

  e (int): Number of errors allowed in the match. By default 0.

  ignorecase (bool, optional): Whether casing should be ignored. By default false.

  pixel_tolerance (int, optional): Sets a number of pixels that a word can

      be past the side of a rectangle it is contained in. By default 2.

  exclude_label_line (bool, optional): If set to true, will remove the label

      used to find the rectangle from the output.



Returns:

    Returns content of the visual box containing the search term.

scan_line

scan_line(text, label=None, label_any=None, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false)

Returns the line that has the label bounded by left_pos and right_pos params



Args:

    text (str): original text

    label (str, optional): string used for determining position

    label_any (List<str>, optional): will search for each label in order and return

        the position of the first matching label.

    starts_after (str, optional): narrows search space for finding the label.

        Only text following starts_after will be used for searching the label.

        Defaults to beginning of the original text.

    starts_after_any (List<str>, optional): will search for each label in order and return

        the position of the first matching label.

    ends_before (str, optional): narrows original text, only text before

        ends_before string will be used for searching the label. Defaults to

        end of the original text.

    ends_before_any (List<str>, optional): will search for each label in order and return

        the position of the first matching label.

    left_pos (int, optional): left position index.

    right_pos (int, optional): right position index.

    e (int): number of errors allowed in the match.

    ignorecase (bool, optional): Whether casing should be ignored. By default false.



Returns:

    returns value for a label by scanning right of the label in the same line



Examples:

    scan_line(INPUT_COL, 'Tax Year', right_pos=left_pos(INPUT_COL, 'Tax Year'))

scan_line_repeated

scan_line_repeated(text, label=None, label_any=None, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false, max_scans=10000)

Finds a list of matches by running scan_line() repeatedly on the remaining text after each match.



Args:

    text (str): original text

    label (str, optional): string used for determining position

    label_any (List<str>, optional): will search for each label in order and return

        the position of the first matching label.

    starts_after (str, optional): narrows search space for finding the label.

        Only text following starts_after will be used for searching the label.

        Defaults to beginning of the original text.

    starts_after_any (List<str>, optional): will search for each label in order and return

        the position of the first matching label.

    ends_before (str, optional): narrows original text, only text before

        ends_before string will be used for searching the label. Defaults to

        end of the original text.

    ends_before_any (List<str>, optional): will search for each label in order and return

        the position of the first matching label.

    left_pos (int, optional): left position index.

    right_pos (int, optional): right position index.

    e (int, optional): number of errors allowed in the match.

    ignorecase (bool, optional): Whether casing should be ignored. By default false.

    max_scans (int, optional): The maximum number of times we'll repeat the scan_line()

        function.  If the value is not set, or is < 0, we default to 10000.



Returns:

    a list of matches by running scan_line() repeatedly on the remaining text after each match.



Examples:

    scan_line_repeated(INPUT_COL, 'Tax Year')

scan_near

scan_near(text, label, target, max_distance=10, include_info=false, direction=None, max_distance_x=None, max_distance_y=None)

Finds a label within a piece of text, and returns desired targets around found labels.



  See the `regex()` and `token_matcher()` functions to use regexes and special tokens

  as labels and targets. Otherwise, pass a provenance-tracked value as a target or

  label to scan from a specific extraction. If not provenance-tracked, the string literal

  will be interpreted as a regex to search from.



  Args:

    text (str): The text to search in

    label (Union<str, List<str>, Regex, List<Regex>, Matcher, List<Matcher>>): A previously computed value, regex, list of regexes, or a token matcher to search for within the given text

    target (Union<str, List<str>, Regex, List<Regex>, Matcher, List<Matcher>>): A previously computed value, regex, list of regexes, or a token matcher to search for within the found region

    max_distance (float, optional): The maximum distance allowed between the label and target. Distance is the minimum distance from one piece of text to another, with columns and lines each being a unit distance of 1.

                  For instance, [label][target] would be a distance of 0, while [label] [target] is a distance of 1,

                  and the following example is a distance of 1 (indicated by the pipe):



                  [label]

                      |

                      [target]



    include_info (bool): If false, only returns the list of found targets. Otherwise, returns a dictionary including the label, target, locations, distance, and angle from label to target. Defaults to false.

    direction (str): Can be 'left', 'right', 'above', or 'below', indicating what direction from the label to the target is prioritized. This weighted distance is included as 'heuristic_distance' when info is included.

    max_distance_x (float, optional): The maximum distance allowed between the label and target in the x direction (see above for info on how this is computed). If not None, `max_distance` is ignored. Defaults to None.

                                      max_distance_y must also be set, or this will be ignored.

    max_distance_y (float, optional): The maximum distance allowed between the label and target in the y direction (see above for info on how this is computed). If not None, `max_distance` is ignored. Defaults to None.

                                      max_distance_x must also be set, or this will be ignored.



  Returns:

    Desired targets around found labels

scan_repeated

scan_repeated(text, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, num_lines=None, e=0, ignorecase=false, max_scans=10000)

Finds a list of matches by running scan() repeatedly on the remaining text after each match.



Args:

    text (str): original text

    starts_after (str, optional): narrows search space for finding the label.

        Only text following starts_after will be used for searching the label.

        Defaults to beginning of the original text.

    starts_after_any (List<str>, optional): will search for each label in order and return

        the position of the first matching label.

    ends_before (str, optional): narrows original text, only text before

        ends_before string will be used for searching the label. Defaults

        to end of the original text.

    ends_before_any (List<str>, optional): will search for each label in order and return

        the position of the first matching label.

    left_pos (int, optional): left position index.

    right_pos (int, optional): right position index.

    num_lines (int, optional): number of lines to consider from starts_after;

        defaults to all the lines.

    e (int, optional): number of errors allowed in the match.

    ignorecase (bool, optional): Whether casing should be ignored. By default false.

    max_scans (int, optional): the maximum numbers of repeated scans performed.



Returns:

    a list of matches by running scan() repeatedly on the remaining text after each match.



Examples:

    scan_repeated(INPUT_COL, starts_after='Net Pay')

scan_right

scan_right(text, label=None, label_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false)

Finds value for a label by scanning right of the label in the same line



Args:

    text (str): original text

    label (str, optional): string used for determining position

    label_any (List<str>, optional): will search for each label in order and return

        the position of the first matching label.

    ends_before (str, optional): narrows original text, only text before

        ends_before string will be used for searching the label. Defaults to

        end of the original text.

    ends_before_any (List<str>, optional): will search for each label in order and return

        the position of the first matching label.

    left_pos (int, optional): left position index.

    right_pos (int, optional): right position index.

    e (int, optional): number of errors allowed in the match.

    ignorecase (bool, optional): Whether casing should be ignored. By default false.



Returns:

    returns value for a label by scanning right of the label in the same line



Examples:

    scan_right(INPUT_COL, 'NET PAY')

scan_right_repeated

scan_right_repeated(text, label=None, label_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false, max_scans=10000)

Finds a list of matches by running scan_right() repeatedly on the remaining text after each match



Args:

    text (str): original text

    label (str, optional): string used for determining position

    label_any (List<str>, optional): will search for each label in order and return

        the position of the first matching label.

    ends_before (str, optional): narrows original text, only text before

        ends_before string will be used for searching the label. Defaults to

        end of the original text.

    ends_before_any (List<str>, optional): will search for each label in order and return

        the position of the first matching label.

    left_pos (int, optional): left position index.

    right_pos (int, optional): right position index.

    e (int, optional): number of errors allowed in the match.

    ignorecase (bool, optional): Whether casing should be ignored. By default false.

    max_scans (int, optional): maximum number of results that should be populated.  If the

        value is not set, or is less than 0, we default to 10000.



Returns:

    a list of matches by running scan_right() repeatedly on the remaining text after each match.



Examples:

    scan_right_repeated(INPUT_COL, 'NET PAY')