TokenMatchers and Tokenizers (Legacy)

You can use Token Matchers and Tokenizers with some Refiner functions.

  • Tokenizers provide a way to break text into multiple pieces, while

  • TokenMatchers provide a way to score and clean each particular piece of text with knowledge of the semantic category that it belongs to.

For example, without context, it is difficult for a computer to interpret the value of 1o Novembr 200B, but knowing that this value is supposed be a date changes everything: it is clearly 10 November 2008.

TokenMatcher and Tokenizer usage

Token Matchers and Tokenizers can be used with the following Refiner functions (documented in the Refiner/Sheet Functions::NLP Functions section):

  • nlp_token_clean cleans a piece of text

  • nlp_token_score provides a semantic validation score, from 0 to 1.0

  • nlp_token_find finds the best scoring token in a string

  • nlp_token_find_all finds all the tokens in a string with a score higher than 0.8.

  • nlp_token_select finds the best scoring string among its argument list

Each of these refiner functions take in a model_config argument, which is a dictionary of parameters that can be used by your chosen TokenMatcher. The nlp_token_find and nlp_token_find_all functions also take in tokenizer and tokenizer_config arguments to define how text should be broken up before each token is scored (default to unigrams, or splitting text at whitespace).

At minimum, these nlp_token functions take in the model argument. Most TokenMatchers are automatically paired with a default Tokenizer, usually the recommended Tokenizer for that TokenMatcher.

Here are a few examples of how these functions can be used:

nlp_token_find('Due on 20IB-0A-1B', 
               model='only-digits', 
               separator=' ') # -> '20IB-0A-1B'

Here is an example of how to pass a file path to the Lexicon TokenMatcher:

  nlp_token_find('Your input string', 
                 model='dataset', 
                 tokenizer_config=map_create([['dataset', 'your_dataset.csv']]))

Available TokenMatchers

The following token matcher can be used within Refiner formulas and UDFs.

Category Identifier Matcher Name Functionality Description Find Example Notes
Geography iso-3-country-code ISO 3 Country Code score, clean Matches tokens which conform to the ISO 3 country code standard "sdf usa foo""usa"
Geography us-state US State score, clean, generate, generate_obvious Matches any state name or abbreviation. "Heading home to Virginia""Virginia" Prioritizes state names over abbreviations, and upper case over lower case.
Geography us-address Address (USA) score Scores tokens with respect to how likely they are to be a US-based address. (Score example) 123 Something Street\nTownsville, CA 1111110.9 Hello, world0.1 It is recommened that this is used with the tokenizer:us-address tokenizer.
Geography global-address-word Geography - Address Word (Global) score Matches words that are commonly used within local addresses. "I live on Main Street""Street" Must be used with the tokenizer:global-address-word tokenizer.
Geography global-city Geography - City (Global) score Matches various cities around the world, in English spelling. "I live in San Francisco, CA""San Francisco" Must be used with the tokenizer:global-city tokenizer.
Geography global-state Geography - State (Global) score Matches various state and provice names around the world, in English spelling. "I live in Bengaluru, Karnataka""Karnataka" Must be used with the tokenizer:global-state tokenizer.
Geography global-zipcode Geography - Zipcode (Global) score Matches various zipcodes used by different countries and territories. "I went to Douglas, Isle of Man IM1 1EG to watch the race.""IM1 1EG" Must be used with the tokenizer:global-zipcode tokenizer.
Geography country Geography - Country score Matches country names and some country name abbreviations "I grew up in the United States.""United States" Must be used with the tokenizer:country tokenizer.
Geography global-address Geography - Address (Global) score Matches addresses for various regions. "My address:\n123 Some Street, London UK""123 Some Street, London UK" Must be used with the tokenizer:global-address tokenizer.
ID Numbers passport-number Passport Number score, clean, generate, generate_obvious Matches passport numbers. "sdf 12 122309df fd""12" Mostly just matches a sequence of digits.
ID Numbers social-security-number Social Security Number score, clean, generate, generate_obvious Matches US social security numbers. "sdf 444-12-2812""444-12-2812" Can handle when digits get OCR’d as letters.
Numbers credit-card-number Credit Card Number score, clean, generate, generate_obvious Matches the major credit card types. "sdf 4444125562813212""4444125562813212" Missing numbers and format mismatches will cause a score of 0.
Core floating-point-us Floating Point (US) score, clean Matches numbers of the form ###,###,###.##. "Ted 1.0 2.0 foo""1.0" Will pick first in the case of ties. Score is lower but not 0 when letters are found (ex: B may be an 8)
Core positive-integer Positive integer score, clean Matches numbers of the form ##### (no decimal points). "Ted 1.0 10 foo""10"
Core only-az Only AZ score, clean Matches only the letters A-Za-z. "Ted 1.0 10 foo""Ted"
Core only-digits Only Digits score, clean Matches only digits. "Ted 1.0 10 foo""10" Can include decimal points and letters that look like numbers, but prioritizes pure digits.
Core passport-field-sex Gender (M/F) score, clean Matches only the letters M and F. "Z M F""M" Picks the first item when there are ties.
Core mrz MRZ score, clean Matches a string containing a machine-readable zone (Type 1 and Type 3).
Names person-name Person - Name score, clean, generate, generate_obvious Matches person names. "Barack Obama was born on August 4, 1961""Barack Obama"
Time dd-month-yyyy DD-Month-YYYY score, clean Matches DD-Month-YYYY. "The date is 13-Jan-2019""13-Jan-2019" Finds the month using the matcher:month-name TokenMatcher
Time month-name Month Name score, clean Matches English-language month names and their abbreviations. "I'll see you in March""March"
Time passport-field-date US Passport Date score, clean Matches the date format that appears in passport fields: ## MonthName #### (Clean example) "fxx12 March 2018""12 MARCH 2018"
Time passport-mrz-date US Passport Date (MRZ) score, clean Matches the date format that appears in passport MRZ: YYMMDD "sdf 122309 fd""122309"
Time date Date score, clean, generate, generate_obvious Matches dates of a variety of different formats. Some text on 11-JUN-201811-JUN-2018 By default uses the date-regex tokenizer.
Currency currency Currency score Matches currency and amount pairs both currency-first and amount-first Total transaction amount USD 1,447,329.11USD 1,447,329.11 By default uses the currency tokenizer. Must be used with this tokenizer.
Dataset dataset Dataset score, clean Returns matches from list of terms provided as a filepath in tokenizer_config Dataset: ['UK', 'US', 'EU'], "An EU regulation is a legal act which applies directly at the national level." -> "EU" Must be used with the dataset tokenizer

Available Tokenizers

The TokenMatcher framework allows for different ways to break up text before searching for tokens. The following tokenizers can be used. (The default for most operations is Unigram).

Identifier Matcher Name Description Example Notes
unigram Unigram Splits input text into single-token pieces based on a split regex. “Tokenize this text!” → [“Tokenize”, “this”, “text!”] Equivalent to Python’s split method.
bigram Bigram Creates bigrams by splitting input text into two-token pieces based on a split regex. “Tokenize this text!” → [“Tokenize this”, “this text!”]
n-gram n-gram Provides tokenization by breaking up text into single tokens, and then returning a list which is generated using a sliding window of size n across the entire list of single tokens. n defaults to 1. “a b c d e f g”, ngram_size=3 → [“a b c”, “b c d”, “c d e”, “d e f”, “e f g”]
n-gram-range n-gram range Equivalent to tokenizer:n-gram, but for a range of n. The range defaults to 1:T+1 where T is the total number of single tokens created from this text. Upper bound is exclusive. “a b c d”, ngram_range=“1:3” → [“a”, “b”, “c”, “d”, “a b”, “b c”, “c d”]
us-address Address (USA) Returns a list of tokens from an input text that are deemed to be likely US addresses. “Welcome to Instabase\n123 Street Ave.\nCity, ST 123456” → [“123 Street Ave.\nCity, ST 123456”] Is not guaranteed to return a list that covers the entire input text.
global-address-word Geography - Address Word (Global) Returns a list of tokens from an input text that are deemed to be likely local address clarifiers, such as “street” or “road”. “I live at Main Street near Instabase Square” → [“Street”, “Square”] Current state is experimental.
global-city Geography - City (Global) Returns a list of tokens from an input text that are deemed to be likely city names. “I grew up in Bristol, UK but moved to Boston, MA” → [“Bristol”, “Boston”] Current state is experimental.
global-state Geography - State (Global) Returns a list of tokens from an input text that are deemed to be likely state names. “I’ve been to New York, CT, and ME, and now I am on my way to Karnataka.” → [“New York”, “CT”, “ME”, “Karnataka”] Current state is experimental.
global-zipcode Geography - Zipcode (Global) Returns a list of tokens from an input text that are deemed to be likely zipcodes. “I visited Boston, MA with a zip code of 02116 and Douglas, Isle of Man IM1 1EG.” → [“02116”, “IM1 1EG”] Current state is experimental.
country Geography - Country Returns a list of tokens from an input text that are deemed to be likely country names. “I grew up in the U.K., but moved to the United States and then Canada.” → [“U.K.”, “United States”, “Canada”] Current state is experimental.
global-address Geography - Address (Global) Returns a list of tokens from an input text that are deemed to be likely addresses from various regions of the world. “My address:\n123 Some Street, London UK” → [“123 Some Street, London UK”] Current state is experimental.
regex Basic - Regex Returns matches from provided regex. regex used: r'\b[A-Za-z]+\b', "Jun 1992 2019-01-01 1st january 2018"["Jun", "january"] If no regex is provided, returns an empty list. Is not guaranteed to return output, if there are no matches in the input text.
date-regex Regex - Date Returns matches from preset date regex. "Jun 1992 2019-01-01 1st january 2018"["Jun 1992", "2019-01-01", "1st january 2018"] Is not guaranteed to return output, if there are no matches in the input text.
currency-regex Basic - Currency Returns matches from preset currency regex. Only matches amounts with US style "USD 199.99 USD 1,447,329.11 EUR 1.447,98"["USD 199.99", "USD 1,447,329.11"] Is not guaranteed to return output, if there are no matches in the input text.
dataset Dataset Returns matches from list of terms provided as a filepath in tokenizer_config Dataset: ['UK', 'US', 'EU'], "An EU regulation is a legal act which applies directly at the national level." -> "EU" Is not guaranteed to return output, if there are no matches in the input text, or if no filepath has been passed to the Tokenizer.

Creating custom TokenMatchers and Tokenizers

Creating custom TokenMatchers and Tokenizers is similar to the process of registering custom Refiner formulas. Import the TokenMatcher and Tokenizer interfaces, and create subclasses that implement the required functionality. The custom classes are then registered by defining register_token_matchers and register_tokenizers functions. See the example below:

from instabase.model_utils.tokenizer import Tokenizer
from instabase.model_utils.token_matcher import TokenMatcher
from instabase.provenance.tracking import Value

class MyCustomTokenizer(Tokenizer):
  TOKEN_TYPE = u'my-custom-tokenizer'
  HUMAN_NAME = u'Custom - From a UDF'

  def tokenize(self, text, **kwargs):
    # type: (Value[Text]) -> List[Value[Text]]
    # Just return the first word
    return [Value(text.value().split(' ')[0])]


class MyCustomTokenMatcher(TokenMatcher):
  MATCHER_TYPE = u'my-custom-token-matcher'
  HUMAN_NAME = u'Custom - From a UDF'
  DEFAULT_TOKENIZER = u'my-custom-tokenizer'

  def score(self, token, **kwargs):
    # type: (Value[Text], **Any) -> float
    # For this example, give all tokens the same score
    return 0.8


def register_token_matchers(name_to_fn):
    name_to_fn.update({
      'MyCustomTokenMatcher': {
        'class': MyCustomTokenMatcher
      }
    })
def register_tokenizers(name_to_fn):
  name_to_fn.update({
    'MyCustomTokenizer': {
      'class': MyCustomTokenizer
    }
  })

The tokenizer takes in a Value object of type Text and returns a list of Value objects.

This list of value objects can then be used in a Refiner formula:

  nlp_token_find('Some example text', 
                 model='my-custom-token-matcher', 
                 tokenizer='my-custom-tokenizer')

or within a UDF using REFINER_FNS:

scan_result, err = REFINER_FNS.call('nlp_token_find_all', 
                                    val, 
                                    model='my-custom-token-matcher', 
                                    tokenizer='my-custom-tokenizer', 
                                    threshold=0.7, 
                                    **kwargs)

Make sure to include the **kwargs parameter that comes from the calling UDF function. The **kwargs parameter includes an important context for running TokenMatchers.