NLP Functions


nlp_get_entities(text, label=None)

Extracts entities from natural language text.


    text (str): the text of interest

    label (str): filters for a specific kind of entity, such as PERSON or ORG. Defaults to None, which gets all entity types.


    Returns a dictionary containing entities extracted from the text


    nlp_get_entities('The Massachusetts Institute of Technology is a private research university in Cambridge, Massachusetts, United States.') ->


      'entities': [

        {'char_pos': {'end': 41, 'start': 0},

         'entity': u'The Massachusetts Institute of Technology',

         'label': u'ORG',

         'word_pos': {'end': 5, 'start': 0}},

        {'char_pos': {'end': 87, 'start': 78},

         'entity': u'Cambridge',

         'label': u'GPE',

         'word_pos': {'end': 12, 'start': 11}},

        {'char_pos': {'end': 102, 'start': 89},

         'entity': u'Massachusetts',

         'label': u'GPE',

         'word_pos': {'end': 14, 'start': 13}},

        {'char_pos': {'end': 117, 'start': 104},

         'entity': u'United States',

         'label': u'GPE',

         'word_pos': {'end': 17, 'start': 15}}


      'status': 'OK'



nlp_token_clean(text, model=None, model_config=None)

Cleans a token according to the provided model.


    text (str): the token of interest

    model (str): the name of a valid token model

    model_config (map): A map of options to configure the model


    The input token, cleaned according to the logic of the token model.


    nlp_token_clean('20IB0A1B', model='matcher:only-digits') -> '20180418'


nlp_token_find(text, model=None, separator=None, tokenizer=None, model_config=None, tokenizer_config=None)

Tokenizes the input string and returns the best scoring token according to the provided matcher.

If given, will use specified tokenizer. Otherwise, will use the default

tokenizer specified in the tokenmatcher class. If no tokenizer is specified or

set as default, will use unigram tokenizer.


    text (str): the text assumed to contain the token of interest

    model (str): the name of a valid token matcher

    separator (str): the string on which to to split text into tokens

    tokenizer (str): the name of a valid tokenizer

    model_config (map): A map of options to configure the model

    tokenizer_config (map): A map of options to configure the tokenizer


    The best scoring token according to the provided matcher logic.


    nlp_token_find('Due on 20IB-0A-1B', model='matcher:only-digits', separator=' ') -> '20IB-0A-1B'


nlp_token_find_all(text, model=None, separator=None, threshold=None, tokenizer=None, model_config=None, tokenizer_config=None)

Tokenizes the input string and returns all tokens with score above

threshold, according to the provided matcher.

If given, will use specified tokenizer. Otherwise, will use the default

tokenizer specified in the tokenmatcher class. If no tokenizer is specified or

set as default, will use unigram tokenizer.


    text (str): the text assumed to contain the token of interest

    model (str): the name of a valid token matcher

    separator (str): the string on which to to split text into tokens

    threshold (float): the threshold for determining whether token fits the

      model, default=0.8.

    tokenizer (str): the name of a valid tokenizer

    model_config (map): A map of options to configure the model

    tokenizer_config (map): A map of options to configure the tokenizer


    The best scoring token according to the provided matcher logic.


    nlp_token_find_all('ID: 20I80A1B', model='matcher:only-digits', separator=' ') -> ['20I80A1B']


nlp_token_score(text, model=None, model_config=None)

Scores a token from 0 to 1.0 according to the provided matcher.


    text (str): the token of interest

    model (str): the name of a valid token matcher

    model_config (map): A map of options to configure the model


    A score for the input token, from 0 to 1.0, according to the logic

    of the token matcher.


    nlp_token_score('20IB0A1B', model='matcher:only-digits') -> 0.75


nlp_token_select(*args: Any)

Returns the best scoring token, among provided inputs, according to the provided matcher.


    args: dict containing:

        text1 .. textN (str): the tokens of interest.

        model (str): the name of a valid token matcher

        model_config (map): A map of options to configure the model


    The best scoring token according to the provided matcher logic.


    nlp_token_select('20IB-0A-1B', '2018-01-20', model='matcher:only-digits') -> '2018-01-20'