Custom Classifier guide

Most document processing projects require the incoming documents to be sorted, or classified. Instabase provides an app called Classifier to perform this sorting operation.

The automatic Instabase Classifier works with word frequency. Sometimes, your documents are complex or similar enough that you need more specific logic.

In this guide, we will learn to extend Classifier with custom Python, so that we can label small, medium, or large input documents.

Prerequisites

Before this exercise, complete the Flow guide, the Refiner guide, and the Classifier guide. All of these are in Core concepts.

For this exercise, we’ll be working with ADP and Gusto paystubs. You can use your Metaflow paystub project created from previous guides if you want to.

Or, you can start from our Metaflow paystub solution:

  • Create a new workspace and remove any initialized directories. Unzip the custom-classifier-starter.zip on your computer and drag your unzipped paystub-extraction folder into your blank workspace.

1. Why Use Custom Classifiers

Sometimes, Instabase’s default Classifier is not robust enough to catch all the types of documents that we want. In this case, we can extend the Classifier by building a custom Classifier. Building a custom Classifier makes the most sense in two cases:

  • Case 1: The documents are extremely simple to classify by hand. For example, maybe every document you’ll sort begins with some standard phrase, like: DEPARTMENT: 4. In this situation, rather than use a machine learning model, it might make sense to build a Classifier that extracts the department number and returns something like department_4 as the result.

  • Case 2: You have a sophisticated classifier from your engineering team that you would like to use. Maybe your engineering department has just developed the latest deep learning technology that knocks the socks off everything out there, and you’d like to use it with Instabase.

Custom Classifiers can implement heuristic models to label documents.

Custom Classifiers are implemented in Python, placed inside a special scripts directory in Instabase, and then registered for use with our platform. This guide will walk you through the required steps to create one and attach it to your Flow.

2. Implementing a Custom Classifier

A Classifier is a Python class that implements the following interface:

class Classifier(object):
  """
  Interface for an Instabase Classifier.
  """

  def get_type(self):
    # type: () -> Text
    raise NotImplementedError(u'To be implemented.')

  def get_version(self):
    # type: () -> Text
    raise NotImplementedError(u'To be implemented.')

  def train(self, training_context, reporting_context):
    # type: (ClassifierTrainingContext, ClassifierReportingContext) -> Tuple[Dict, Text]
    """
    Deprecated method for training a classifier
    """
    raise NotImplementedError(u'Deprecated, no need to implement.')

  def predict(self, datapoint):
    # type: (ClassifierInput) -> Tuple[ClassifierPrediction, Text]
    raise NotImplementedError(u'To be implemented.')

  def export_parameters_to_string(self):
    # type: () -> Tuple[Text, Text]
    """
    Deprecated method for returning a representation of the trained model as a string.
    """
    raise NotImplementedError(u'Deprecated, no need to implement.')

  def load_parameters_from_string(self, model_string, model_metadata=None):
    # type: (Text, Dict) -> Tuple[bool, Text]
    """
    Deprecated method for loading a trained model from a string.
    """
    raise NotImplementedError(u'Deprecated, no need to implement.')

Your custom Classifier is responsible for:

  • Reporting its own type and version. In this case, type means the String that is returned from get_type(). You can give your Classifier any type you want, but it is good practice to make sure it is specific and descriptive like "ib:heuristic-classifier".

  • Serializing and de-serializing its model to a string.

  • Predicting a label, given a datapoint.

  • Any custom progress, status, or error logging.

Serializing and de-serializing your model

When asked to save, your Classifier must encode its model as a UTF-8 string.

This will be stored by Instabase inside a JSON wrapper containing more metadata about your Classifier. If your model contains data that can not be stored inside a JSON UTF-8 string, we recommend using Base64 encoding.

Predicting

The predict() method returns a ClassifierPrediction object, which implements the following interface:

class ClassifierPrediction(object):
  """
  Wrapper for the result of a classification.
  """

  def __init__(self, best_match, debugging_data=None):
    # type: (Text) -> None
    self.best_match = best_match # type: Text

    # For anything else.
    self.debugging_data = None # type: Dict

The only required field is best_match, which is the string label your Classifier has predicted. The debugging_data field can be used to store any additional information (such as a distribution over labels).

Activity

  1. Create a new folder in your classifiers folder. Call the new folder scripts.

  2. Create a new file. Name it new_custom_classifier.py.

  3. Open new_custom_classifier.py in a text editor.

  4. We are going to make a custom Classifier that sorts documents based on their size. First, copy and paste these helper functions into new_custom_classifier.py:

from typing import Text, Dict, List, Union, Tuple, Any, Callable, Set
_PAGE_IN_CHARS = 80 * 48
_DOC_MEDIUM = 2 * _PAGE_IN_CHARS
_DOC_LARGE = 10 * _PAGE_IN_CHARS

def _ibocr_to_text(ibocr): # extract text from an IBOCR
  # type: (List[Dict]) -> Tuple[Text, Text]
  """
  Transform an IBOCR file in Python representation into a string containing
  the concatenation of each page.
  """
  page_texts = []
  for i in range(ibocr.get_num_records()):
    cur_txt, err = ibocr.get_text_at_record(i)
    if err:
      return None, err
    page_texts.append(cur_txt)

  return u'\n'.join(page_texts), None

class DocsizePrediction(object): # You don't need to subclass Prediction
  def __init__(self, best_match):
    self.best_match = best_match
    self.debugging_data = dict()
  1. Copy and paste this example code for a custom Classifier:
class DocsizeDemoClassifier(object): # You don't need to subclass Classifier
  """
  This is a demo heuristic Classifier.
  """

  def get_type(self):
    return u'ib:heuristic-demo'

  def get_version(self):
    return u'1.0.0'

  def train(self, training_context, reporting_context):
    """
    No training is necessary; this is a heuristic model.
    """
    return dict(), None

  def export_parameters_to_string(self):
    """
    Returns an empty string; this is a heuristic model.
    """
    return u'', None

  def load_parameters_from_string(self, model_string, model_metadata=None):
    """
    No-op; this is a heuristic model.
    """
    return True, None

This DocsizeDemoClassifier will be used to classify documents by size, but at this point it doesn’t do anything. It doesn’t have a predict() method.

  1. We want to be able to predict what type a document is based on whether it is Small, Medium or Large. Add a predict() method to your DocsizeDemoClassifier. It could look like this example:
def predict(self, datapoint):
  """
  Classifies a document into categories EMPTY, SMALL, MEDIUM, and LARGE
  based on arbitrary thresholds defined in the constants at the top of the
  file.
  """

  if datapoint.get_ibocr() and not datapoint.get_text():
    text_content, ibocr_error = _ibocr_to_text(datapoint.get_ibocr())
    if ibocr_error:
      return None, u'Could not transform IBOCR file to text'
    datapoint.set_text(text_content)

  best_match = u'EMPTY'
  the_text = datapoint.get_text()

  if the_text:
    if len(the_text) > _DOC_LARGE:
      best_match = u'LARGE'
    elif len(the_text) > _DOC_MEDIUM:
      best_match = u'MEDIUM'
    else:
      best_match = u'SMALL'

  return DocsizePrediction(best_match), None

This example model uses a size heuristic to sort documents. It sorts incoming documents by length: small, medium, and large.

  1. Save the changes to new_custom_classifier.py

  2. Go to the Classifier app. Create a new Classifier and give it the Classifier type Custom Classifier. Select the scripts folder inside your classifiers folder.

Uh oh! Expect an error here. We haven’t registered our new Classifier, so Instabase doesn’t know yet that it exists. We’ll fix this in the next section.

3. Registering your Classifier

To make your custom Classifier available to Instabase, you must register it inside the same Python module that defines it. This is done by creating a special register_classifiers function that Instabase will look for.

This function returns a Python dictionary of the following form:

def register_classifiers():
  return {
    'company:classifier-name': {
      'class': MyClassifierClass
    }
  }

For example:

def register_classifiers():
  return {
    'instabase:size-heuristic': {
      'class': SizeHeuristicClassifier
    }
  }

This function tells Instabase to use the SizeHeuristicClassifier object when a user selects the 'instabase:size-heuristic' custom Classifier option.

  1. Open new_custom_classifier.py with the Text Editor

  2. Copy this code to the end of your file:

def register_classifiers():
  return {
    'my-example-classifier': {
      'class': DocsizeDemoClassifier
    }
  }
  1. Save the changes.

  2. Go to the Classifier app. Create a new Classifier and give it the Classifier type Custom Classifier. Select the scripts folder inside your classifiers folder. Now, you can see my-example-classifier as an option. Select that one.

Awesome! Now, you’ve created a Classifier that has functionality beyond the normal bigram classification. This one sorts by size, but you can create a Classifier that classifies by any number of characteristics. If a document has the word “PAYSTUB” in it, for example, you could catch that in a custom Classifier.

When you get the hang of using heuristics to classify documents with your specific structure, you’ll be able to solve any classification project that comes your way.

4. Why Use Ensemble Classifiers

Let’s say we have a Classifier that correctly predicts 9 out of 10 document types, but it continually gets the 10th document type mixed up with the 8th document type. Instead of starting over and trying to create a Classifier that predicts all types correctly, we can chain our existing Classifier with another Classifier that would provide specific logic to handle that last case.

When we combine multiple Classifiers into one classification process, we call it an Ensemble Classifier.

On Instabase, you can create an Ensemble Classifier that:

  • Applies multiple Classifiers, including custom ones, to a piece of data

  • Applies custom logic to decide which result to select for each datapoint

  • Applies custom pre-processing on Classifier inputs before prediction

5. Implementing an Ensemble Classifier

An Ensemble Classifier is a Python class that implements the same interface that our singular custom Classifiers implement.

An Ensemble Classifier has an extra method called get_ensemble_types():

def get_ensemble_types(self):
  # type: () -> List[Text]
  """
  Return the types of the classifiers to form an ensemble from. Classifier types
  must come from the Custom Classifiers occupying the same scripts/
  folder. An Ensemble Classifier can't request to contain another copy
  of itself, such as the type returned by self.get_type()
  """
  return ['ib:custom-classifier']

Your Ensemble Classifier is responsible for:

  • Reporting its type and version

  • Reporting the Classifier types it wraps

  • Making a decision about the Ensemble’s final output

  • Optional: Manipulating incoming data-points for prediction

Classifier Types

The get_ensemble_types function in an Ensemble Classifier can request the type of any custom Classifier in the same scripts folder as the Ensemble Classifier.

Modifying the input

Ensemble Classifiers allow us to do preprocessing on a data-point before the classification is performed. We put the preprocessing inside a function called classifiers_will_predict(), which takes in one datapoint.

def classifiers_will_predict(self, datapoint):
  # type: (ClassifierInput) -> ClassifierInput
  """
  Called before prediction. Provides an opportunity to modify the
  input.
  """
  return datapoint

For example, we might want to transform all text to lowercase and take out common words before we look for regex matches.

In that case, our classifiers_will_predict function would look something like:

def classifiers_will_predict(self, datapoint):
  # type: (ClassifierInput) -> ClassifierInput
  """
  Called before prediction. Provides an opportunity to modify the
  input.
  """
  text = datapoint.get_text()
  text = text.lower()

  #remove some common words:
  common_words_to_remove = ["the", "and", "a", "an", "of"]
  text_words = text.split()
  text_list = [word for word in text_words if word not in common_words_to_remove]
  text = ' '.join(text_list)

  datapoint.set_text(text)
  return datapoint

Classify with Additional Logic

After all of the Classifiers have run, we can add conventional logic to sort the documents. We can put this logic into the function classifiers_did_predict(), which runs after the classification occurs.

def classifiers_did_predict(self, original_datapoint, modified_datapoint, predictions):
  # type: (ClassifierInput, ClassifierInput, List[ClassifierPrediction]) -> Tuple[ClassifierPrediction, Text]
  """
  Called after prediction. Requires the implementor to decide which output is correct.
  User can additionally add their own heuristic output as an override for special cases.
  """

  return predictions[0], None

For example, you could use regex search to see if any of the prediction objects that were classified to be "other" actually are paystubs.

To do this, you would check in classifiers_did_predict() if predictions[0] contains the strings "paystub", "pay check", "pay cheque", or any other strings you have specifically identified that paystub documents have. If it has any of these words, you could return the type "paystub", instead of the original "other".

In this case, the classifiers_did_predict would look something like:

def classifiers_did_predict(self, original_datapoint, modified_datapoint, predictions):
  # type: (ClassifierInput, ClassifierInput, List[ClassifierPrediction]) -> Tuple[ClassifierPrediction, Text]
  """
  Called after prediction. Requires the implementor to decide which output is correct.
  User can additionally add their own heuristic output as an override for special cases.
  """

  prediction = predictions[0]
  text = modified_datapoint.get_text()
  keywords = ["paystub", "pay check", "pay cheque"]

  #if the Classifier called the document 'other', check to see if it's actually a paystub
  if prediction.best_match == "other":
    for keyword in keywords:
      if keyword in text:
        return ClassifierPrediction("paystub"), None

  return prediction, None

Conclusion

Great! Now you have seen how to create and use a custom Classifier to sort documents that you know the structure of.

Together, we saw how to:

  • Create a custom Classifier

  • Create an Ensemble Classifier

  • Register your new Classifiers with the Instabase platform

If not, reach out to us at training@instabase.com. We’d love to chat about any questions, comments, or concerns that you might’ve had in completing this guide.