Deep learning

Table of Contents

Instabase document understanding capabilities are enabled with deep learning models.

Supported models

You have multiple options for models depending on your specific use case.

Natively trained models

Instabase provides the following base transformer models natively on the platform. Using ML Studio, these models can be used as base models for training on use case specific documents to solve document understanding use cases for both classification and extraction.

Layout aware transformer models: These transformer models are trained on a large dataset of documents. They take into account layout information by utilizing spatial information of the words in the document. These models are useful for solving use cases such as application forms, paystubs, invoices, or bank statements, where layout is an important consideration.
Natural language transformer models: These transformer models are trained on a large dataset of language text. They take into account the sequence of words without any layout consideration. These models are useful for solving use cases such as extracting and classifying from natural language documents, including articles, letters, websites, and emails.

Supported pre-trained, third-party models

Instabase natively provides several pre-trained models that allow you to do various language tasks such as translation, summarization, and question answering, among others. These models are available through Model Catalog and can be used in Flow for building applications.

Bring your own pre-trained, third-party models

Instabase enables you to run a small set of approved, pre-trained, third party models on the platform. The supported models include any model trained using one of the following base transformers:

bert-base-uncased - BERT is a transformers model pre-trained on a large corpus of English data in a self-supervised fashion. This model is primarily aimed at being fine-tuned on tasks that use the whole sentence to make decisions, such as sequence classification, token classification, or question answering.
distilbert-base-uncased - DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher. This model is nothing but a distilled version of the BERT base model. It is also aimed at being fine-tuned on tasks such as sequence classification, token classification or question answering.
legal-bert-base-uncased - Trained on diverse English legal text from several fields scraped from publicly available resources, including legislation, court cases, and contracts. Sub-domains variants (CONTRACTS-, EURLEX-, ECHR-) and/or general LEGAL-BERT perform better than using BERT out of the box for domain-specific tasks. This model is helpful for handling contracts and agreements.
Bio_ClinicalBERT - Pretrained on PubMed dataset and Mimic Dataset (health records). It should be more accurate on natural language medical domain use cases compared to general domain models.

Our catalog of supported third-party models continues to expand. If there’s specific interest in running a third-party, pre-trained model in production that falls outside of these guidelines, we can evaluate it on a case-by-case basis.

Migrate legacy models

Models created on the Instabase platform prior Release 21.10.0 will not be supported and they will need to be migrated. Prior to 21.10.0, models were identified by a model-name. However, this caused a number of issues when the model was updated or renamed. Models are now uniquely referenced by a model name and a version and this will be enforced upon model publishing. All model-usage in UDF’s, Refiners and Flows will need to be updated to assign a version.

Find the folder that contains your saved model - if coming from Annotator, this is often a folder called saved_model, and includes the package.json. Right click that folder, click Package, and then Model.
Open the Marketplace Admin app, go to Tools > Publish, and select the .ibsolution file that was generated in the previous step. Make sure to note down the name of your model and the version number (found in the package.json). This step can take a few minutes.

Note

If needed, change the model name and version inside of package.json, such as if these values clash with an existing published model.

Within your Refiner program script, look for the run_model function. Inside of that function, add a model_version field with the version of the model from your package.json, and remove the path to the model. More specifically, your function should look like this at first:

def run_model(INPUT_COL: Value[str], **kwargs) -> ModelResultType:
    ip_sdk = IntelligencePlatform(kwargs, ['path_to_model'])
    ibocr, err = kwargs['_FN_CONTEXT_KEY'].get_by_col_name('INPUT_IBOCR_RECORD')
    if err:
        raise KeyError(err)
    record = ParsedIBOCRBuilder()
    record.add_ibocr_records([ibocr])
    results = ip_sdk.run_model('UberEats', 
                            input_record=record, 
                            force_reload=False,
                            refresh_registry=False,
                            **kwargs)
    return ner_result_to_values(results, INPUT_COL)

It should be converted to the following:

def run_model(INPUT_COL: Value[str], **kwargs) -> ModelResultType:
    ip_sdk = IntelligencePlatform(kwargs)
    ibocr, err = kwargs['_FN_CONTEXT_KEY'].get_by_col_name('INPUT_IBOCR_RECORD')
    if err:
       raise KeyError(err)
    record = ParsedIBOCRBuilder()
    record.add_ibocr_records([ibocr])
    results = ip_sdk.run_model('UberEats', 
                               model_version='0.0.1',  # This line is newly added.
                               input_record=record, 
                               force_reload=False,
                               refresh_registry=False,
                               **kwargs)
    return ner_result_to_values(results, INPUT_COL)

Go back to your Refiner program, click run, and verify that it now works!