.ibformers 2.2.0

Table of Contents

What’s new in v2.0

Supports metrics for extraction.
Ability to write a confusion matrix file for classification or split-classification task.
Ability to write dataset counts to a JSON file.
Long-doc retriever and multilabel features are now available as a separate checkboxes for all the base models, instead of only for instalm.
Public preview | Support for list fields. Extraction models are able to predict multiple text items instead of the single text value.
Public preview | Add ONNX support for layoutlm-base-uncased for classification and extraction task types, and instalm-base-draft for extraction task type.

Supported models & pipelines

Extraction

`instalm_base`

Internal Instabase model. It shares the architecture of layoutlm-base-uncased, but it’s heavily pretrained on additional tasks to enhance document understanding capabilities. On internal benchmarks, it’s superior over layoutlm on all datasets. The model task is formulated as labeling each word with a class for each named field and a class named “0” for words that do not contain any entities

`layout_base`

LayoutLM-base model. One of the first models which can leverage the layout of document to understand them better. The model task is formulated as labeling each word with a class for each named field and a class named “0” for words that do not contain any entities

`layout_large`

LayoutLM-large model. One of the first models which can leverage the layout of the document to understand them better. The model task is formulated as labeling each word with a class for each named field and a class named “0” for words that do not contain any entities

`layout_multilingual`

Base size of multi-language layoutxlm model. Added image input and relative positional embeddings. The model task is formulated as labeling each word with a class for each named field and a class named “0” for words that do not contain any entities.

This model is good for datasets in languages other than English or mixed language datasets. English-only documents are also fine to try. LayoutXLM model supports 53 languages (en, de, ja, es, fr, it, pt, zh-tw, zh-cn, nl, pl, sv, cs, ru, da, ro, no, id, hu, tr, hr, sk, fi, vi, th, sl, el, et, ar, ko, it, fa, bg, he, uk, tl, mk, so, lv, sq, af, sw, bn, hi, ur, mr, gu, ta, ne, ml, te, kn, pa).

`layoutlm_v2`

Base size of layoutlmv2 model. Added image input and relative positional embeddings. The model task is formulated as labeling each word with a class for each named field and a class named “0” for words that do not contain any entities

Even though authors claim that v2 is significantly better than v1, these models require more documents to get satisfying results. Try it if you have >100 examples.

`layoutlm_v3`

Base size of layoutlmv3 model. Added image input and relative positional embeddings. The model task is formulated as labeling each word with a class for each named field and a class named “0” for words that do not contain any entities

Might be useful for datasets where both text and image modalities are important to understand a document.

`instalmv3`

Combines some of the ideas of instalm_base and layoutlm_v3. Starting weights and architecture come from the layoutlm_v3 but with added instalm pretraining regime. Unlike layoutlm_v3 it does not use the image modality, which makes it faster in training and inference.

It outperforms both its parent models in benchmark tests.

`SingelQA_base`

Pipeline by default uses the bert-base-uncased model. The difference is with the task formulation. This pipeline uses Extractive QA, which points to the start and end of the answer, which is a part of the document content. It leverages the name of the field when answering the question, so a meaningful field name is important. This model is significantly slower, because the model asks separately for each field. If the given dataset has 10 different fields, it runs on average 10x slower than word classification models.

`natural_language`

Bert model. One of the first transformer models. Widely popular and often used as a baseline when comparing with different models. It uses only plain text input and therefore has limited capabilities for understanding complex documents. The model task is formulated as labeling each word with a class for each named field and a class named “0” for words that do not contain any entities

`legal_natural_language`

Legal Bert model. Bert-base model additionally trained on legal text. It uses only plain text input and therefore has limited capabilities for understanding complex documents. The model task is formulated as labeling each word with a class for each named field and a class named “0” for words that do not contain any entities

Classification

`layout_base`

LayoutLM-base model. One of the first models which can leverage the layout of the document to understand them better. The model task is formulated as labeling each page of the document. Scores are aggregated on the document level to obtain a final class prediction.

`traditional_ml_classifier`

Uses classical ML algorithms from scikit-learn library. By default, it uses TF-IDF vectors and Ada-boost classifier. It uses only plain text input and therefore has lower capabilities for understanding layout. It is much faster in training and inference than LayoutLM deep learning model.

Split classification

`bert_small_sequence_split_classifier`

Based on a smaller version of the bert model. It processes a sequence of pages instead of one by one and it provides same accuracy but much faster inference performance.

`sentence_bert_split_classifier`

Sentence Bert model. A smaller bert model which is fine-tuned to provide embeddings for sentences. It provides great results for the split classification problem.

`sentence_bert_split_classifier_aug`

Same as for sentence_bert_split_classifier. In addition, training data is augmented by creating documents out of randomly concatenated records.

`layout_split_classifier`

LayoutLM-base model. One of the first models which can leverage the layout of the document to understand them better. The model contains two tasks. The first one is formulated as labeling each contiguous page pair of the document to detect whether there’s a split which divides the document into parts. The second task is to classify obtained parts of documents (records) into the defined classes.

`layout_split_classifier_aug`

Same as for layout_split_classifier. In addition, training data is augmented by creating documents out of randomly concatenated records.

Table extraction

`table_detr_base`

This model uses DETR architecture to detect objects. The model was pretrained internally on the PubTables-1M dataset

Additional model characteristics

Extraction models

Name	Size	Training time factor	Max training GPU memory utilization*	Max prediction CPU memory utilisation**	Modalities
layoutlm-base-uncased	432 MB	1x	5.1 GB	3.7 GB	text, spatial
instalm-base-draft	432 MB	1x	5.1 GB	3.7 GB	text, spatial
instalmv3	628 MB	1.3x	5.1 GB	3.8 GB	text, spatial
layoutlm-large-uncased	1.27 GB	2.4x	11.0 GB	4.96 GB	text, spatial
layoutlmv2-base	765 MB	3.1x	9.6 GB	4.7 GB	text, spatial, image
layoutxlm-base	1.38 GB	3.1x	9.4 GB	4.98 GB	text, spatial, image
bert-base-uncased	420 MB	0.9x	5.1 GB	3.8 GB	text
legal-bert-base-uncased	420 MB	0.9x	5.1 GB	3.8 GB	text

Metrics

Extraction metrics

Individual field metrics

Below metrics are computed at the field level for all documents in the test dataset. For each field we identify:

POSITIVES_COUNT number of predicted fields - also known as TP+FP (true positives + false positives)
GT_COUNT number of ground-truth fields - also known as TP+FN (true positives + false negatives), another popular name is “support”
TP_COUNT number of correctly predicted fields - also known as TP (true positives)

Based on that, the below metrics are computed:

Precision - This metric measures how many fields retrieved by the model are relevant. It’s defined by the formula TP / POSITIVES_COUNT.
Recall - This metric measures how many relevant fields are retrieved by the model. It’s defined by the formula TP / GT_COUNT
F1 - Harmonic mean of precision and recall. Defined by the formula 2 / ((1 / Recall) + (1 / Precision)).

Individual token level metrics

Below metrics are computed at the token level (a field is created by a collection of tokens) for all documents in the test dataset. For each field we identify:

POSITIVES_COUNT number of predicted tokens - also known as TP+FP (true positives + false positives)
GT_COUNT number of ground-truth tokens - also known as TP+FN (true positives + false negatives), another popular name is “support”
TP_COUNT number of correctly predicted tokens - also known as TP (true positives)

Based on that, the below metrics are computed:

Precision - This metric measures how many tokens retrieved by the model are relevant. It’s defined by the formula TP / POSITIVES_COUNT.
Recall - This metric measures how many relevant tokens are retrieved by the model. It’s defined by the formula TP / GT_COUNT
F1 - Harmonic mean of precision and recall. Defined by the formula 2 / ((1 / Recall) + (1 / Precision)).

Averaged model metrics

Based on the above metrics we compute some model-level metrics. To do that we average the above metrics. There are two popular ways of doing so:

micro-averaged - this approach would first compute POSITIVES_COUNT, GT_COUNT, TP_COUNT across all fields and then based on that compute Precision, Recall, and F1 the same way. This implies that fields that are most frequent in the dataset and fields which are longer (occupies more words) will be heavier weighted in this metric.
macro-averaged - arithmetic mean (aka unweighted mean) over all individual classes (fields). This implies that a field that is very rare in the dataset would get the same weight as a very free one.

ECE

Confidence scores are an important part of the output from our models since they inform customers how reliable predictions are. While we have some well know metrics (F1/Accuracy/Recall/etc.) to measure the model performance, we don’t have such a metric for measuring how close the confidence scores are to the real underline probabilities.

ECE which is an abbreviation for Expected Calibration Error and is the metric that is used in the bibliography to identify the performance of the model in regards of calibration.

ECE can take values from 0.0 to 1.0 with 0.0 indicating that we have a perfectly calibrated model. Intuitively what we are trying to capture when we calculate this metric is the ratio of correct predictions over all the predictions that fall into each bucket and compare that ratio to the average confidence score for that bucket. For example, for the bucket [0.5, 0.6) ideally, we would like only 55% percent of the predictions that fall into the bucket to be correct and the average confidence score of them to be 0.55.

Another intuitive explanation of ECE (which I took from this post is:

Larger values of ECE error indicate a larger difference between output confidence (pseudo-probability) and actual model accuracy of the prediction — larger miscalibration. Smaller values of ECE indicate less miscalibration. If ECE is 0, all output confidence values equal the actual accuracies — the network model is perfectly calibrated.

Table extraction metrics

Prerequisite - matching predictions with annotations

For each page on the document, we generate a set of detected tables and compare it to the set of gold (labeled) tables. We create predicted-gold pairs by matching each detected table to the gold table that yields the highest IOU. This is achieved by solving a linear sum assignment problem, where the cost matrix is the IOU between the predicted and gold tables.

Because we match a table if the IOU is non-zero, this process can leave us with tables (either gold or predicted) that don’t have a match assigned.

In the next parts we will use the following terms:

matched pair - a pair of detected-gold tables that was matched with non-zero IOU
unmatched prediction table - a prediction table that didn’t have any match in the annotations
unmatched gold table - an annotated table that didn’t have any match in the predictions

Prequisite - GRiTS

GRiTS is a table similarity metric proposed in https://arxiv.org/pdf/2203.12555.pdf paper by Microsoft. It works by finding the most similar substructure between two tables and then averaging the cell similarity over the table’s cells. This general schema allows for plugging in different similarity functions that measure various aspects of table extraction (e.g. cell content or location). For our purposes, we use the content-based metrics, as in the end, it’s the content of the table that you want to have extracted.

Metric definitions

Dataset-level metrics

Cell Similarity - average GRiTS metric for content similarity calculated between the predicted and the gold tables. We calculate it as follows:
- For each matched pair we calculate the GRITS cell content similarity
- For each unmatched gold table we return 0 (the table wasn’t detected)
- For each unmatched predicted table labeled as one of the labels we return 0 (model found the wrong table of a specified type)
- We ignore extra tables returned by the model but labeled as “Other detected tables” - since the base model is a general table extractor, we allow it to detect other tables as long as they are not labeled.

The final score is the average of all the above.

This metric measures how well the predicted tables match the gold ones. This metric is influenced by both the detection and structure extraction models.

Table Label Accuracy - average table label accuracy over a whole dataset. Calculated only on the pairs of tables that were matched, just to remove any influence of the detection model on classification accuracy. This metric is affected only by the classification model.
Joint Extraction Score - a score that combined both cell similarity and classification:
- For matched pair that is correctly classified we return its GRITS cell content similarity
- For all other pairs and unmatched tables we return 0

The final score is the average of all the above. This metric combines the measuring performance of all three models.

Pure Cell Similarity - average GRiTS metric for content similarity calculated between all the matched table pairs:
- For matched pairs, we return GRiTS cell content similarity
- Unmatched tables are ignored

The final score is the average of all the above. This metric is affected only by the structure model.

Exact Table Match - a percentage of tables that were 100% correctly extracted:
- For matched pairs we return 1 if its GRiTS cell content similarity is equal to 1 and if the label is properly assigned.
- For all other matched pairs we return 0
- We ignore extra tables returned by the model but labeled as “Other detected tables”
- For all other unmatched tables we return 0

This metric is influenced by all three models.

Detection Precision - a percentage of tables returned by the model that has been present in the annotations, regardless of the labeling:
- For matched pairs, we return 1
- For unmatched predicted tables, we return 0
- We ignore unmatched gold tables

This metric is affected only by the detection model

Detection Recall - the percentage of gold tables that were detected by the model, regardless of the labeling:
- For matched pairs, we return 1
- For unmatched gold tables we return 0
- We ignore unmatched predicted tables

This metric is affected only by the detection model

Field-level metrics

For the field-level metrics, we have classification F1, Precision, Recall and two metrics from the list above (Cell Similarity and Detection Recall).

Classification metrics are similar to Table Label Accuracy, but they are less forgiving since they also account for unmatched tables.

Cell Similarity allows to assess which types of tables are handled by the structure model better or worse. It is a direct counterpart of dataset-level Cell Similarity

Detection Recall allows to assess which types of tables are easier to detect (regardless of the predicted label). Again, it is a direct counterpart of dataset-level Detection Recall.

Table label metrics

With row and column labeling feature there are new metrics that measure model’s performance in this task. The metrics are designed to evaluate the extraction quality of the labeled part of the tables. That way we can properly describe cases when, for example, the overall performance of the model is not perfect, but the labeled fields that are important to you are extracted better.

Content Similarity is defined similarly to the general case of this metric, but here it’s calculated only on the labeled part of gold and predicted tables.

Content Similarity Recall - we measure how much of the gold table’s labeled content was covered by predicted table’s labeled content. This metric does not punish the model for predicting additional table content.

Content Similarity Precision - we measure how much of the predicted table’s labeled content was actually in the gold table’s labeled content. This does not punish the model for missing some of the actual table content.

Classification & Split Classification metrics

Record-wise splitter metrics (per class)

Below metrics are computed at the record level for all documents in the test dataset.

Support - number of records present for a given class
Correctly Split Records - number of records that were correctly classified for a given class
Accuracy - measured as correctly_splitted_records / total_records

Record-wise classifier metrics (per class)

In order to produce the statistics below helper statistics are computed. For split classification, we assume that the splitter is 100% correct to deliver independent metrics for classification.

POSITIVES_COUNT number of predicted records for a given class - also known as TP+FP (true positives + false positives)
GT_COUNT number of ground-truth records for a given class - also known as TP+FN (true positives + false negatives), another popular name is “support”
TP_COUNT number of correctly predicted records for a given class - also known as TP (true positives)

Below metrics are computed at the record level for all documents in the test dataset.

Support - number of records present for a given class
Precision - measures how many records retrieved by the model for a given class are relevant. It’s defined by the formula TP / POSITIVES_COUNT
Recall - measures how many relevant records for a given class are retrieved by the model. It’s defined by the formula TP / GT_COUNT
F1 - Harmonic mean of precision and recall. Defined by the formula 2 / ((1 / Recall) + (1 / Precision)).

Page-wise splitter metrics (per split type) - (to be deprecated)

Page-wise metrics are heavily affected by the number of pages within records, so treat them as a secondary metric.

Metrics measure the quality of the splitter on a more granular level. Since splitter is operating on the page pairs below metrics show the performance for the individual model predictions Computations are similar to Record-wise classifier metrics with the exception that we use numbers of page pairs instead of numbers of records. Also, we operate only on two classes Split and No Split

Page-wise classifier metrics (per class) - (to be deprecated)

Page-wise metrics are heavily affected by the number of pages within records, so treat them as a secondary metric. Computations are similar to Record-wise classifier metrics with the exception that we use numbers of pages instead of numbers of records.

Calibration

Calibration is needed if the ECE of a trained model is high (>0.08). To calibrate the model we need to set aside a validation (which is needed to train the calibration model). An appropriate validation set is 0.1-0.2 times the size of the training set. To create a validation split you need to use the following hyperparameter: "validation_set_size": 0.2

You also need to select a calibration model. We offer two calibration models:

Platt Scaling Calibration

Platt Scaling is one of the simplest calibration options that works for most of the use cases. The major pros of this model are that it is lightweight and fast to train and that it doesn’t alter the prediction order (e.g. if we sort predictions base on confidence score, we will get the same sequence before and after calibration).

To use Platt Scaling Calibration set the following hyperparameter:d "calibration_model": "PlattScalingCalibrationModel"

Gausian Basis Calibration

Gaussian Basis Calibration is a non-linear calibration model which means that the prediction order might be altered after calibration. The pros of this model is that it introduces nonlinearities thus it is more powerful and provides better results.

To use Gaussian Basis Calibration set the following hyperparameter: "calibration_model": "GaussianBasisCalibrationModel"

Pruning

The goal of model pruning is to take a trained model and create a smaller, faster version of it.

The model inference is faster and more efficient in terms of CPU usage
Less memory consumption
Smaller model artifact
Pruned models often generalize better, slightly improving on the original model accuracy.

How it works

The general idea is based on the Structured Pruning paper: https://arxiv.org/abs/2204.00408 Which was then extended to layout models at Instabase and also perfected to achieve faster training time and stable results.

Sparsity

The main parameter you need to set - it means what fraction of the model (the encoder part, to be specific) weights are removed. We recommend setting it to 0.7 or 0.8 which was shown to work well in our experiments. Higher values like 0.9 or 0.95 are possible, leading to even smaller models, however with a higher risk of degraded accuracy.

Suported models

Pruning works for instalmv1-base and instalmv3-base models on extraction task.

Pruning job time

In contrary to usual training the dataset size doesn’t affect the time to prune much. It needs perform a constant number of 4000 training steps which takes 30-90 minutes depending on the GPU type.

Visual Entity

Allows to train a model on the object detection task. The goal is to find bounding boxes and classes of visual objects in an image.

Supported models

The only base model we support is the YOLOv5s: https://github.com/ultralytics/yolov5

Key parameters for visual entities

image_size - The resolution of images the model is trained at. With higher size the model might be more accurate but takes longer to train.
augmentation_strength - The multiplier on the amount of image augmentation to do during training. The default is 1.0, if your datastest is large, you can try lower values, such as 0.5 or 0.7. If you want more image augmentation, you can try a value such as 1.5, though the upper limit for a useful multiplier value is 2.0.
conf_thresh- The model confidence at which to consider the matches. Lower this value if you want more recall, increase it if you want more precision.
iou_thresh - The Intersection over Union threshold to consider the matches. A lower value will give you more matches but possible with worse precise bounding boxes.

GPU allocation modes

With Ray deployed on Instabase instance, you can choose the desired GPU allocation. There are three supported options:

single - The training task will allocate a single GPU. No other task can be run unless the instance has more than one GPU. This is the default setting.
partial - The training task will allocate half of GPU memory. With this setting it is possible to run two training tasks on a single GPU.
multi_gpu - The training task will allocate all available GPUs and run the training in distributed mode.

Here are some guidelines on how to use the non-default modes:

partial

While this parameter allows you to schedule two jobs at the same time, it might cause out-of-memory errors if the jobs utilize too much GPU memory. In such case try to reduce the batch_size, or run the training jobs in single mode.

multi_gpu

This parameter is most effective for larger training jobs, where most of the training task is the actual model training.

The results you are getting compared to single-GPU training jobs might be a bit different - using X GPUs is equivalent to multiplying gradient_accumulation_steps by X. This means that the effective batch size is batch_size * gradient_accumulation_steps * <number of gpus>, and the total number of training steps is decreased by the factor of <number of gpus>. You can try to compensate for this by increasing the learning_rate parameter, but this is not obligatory. Assuming that the dataset is large enough to justify distributed training, you get enough training steps to still achieve similar results.

Multi-GPU training is not available for some of the tasks:

Visual entity detection
List support with clustering (list_model_type=Clustering)
Hyperparameter tuning (do_hyperparam_optimization=True)
Model pruning
Long document pipeline (enable_long_doc_pipeline=True)

Hyperparameters description

`num_train_epochs`

A number of epochs for which the model will be trained, that means how many times the model will iterate over the training dataset during learning. Keep in mind that too large a value of this parameter can cause overfitting (the model remembers exactly what are predictions for training docs, but performs unwell for unseen documents). You can track such behavior by looking at model metrics after each epoch and monitoring if they deteriorate after some point.

What to try?

It’s worth increasing it especially in cases when:

dataset is small - model might need more updates to learn
gradient_accumulation_steps is set to a higher value - it means that model is making a smaller number of updates per epoch.
early_stopping_patience is set to > 0 - then the early stopping mechanism will take control over the number of epochs, and we might to increase num_train_epochs to make sure that we reach the optimal training length

`learning_rate`

One of the most important parameters for a DL model. It defines how much we adjust weights during training. So it could be either too small (the DL model is not learning effectively) or too large (weights are changing in an unstable way and getting out from the local minimum). Article

What to try?

run it for few options i.e. (1e-5, 1e-4)
for larger effective batches (higher gradient_accumulation_steps) try to increase learning_rate

`lr_scheduler_type`

Defines the way that the learning rate changes during the training. There are two variants available:

constant_with_warmup - linearly increases the learning rate from 0.0 to the value set in learning_rate over the percentage of the training controlled by warmup_ratio parameter. Then it is set to a constant value equal to learning_rate.
linear - it also linearly increases the learning rate over the warmup_ratio training part. The difference is that after reaching this point, it decreases back to 0.0 over the rest of the training.

What to try?

worth checking during the hyperparameter optimization phase

`warmup_ratio`

Defines the percentage of the training that will be used for warming up the learning rate.

What to try?

Usually, doesn’t hold much impact, might be checked during the hyperparameter optimization phase

`weight_decay`

A penalty on the model’s weights, is added to the loss function. It keeps the model weights to be far from 0, which increases the stability of the training and reduces overfitting.

`batch_size`

This parameter defines how many chunks of documents are simultaneously processed by the model. Together with gradient_accumulation_steps define the number of examples needed to compute a single update of model weights. A larger batch size will usually make the update more accurate, however, the dataset will generate fewer batches and thus the model will perform fewer updates overall.

What to try?

We don’t suggest modifying it! You might get into out-of-memory issues. If you want to have a larger effective batch size, use gradient_accumulation_steps instead.

The formula is: effective_batch_size = batch_size*gradient_accumulation_steps

`gradient_accumulation_steps`

Have a similar effect to the batch_size. We do the update of parameters after this number of steps. It won’t affect memory utilization, therefore it’ i’s much safer to use.

What to try?

run with few options (2, 4)
usually, it’s worth increasing it if we have a large dataset

`max_length`

A number of words (in fact, tokens) that are processed in the single document chunk (part). For many DL models, 512 is the maximum a model is able to handle. Processing long documents in chunks means that later chunks cannot access information (context) from the first pages.

What to try? We don’t recommended modifying this. It’s set to the maximum for supported models. It also affects memory utilization.

`chunk_overlap`

When processing a document in chunks the tokens near a chunk boundary have less context. To counteract this, chunks are created with overlap so that each word in the document has a minimum of context on both sides. Large values of overlap produces more chunks and increase processing time.

What to try?

You could try to increase it for long documents, but usually, the default is ok

`loss_type`

Define which type of loss to use.

ce_fixed_class_weights - use class_weights value to change the weight of all classes other than O
ce_ins - use an inverse number of samples mechanism to set up a weight of the class based on the inverse number of samples. The final value is adjusted by the class_weights_ins_power parameter

What to try?

You could try both values, but ce_ins is slightly better according to our internal tests

`class_weights`

Under the hood, the extraction task is performed by doing a classification of each word in the document. Words which are not part of entities are classified as the O class. For documents where a large fraction of words don’t fall under any entity that could cause a problem of class imbalance. For example, you could have a few pages of text, and you care only about a single word that indicates your entity – in that case you would end up with the dataset for which over 99 percent of the words are assigned to the O class. Models trained on such data tend to predict the O class for all the words, because of not getting much punishment for predicting otherwise (the model is still right in >99% of cases). Increasing the class_weights parameter will increase the weight of all classes other than O which forces the model to focus more on words belonging to entity classes.

Applicable only to loss_type: "ce_fixed_class_weights"

What to try?

Try few options (3, 5, 10) It’s worth to increase especially in the datasets where amount of words corresponding to labels is only small percentage of overall number of words.

`class_weights_ins_power`

The parameter needed to adjust class weights computed by ce_ins loss.

These values are computed by applying the formula: class_frequency ** -class_weights_ins_power

in the example where the O class occupies 100 words and the name class occupies 4 words, we could compute the class weights as follows:

for class_weights_ins_power: 1.0 - O class: 1/100, name class: 1/4 – difference 25x

for class_weights_ins_power: 0.5 - O class: 1/10, name class: 1/2 – difference 5x

Higher parameter values implicate higher differences between class weights due to a class imbalance.

Applicable only to loss_type: "ce_ins"

What to try? Try few options (0.3, 0.5, 0.66)

`max_no_annotation_examples_share`

This parameter similarly to class_weights also helps with class imbalance, but in a bit different way. For documents with many pages, we often end up with many pages where there are no labels at all. Therefore model gets a lot of signals to predict the O class only. This parameter will limit the number of pages (in fact chunks) for which there are no annotations. We still want to keep some pages with no annotations so the model will still learn to predict nothing for pages with no labels during inference - as for inference we will process all pages. This feature is turned on by default - 0.7 means that there might be a max of 70% of pages (chunks) without annotations.

What to try? Try few options (0.5, 0.7, 0.85) Especially worth applying for long documents where there are lots of pages with no annotations

`npages_to_filter`

This parameter is used for classification models (not split classifier), to limit the number of pages to consider for classifying documents. For documents with many pages, more often than not the information to distinguish different types of documents will be present in the first few pages of the document. A value of 20 means, the classifier will only consider the first 20 pages to make the prediction. To disable this, and consider all the pages of the document, set the value to -1. A lower value here also makes the inference faster. Use it for long documents spanning 50-100+ pages, where most of the distinguishing info for classification is in the first few pages. Try values like (5, 10, 15, 20).

pipeline_name

The internal parameter that distinguishes data preprocessing and model architectures. Do not modify

What to try? Do not modify

`early_stopping_patience`

This parameter controls the early stopping mechanism. It automatically stops the training when model performance on a validation set has not improved over the last early_stopping_patience epochs. It requires validation_set_size to be set.

`metric_for_best_model`

The default objective metric to look for when evaluating the early stopping condition. Depending on the task, there are different valid values for this parameter.

Extraction:

eval_average_field_level_macro_f1 (default)
eval_average_field_level_macro_recall
eval_average_field_level_macro_precision
eval_average_field_level_micro_f1
eval_average_field_level_micro_recall
eval_average_field_level_micro_precision
eval_average_token_level_macro_f1
eval_average_token_level_macro_recall
eval_average_token_level_macro_precision
eval_average_token_level_micro_f1
eval_average_token_level_micro_recall
eval_average_token_level_micro_precision

Classification:

eval_recordwise_macro_f1 (default)
eval_recordwise_macro_recall
eval_recordwise_macro_precision
eval_recordwise_micro_f1
eval_recordwise_micro_recall
eval_recordwise_micro_precision
eval_macro_f1
eval_macro_recall
eval_macro_precision
eval_micro_f1
eval_micro_recall
eval_micro_precision

Split classification:

eval_recordwise_split_classifier_accuracy (default)
eval_recordwise_split_accuracy

Table extraction:

eval_joint_extraction_score (default)
eval_grits_cont
eval_grits_cont_gold
eval_grits_cont_matched
eval_label_accuracy
eval_exact_content_match
eval_exact_table_match
eval_exact_doc_match
eval_detection_score
eval_detection_precision
eval_detection_recall

What to try? You can try different metrics depending on what is the most important for your solution.

`validation_set_size`

The percentage of the train set that will be used as a validation set. The validation set is used to monitor the model’s performance during the training, and to make training-related decisions (e.g. early stopping the training).

What to try? For larger datasets, the value can be reduced. In cases of >100 documents in the train set you can keep the raw validation set size to about 15-20 documents.

`do_hyperparam_optimization`

If true, the model will perform hyperparameter optimization. This means that before the actual training, different parameter combinations will be used to measure the model’s performance on the validation set. The best ones will be then applied to the final training. WARNING: this feature is intended for Instabase releases 22.02 and higher. On older releases, enabling this option will make the training job fail. Turning this on will obviously make the training longer, as we have more training jobs to perform. You can try this if you have some extra time.

`hp_search_num_trials`

The maximum number of parameter combinations to try during the hyperparameter optimization phase. The actual number can be lower, if you provides hyperparam space that is smaller than the requested number of trials.

What to try? 20 is the reasonable default, though you can try to increase/decrease it dependent on your dataset size and the time you want to run it

`hp_search_objective_name`

The name of the objective that will be used to select the best model from the performed runs. See metric_for_best_model section for list of valid metric names for each task.

`is_retraining`

Enable this to True if you need to run a retraining job. If you start a retraining job, also make sure to change the model_name to the marketplace model name, and model_version to the model version you want to retrain on.

`model_version`

Set model_version if you want to use a particular model version of the model_name. If this is unset, we will always use the latest model version if the model exists in the Instabase marketplace.

`enable_long_doc_pipeline`

Extracting key information from a 200+ page document is a challenging task both in terms of model inference time and model prediction accuracy. Most transformer models which are used for extraction, work on small sequences of text (like 512 tokens). We could in theory increase the sequence length, but the model size / model FLOPS increase quadratically with sequence length, which makes the deployment of these models all the more difficult. We explore a smarter approach to handling documents in a much more modular step-by-step fashion.

Filter and extract

Open-domain Question-Answering (QA) involves answering an open-ended question like “What’s the angle in an equilateral triangle?”, and you are allowed to reference a large corpus to find the answer like e.g. Wikipedia. Running a Question-Answering model on the entire Wikipedia would not be practical in this case. It would be much simpler if we could first filter out the Wikipedia pages which are relevant to answer the given question, and then run the QA model only on those filtered Wikipedia pages. This is the filter and extract approach. We use a retriever module to filter the corpus for relevant pages, and then the actual model for extracting information is run on the filtered pages.

To use an analogy, we can draw a parallel between Open-domain QA and long document extraction. In long documents, the information we are looking for might just be in 1 or 2 pages out of the 200+ page document. If we use the retriever to filter the pages where the required information might be present, we can run our transformer model only on those filtered pages. Whereas Wikipedia is the corpus for Open-domain QA, the 200+ pages is the corpus for each long document. In summary, we can consider long document extraction as a microcosm version of Open-domain QA.

Training

During training, we filter the chunks/pages where labeled data is present, and run the retriever model over those samples. After this, we aggregate the embeddings over these samples per field. We can either use a simple statistic like mean pooling for aggregation or use some sort of clustering algorithm to find the cluster centroid (in this case we can have more than 1 embedding representation per field). For simplicity, we have first considered mean pooling.

We also normalize the vectors to be unit normal, since that’s how sentence transformers are usually trained to learn embeddings (on unit hypersphere) with triplet loss.

At the end of this, we have a mapping from the field label to its corresponding embedding representation from the retriever.

The training loop will still train the main model (layoutlm, layoutxlm or instalm, etc) on the labeled chunks/pages and also a fraction of unlabelled chunks/pages (can be controlled) so that we give the model enough negative samples to train on.

Evaluation and prediction

Here first we compute the retriever embeddings for all samples and filter out only those samples which are similar to the field embeddings we stored during a training phase. We use Maximum Inner Product to score the embeddings with respect to field embeddings.

For really large documents ANN approaches can be used like faiss. Since our documents are rarely over 500 pages, we just use an inner product (which is equivalent to cosine similarity since our embeddings are normalized to be unit vectors).

We then run the main model which was trained on the filtered samples to get predictions. All the other samples are assumed to output the “O” (other ) class for all their tokens.

Time performance

We see significant improvement in time performance for long documents, with as far as 50% for 100+ page documents.

`enable_multilabel_extraction`

Enable this to True if you want to use the multilabel token classification model. This model is trained to predict multiple labels per token. This is useful for cases where there are words in the doc that you want to assign to multiple labels e.g. address and street (which is a part of the address).

`enable_onnx_runtime_optimization`

Enable this to True if you want to use the model in ONNX format for inference, which has runtime optimizations giving faster prediction speeds. Article.

`permute_records_for_split`

Enable this to True if you want to augment document sequences for split classification training. If true we will permute the records inside a document to create artificial training data for the model. The max_augmentation_factor (default 5) will control how many additional samples we will create. Use, this hyperparameter if your training set is small!

`enable_split_classification_augmentation`

Similar to permute_records_for_split but also combines records across different documents. This option would allow to use ML Studio datasets of the single class documents. These documents during training could be randomly joined to train the split classifier. A typical usage pattern is to have datasets with a single class allocated for training and a separate test dataset used for model evaluation.

`max_augmentation_factor`

What’s the maximal number of artificial samples we add per document. Used in split classification with permute_records_for_split or enable_split_classification_augmentation set.

`max_sequence_length`

Use this flag with the bert_small_sequence_split_classifier model. It controls how many pages the models will process in each pass. It can affect the size of the model so modify it in conjunction with the batch size to fit in your GPU. The default size is 10 with an associate batch size of 2.

`sequence_overlap`

Use this flag with the bert_small_sequence_split_classifier model. It control how many pages to use as a context when the total amount of pages of the document are larger than max_sequence_length. The default value is 2 and good values are from 2 to max_sequence_length/2 (rounded down). Larger numbers might give better accuracy but will decrease the inference speed of the model.