Annotation guide

Annotation sets demonstrate how you want documents to be classified and what data you want extracted. Annotation sets are used to train deep learning models.

Annotation occurs early in the process of developing a solution. Basic annotation might be handled by a solution developer or, for very large or complex annotation sets, annotation might be handed off to dedicated annotators.

Generally, the more records you annotate, the more accurate your model becomes. You can do all your annotating before training a model, or you can annotate in an incremental fashion, annotating and training in cycles until you reach acceptable model performance. If you use the incremental approach, you can potentially improve annotation speed and accuracy using annotation assist.

Before you begin

You must have an annotation set that includes documents. For instructions on creating annotation sets, see the Solution Builder guide.

Annotation set settings

Annotation set configuration includes details about the files, classes, and schemas in the annotation set, as well as settings that control how ML Studio uses your annotations in model training and testing.

Use annotation set settings to perform these tasks:

  • Add or modify classes, schemas, or table labels.

  • Import redigitized documents.

  • Enable or disable automatic test record selection.

  • Export an annotation set.

Switch between your annotation set and its settings using the gear icon in ML Studio. Gear icon.

You can switch views without losing work you’ve annotated, but some actions, such as importing redigitized documents, might prompt you to review existing classification and annotation data.

Model training requirements for annotation sets

Model training requires a minimum of five training records and two test records, so your annotation set must include at least seven records that are marked as annotated. In sets with manual test record selection enabled, you must select at least two of your annotated records as test records.

You must meet the minimum annotation requirement for each type of model you’re training. For example, if your solution requires a classification model, extraction model, and table extraction model, you must annotate at least seven records with classification, seven records with text extraction, and seven records with table extraction.

Marking records as annotated helps Instabase determine when training and testing requirements are met, but during training, all records with annotations are used, regardless of whether they’re marked as annotated.

Although you can train a model with a small annotation set, model performance increases with more annotations to train with. In fact, providing more annotations is generally the best way to improve model performance, especially with unstructured or irregular documents.

Creating classes and schemas

If the annotation set you’re working on doesn’t specify classes and schemas, you must create them before you begin annotating.

Classes usually indicate record type, like paystub or driver’s license. Annotation sets must include at least one class, even if there’s only one type of document in the set. Schemas indicate the fields you want to extract for each class. For example, the schema for a paystub class might include employee name, gross pay, and net pay.

You can create classes and schemas, or you can import classes from an existing annotation set or Marketplace model. Imported classes include a class schema. If you want to use a Marketplace model without further model training, don’t modify the class schema, because doing so might jeopardize model performance. If you plan to do incremental training on a Marketplace model, you can add or edit fields in the schema as needed after you import a class, because incremental training accommodates your changes.

  1. With your annotation set open in ML Studio, click Create or Import class to access class settings.

  2. Choose whether you want to create or import a class.

    To create a class:

    1. Click Create new class.

    2. Specify a class name, optional description, and any fields you want to include in the class schema, then click Save.

    To import a class:

    1. Click Import classes.

    2. Choose whether you want to import classes from an existing annotation set or from a Marketplace model.

      • If you import from the Marketplace, you can search for the model you want to use. Find the model from which you want to import, and click Import.

      • If you import from another annotation set, you can select from any Solution Builder project you have access to. Select the annotation set from which you want to import the class, and click Open.

  3. Repeat the previous step for any additional classes you want to create or import.

Modifying schemas

If an existing class doesn’t include all the fields you need to annotate, modify the class schema.

Info

Schemas can include these field types:

  • Text - Used to annotate text where there’s a 1:1 relationship between field label and value.

  • Text (multiple instance) - Used to annotate text where the same value appears multiple times in a record.

  • List - Used to annotate text where there’s a one-to-many relationship between field label and values, such as deposits in a banking summary.

  • Table - Used to annotate tables. If you want to extract individual fields within a table, you can further annotate tables with header labels.

  1. In the settings for your annotation set, on the Classes tab, select a class you want to modify and click the edit icon.

  2. Refine fields as needed:

    • To add a field, click Add field and specify field name and type.

    • To add labels to a table field, click , then enter a label name.

    Note

    Clicking Import fields lets you import an entire class schema, which overrides any existing fields defined in the class.

  3. Click Save.

Assigning classes

If the documents in your annotation set aren’t assigned classes—or aren’t assigned accurate classes—you must classify records. It’s important to classify records accurately, because the data available to extract from each record is determined by the schema for the class you select.

Tip

You can annotate fields as you assign classes, so you don’t have to touch the same record twice. If you use this approach, be sure you verify the class before annotating any fields, because changing an assigned class clears all annotations from a record.

Often, classes are auto-assigned when documents are uploaded. In simple annotation sets with more structured documents, auto-assignment works well, but there are several scenarios where you might need to verify or add classes to documents in your annotation set:

  • If documents were uploaded with class auto-assignment disabled, you must assign classes to all records.

  • If documents are tricky to classify, you must verify that classes were auto-assigned appropriately.

  • If documents include more than one record, you must split and classify records appropriately.

The steps you take to assign classes depend on whether your documents include a single record or multiple records.

  • To classify documents that include one record, in the class panel, select the correct class.

  • To classify documents that include multiple records, select the records grid. Icon showing grid of two by two empty blue boxes.

    If necessary, modify how the document is divided by selecting pages and dragging them to the correct record, or use the button controls to create additional records or delete irrelevant pages. Then, in the class panel, assign the correct class for each record.

Annotating fields

Annotating indicates the data points associated with each field in a document’s assigned class schema. For example, if your paystub class includes a gross pay text field, you mark—or annotate—the value in each paystub that corresponds with gross pay.

Annotation methods vary depending on field type. To annotate fields of a given type, you must first assign a class that includes that field type.

Tip

See Help > Keyboard Shortcuts for a list of hotkeys that can help you annotate more quickly.

Annotating text

Text fields are used to annotate text where there’s a 1:1 relationship between field label and value.

  1. With your annotation set open in ML Studio, select a document from the document list to display the document in the center panel.

  2. Select a text field from the class panel.

  3. Highlight the document area that contains the data point for the selected field. You can click on words or numbers, or you can use your mouse to drag and draw a box around the information.

  4. Continue selecting fields and highlighting corresponding data points until you have annotated the document with each of the fields in the class.

  5. Click Mark as annotated on the annotated document.

    Marking records as annotated indicates to ML Studio that the record can be used for model training and testing. Records marked as annotated display a green filled circle in the Files pane.

Annotating multiple instances of the same value

If the documents you’re annotating repeat the same value multiple times, such as an invoice total, annotating all instances of the value using text (multiple instances) can improve model accuracy. The trained model still extracts a single instance of the value: the instance with the highest confidence score.

  1. With your annotation set open in ML Studio, select a document from the document list to display the document in the center panel.

  2. Select a text (multiple instances) field from the class panel.

  3. Click Add item and, in the document, highlight the document area that contains the data point for the selected field. You can click on words or numbers, or you can use your mouse to drag and draw a box around the information.

  4. Repeat the previous step for any additional values corresponding to the text (multiple instances) field.

  5. Continue selecting fields and highlighting corresponding data points until you’ve annotated the document with each of the fields in the class.

  6. Click Mark as annotated on the annotated document.

Annotating a list of different values

If the documents you’re annotating include multiple values for a given field, such as all the deposits in a bank statement, you can annotate a list of values for the field.

  1. With your annotation set open in ML Studio, select a document from the document list to display the document in the center panel.

  2. Select a list field from the class panel.

  3. Click Add item and, in the document, highlight the document area that contains the data point for the selected field. You can click on words or numbers, or you can use your mouse to drag and draw a box around the information.

  4. Repeat the previous step for any additional values corresponding to the list field.

  5. Continue selecting fields and highlighting corresponding data points until you’ve annotated the document with each of the fields in the class.

  6. Click Mark as annotated on the annotated document.

Annotating tables

You can extract tables as a standalone field or, for well-structured tables, you can extract individual fields within a table.

Annotating table fields involves specifying the table area and drawing borders to indicate where cells are divided. If documents were scanned with table entity detection enabled, you can simply select a detected table and refine it. If needed, you can redigitize documents with table entity detection enabled.

Tip

If you want to extract individual fields within a table, you can further annotate tables with header labels. For instructions on adding labels to a table field, see Modifying schemas.

  1. With your annotation set open in ML Studio, select a document from the document list to display the document in the center panel.

  2. Select a table field from the class panel.

  3. Select a table.

    • If tables were detected, click the table entity. Icon showing two overlapping purple boxes, the front box with a two by three table-like grid.

    • If no tables were detected, click Add item in the class panel and, in the document, draw a box around the table you want to annotate. If a table is split across pages, you can add a table segment by repeating this step.

  4. Refine your table as needed using the table toolbar:

    • To add column or row dividers, click Add a column or Add a row to add the appropriate number of column and row dividers. Drag the dividers so they’re positioned over the document’s table dividers.

    • To delete column or row dividers, select the divider line (not the column or row itself) and click Delete the selected line.

    • To merge cells, select adjacent cells and click Merge cells.

    • To extract individual fields within the table, apply header labels: select a column or row, click Add column label or Add row label, then select the appropriate label from the table schema.

      Tip

      Use the Display all table labels and View as plaintext controls to verify your table annotations.

  5. Continue selecting fields and highlighting corresponding data points until you’ve annotated the document with each of the fields in the class.

  6. Click Mark as annotated on the annotated document.

    Marking records as annotated indicates to ML Studio that the record can be used for model training and testing. Records marked as annotated display a green filled circle in the Files pane.

Redigitizing documents

Redigitize documents with modified OCR settings if your scanned documents don’t have particular data points available for annotation.

Tip

For details about digitization settings, see the parameter reference for the process files step in Flow, which provides similar settings.

After redigitizing, annotations for some documents might be updated according to the new digitization settings. You have the option to accept or reject any updated annotations.

Redigitizing must be done in Solution Builder.

Before you begin

If you want to redigitize only certain files, note the names of the files.

  1. With your annotation set open in ML Studio, at the top of the screen, select the solution that your annotation set is part of.

    Header bar in ML studio showing Machine Learning Studio logo, then Help and Solution Builder menus, then a space and My solution and My annotation set separated by a backslash. A cursor hovers on My solution. Header bar in ML studio showing Machine Learning Studio logo, then Help and Solution Builder menus, then a space and My solution and My annotation set separated by a backslash. A cursor hovers on My solution.

    A new browser tab opens with the solution open in Solution Builder.

  2. Select Documents.

  3. Select the documents you want to redigitize, then in the documents header, click Redigitize. Modify the digitization settings as needed, then click Redigitize

  4. Return to the browser tab with ML Studio.

  5. In the settings for your annotation set, on the Files tab, click Import outdated documents, or click the download button next to a redigitized document to import a specific document.

  6. Close the settings pane for your annotation set, and check for warnings on redigitized documents, which indicate that existing annotations were updated.

    Select any documents that display a warning about redigitized files, then in the field list, accept or reject updated annotations.

    For text and list annotations, the previous annotation is displayed for comparison. Previous annotations aren’t displayed for table annotations.

    If you reject an updated annotation, annotate the field again.

Using annotation assist

If the annotation set you’re working with has already been used for some model training, you might see annotation suggestions. Suggestions are the model’s best guess about annotation based on a previous round of training.

For unannotated fields, you can speed up annotation by accepting the model’s accurate suggestions. For previously annotated fields, you can potentially improve annotation accuracy by comparing the human annotation with the model’s suggestion and correcting any discrepancies. If a model was trained with annotation analysis, enable Annotation assist in the class panel to view suggestions with error probabilities.

Suggestions aren’t provided for records or fields where the model had low confidence, or for documents that were added to the annotation set since the last training or evaluation run.

Manually selecting test records

By default, newly created annotation sets use automatic test record selection. You can switch to manual test record selection if you want to specify which records are used for testing. In annotation sets with a lot of variance, manually selecting test records that are particularly exemplary of a class can help improve model performance.

  1. In the settings for your annotation set, on the Info tab, click Edit.

  2. Disable Automatically assign test records and click Save.

  3. In your annotation set, select records that you want to use for testing and, in the class panel, click Mark for testing. Flag icon.

Exporting annotation sets

If you want to save the current state of your annotation set, or move it to a different location, you can export the annotation set.

  1. In the settings for your annotation set, on the Export tab, select an action:

    • Copy annotation set within the same environment - Copies the annotation set to a selected parent folder in the current environment.

    • Move annotation set to an external environment (version 22.05 and above): Generates a ZIP file of the annotation set for use in an Instabase environment running release version 22.05 or greater.

    • Move annotation set to an external environment (before version 22.05): Generates a ZIP file of the annotation set for use in an Instabase environment running a release version prior to 22.05.

      Note

      You must provide the exact absolute path to the target parent directory. You can find this information in the file system by navigating to the target directory and clicking Actions > Copy to Clipboard > Copy Path.

  2. (Optional) Upload the annotation set to a new environment.

    1. Unzip the file.

    2. In the Instabase Explorer, select the parent folder where you want to upload your annotation set.

    3. Click New > Upload files, select the files to upload, and click Start Upload.

Creating annotation sets based on human review

You can use the results from human reviews to create annotation sets, which you can then use to refine a model.

Info

Annotations based on human review are generated only for fields that have provenance, where the data is identified in a specific location on the page. Data that’s entered by keying it in is excluded.

  1. In Flow Review, in the review that you want to export as an annotation set, select Export > Annotation Set.

  2. Review and, if necessary, correct the class schema, then click Next Step.

    You can modify class names, field names, and field types, or select fields to remove from export.

  3. Specify a name for your annotation set, then click Export.

    A popup notification alerts you when export completes.

What’s next? You can copy the exported annotation set to a different environment, if needed. Then, add the annotation set to an existing ML Studio project, or add it to a new project and use it to iteratively train a previously trained model.

Annotation statuses

A record can have these annotation statuses:

  • Not Started - The record hasn’t been assigned a class.

  • In Progress - The record has been classified and might include annotations, but it hasn’t been marked as annotated.

  • Annotated - The record has been marked as annotated.

Records must be marked as annotated to be used for training and testing a model.