Flow binary performance

Performance of Flow binaries depends on various factors.

  • The number of documents processed in the Flow

  • The type of documents processed (machine readable PDFs, documents that OCR, documents that contain attachments, and so on)

  • The number of pages in the document

  • The complexity of Refiner Programs and UDFs

  • The number of steps in the Flow

These factors lead to a high degree of variance in the throughput and latency.

Performance on data extraction is usually governed by two major factors:

  • Image to text conversion

  • Extraction

Image to text conversion services

The image to text conversion operation involves multiple services:

  • celery-app-tasks: Celery App Tasks is responsible for pre-processing of images. The preprocessing includes deboxing, de-blurring, image rotation, custom image filtering, and so on.

  • pdf-tservice: PDF Service performs various operations on the PDF documents. Some of these operations involve splitting PDFs into smaller manageable chunks, extracting text from human readable PDFs, and converting pages to images to support extraction in the image domain.

  • ocr-tservice: OCR Service converts the image into text.

All of these operations are performed on a per-page basis. The unit-of-work is defined as a single page which defines the performance statistics of the system.

Extraction (text, checkboxes, signature)

The extraction operation usually requires running complex CPU intensive work in the Refiner and UDF programs. In most cases, the performance of these operations is in sub-seconds. Cases that involve complex regular expression trees or extraction in the image domain are expensive operations.

Services that are usually involved in the extraction path are:

  • celery-app-tasks: Where all the Refiner programs and UDFs are executed.

These extraction operations are usually dictated by the number of pages.

A page consistently stands out as the unit-of-work. As a result, resources are computed on page per hour metrics.

Memory performance

At a high level, 48 cores and 192 GB memory provide at least 8k pages per hour, assuming every page requires Image Processing and OCR. For native PDFs (PDFs with Text that don’t require OCR], the number of pages that can be processed is much higher.

The pages per hour scale linearly with more cores and memory. The computation can change depending on the type of the documents being processed. Rerun the model after a few weeks of consistent system usage to evaluate your performance.

Runtime engine

A Flow runtime engine is usually affected by the following:

  • Document processing

    • The quantity of documents being processed as part of the Flow and if the document is machine readable or OCR is required to extract the document content
  • Number of pages in each document

  • Refiner functions runtime

  • UDF functions runtime

A flow consists of steps that are run sequentially. To make efficient use of the processes, the steps are executed in a single worker for a single document. This execution ensures that the data from each step can be passed through in-memory operations to provide extremely efficient processing. The contents are also asynchronously written to mounted storage to enable use during debugging.

Document processing

The input to a Flow is typically a folder that contains input documents. In general, the documents inside the Flow are processed in parallel in Celery app tasks equal to the Flow batch size (default 10). The pages inside the documents are also processed in parallel within a worker.

The key points about performance of document processing:

  • The number of documents that can be processed in parallel per Flow is governed by the Flow batch size. This restriction ensures fairness across various Flows that are executed in parallel. Because a Flow runs batches of 10 documents at a time, a Flow with 1000s of input documents does not starve a Flow with only a few documents.

  • The number of pages that are processed in parallel is restricted. For PDF processing, 10 pages are processed in parallel. For OCR processing, 2 pages are processed in parallel. This parallel processing ensures that a document with 1000s of pages does not starve the processing of other documents with fewer pages.

Refiner and UDF processing

The Refiner and UDF runtime processes run operations sequentially within a record. Records are executed in parallel in the runtime framework because all of the dependencies between the functions are not known.

  • The Refiner program or Refiner UDFs are evaluated top to bottom.

  • A UDF in a step executes after all previous steps are executed because the UDF assumes that the output of the previous steps is available.

The UDF developer can employ strategies that run various functions in parallel to the concurrency module with ThreadPool worker.

For example, an analytics pipeline performs three operations. Each operation can be performed in parallel. In this case, instead of performing each operation in a separate Apply UDF step of a Flow, you can add code to run the operations in parallel:

def do_entity_analysis(record):
  pass

def do_sentiment_analysis(record):
  pass

def do_classification(record):
  pass

def do_analytics(record):
  # This is the main entry point of the UDF
  ops = [ do_entity_analysis, do_sentiment_analysis, do_classification ]
  with ThreadPoolExecutor(num_threads) as e:
    res = [e.submit(op) for op in ops]

register_fn() …

UDF and Refiner function runaway processing

Refiner functions and UDFs have the potential to stop responding to the system. A runaway process can occur if there is an infinite loop in the code or a complex regular expression takes too long to complete. To remediate these operations, a timeout is associated with the UDF and Refiner steps.

If the steps exceed the timeout value, the entire process that is running the step is ended and replaced with a new one. The side effect of this end/replace operation is that the entire document is marked as failed.

UDF environment extensibility

You can extend the UDF environment by bringing in your own libraries. There are two ways to extend the UDF environment:

  • Extending Instabase worker containers

  • Event handlers

Extend the Instabase worker containers

You can install custom Python libraries to extend the celery-app-task containers. This approach is usually very straightforward and allows instant use of the libraries in a UDF.

However, this approach is not recommended because of these limitations:

  • Instabase does not use virtual environments, so it is difficult to introduce multiple versions of the same library.

  • Maintenance is problematic and requires that you must ensure that the extension happens on all subsequent app-tasks.

  • The side effects of the libraries on regular processing in app-tasks is unknown

Event handlers

Extending the UDF environment with event handlers requires that you provide your own REST service to the Kubernetes cluster. In theory, event handlers can run anywhere. Running event handlers inside the cluster namespace has additional benefits such as isolation, performance over network, and so on.

In each UDF, you can write an REST API request that sends the appropriate data to the server and receives the response back.

Using event handlers, you own the UDF logic and manage the extensibility while avoiding maintenance issues and version incompatibility. The best production gain is realized when you distribute the custom libraries of encrypted Python files as developer packages.

Although this extensibility approach is recommended, the ease of writing the UDF is removed because the libraries need to be wrapped with REST endpoints that the UDF can call over HTTP or HTTPS.