UDFs on Instabase

UDFs let you add custom functionality in Flow and Refiner.

  • Flow has an “Apply UDF” step that allows you to run a UDF as a step in a Flow
  • Runtime input variables contain information for your UDFs
  • UDFs are registered in a Python file (for example, scripts.py) with a special registration function
  • Use custom Python modules to call helper functions that are common across several script directories

Providing access

Site admins can use Settings > Site > Access Controls to grant users Execute UDFs access across the entire system. See Allowing users to run UDFs.

Script folders and registration functions

Script folders are directories on Instabase that contain your script files. Using a UDF in a Flow or Refiner requires configuring the app to point to your script folder:

  • Click Choose scripts directory from an Apply UDF step.

  • Click the Scripts button in Refiner Settings.

Registering UDFs

A UDF must be registered before it can be used in Flow or Refiner. You can register a UDF by importing the register_fn decorator and adding @register_fn above the UDF you wish to register. Decorator parameters include:

  • name (string): Registered name of the function. This is the name you use to call the function in Flow or Refiner. If no name is specified, the registered name defaults to the name of the function.

  • provenance (boolean): Whether or not the UDF is provenance-tracked. The default value is True, which sets the UDF to be provenance-tracked.

You can register both a provenance-tracked and untracked version of a function with the same name parameter.

Here are some examples of how to use the decorator:

# Import the decorator
from instabase.provenance.registration import register_fn

# This function is registered under the name "custom_greeting" 
# and uses provenance-tracking.
@register_fn(name='custom_greeting')
def custom_greeting_v(name: Value[str], **kwargs) -> Value[str]:
  return Value('Hi ') + name

# This function is registered under the name "custom_greeting" 
# and does not use provenance-tracking.
@register_fn(name='custom_greeting', provenance=False)
def custom_greeting(name: str, **kwargs) -> str:
  return 'Hi ' + name

# This function is registered under the name "greeting"
# and uses provenance-tracking.
# Note that this decorator has no parameters specified, so the default
# behavior is to register the function name and set provenance to True.
@register_fn
def greeting(name: Value[str], **kwargs) -> Value[str]:
  return Value('Hi ') + name

Alternatively, you can register a UDF using the legacy approach of defining a register function in any of the Python files in your scripts folder. Here’s an example:

def custom_function_fn(content, *args, **kwargs):
  pass

def register(name_to_fn):
  more_fns = {
    'custom_function': {
      'fn': custom_function_fn,
      'ex': '',  # Example usage of the function.
      'desc': ''  # Description of the function.
    }
  }
  name_to_fn.update(more_fns)

To invoke the custom function, run as a custom Refiner or Apply UDF function:

custom_function(INPUT_COL)  # INPUT_COL is a special variable that is defined in the execution environment

Subfolders in a scripts folder

You can have multiple subfolders within a scripts folder. Script files in the subfolders can be imported and used in other script files. Relative paths are supported.

Restriction: Only script files at the root level of the scripts folder can register custom functions to Flow.

Importing files in a scripts folder

Within the scripts folder, use relative paths to import variables from one file to another file. Import is supported using this syntax: from <file> import <variables>.

For example, if a script folder contains the following files:

my-user/my-repo/fs/Instabase Drive/samples/scripts/
|   
+---python_file1.py
|       function_1()
|
+---python_file2.py
|       function_2()
|
+---folder1/
    |
    +---python_file3.py
    |       function_3()
    |
    +---python_file4.py
            function_4()    

From python_file1.py, use the following statements to import function_2 and function 3:

from .python_file2 import function_2
from .folder1.python_file3 import function_3

Files in subfolders in the scripts folder can also import variables from other files. For example, from the python_file3.py file, use the following statements to import function_2 and function_4:

from ..python_file2 import function_2
from .python_file4 import function_4

Python files outside of the scripts folder cannot be imported.

Invoking a UDF

UDFs are invoked in these general categories:

  • Scripts attached to a Flow step

    • Fetcher

    • Process Files

    • Map Records

    • Apply Classifier

    • Apply UDF

  • Scripts inside of extraction programs

    • Refiner programs (.ibprog)

    • Sheet programs (.ibsheet)

  • Scripts run at specific times within a Flow

    • Custom Classifier

    • Pre and Post Run Custom Hooks

Runtime input variables

Runtime input variables contain information for your UDFs:

  • Context about your Flow (root_output_folder, input_filepath). This information is about the entire Flow itself, containing the input file being processed and the root of the final output folder.

  • Filesystem access using the ibfile object.

  • Context for your step (parsed_ibocr, input_ibocr_record). These details are about a particular file or record you are processing.

UDFs also have access to the runtime_config, which is a user-defined dictionary of Strings passed into the binary at runtime. You can access this dictionary with the CONFIG column in the kwargs that are passed into the UDF:

runtime_configs, err = kwargs['_FN_CONTEXT_KEY'].get_by_col_name('CONFIG')

Because UDFs are executed in objects, functions, and in a variety of different contexts, you must specifically define the dictionary, object, or input variable that contains the input for your particular step or UDF hook point.

Use the following input variables to provide the runtime arguments to your UDF:

Input variable Access with
Custom Classifier ModelMetadataDict
Custom Classifier in Metaflow Special input variables
Custom Fetcher FetcherContext object
Custom formulas in Refiner programs Special input variables
Custom formulas in Sheet programs Special input variables
Custom image filters FilterConfigDict
Map records Special input variables
Pre and Post Run Custom Hooks FlowInfoDict
UDF formula Special input variables

See Configuring Flow steps for details on these objects, dictionaries, and special input variables.

Importing other files in scripts directory

You can import modules from other files in the same scripts directories using absolute or relative paths.

Package payload

The package payload is the src folder with Python files. This src folder contains resources for the Refiner programs and UDFs for the Flow.

The src folder can contain subfolders for package management.

For example, this src folder hierarchy shows multiple undefined file and folder names in the root src folder.

src/
|   
|
+---__init__.py
|
|
+---python_file1.py
|       function_1()
|
+---subpkg1/
    |
    +---python_file2.py
    |       function_2()
    |
    +---subpkg2/
        |
        +---python_file3.py
                function_3()

Absolute module paths

If you are using absolute module paths, the root directory of the path is the location of the scripts folder. For example, if your scripts are located at user/repo/fs/Instabase Drive/files/samples/src/, the root package name is src. With this directory structure, the following examples are valid import statements:

import src
import src.python_file1
import src.subpkg1
from src.subpkg1 import python_file2
from src.subpkg1.python_file2 import function_2

Relative module paths

Alternatively, you can use relative paths to import other files in the src folder.

For example, to use function_1() in python_file2.py:

from ..python_file1 import function_1

Custom Python modules

After your UDFs get beyond a certain complexity, you’ll likely have helper functions which are common across several script folders. How do you manage that complexity? Our recommendation is to use custom Python modules.

The supported Custom Python module roots are ib.market and ib.custom. A module root is like a namespace where you insert custom logic.

ib.market modules

ib.market modules must be published to Marketplace as a pypkg solution.

Marketplace is an installation-wide distribution store. Modules accessed under ib.market are common across the entire installation.

To create ib.market modules, see Publishing developer packages.

Using ib.market modules in UDFs

To use the Python packages in UDFs, import them by prepending ib.market:

import ib.market.<solution-name>
import ib.market.<solution-name>.<submodules>
from ib.market.<solution-name>.<submodules> import <obj-name>

where:

  • <solution-name> is the name in the package.json of the solution published on Marketplace

  • <submodules> is the module to import the desired object from, as contained in the src folder hierarchy

  • <obj-name> is the name of the variable, function, class, and so on

For this example src folder and a <solution-name> of my_pkg, these are valid import statements:

import ib.market.my_pkg
import ib.market.my_pkg.python_file1
from ib.market.my_pkg.python_file1 import function_1
from ib.market.my_pkg.subpkg1.python_file2 import function_2
from ib.market.my_pkg.subpkg1.subpkg2.python_file3 import function_3

Using a specific version of a package

A Python package can have multiple versions available in the Marketplace. However, a flow binary uses only one package version. By default, the import statement imports the latest version of a package.

  • To define a fixed version of a package, create an ib_requirements.txt file in the same folder as the UDF scripts. The format of the ib_requirements.txt file follows the Python requirements.txt file format.

  • To import packages of specific versions, specify a list of <pkg_name>==<version_num>.

Note: Using multiple versions of the same package in a Flow Binary is not supported.

Restrictions for ib.market

  • All ib.market code must first be published to Marketplace.

ib.custom modules

All features with custom Python modules are fully supported across all places where UDFs are used.

Custom Python modules in the ib.custom module must reside in the same filesystem drive as its intended use.

Modules under ib.custom are imported from Python files stored in the .flow folder of the filesystem drive where the ib.custom import occurs. For example, if a UDF sits nested somewhere within my-user/my-repo/fs/Instabase Drive, then files stored within my-user/my-repo/fs/Instabase Drive/.flow/ib/custom will be available for import under the ib.custom root module in that UDF.

To create the .flow/ib/custom folder for a UDF nested somewhere within my-user/my-repo/fs/Instabase Drive:

  1. Navigate to my-user/my-repo/fs/Instabase Drive in the UI

  2. Create a folder named .flow, and navigate to my-user/my-repo/fs/Instabase Drive/.flow

  3. Create a folder named ib, and navigate to my-user/my-repo/fs/Instabase Drive/.flow/ib

  4. Create a folder named custom, and navigate to my-user/my-repo/fs/Instabase Drive/.flow/ib/custom

Within that my-user/my-repo/fs/Instabase Drive/.flow/ib/custom folder, you can add any Python files. For example, the hierarchy of a valid folder structure is:

my-user/my-repo/fs/Instabase Drive/.flow/ib/custom/
|   
+---python_file1.py
|       function_1()
|
+---subpkg1/
    |
    +---python_file2.py
    |       function_2()
    |
    +---subpkg2/
        |
        +---python_file3.py
                function_3()

You can use the Python files in the filesystem drive .flow folder by importing them in UDFs, prepending them with ib.custom. For example:

import ib.custom.<submodules>

where:

  • <submodules> is the module from which to import the desired object, as contained in the .flow/ib/custom folder hierarchy

  • <obj-name> is the name of the variable, function, or class

Suppose you have string helpers called strutils.py in my-user/my-repo/fs/Instabase Drive/.flow/ib/custom/common/strutils.py and strutils.py has a method inside of it called decode. Then, all of the UDFs can refer to strutils.py logic through a custom import, like:

  from ib.custom.common.strutils import decode

  def custom_use_libs_fn(val, **kwargs):
    return decode(val)

  def register(name_to_fn):
    name_to_fn.update({
      'custom_use_libs': {
        'fn': custom_use_libs_fn
      }
    })

ib.custom module restrictions

  • For a UDF in a particular workspace called my-user/my-repo/fs/Instabase Drive/, the root directory where all of the custom code resides is my-user/my-repo/fs/Instabase Drive/.flow/ib/custom.

Reading raw files

To enable a custom Python module to read raw files from the filesystem, use the load_file function with ib.custom modules.

from ib.custom.utils import load_file

def my_classifier_fn():
  content, err = load_file('ib/custom/data/myfile.txt')
  if err:
    print('Reading was unsuccessful')
  else:
    print('Binary content for file {}'.format(content))

For example, you can store ML models or small files like scikit-learn models, and then load them with the load_file function.

Store these files in the .flow/ib/custom root drive.

For this example, store the myfile.txt file in:

my-user/my-repo/fs/Instabase Drive/.flow/ib/custom/data/myfile.txt

Accessing resource files in UDFs

Use the Resource Reader to access an external file from within a UDF and dynamically locate your external files at runtime. Any file can be accessed using the Resource Reader, common files are CSV, JSON and XML. The Resource Reader is useful for loading external static information into a Flow.

Resource folder name

For the resource folder to be correctly loaded, the folder must be named _resources. This reserved folder name defines that files in this folder can be used in UDFs.

Resource folder location

Resource folders are supported in Refiner, Flow, and Solution.

Resource folders are not supported for Flow Binaries.

  • For Flows, including Metaflows and Multiflows, the _resources folder must be in the same directory as the .ibflow files.

  • For Solutions, the _resources folder must be in the same packaging directory as the .ibflowbin and package.json files.

  • For Refiner, the _resources folder location is specified in the UDF function when calling get_resource_reader(resources_path).

    • The path to the _resources folder is a relative path starting from the location of the .ibprog Refiner program. When a Refiner program is used in a Flow, the _resources folder in the Flow is used (the same directory as .ibflow files).

    For the following example folder structure, use get_resource_reader('../../Workflows/_resources').

resources_folder_test/
|
+---Samples/
    |
    +---prog/
        |
        +---resource_test.ibprog
|
+---Workflows/
    |
    +---_resources/
    |
    +---resources_test.ibflow

Using the resource reader client

To read files from the resource folder, you must use the resource reader client. To retrieve this client from within a UDF, call get_resource_folder() on the FNContext object.

When calling this resource reader with Refiner, specify the path to the resources folder relative to the .ibprog file.

Loading files from the resource folder

To read files using the resource reader client, call load_file(filepath) on the resource reader client. The path given to the function is relative to the resource folder. This function returns the bytes of the file and, potentially, an IBError object if an error occurs.

Writing files to the resource folder

To write files using the resource reader client, call write_file(filepath, contents) on the resource reader client. Again, the path given to the function is relative to the resource folder. This function writes the bytes given in contents to the file at filepath and returns a tuple that specifies if the write was successful and, potentially, an IBError if an error occurs.

Resource reader example in a UDF

First, get the clients that the resource_reader calls, pass it in the formula or call it from _FN_CONTEXT_KEY:

def custom_resource_fn(**kwargs):
    clients, err kwargs['_FN_CONTEXT_KEY'].get_by_col_name('CLIENTS')

    resource_reader = clients.resource_reader

After you get the client:

  def custom_resource_fn(**kwargs):
    resource_reader = clients.resource_reader
    contents, ib_err = resource_reader.load_file('resource.txt')
    success, ib_err = resource_reader.write_file('other_resource.txt', b'contents of this file')
    return contents

  def register(name_to_fn):
      more_fns = {
          'custom_fn_name': {
              'fn': custom_resource_fn,
              'ex': '',
              'desc': ''
          }
      }
      name_to_fn.update(more_fns)