Third-party libraries for UDFs

Instabase supports several third-party libraries that you can reference in custom, user-defined functions (UDFs). If you have an on-premises environment, you can add additional third-party libraries by extending Instabase images.

Default third-party libraries

Instabase includes commonly used libraries with your deployment. These libraries can be imported like any other Python library.

The following third-party libraries are included by default with your Instabase deployment:

  • beautifulsoup v. 4.6.3, supports HTML and XML parsing.

  • dateparser v. 0.7.0, import dateparser to use. dateparser provides modules to parse localized dates in almost any string formats commonly found on web pages.

  • nltk v. 3.4.5, import nltk to use. nltk is a library for handling human language.

  • numpy v. 1.16.5, import numpy to use. numpy is the fundamental package for scientific computing with Python.

  • opencv-contrib-python v. 3.4.2.17, import cv2 to use. This library also adds opencv-python. Open Source Computer Vision Library (OpenCV) is an open-source computer vision and machine learning software library used for various image processing operations.

  • pandas v. 1.1.0, a data analysis and manipulation tool.

  • pdfkit v. 0.6.1, for converting HTML to PDFs.

  • pillow v. 8.1.2, import PIL to use. Pillow is a library for various image processing operations.

  • pypdf2 v. 1.26.0, a PDF utility library.

  • regex, v. 2018.01.10, a regex library.

  • requests v. 2.22.0, supports HTTP requests to external services.

  • scikit-learn v. 0.21.3, a machine learning library.

  • scikit-image v. 0.16.2, an image processing library.

  • scipy v. 1.3.1, supports many math and image processing modules.

  • spacy v. 2.3.2, an NLP library.

  • xlwt v. 1.3.0, to write to Excel sheets.

  • xlsxwriter v. 1.0.2, to write to Excel sheets.

Enabling additional third-party libraries

In on-prem environments, your infrastructure administrator can install additional packages on top of the base container. After extending your Instabase images, you can reference your third-party libraries in UDFs.

To reference additional third-party libraries, you must extend celery-app-tasks.

Extending celery-app-tasks

On a machine with Docker installed and with access to Instabase and your local repo:

  1. Run vim Dockerfile.

  2. Reference the following Dockerfile to extend celery-app-tasks.

    from gcr.io/instabase-public/celery-app-tasks:YOUR_RECENT_RELEASE
    # Your infrastructure team will have received several images to deploy Instabase. Find the tag that matches your most recent release, and replace YOUR_RECENT_RELEASE with the tag.
    
    USER root
    # Package installations and changes must be performed as root.
    
    RUN pip install --user [package]
    # Install your list of packages here. For example: RUN pip install -- user oracledb==1.0.1
    
    USER 9999
    # For security, we do not run our services as root. So, you must switch back to the ib-user UID.
    
  3. Run docker build -t gcr.io/instabase-public/celery-app-tasks:YOUR_RECENT_RELEASE to build your extended container.

Note: Again replace YOUR_RECENT_RELEASE with the tag that matches your most recent release.

  1. Pull the extended image from Instabase and push it to a repository of your choice, then deploy it in place of the Instabase-provided image.

You can now reference your additional third-party libraries in UDFs.