Services

Each service or container in an Instabase deployment exposes environment variables that you can configure at the deployment level for each platform installation.

Services

Web services

These web services implement the Instabase webapp, user interface, and public APIs and SDK.

  • Webapp: Responsible for serving the backend information for all platform UI features, including user and account management, subspace and repository management, authenication and login (SSO and LDAP), and file-system operations.

  • API server: Implements the public APIs for all core-platform features including user/account management, Subspace/Repository management, file-system and mounting operations, as well as application APIs.

  • Apps server: Responsible for serving all Instabase applications built on the OS such as Flow, Refiner, Sheets, Text-Editor and more.

  • Server Nginx: Provides an entry-point to Instabase and proxies requests to all upstream Instabase Web Services. It also serves static content, including .css and .js files.

Core services

These services make up the Instabase platform layer.

  • Core-platform-service: Manages everything to drive the core platform functionality, such as user account management (Accounts, Organizations, Groups, sessions, access tokens, and other settings), Subspace management (Subspaces, marketplace and various Instabase applications), mounting functionality, and database management.

  • File tservice: Responsible for all file system operations on the platform.

  • Table tservice: Provides support for accessing any SQL-based server. The list includes Oracle DB, Sqlserver, Postgres and MySQL.

  • Job service: Manages the state of async tasks running on the platform.

  • Search tservice: Integrates with Elastic search to provide powerful indexing and querying capabilities used for metrics, audit-logs, and document search.

  • License service: Supports our license manager and metering functionality.

Data services

The data services provide core functionality across various applications and typically implement CPU- or memory-intensive operations.

  • OCR MSFT LITE: Provides basic OCR extraction for English language documents.

  • OCR MSFT V3: Provides advanced OCR extraction, supporting English and non-Latin languages.

  • OCR service: Provides OCR conversion functionality and image processing functionality.

  • Model service: Provides model training and inference functionality.

  • Conversion service: Provides document conversion functionality from Microsoft file formats (including .doc, .docx, .xls, .xlsx, .ppt, and .pptx) into more flexible formats, such as PDF.

  • Ray head: The head node in a Ray cluster is a dedicated CPU node that performs singleton processes responsible for cluster management, dispatching remote tasks, and executing Ray drivers. Each training job is supervised by its own dedicated job supervisor actor running on the head node.

  • Ray model training worker: The worker node is a GPU node in the Ray cluster that performs the computation tasks for model training jobs.

Async task workers

The async task workers are celery workers that process tasks asynchronously.

  • App tasks: Processes long-running tasks for Instabase Applications such as Flow, Refiner, Classifier and many others.

  • Core tasks: Processes long-running I/O tasks such as copying, moving, or renaming of many files and the Git SDLC integration.

  • Webdriver tasks: Processes long-running NLP tasks for Instabase applications.

Core components

These services provide core that are central to the Instabase platform and used across a large number of services.

  • OpenSearch: OpenSearch provides powerful indexing, query, and aggregation functionality. OpenSearch is used across features to leverage its flexible query language and for indexing large datasets.

  • RabbitMQ HA: RabbitMQ is our broker for queue management. Built on the AMQP protocol, it is used for managing the lifecycle of all async tasks. RabbitMQ is used in high-availability (HA) mode. By default, there are two replicas running, with one designated as active and the other as passive. Using HA mode requires a ReadWriteMany (RWX) volume. You can use the single-replica, non-HA mode with a ReadWriteOnce (RWO) volume.

  • Redis: Redis is our central cache provider. It caches data specific to the core platform as well as our async tasks.

Observability infrastructure

These services provide metrics and visibility into your infrastructure.

  • Grafana: Service deployed to provide visualization for system stats.

  • Prometheus: Service responsible for collecting system stats from Instabase components and writing them to remote storage. This service also computes derived time series and any alerts based on the time series data.

  • Victoriametrics: Remote storage for Prometheus service, as well as the datasource for visualizations in Grafana.

  • Alertmanager: This service receives alerts from Prometheus. It takes care of deduplicating, grouping, and routing alerts to the correct receiver integration such as email, PagerDuty, or OpsGenie.

  • Kube-state-metrics: This service is responsible for collecting the Kubernetes objects states (within the ib namespace) such as deployments, statefulsets, replicasets, services, endpoints, horizontal pod autoscalers and more. This helps collect stats such as deployment metadata, number of pods.

  • Jaeger: Service deployed to provide request tracing capabilities for improved debugability.

  • Loki-Read: This service is a component within the IB log aggregation system focused on retrieving and serving log entries from the underlying storage backend. It enables users to query and analyze log data for insights into system behavior from Grafana.

  • Loki-Write: This service is a component within the Instabase log aggregation system designed for ingesting log entries from Fluent Bit and writing to the underlying storage backend.

High availability services

The high availability (HA) feature allows you to run multiple replicas for selected services within Instabase, preventing platform downtime in cases where the underlying infrastructure becomes unavailable.

These replicas run in an active-passive failover fashion, where one replica is active and serving requests, while the rest of the replicas wait in passive mode to take over if needed.

HA is available for the job service, the license service, and RabbitMQ.

Configuration and monitoring

The HA feature is enabled by default for the job service, license service. and RabbitMQ. However, before you can use HA for services, you must set up a database table. See creating a database table for details.

To disable HA, set the environment variable ENABLE_HA to false and update the service resourcing, reducing the number of replicas to one.

To check if the HA feature is enabled and functioning for your deployment, check in the HA Panels dashboard in Grafana. If the feature is enabled, you will see the replicas being managed under HA on this dashboard. This dashboard also displays details such as replica state, age, and initialization and shutdown durations across time.

Creating a database table for services

Before you begin

Enable any required environment variables.

  1. In the Admin app, go to Configuration.

  2. Check the Service Setup section. If any tables need to be set up to support enabled services, click Set up.

HA replicas and failover settings

The state for the replicas that HA manages is maintained in the database. The role of each replica—active or passive—is decided using a common time-based lease for all replicas of a service. Every active replica is responsible for renewing its lease to maintain its active status. If it fails to renew in time or releases the lock voluntarily, another replica is automatically be chosen as the next active replica.

The duration for switching from one active replica to another, also known as failover, depends on replica initialization and shutdown time, as well as a set of tunable parameters listed below.

  • HA_LEASE_RENEWAL_PERIODICITY_MS: How often the replica tries to renew a lease, in milliseconds. The default is 3 seconds.

  • HA_LEASE_ATTEMPT_PERIODICITY_MS: How often the replica tries to acquire a lease, in milliseconds. The default is 3 seconds.

  • HA_LEASE_DURATION_MS: The length of the lease.

With the default settings for these parameters, the failover times are typically 5-10 seconds for graceful terminations and 15-20 seconds for ungraceful terminations. These times are within the retry windows for the remote procedure calls that are handled by these services. You should notice little, if any, performance impact when failovers happen.

Note

Quarantining pods maintained under HA is not supported.