Document search

Document search provides indexing and search capabilities for the Instabase platform. With Document search, users can query for files and folders based on document metadata, and access files for which they have permissions.

Non-technical users can leverage this feature by searching for documents via the Spotlight search bar, and API users can conduct more elaborate queries via RESTful APIs provided by the platform.

Prerequisites

The environment variable ENABLE_DOC_SEARCH needs to be set true in core-platform-service, file-tservice, search-tservice, and api-server to enable this feature. In addition, the environment variable DOCUMENT_EXPORT_ENDPOINT needs to be set as RABBIT_MQ in core-platform-service, file-tservice.

Usages

Use cases

Document search allows users to conduct queries based on document metadata, including file name, file path, file extension, or document type. It also supports non-exact matching, such as fuzzy query, prefix query, and wildcard query. Some typical use cases are given below (this list is not exhaustive):

  • Find documents named “Bank Statement.pdf”

  • Find files in a folder named “My Demo”

  • Find files in folders that have a prefix “passport” in their names, return items sorted by their last modified time

  • Find all .txt files in a folder.

  • Perform a fuzzy search (approximate string match) for a keyword.

  • Find files that include the word “string” in their file names.

APIs

Search can be conducted via APIs supported by Instabase. Detailed Search API documentation is available here

You can also search for documents on the platform via the Spotlight search bar. Click the search icon on the dock and you can then search for a variety of documents by entering keywords.

Note: The Spotlight feature needs to be enabled by setting the environment variable ENABLE_SEARCH to true in apps-server and webapp services.

Indexing Operations

The operations that trigger indexing of documents are summarized in the sections below.

File I/Os

The document search feature takes a best effort approach to keep the document index up-to-date with the file-system. The following file operations will update the index:

  • File creation: When a file is created, not only the file itself is indexed, but also any intermediate folders along its parent path.

  • File write: Writing to a file triggers the indexing of this file’s metadata.

  • Folder creation: when a folder is created, it is also indexed. If multiple intermediate folders are also created, then these newly created folders are all indexed.

  • Copy: When a file is copied, the copied file is indexed; when a folder is copied, then the copied folder, and all its contents are also indexed.

  • Move/Rename: when a file/folder gets moved or renamed, the index for the old file/folder gets removed and the new file/folder gets indexed.

  • File & Folder deletion: when a file is deleted, its index is removed; when a folder is deleted, all its contents are also removed.

File Walker

In order to index all relevant documents on the platform, there is a background routine in each Search-Service pod that slowly crawls and indexes the whole file system. The file walker takes a best effort approach to keep the document index consistent with the state of the file system. This allows the index to reflect files/folders operations that were dropped due to errors or if files were added/removed via non-Instabase APIs. The rate of indexing can be configured at the deployment level and thus can be controlled by the site admin.

Infrastructure Requirements

Before enabling the Document Search feature, you will need to ensure that your system meets these minimum requirements.

Storage requirements

The document index is persisted in Elasticsearch. On average, 1 million documents require about 1.5~2.0 GB storage space in Elasticsearch. Therefore, in order to configure storage for the document search feature, the site admin needs to

  1. Estimate how many files your file-system will need to hold (to the order of millions)

  2. Multiply the number of million documents by 2GB.

  3. Verify that your Elasticsearch cluster has sufficient storage.

If your Elasticsearch cluster does not have sufficient storage, you will need to upgrade it with more storage space before enabling this feature.

Search-Service is stateless and does not require significant storage.

CPU/Memory requirements

Setting the file walker’s indexing rate limit

A prerequisite of resource provision for Search-Service and Elasticsearch is to specify the file walker’s document indexing rate limit. This is controlled by an environment variable named INDEXER_RATE_LIMIT_DOCS_PER_SEC, which specifies the rate limit for the number of documents that are indexed per second per pod. For example, if you specify this value to be 25, or 25 docs/sec/pod, and have launched 4 Search-Service pods in your environment, the aggregated indexing rate of the file walker is 100 docs/sec. At this setting, up to 360,000 documents are indexed every hour, meaning up to 8,640,000 documents can be indexed every day. The specification of this variable makes an impact on the resource consumption, which will be elaborated in the sections below.

A recommended range for INDEXER_RATE_LIMIT_DOCS_PER_SEC is from 5 to 40.

Search-Service

The file walker runs as a background routine and takes approximately 50m (5% of a full core) to 250m CPU when the indexing rate varies between 5 to 40 docs/sec for each pod. It is recommended that you allocate more than 50% of the resource on non-background tasks. Therefore, if your index rate limit is set to 5 docs/sec/pod, then you should provision at least 100m CPU for each Search-Service pod; if your index rate limit is set to 40 docs/sec/pod, then you should provision at least 500m CPU for each Search-Search pod.

The indexing rate does not heavily impact memory consumption of Search-Service pods. We recommend requesting at least 256MB of memory per pod.

Elasticsearch

For an aggregated indexing rate between 50 to 500 docs/sec, it is recommended that you allocate an additional 200m to 2000m CPUs in your Elasticsearch cluster for this feature.

The average memory usage by the Elasticsearch cluster is relatively insensitive to indexing rate when the aggregated indexing rate is below 200 docs/sec. For an indexing rate between 200 to 500 docs/sec, it is recommended that you provision an additional 4GB memory to the Elasticsearch cluster.