Model training and inference requirements

Model inference and training have specific requirements to ensure the best performance. Model inference relies on sufficient local volume to reach production level throughput, while model training requires a robust GPU to handle the demands of training jobs.

Model inference

Model inference functionality is provided through the model service. To support reliable inference, you must allocate a sizable amount of resources towards the model service.

Model service pods must be provisioned with at least 16 GB of memory and four cores each. Inference is supported only on CPUs, not GPUs.

To ensure production-level throughput, Instabase utilizes local volume store models. It is essential to provision an adequate disk volume on nodes where model service pods are deployed. It is recommended to allocate a minimum of 50 GB local volume capacity to each model service pod. However, the necessary volume capacity can vary based on the specific models required.

Minimum requirements:

  • Model service: 16 GB of memory, 4 CPUs each, 50 GB volume size.

Model training

The model training services in the Instabase platform supports custom fine-tuning of deep learning models, using your proprietary datasets. Model training lets you leverage massive pre-trained models to achieve high accuracy in extraction and classification tasks even with limited datasets.

The Instabase platform provides two infrastructure options to support model training:

  • Celery worker-based model training tasks

  • Ray model training (introduced in public preview in Release 23.04)

Model training tasks

Model training functionality is provided through model-training-tasks-gpu. Model training is supported only in environments with GPUs. The number of concurrent training tasks is capped by the number of model training task GPU replicas. Training jobs can run for up to six hours, and all models must be trained through Instabase applications: ML Studio, Annotator, or Classifier.

Ray model training

The functionality for Ray model training is made available via ray-head and ray-model-training-worker. In a Ray cluster, the Ray head node is a dedicated CPU node that performs singleton processes responsible for managing the cluster and assigning training tasks to the worker node. The number of replicas for the Ray head is always set to 1. On the other hand, the worker node is a GPU node responsible for executing the actual computational tasks for model training jobs. Additionally, all models must undergo training using Instabase applications, such as ML Studio.

Disable Ray model training

As of release 23.07, Ray is the default infrastructure for model training. You can optionally disable Ray model training and use model training tasks instead. To make this change, use Deployment Manager to apply the following patches, in the order they are listed:

  • Patch 1: Disable ENABLE_RAY_MODEL_TRAINING

    This patch sets the ENABLE_RAY_MODEL_TRAINING environment variable to "False" in api-server. This change routes model training requests to model training tasks.

    # target: deployment-api-server
    
    apiVersion: apps/v1
    kind: Deployment
    spec:
      template:
        spec:
          containers:
          - name: api-server
            env:
              - name: ENABLE_RAY_MODEL_TRAINING
                value: "False"
    
  • Patch 2 and 3: Invert replica count

    This patch change inverts the replica count for deployment-model-training-tasks-gpu and deployment-ray-model-training-worker. deployment-ray-model-training-worker has its replica count set to 0 while deployment-model-training-tasks-gpu has its replica count set to 1. (If binary autoscaling is enabled, the max replica count is defined.)

    Without binary autoscaling (apply both patches):

    # target: deployment-model-training-tasks-gpu
    
    apiVersion: apps/v1
    kind: Deployment
    spec:
      replicas: 1
    
    # target: deployment-ray-model-training-worker
    
    apiVersion: apps/v1
    kind: Deployment
    spec:
      replicas: 0
    

    With binary autoscaling (apply both patches):

    # target: deployment-model-training-tasks-gpu
    
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: deployment-model-training-tasks-gpu
      annotations:
        autoscaling/enabled: "true"
        autoscaling/max_replicas: "1"
        autoscaling/queries: max_over_time(rabbitmq_queue_messages_unacked{queue="celery-model-training-tasks-gpu"}[30m])&max_over_time(rabbitmq_queue_messages_ready{queue="celery-model-training-tasks-gpu"}[30m])
    
    # target: deployment-ray-model-training-worker
    
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: deployment-ray-model-training-worker
      annotations:
        autoscaling/enabled: "true"
        autoscaling/max_replicas: "0"
        autoscaling/queries: max_over_time(clamp_min(sum(ray_tasks{State!~"FINISHED|FAILED"}[15s]), 0)[30m]) or vector(0)
    

GPU requirements

List of supported GPUs and card performance:

GPU FP16 TFLOPS VRAM (GB) NVLink
A100 312 40GB Yes
V100 112 32 GB Yes
A10 125 24 GB No
A30 165 24 GB Yes
A40 150 48 GB Yes

A10 is the lowest tier GPU Instabase supports.

Alternatively, you can provide GPU support by meeting these requirements:

  • Hardware support for CUDA 11.7 (including any relevant drivers).

  • Ability to run the NVIDIA device plugin.

  • Meets compute requirements (relative to A10, see above).

To achieve faster speeds with distributed multi-GPU training, we recommend using GPUs with NVLink, to enable a direct connection between GPUs during multi-GPU training. Using NVLink significantly reduces the time spent on multi-GPU synchronization.

Node requirements

Model training tasks & Ray model training worker

  • Memory: The amount of memory required is determined by the GPU VRAM and varies depending on the dataset and model. The minimum amount of RAM needed is either 16 GB or the GPU’s VRAM size, whichever is larger.

  • CPU: A minimum of four provisioned cores, primarily for data preparation for model training. The number of CPUs must be greater than or equal to the number of GPUs.

Ray head

  • Memory: The minimum amount of RAM needed is 16 GB.

  • CPU: A minimum of two provisioned cores, primarily for task orchestration and cluster management.

Distributed multi-GPU training

Ray model training lets you specify a GPU allocation mode at the training task level in the ML Studio. There are three GPU allocation modes:

  • The single GPU mode is the default mode, where each training job is allocated one GPU. This mode is suitable for most training tasks and ensures dedicated GPU resources for each job.

  • The partial mode is ideal for small training tasks. In this mode, each training job utilizes 0.5 GPU, enabling the concurrent execution of two jobs on a single GPU.

  • The multi_gpu mode is ideal for heavy training tasks. In this mode, all available GPUs are utilized for the training job, leveraging the full computational power to accelerate the model training job.

This new capability for specifying GPU allocation mode provides more granular control over GPU allocation, letting you optimize resource utilization based on the specific requirements of their training tasks.

To utilize the distributed multi-GPU training feature, use Deployment Manager to apply a patch that makes the following changes to the ray-model-training-worker configuration:

  • Define the number of available GPUs.
  • Define a NUM_GPUS_REQUESTED value.
  • Update the memory value by multiplying its current value by the NUM_GPUS_REQUESTED value. For example, if the current memory value is 12Gi and the NUM_GPUS_REQUESTED value is 4, define the new memory value as 48Gi.
  • Ensure that the CPU count is greater than or equal to the GPU count.

See the following patch for an example of a ray_model_training_worker patch that makes the required changes:

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: ray-model-training-worker
          env:
            - name: NUM_GPUS_REQUESTED
              value: "4"
          resources:
            limits:
              nvidia.com/gpu: 4
              cpu: 4000m
              memory: 48Gi
            requests:
              nvidia.com/gpu: 4
              cpu: 4000m
              memory: 48Gi

Binary autoscaling for GPU nodes

Binary autoscaling for GPU nodes is typically used in SaaS deployments. When there are no model training requests, GPU nodes are automatically scaled down to save costs. When a model training request is received, the model training infrastructure automatically scales up the number of GPU nodes to execute model training tasks.

To enable autoscaling, apply the following patches in Deployment Manager:

  • Model training tasks

    # This patch enables binary autoscaling for model-training-tasks-gpu
    # target: deployment-model-training-tasks-gpu
    
    apiVersion: apps/v1
    kind: Deployment
    metadata:
        name: deployment-model-training-tasks-gpu
        annotations:
            autoscaling/enabled: "true"
            autoscaling/max_replicas: "1"
            autoscaling/queries: "max_over_time(rabbitmq_queue_messages_unacked{queue=\"celery-model-training-tasks-gpu\"}[30m])&max_over_time(rabbitmq_queue_messages_ready{queue=\"celery-model-training-tasks-gpu\"}[30m])"
    spec:
        replicas:
    
  • Ray model training

    # This patch enables binary autoscaling for ray-model-training-worker
    # target: deployment-ray-model-training-worker
    
    apiVersion: apps/v1
    kind: Deployment
    metadata:
        name: deployment-ray-model-training-worker
        annotations:
            autoscaling/enabled: "true"
            autoscaling/max_replicas: "1"
            autoscaling/queries: "max_over_time(clamp_min(sum(ray_tasks{State!~\"FINISHED|FAILED\"}[15s]), 0)[30m]) or vector(0)"
    spec:
        replicas: