Logging and troubleshooting

Table of Contents

Troubleshoot failing or slow flows using logs and debugging tools.

To troubleshoot a single flow job, use logs and traces. To troubleshoot overall Flow performance or availability, use the Flow dashboard in Grafana.

Tip

You can retry flows with errors from Flow Dashboard. In the Job Status column, click the Retry Job icon and select Retry files that have any step errors.

Flow logs track the progress of a single flow job, along with associated errors. Flow logs are typically the quickest way to diagnose issues. Scan the logs to see why a particular step or flow didn’t perform as expected, or use visualizations to identify slowdowns for specific files or steps. If your flow includes user-defined functions, you can add debugging log statements and view them in flow logs.

To access logs for a given flow, from the Flow Dashboard, click Logs for the flow.
Flow traces provide a detailed view of an individual flow execution. Traces show all the services involved in running a flow, corresponding function calls and duration, and any errors.

To access the trace for a given flow, from the Flow Dashboard, click Logs for the flow, then switch to the Trace tab, or click the Trace link that appears when you hover over a line in the log viewer.
The Flow dashboard in Grafana visualizes overall Flow system availability and performance. Use the Flow dashboard to check general system status and identify bottlenecks or service issues.

To access the Flow dashboard, from Grafana, select Dashboards > Browse, then click Flow Dashboard.

Using the Flow dashboard in Grafana

The Flow dashboard in Grafana is divided into three sections. If flows aren’t starting or are repeatedly failing, begin with Flow Services Availability. If flows are slower than expected, check Flow System Degradation. For details about performance and usage, see Flow Execution Stats.

Flow Services Availability

This section visualizes whether services are online, whether services are crashing and restarting, and whether requests to any service are experiencing a high error rate.

The Flow components availability by deployment graphs show the availability of each component used in Flow execution. Available components are displayed in green; unavailable components are shown in red.

To troubleshoot unavailable components, check to see if any pods are in an unready state or in a CrashLoopBackOff. If so, investigate the reason by examining the pod’s logs. If the pods are unscheduled, check for the reason by running the Kubernetes describe command for the pod: kubectl describe pods <MY-POD>.

The RPC errors in downstream services graph shows the rate at which gRPC requests error out along with the type of error. Troubleshooting steps depend on the type of error.

DEADLINE_EXCEEDED or UNAVAILABLE:
- Check the status of the downstream service in the Flow components availability by deployment graph, and by running the kubectl describe deployment command for the respective service.
- If the component is available, check the logs of any downstream and upstream services.
- If the downstream service is not receiving requests due to network issues, restart the downstream service.
- If the upstream service is still unable to connect to downstream services due to stale connections, try restarting the upstream services.
RESOURCE_EXHAUSTED – Increase the number of pods of the service.
CANCELLED – Check for gRPC timeout configuration in the upstream service deployment configuration.
INTERNAL or NOT_FOUND – Check the logs of the downstream service.

Flow System Degradation

This section helps you identify why flows are slow.

The Memory Usage and CPU Usage graphs show usage by deployment.

If CPU usage is significantly higher than average, check logs and incoming requests for the service.

If memory usage is significantly higher than average, connect to the container as exec by running kubectl exec -it <pod_name> /bin/bash. You can then check which processes are using memory by running ps aux or top to view process memory usage.

The 95th Percentile - RPC Latency by Service and Average RPC Latency graphs show the p95 and average RPC latency for each service observed from celery app tasks. If certain requests have very high latency, investigate the service logs for the associated service.

For example, in the 95th percentile latency panel, if you see lines showing app-tasks to model-service RPC latency at 4 mins, that means some model-service calls are taking a long time. You can look deeper into the model service logs or dashboard to figure out which models are causing the slowdown.

Flow Execution Stats

This section provides information about underlying flow execution system.

The Running Jobs Count graph shows the number of jobs—from flows, refiners, and other services—running in the task scheduler. This visual gives you insight into overall load on the flow execution system.

The Running Tasks Count shows the number of running tasks in the task scheduler. Each job in the task scheduler consists of one or more tasks.

The Tasks Completed Per Minute graph shows task completion over time. If no tasks are being completed, it could mean that long-running tasks in the system are causing slowdowns, blocking other flows from being executed. Use the Flow Dashboard to identify flow jobs that aren’t progressing and troubleshoot or cancel the job.

The RabbitMQ - Celery Queue graph shows the number of messages in the celery queue in RabbitMQ that are currently being processed or waiting to be processed. If the number of messages waiting to be processed doesn’t decrease for a significant period, it could mean there’s a long-running job in the system.

The Number of Tasks Pending or Waiting graph shows tasks in the tasks scheduler. Pending tasks are waiting to be scheduled and sent to RabbitMQ. Waiting tasks are on hold until dependent tasks—such as a Flow Review—are completed.

In Flow, the maximum number of active running tasks is equal to the number of workers available in the celery-app-tasks pods. If the number of tasks scheduled is larger than the number of tasks that can run, remaining jobs are held in pending state. A large number of pending tasks leads to processing delays.

If the number of tasks does not decrease for a long time, check for slowdowns in running flow jobs. If pending tasks are increasing, cancel long-running flow jobs.

The Jobs Completed Per Minute graph shows job statuses over time. If progress remains stagnant for a significant period, check for slowdowns in running flow jobs.