Skip to content

Metrics and Observability

NativeLink provides comprehensive metrics through OpenTelemetry (OTEL), enabling deep insights into cache performance, remote execution pipelines, and system health.

NativeLink automatically exports metrics when configured with OTEL environment variables. The metrics cover:

  • Cache Operations: Hit rates, latencies, evictions
  • Execution Pipeline: Queue depths, stage durations, success rates
  • System Health: Worker utilization, throughput, error rates
Terminal window
# Clone the repository
git clone https://github.com/TraceMachina/nativelink
cd nativelink/deployment-examples/metrics
# Start the metrics stack
docker-compose up -d
# Configure NativeLink
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
export OTEL_SERVICE_NAME=nativelink
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=dev"
# Run NativeLink
nativelink /path/to/config.json

Access the services:

NativeLink uses standard OpenTelemetry environment variables:

Terminal window
# Core OTLP Configuration
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc # or http/protobuf
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer token"
OTEL_EXPORTER_OTLP_COMPRESSION=gzip
# Resource Attributes (customize for your deployment)
OTEL_SERVICE_NAME=nativelink # Fixed value
OTEL_RESOURCE_ATTRIBUTES="deployment.environment=prod,region=us-east-1"
# Metric Export Intervals
OTEL_METRIC_EXPORT_INTERVAL=60000 # 60 seconds
OTEL_METRIC_EXPORT_TIMEOUT=30000 # 30 seconds

The OTEL Collector adds resource attributes and batches metrics:

processors:
resource:
attributes:
- key: service.namespace
value: nativelink
action: upsert
batch:
timeout: 10s
send_batch_size: 1024

Monitor cache performance and efficiency:

MetricDescriptionKey Labels
nativelink_cache_operationsOperations count by type and resultcache_type, operation, result
nativelink_cache_operation_durationOperation latency histogramcache_type, operation
nativelink_cache_hit_rateCalculated hit rate (recording rule)cache_type
nativelink_cache_sizeCurrent cache size in bytescache_type
nativelink_cache_eviction_rateEvictions per secondcache_type

Track remote execution pipeline performance:

MetricDescriptionKey Labels
nativelink_execution_active_countActions in each stageexecution_stage
nativelink_execution_completed_countCompleted actionsexecution_result
nativelink_execution_queue_timeQueue wait time histogrampriority
nativelink_execution_stage_durationTime per stageexecution_stage
nativelink_execution_success_rateSuccess percentage (recording rule)instance

Actions progress through these stages:

  1. unknown - Initial state
  2. cache_check - Checking for cached results
  3. queued - Waiting for worker
  4. executing - Running on worker
  5. completed - Finished (success/failure/cache_hit)
# Cache hit rate by type
sum(rate(nativelink_cache_operations{result="hit"}[5m])) by (cache_type) /
sum(rate(nativelink_cache_operations{operation="read"}[5m])) by (cache_type)
# P95 cache operation latency
histogram_quantile(0.95,
sum(rate(nativelink_cache_operation_duration_bucket[5m])) by (le, cache_type)
)
# Cache eviction rate
sum(rate(nativelink_cache_operations{operation="evict"}[5m])) by (cache_type)
# Execution success rate
sum(rate(nativelink_execution_completed_count{result="success"}[5m])) /
sum(rate(nativelink_execution_completed_count[5m]))
# Queue depth by priority
sum(nativelink_execution_active_count{stage="queued"}) by (priority)
# Average queue time
histogram_quantile(0.5,
sum(rate(nativelink_execution_queue_time_bucket[5m])) by (le)
)
# Worker utilization
count(nativelink_execution_active_count{stage="executing"} > 0) /
count(count by (worker_id) (nativelink_execution_active_count))
# Overall throughput (actions/sec)
sum(rate(nativelink_execution_completed_count[5m]))
# Error rate
sum(rate(nativelink_execution_completed_count{result="failure"}[5m])) /
sum(rate(nativelink_execution_completed_count[5m]))
# Stage transition rate
sum(rate(nativelink_execution_stage_transitions[5m])) by (instance)

Import the pre-built dashboard for comprehensive monitoring:

{
"title": "NativeLink Metrics",
"panels": [
{
"title": "Execution Success Rate",
"targets": [{
"expr": "nativelink:execution_success_rate"
}]
},
{
"title": "Cache Hit Rate",
"targets": [{
"expr": "nativelink:cache_hit_rate"
}]
},
{
"title": "Queue Depth",
"targets": [{
"expr": "sum(nativelink_execution_active_count{stage=\"queued\"})"
}]
}
]
}
  1. SLI/SLO Metrics:

    • Execution success rate > 99%
    • Cache hit rate > 80%
    • P95 queue time < 30s
    • P95 cache latency < 100ms
  2. Capacity Planning:

    • Queue depth trends
    • Worker utilization
    • Cache size growth
    • Eviction rates
  3. Performance Optimization:

    • Stage duration breakdowns
    • Cache operation latencies
    • Output size distributions
    • Retry rates

Best for most deployments with excellent query capabilities:

# Enable OTLP receiver
prometheus --web.enable-otlp-receiver
# Configure out-of-order handling
storage:
tsdb:
out_of_order_time_window: 30m

Managed solution with built-in dashboards:

Terminal window
export OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway.grafana.net/otlp
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer ${GRAFANA_TOKEN}"

For high-volume metrics with SQL queries:

exporters:
clickhouse:
endpoint: tcp://clickhouse:9000
database: nativelink_metrics
ttl_days: 90

Unified logs and metrics search:

exporters:
otlp:
endpoint: quickwit:7281
headers:
x-quickwit-index: nativelink-metrics
- alert: HighErrorRate
expr: |
(1 - nativelink:execution_success_rate) > 0.05
for: 5m
annotations:
summary: "Execution error rate above 5%"
- alert: QueueBacklog
expr: |
sum(nativelink_execution_active_count{stage="queued"}) > 100
for: 15m
annotations:
summary: "Queue backlog exceeds 100 actions"
- alert: CacheEvictionHigh
expr: |
rate(nativelink_cache_operations{operation="evict"}[5m]) > 10
for: 10m
annotations:
summary: "Cache eviction rate exceeds threshold"
  1. Verify OTEL environment variables:

    Terminal window
    env | grep OTEL_
  2. Check collector health:

    Terminal window
    curl http://localhost:13133/health
  3. Verify metrics are being received:

    Terminal window
    curl http://localhost:8888/metrics | grep otelcol_receiver

Reduce label dimensions:

processors:
attributes:
actions:
- key: high_cardinality_label
action: delete

Increase Prometheus window:

storage:
tsdb:
out_of_order_time_window: 1h
Terminal window
# Increase export interval for lower overhead
export OTEL_METRIC_EXPORT_INTERVAL=120000 # 2 minutes
# Batch metrics at collector
processors:
batch:
send_batch_size: 2048
timeout: 30s

Use Prometheus recording rules for expensive queries:

- record: nativelink:hourly_success_rate
expr: |
avg_over_time(nativelink:execution_success_rate[1h])

For high-volume deployments, sample metrics:

processors:
probabilistic_sampler:
sampling_percentage: 10 # Sample 10% of metrics