Metrics and Observability
NativeLink provides comprehensive metrics through OpenTelemetry (OTEL), enabling deep insights into cache performance, remote execution pipelines, and system health.
NativeLink automatically exports metrics when configured with OTEL environment variables. The metrics cover:
- Cache Operations: Hit rates, latencies, evictions
- Execution Pipeline: Queue depths, stage durations, success rates
- System Health: Worker utilization, throughput, error rates
# Clone the repositorygit clone https://github.com/TraceMachina/nativelinkcd nativelink/deployment-examples/metrics
# Start the metrics stackdocker-compose up -d
# Configure NativeLinkexport OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317export OTEL_EXPORTER_OTLP_PROTOCOL=grpcexport OTEL_SERVICE_NAME=nativelinkexport OTEL_RESOURCE_ATTRIBUTES="deployment.environment=dev"
# Run NativeLinknativelink /path/to/config.jsonAccess the services:
- Prometheus: http://localhost:9091
- Grafana: http://localhost:3000 (admin/admin)
- OTEL Collector: http://localhost:8888/metrics
# Create namespacekubectl create namespace nativelink
# Deploy OTEL Collectorkubectl apply -f deployment-examples/metrics/kubernetes/otel-collector.yaml
# Deploy Prometheuskubectl apply -f deployment-examples/metrics/kubernetes/prometheus.yaml
# Configure NativeLink podskubectl set env deployment/nativelink \ OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \ OTEL_EXPORTER_OTLP_PROTOCOL=grpc \ OTEL_RESOURCE_ATTRIBUTES="k8s.cluster.name=main"# Start Prometheus with OTLP receiverprometheus \ --web.enable-otlp-receiver \ --storage.tsdb.out-of-order-time-window=30m \ --config.file=prometheus.yml
# Configure NativeLinkexport OTEL_EXPORTER_OTLP_PROTOCOL=http/protobufexport OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://localhost:9090/api/v1/otlp/v1/metricsexport OTEL_SERVICE_NAME=nativelinkexport OTEL_RESOURCE_ATTRIBUTES="service.instance.id=$(uuidgen)"
# Disable traces and logsexport OTEL_TRACES_EXPORTER=noneexport OTEL_LOGS_EXPORTER=noneNativeLink uses standard OpenTelemetry environment variables:
# Core OTLP ConfigurationOTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317OTEL_EXPORTER_OTLP_PROTOCOL=grpc # or http/protobufOTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer token"OTEL_EXPORTER_OTLP_COMPRESSION=gzip
# Resource Attributes (customize for your deployment)OTEL_SERVICE_NAME=nativelink # Fixed valueOTEL_RESOURCE_ATTRIBUTES="deployment.environment=prod,region=us-east-1"
# Metric Export IntervalsOTEL_METRIC_EXPORT_INTERVAL=60000 # 60 secondsOTEL_METRIC_EXPORT_TIMEOUT=30000 # 30 secondsThe OTEL Collector adds resource attributes and batches metrics:
processors: resource: attributes: - key: service.namespace value: nativelink action: upsert batch: timeout: 10s send_batch_size: 1024Monitor cache performance and efficiency:
| Metric | Description | Key Labels |
|---|---|---|
nativelink_cache_operations | Operations count by type and result | cache_type, operation, result |
nativelink_cache_operation_duration | Operation latency histogram | cache_type, operation |
nativelink_cache_hit_rate | Calculated hit rate (recording rule) | cache_type |
nativelink_cache_size | Current cache size in bytes | cache_type |
nativelink_cache_eviction_rate | Evictions per second | cache_type |
Track remote execution pipeline performance:
| Metric | Description | Key Labels |
|---|---|---|
nativelink_execution_active_count | Actions in each stage | execution_stage |
nativelink_execution_completed_count | Completed actions | execution_result |
nativelink_execution_queue_time | Queue wait time histogram | priority |
nativelink_execution_stage_duration | Time per stage | execution_stage |
nativelink_execution_success_rate | Success percentage (recording rule) | instance |
Actions progress through these stages:
unknown- Initial statecache_check- Checking for cached resultsqueued- Waiting for workerexecuting- Running on workercompleted- Finished (success/failure/cache_hit)
# Cache hit rate by typesum(rate(nativelink_cache_operations{result="hit"}[5m])) by (cache_type) /sum(rate(nativelink_cache_operations{operation="read"}[5m])) by (cache_type)
# P95 cache operation latencyhistogram_quantile(0.95, sum(rate(nativelink_cache_operation_duration_bucket[5m])) by (le, cache_type))
# Cache eviction ratesum(rate(nativelink_cache_operations{operation="evict"}[5m])) by (cache_type)# Execution success ratesum(rate(nativelink_execution_completed_count{result="success"}[5m])) /sum(rate(nativelink_execution_completed_count[5m]))
# Queue depth by prioritysum(nativelink_execution_active_count{stage="queued"}) by (priority)
# Average queue timehistogram_quantile(0.5, sum(rate(nativelink_execution_queue_time_bucket[5m])) by (le))
# Worker utilizationcount(nativelink_execution_active_count{stage="executing"} > 0) /count(count by (worker_id) (nativelink_execution_active_count))# Overall throughput (actions/sec)sum(rate(nativelink_execution_completed_count[5m]))
# Error ratesum(rate(nativelink_execution_completed_count{result="failure"}[5m])) /sum(rate(nativelink_execution_completed_count[5m]))
# Stage transition ratesum(rate(nativelink_execution_stage_transitions[5m])) by (instance)Import the pre-built dashboard for comprehensive monitoring:
{ "title": "NativeLink Metrics", "panels": [ { "title": "Execution Success Rate", "targets": [{ "expr": "nativelink:execution_success_rate" }] }, { "title": "Cache Hit Rate", "targets": [{ "expr": "nativelink:cache_hit_rate" }] }, { "title": "Queue Depth", "targets": [{ "expr": "sum(nativelink_execution_active_count{stage=\"queued\"})" }] } ]}-
SLI/SLO Metrics:
- Execution success rate > 99%
- Cache hit rate > 80%
- P95 queue time < 30s
- P95 cache latency < 100ms
-
Capacity Planning:
- Queue depth trends
- Worker utilization
- Cache size growth
- Eviction rates
-
Performance Optimization:
- Stage duration breakdowns
- Cache operation latencies
- Output size distributions
- Retry rates
Best for most deployments with excellent query capabilities:
# Enable OTLP receiverprometheus --web.enable-otlp-receiver
# Configure out-of-order handlingstorage: tsdb: out_of_order_time_window: 30mManaged solution with built-in dashboards:
export OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway.grafana.net/otlpexport OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer ${GRAFANA_TOKEN}"For high-volume metrics with SQL queries:
exporters: clickhouse: endpoint: tcp://clickhouse:9000 database: nativelink_metrics ttl_days: 90Unified logs and metrics search:
exporters: otlp: endpoint: quickwit:7281 headers: x-quickwit-index: nativelink-metrics- alert: HighErrorRate expr: | (1 - nativelink:execution_success_rate) > 0.05 for: 5m annotations: summary: "Execution error rate above 5%"
- alert: QueueBacklog expr: | sum(nativelink_execution_active_count{stage="queued"}) > 100 for: 15m annotations: summary: "Queue backlog exceeds 100 actions"
- alert: CacheEvictionHigh expr: | rate(nativelink_cache_operations{operation="evict"}[5m]) > 10 for: 10m annotations: summary: "Cache eviction rate exceeds threshold"-
Verify OTEL environment variables:
Terminal window env | grep OTEL_ -
Check collector health:
Terminal window curl http://localhost:13133/health -
Verify metrics are being received:
Terminal window curl http://localhost:8888/metrics | grep otelcol_receiver
Reduce label dimensions:
processors: attributes: actions: - key: high_cardinality_label action: deleteIncrease Prometheus window:
storage: tsdb: out_of_order_time_window: 1h# Increase export interval for lower overheadexport OTEL_METRIC_EXPORT_INTERVAL=120000 # 2 minutes
# Batch metrics at collectorprocessors: batch: send_batch_size: 2048 timeout: 30sUse Prometheus recording rules for expensive queries:
- record: nativelink:hourly_success_rate expr: | avg_over_time(nativelink:execution_success_rate[1h])For high-volume deployments, sample metrics:
processors: probabilistic_sampler: sampling_percentage: 10 # Sample 10% of metrics