Skip to content

OOMKilled in signoz-otel-collector despite high memory limits #9773

@lynyuxuan

Description

@lynyuxuan

Bug description

The signoz-otel-collector pod is repeatedly OOMKilled despite being allocated 12Gi memory request / 13Gi limit. In contrast, a standard OpenTelemetry Collector (e.g., opentelemetry-collector-contrib:0.128.0) with similar pipeline runs stably on just 2Gi under the same load.

Pod event: Warning PodOOMKilling ... Reason: OOMKilled, Exit Code 137.

Expected behavior

Collector should not exhaust 13Gi memory under moderate telemetry load, especially when memory_limiter is set to 8000 MiB.

How to reproduce

  1. Deploy Signoz with signoz-otel-collector:v0.129.11

Version information

  • Signoz version: v0.129.11
  • Architecture: x86_64
    [图片]

Additional context

apiVersion: v1
kind: ConfigMap
metadata:
name: signoz-otel-collector
namespace: signoz
data:
otel-collector-config.yaml: |
connectors:
signozmeter:
dimensions:
- name: service.name
- name: deployment.environment
- name: host.name
metrics_flush_interval: 1h

exporters:
  clickhouselogsexporter:
    dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_LOG_DATABASE}
    timeout: 10s
    use_new_schema: true
    sending_queue:
      enabled: true
      queue_size: 1024
      num_consumers: 10
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 10s
      max_elapsed_time: 120s
  
  clickhousetraces:
    datasource: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_TRACE_DATABASE}
    low_cardinal_exception_grouping: ${env:LOW_CARDINAL_EXCEPTION_GROUPING}
    timeout: 10s
    use_new_schema: true
    sending_queue:
      enabled: true
      queue_size: 5000
      num_consumers: 10     
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 10s
      max_elapsed_time: 120s
  
  metadataexporter:
    cache:
      provider: in_memory
    dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/signoz_metadata
    tenant_id: ${env:TENANT_ID}
    timeout: 15s
  
  signozclickhousemeter:
    dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_METER_DATABASE}
    sending_queue:
      enabled: false
    timeout: 45s
  
  signozclickhousemetrics:
    dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_DATABASE}
    timeout: 45s

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: localhost:1777
  zpages:
    endpoint: localhost:55679

processors:
  memory_limiter:
    check_interval: 5s
    limit_mib: 8000
    spike_limit_mib: 1024
  
  batch:
    send_batch_size: 800
    timeout: 2s             
    send_batch_max_size: 1000
  
  batch/meter:
    send_batch_size: 800
    timeout: 2s               
    send_batch_max_size: 1000 
  
  signozspanmetrics/delta:
    aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
    dimensions:
    - default: default
      name: service.namespace
    - default: default
      name: deployment.environment
    - name: signoz.collector.id
    dimensions_cache_size: 1000  # ⚡ 从 100000 降低到 10000
    latency_histogram_buckets:
    - 100us
    - 1ms
    - 2ms
    - 6ms
    - 10ms
    - 50ms
    - 100ms
    - 250ms
    - 500ms
    - 1000ms
    - 1400ms
    - 2000ms
    - 5s
    - 10s
    - 20s
    - 40s
    - 60s
    metrics_exporter: signozclickhousemetrics

receivers:
  httplogreceiver/heroku:
    endpoint: 0.0.0.0:8081
    source: heroku
  httplogreceiver/json:
    endpoint: 0.0.0.0:8082
    source: json
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_http:
        endpoint: 0.0.0.0:14268
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 16
      http:
        endpoint: 0.0.0.0:4318

service:
  extensions:
  - health_check
  - zpages
  - pprof
  pipelines:
    logs:
      receivers:
      - otlp
      - httplogreceiver/heroku
      - httplogreceiver/json
      processors:
      - memory_limiter  # ⚡ 第一个 processor
      - batch
      exporters:
      - clickhouselogsexporter
      - metadataexporter
      - signozmeter
    
    metrics:
      receivers:
      - otlp
      processors:
      - memory_limiter  
      - batch
      exporters:
      - metadataexporter
      - signozclickhousemetrics
      - signozmeter
    
    metrics/meter:
      receivers:
      - signozmeter
      processors:
      - memory_limiter  
      - batch/meter
      exporters:
      - signozclickhousemeter
    
    traces:
      receivers:
      - otlp
      processors:
      - memory_limiter        
      - signozspanmetrics/delta
      - batch
      exporters:
      - clickhousetraces
      - metadataexporter
      - signozmeter
  
  telemetry:
    logs:
      level: warn  
      encoding: json

ConfigMap includes:

  • memory_limiter with limit_mib: 8000
  • Large sending queues (queue_size: 5000 for traces)
  • signozspanmetrics/delta with dimensions_cache_size: 1000

However, the container still exceeds 13Gi and gets killed.

By comparison, a vanilla OTel Collector (no ClickHouse exporters, no spanmetrics) runs on 1C2G without issues.

Suspected causes:

  • Unbounded memory in ClickHouse exporters
  • Spanmetrics cache growth
  • Queue sizes translating to GBs of buffered data

Suggest investigating memory profile via pprof and reducing queue/cache sizes as workaround.

Metadata

Metadata

Assignees

Labels

configureIssues with configuring SigNozquestionQuestions about using the SigNoz

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions