OOMKilled in signoz-otel-collector despite high memory limits

## Bug description

The `signoz-otel-collector` pod is repeatedly OOMKilled despite being allocated 12Gi memory request / 13Gi limit. In contrast, a standard OpenTelemetry Collector (e.g., `opentelemetry-collector-contrib:0.128.0`) with similar pipeline runs stably on just 2Gi under the same load.

Pod event: `Warning  PodOOMKilling ... Reason: OOMKilled`, Exit Code 137.

## Expected behavior

Collector should not exhaust 13Gi memory under moderate telemetry load, especially when memory_limiter is set to 8000 MiB.

## How to reproduce

1. Deploy Signoz with `signoz-otel-collector:v0.129.11`


## Version information
- **Signoz version**: v0.129.11
- **Architecture**: x86_64
[图片]



## Additional context
apiVersion: v1
kind: ConfigMap
metadata:
  name: signoz-otel-collector
  namespace: signoz
data:
  otel-collector-config.yaml: |
    connectors:
      signozmeter:
        dimensions:
        - name: service.name
        - name: deployment.environment
        - name: host.name
        metrics_flush_interval: 1h
    
    exporters:
      clickhouselogsexporter:
        dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_LOG_DATABASE}
        timeout: 10s
        use_new_schema: true
        sending_queue:
          enabled: true
          queue_size: 1024
          num_consumers: 10
        retry_on_failure:
          enabled: true
          initial_interval: 5s
          max_interval: 10s
          max_elapsed_time: 120s
      
      clickhousetraces:
        datasource: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_TRACE_DATABASE}
        low_cardinal_exception_grouping: ${env:LOW_CARDINAL_EXCEPTION_GROUPING}
        timeout: 10s
        use_new_schema: true
        sending_queue:
          enabled: true
          queue_size: 5000
          num_consumers: 10     
        retry_on_failure:
          enabled: true
          initial_interval: 5s
          max_interval: 10s
          max_elapsed_time: 120s
      
      metadataexporter:
        cache:
          provider: in_memory
        dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/signoz_metadata
        tenant_id: ${env:TENANT_ID}
        timeout: 15s
      
      signozclickhousemeter:
        dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_METER_DATABASE}
        sending_queue:
          enabled: false
        timeout: 45s
      
      signozclickhousemetrics:
        dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_DATABASE}
        timeout: 45s
    
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133
      pprof:
        endpoint: localhost:1777
      zpages:
        endpoint: localhost:55679
    
    processors:
      memory_limiter:
        check_interval: 5s
        limit_mib: 8000
        spike_limit_mib: 1024
      
      batch:
        send_batch_size: 800
        timeout: 2s             
        send_batch_max_size: 1000
      
      batch/meter:
        send_batch_size: 800
        timeout: 2s               
        send_batch_max_size: 1000 
      
      signozspanmetrics/delta:
        aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
        dimensions:
        - default: default
          name: service.namespace
        - default: default
          name: deployment.environment
        - name: signoz.collector.id
        dimensions_cache_size: 1000  # ⚡ 从 100000 降低到 10000
        latency_histogram_buckets:
        - 100us
        - 1ms
        - 2ms
        - 6ms
        - 10ms
        - 50ms
        - 100ms
        - 250ms
        - 500ms
        - 1000ms
        - 1400ms
        - 2000ms
        - 5s
        - 10s
        - 20s
        - 40s
        - 60s
        metrics_exporter: signozclickhousemetrics
    
    receivers:
      httplogreceiver/heroku:
        endpoint: 0.0.0.0:8081
        source: heroku
      httplogreceiver/json:
        endpoint: 0.0.0.0:8082
        source: json
      jaeger:
        protocols:
          grpc:
            endpoint: 0.0.0.0:14250
          thrift_http:
            endpoint: 0.0.0.0:14268
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
            max_recv_msg_size_mib: 16
          http:
            endpoint: 0.0.0.0:4318
    
    service:
      extensions:
      - health_check
      - zpages
      - pprof
      pipelines:
        logs:
          receivers:
          - otlp
          - httplogreceiver/heroku
          - httplogreceiver/json
          processors:
          - memory_limiter  # ⚡ 第一个 processor
          - batch
          exporters:
          - clickhouselogsexporter
          - metadataexporter
          - signozmeter
        
        metrics:
          receivers:
          - otlp
          processors:
          - memory_limiter  
          - batch
          exporters:
          - metadataexporter
          - signozclickhousemetrics
          - signozmeter
        
        metrics/meter:
          receivers:
          - signozmeter
          processors:
          - memory_limiter  
          - batch/meter
          exporters:
          - signozclickhousemeter
        
        traces:
          receivers:
          - otlp
          processors:
          - memory_limiter        
          - signozspanmetrics/delta
          - batch
          exporters:
          - clickhousetraces
          - metadataexporter
          - signozmeter
      
      telemetry:
        logs:
          level: warn  
          encoding: json

ConfigMap includes:
- `memory_limiter` with `limit_mib: 8000`
- Large sending queues (`queue_size: 5000` for traces)
- `signozspanmetrics/delta` with `dimensions_cache_size: 1000`

However, the container still exceeds 13Gi and gets killed.

By comparison, a vanilla OTel Collector (no ClickHouse exporters, no spanmetrics) runs on 1C2G without issues.

Suspected causes:
- Unbounded memory in ClickHouse exporters
- Spanmetrics cache growth
- Queue sizes translating to GBs of buffered data

Suggest investigating memory profile via pprof and reducing queue/cache sizes as workaround.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OOMKilled in signoz-otel-collector despite high memory limits #9773

Bug description

Expected behavior

How to reproduce

Version information

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OOMKilled in signoz-otel-collector despite high memory limits #9773

Description

Bug description

Expected behavior

How to reproduce

Version information

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions