-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Bug description
The signoz-otel-collector pod is repeatedly OOMKilled despite being allocated 12Gi memory request / 13Gi limit. In contrast, a standard OpenTelemetry Collector (e.g., opentelemetry-collector-contrib:0.128.0) with similar pipeline runs stably on just 2Gi under the same load.
Pod event: Warning PodOOMKilling ... Reason: OOMKilled, Exit Code 137.
Expected behavior
Collector should not exhaust 13Gi memory under moderate telemetry load, especially when memory_limiter is set to 8000 MiB.
How to reproduce
- Deploy Signoz with
signoz-otel-collector:v0.129.11
Version information
- Signoz version: v0.129.11
- Architecture: x86_64
[图片]
Additional context
apiVersion: v1
kind: ConfigMap
metadata:
name: signoz-otel-collector
namespace: signoz
data:
otel-collector-config.yaml: |
connectors:
signozmeter:
dimensions:
- name: service.name
- name: deployment.environment
- name: host.name
metrics_flush_interval: 1h
exporters:
clickhouselogsexporter:
dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_LOG_DATABASE}
timeout: 10s
use_new_schema: true
sending_queue:
enabled: true
queue_size: 1024
num_consumers: 10
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 10s
max_elapsed_time: 120s
clickhousetraces:
datasource: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_TRACE_DATABASE}
low_cardinal_exception_grouping: ${env:LOW_CARDINAL_EXCEPTION_GROUPING}
timeout: 10s
use_new_schema: true
sending_queue:
enabled: true
queue_size: 5000
num_consumers: 10
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 10s
max_elapsed_time: 120s
metadataexporter:
cache:
provider: in_memory
dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/signoz_metadata
tenant_id: ${env:TENANT_ID}
timeout: 15s
signozclickhousemeter:
dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_METER_DATABASE}
sending_queue:
enabled: false
timeout: 45s
signozclickhousemetrics:
dsn: tcp://${env:CLICKHOUSE_USER}:${env:CLICKHOUSE_PASSWORD}@${env:CLICKHOUSE_HOST}:${env:CLICKHOUSE_PORT}/${env:CLICKHOUSE_DATABASE}
timeout: 45s
extensions:
health_check:
endpoint: 0.0.0.0:13133
pprof:
endpoint: localhost:1777
zpages:
endpoint: localhost:55679
processors:
memory_limiter:
check_interval: 5s
limit_mib: 8000
spike_limit_mib: 1024
batch:
send_batch_size: 800
timeout: 2s
send_batch_max_size: 1000
batch/meter:
send_batch_size: 800
timeout: 2s
send_batch_max_size: 1000
signozspanmetrics/delta:
aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
dimensions:
- default: default
name: service.namespace
- default: default
name: deployment.environment
- name: signoz.collector.id
dimensions_cache_size: 1000 # ⚡ 从 100000 降低到 10000
latency_histogram_buckets:
- 100us
- 1ms
- 2ms
- 6ms
- 10ms
- 50ms
- 100ms
- 250ms
- 500ms
- 1000ms
- 1400ms
- 2000ms
- 5s
- 10s
- 20s
- 40s
- 60s
metrics_exporter: signozclickhousemetrics
receivers:
httplogreceiver/heroku:
endpoint: 0.0.0.0:8081
source: heroku
httplogreceiver/json:
endpoint: 0.0.0.0:8082
source: json
jaeger:
protocols:
grpc:
endpoint: 0.0.0.0:14250
thrift_http:
endpoint: 0.0.0.0:14268
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
max_recv_msg_size_mib: 16
http:
endpoint: 0.0.0.0:4318
service:
extensions:
- health_check
- zpages
- pprof
pipelines:
logs:
receivers:
- otlp
- httplogreceiver/heroku
- httplogreceiver/json
processors:
- memory_limiter # ⚡ 第一个 processor
- batch
exporters:
- clickhouselogsexporter
- metadataexporter
- signozmeter
metrics:
receivers:
- otlp
processors:
- memory_limiter
- batch
exporters:
- metadataexporter
- signozclickhousemetrics
- signozmeter
metrics/meter:
receivers:
- signozmeter
processors:
- memory_limiter
- batch/meter
exporters:
- signozclickhousemeter
traces:
receivers:
- otlp
processors:
- memory_limiter
- signozspanmetrics/delta
- batch
exporters:
- clickhousetraces
- metadataexporter
- signozmeter
telemetry:
logs:
level: warn
encoding: json
ConfigMap includes:
memory_limiterwithlimit_mib: 8000- Large sending queues (
queue_size: 5000for traces) signozspanmetrics/deltawithdimensions_cache_size: 1000
However, the container still exceeds 13Gi and gets killed.
By comparison, a vanilla OTel Collector (no ClickHouse exporters, no spanmetrics) runs on 1C2G without issues.
Suspected causes:
- Unbounded memory in ClickHouse exporters
- Spanmetrics cache growth
- Queue sizes translating to GBs of buffered data
Suggest investigating memory profile via pprof and reducing queue/cache sizes as workaround.