Skip to content

feat(duckdb sink): add DuckDB sink#25737

Open
dannote wants to merge 1 commit into
vectordotdev:masterfrom
dannote:duckdb-sink
Open

feat(duckdb sink): add DuckDB sink#25737
dannote wants to merge 1 commit into
vectordotdev:masterfrom
dannote:duckdb-sink

Conversation

@dannote

@dannote dannote commented Jul 1, 2026

Copy link
Copy Markdown

feat(duckdb sink): add DuckDB sink

Summary

This PR adds an opt-in duckdb sink for writing log events into existing DuckDB tables.

The sink is designed around DuckDB's native table model rather than treating DuckDB as an opaque file output:

  • the destination table must already exist
  • Vector reads the table schema from information_schema.columns at startup
  • supported DuckDB column types are mapped to Arrow types
  • each Vector batch is encoded as an Arrow RecordBatch
  • batches are appended with DuckDB's native appender API
  • each request batch is written inside a transaction, so a failed batch is not partially committed

This avoids implicit schema inference or table creation in Vector. That felt like the safer behavior for an embedded analytical database where users may create, migrate, and query the database directly outside of Vector.

The sink is behind the sinks-duckdb feature and is not included in the default sinks feature set. The duckdb crate currently uses DuckDB's bundled native library in this integration, so keeping it opt-in avoids adding that native build cost to regular Vector builds.

Vector configuration

Example configuration:

[sources.in]
type = "file"
include = ["/var/log/events.ndjson"]
read_from = "beginning"
ignore_checkpoints = true

[transforms.parse]
type = "remap"
inputs = ["in"]
source = '''
. = parse_json!(.message)
'''

[sinks.duckdb]
type = "duckdb"
inputs = ["parse"]
endpoint = "duckdb:///var/lib/vector/events.duckdb"
database = "main"
table = "events"
batch.max_events = 1000
request.concurrency = 1

Example destination table:

CREATE TABLE events (
  host VARCHAR,
  user_identifier VARCHAR,
  datetime VARCHAR,
  method VARCHAR,
  request VARCHAR,
  protocol VARCHAR,
  status VARCHAR,
  bytes BIGINT,
  referer VARCHAR,
  service VARCHAR
);

The sink also accepts plain filesystem paths as endpoints:

endpoint = "/var/lib/vector/events.duckdb"

How did you test this PR?

Unit and integration tests

Added coverage for:

  • config generation and parsing
  • DuckDB endpoint parsing
  • table schema discovery
  • DuckDB-to-Arrow scalar type mapping
  • missing tables failing at build time
  • unsupported DuckDB column types failing at build time
  • healthcheck behavior
  • writing events into DuckDB
  • writing to a configured non-main schema
  • missing non-nullable fields rejecting the batch
  • supported scalar values
  • ignored stress tests for high-volume ingestion and failed-batch atomicity

Ran:

PROTOC=/tmp/protoc-vector/protoc mise exec rust@1.95 -- \
  cargo check --no-default-features --features sinks-duckdb

PROTOC=/tmp/protoc-vector/protoc mise exec rust@1.95 -- \
  cargo test --no-default-features \
  --features duckdb-integration-tests \
  sinks::duckdb:: -- --nocapture

PROTOC=/tmp/protoc-vector/protoc mise exec rust@1.95 -- \
  cargo clippy --no-default-features \
  --features duckdb-integration-tests \
  --tests -- -D warnings

PROTOC=/tmp/protoc-vector/protoc mise exec rust@1.95 -- \
  cargo fmt --check

PROTOC=/tmp/protoc-vector/protoc mise exec rust@1.95 -- \
  cargo vdev check events

./scripts/check_changelog_fragments.sh

git diff --check

All passed.

I also checked that enabling the regular sinks feature set does not pull in DuckDB:

PROTOC=/tmp/protoc-vector/protoc mise exec rust@1.95 -- \
  cargo tree --no-default-features --features sinks -e features -i duckdb

Expected result:

error: package ID specification `duckdb` did not match any packages

Stress testing

The PR includes ignored stress tests that can be run manually. For example:

DUCKDB_STRESS_EVENTS=1000000 \
DUCKDB_STRESS_BATCH_EVENTS=1000 \
DUCKDB_STRESS_REQUEST_CONCURRENCY=2 \
PROTOC=/tmp/protoc-vector/protoc mise exec rust@1.95 -- \
  cargo test --release --no-default-features \
  --features duckdb-integration-tests \
  stress_million_events -- --ignored --nocapture

Those tests validate row counts, min/max/distinct IDs, and aggregate sums after ingestion. They also sample DuckDB database/WAL file sizes and can optionally run a concurrent reader during writes.

A separate ignored stress test verifies failed-batch atomicity by writing one valid batch followed by one invalid batch and asserting that only the valid batch is committed.

Real pipeline testing

I also tested a realistic file ingestion pipeline using a generated 1M-line NDJSON file, about 213 MiB:

file source -> remap parse_json!(.message) -> duckdb sink

With:

batch.max_events = 1000
request.concurrency = 2

The DuckDB pipeline was effectively at the same throughput as the same file/remap pipeline writing to blackhole:

file -> remap -> duckdb:   ~193k events/s
file -> remap -> blackhole: ~195k events/s

That suggests the sink overhead is small for this workload; file reading and JSON parsing dominate the end-to-end pipeline.

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

Changelog fragment added:

changelog.d/duckdb_sink.feature.md

References

Notes

Users should create the destination table before starting Vector. A good initial tuning point is:

batch.max_events = 1000
request.concurrency = 1

request.concurrency = 2 may help when RecordBatch encoding is a significant part of the workload, while 1 is the safer default starting point for local DuckDB writes.

@dannote dannote requested review from a team as code owners July 1, 2026 20:15
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@github-actions github-actions Bot added docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. domain: sinks Anything related to the Vector's sinks domain: external docs Anything related to Vector's external, public documentation and removed docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. labels Jul 1, 2026
@datadog-vectordotdev

This comment has been minimized.

@dannote

dannote commented Jul 1, 2026

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

@domalessi domalessi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Editorial review (docs team). Focused on the new .cue description strings and the changelog fragment, per Vector review guidance — no .md docs pages in this diff. Overall the new copy is clear and concrete. One minor consistency nit on heading case; one optional wording suggestion on the changelog entry. Non-blocking.

}

database: {
title: "Database/schema"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: heading case is inconsistent within this how_it_works section. "Table Schema" above uses Title Case, but this and the next two headings use sentence case. Most existing sink files use Title Case for these (e.g. ClickHouse's "Data Formats," "Arrow Type Mappings"). Suggest:

Suggested change
title: "Database/schema"
title: "Database/Schema"

}

type_mappings: {
title: "Type mappings"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same case-consistency nit as above.

Suggested change
title: "Type mappings"
title: "Type Mappings"

}

batching: {
title: "Batching and transactions"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same case-consistency nit as above.

Suggested change
title: "Batching and transactions"
title: "Batching and Transactions"

Comment thread changelog.d/duckdb_sink.feature.md Outdated
@@ -0,0 +1,3 @@
Add a DuckDB sink for writing log events into existing DuckDB tables using schema-aware Arrow RecordBatch appends.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: "schema-aware Arrow RecordBatch appends" is fairly implementation-heavy for a changelog entry aimed at users. Consider simplifying, e.g.:

Suggested change
Add a DuckDB sink for writing log events into existing DuckDB tables using schema-aware Arrow RecordBatch appends.
Add a DuckDB sink for writing log events into existing DuckDB tables.

@github-actions github-actions Bot added the docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. label Jul 2, 2026
@dannote

dannote commented Jul 2, 2026

Copy link
Copy Markdown
Author

Oops, accidentally force-pushed while applying the docs suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants