feat(duckdb sink): add DuckDB sink by dannote · Pull Request #25737 · vectordotdev/vector

dannote · 2026-07-01T20:15:16Z

feat(duckdb sink): add DuckDB sink

Summary

This PR adds an opt-in duckdb sink for writing log events into existing DuckDB tables.

The sink is designed around DuckDB's native table model rather than treating DuckDB as an opaque file output:

the destination table must already exist
Vector reads the table schema from information_schema.columns at startup
supported DuckDB column types are mapped to Arrow types
each Vector batch is encoded as an Arrow RecordBatch
batches are appended with DuckDB's native appender API
each request batch is written inside a transaction, so a failed batch is not partially committed

This avoids implicit schema inference or table creation in Vector. That felt like the safer behavior for an embedded analytical database where users may create, migrate, and query the database directly outside of Vector.

The sink is behind the sinks-duckdb feature and is not included in the default sinks feature set. The duckdb crate currently uses DuckDB's bundled native library in this integration, so keeping it opt-in avoids adding that native build cost to regular Vector builds.

Vector configuration

Example configuration:

[sources.in]
type = "file"
include = ["/var/log/events.ndjson"]
read_from = "beginning"
ignore_checkpoints = true

[transforms.parse]
type = "remap"
inputs = ["in"]
source = '''
. = parse_json!(.message)
'''

[sinks.duckdb]
type = "duckdb"
inputs = ["parse"]
endpoint = "duckdb:///var/lib/vector/events.duckdb"
database = "main"
table = "events"
batch.max_events = 1000
request.concurrency = 1

Example destination table:

CREATE TABLE events (
  host VARCHAR,
  user_identifier VARCHAR,
  datetime VARCHAR,
  method VARCHAR,
  request VARCHAR,
  protocol VARCHAR,
  status VARCHAR,
  bytes BIGINT,
  referer VARCHAR,
  service VARCHAR
);

The sink also accepts plain filesystem paths as endpoints:

endpoint = "/var/lib/vector/events.duckdb"

How did you test this PR?

Unit and integration tests

Added coverage for:

config generation and parsing
DuckDB endpoint parsing
table schema discovery
DuckDB-to-Arrow scalar type mapping
missing tables failing at build time
unsupported DuckDB column types failing at build time
healthcheck behavior
writing events into DuckDB
writing to a configured non-main schema
missing non-nullable fields rejecting the batch
supported scalar values
ignored stress tests for high-volume ingestion and failed-batch atomicity

Ran:

PROTOC=/tmp/protoc-vector/protoc mise exec rust@1.95 -- \
  cargo check --no-default-features --features sinks-duckdb

PROTOC=/tmp/protoc-vector/protoc mise exec rust@1.95 -- \
  cargo test --no-default-features \
  --features duckdb-integration-tests \
  sinks::duckdb:: -- --nocapture

PROTOC=/tmp/protoc-vector/protoc mise exec rust@1.95 -- \
  cargo clippy --no-default-features \
  --features duckdb-integration-tests \
  --tests -- -D warnings

PROTOC=/tmp/protoc-vector/protoc mise exec rust@1.95 -- \
  cargo fmt --check

PROTOC=/tmp/protoc-vector/protoc mise exec rust@1.95 -- \
  cargo vdev check events

./scripts/check_changelog_fragments.sh

git diff --check

All passed.

I also checked that enabling the regular sinks feature set does not pull in DuckDB:

PROTOC=/tmp/protoc-vector/protoc mise exec rust@1.95 -- \
  cargo tree --no-default-features --features sinks -e features -i duckdb

Expected result:

error: package ID specification `duckdb` did not match any packages

Stress testing

The PR includes ignored stress tests that can be run manually. For example:

DUCKDB_STRESS_EVENTS=1000000 \
DUCKDB_STRESS_BATCH_EVENTS=1000 \
DUCKDB_STRESS_REQUEST_CONCURRENCY=2 \
PROTOC=/tmp/protoc-vector/protoc mise exec rust@1.95 -- \
  cargo test --release --no-default-features \
  --features duckdb-integration-tests \
  stress_million_events -- --ignored --nocapture

Those tests validate row counts, min/max/distinct IDs, and aggregate sums after ingestion. They also sample DuckDB database/WAL file sizes and can optionally run a concurrent reader during writes.

A separate ignored stress test verifies failed-batch atomicity by writing one valid batch followed by one invalid batch and asserting that only the valid batch is committed.

Real pipeline testing

I also tested a realistic file ingestion pipeline using a generated 1M-line NDJSON file, about 213 MiB:

file source -> remap parse_json!(.message) -> duckdb sink

With:

batch.max_events = 1000
request.concurrency = 2

The DuckDB pipeline was effectively at the same throughput as the same file/remap pipeline writing to blackhole:

file -> remap -> duckdb:   ~193k events/s
file -> remap -> blackhole: ~195k events/s

That suggests the sink overhead is small for this workload; file reading and JSON parsing dominate the end-to-end pipeline.

Change Type

Is this a breaking change?

Yes
No

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on our guidelines.
No. A maintainer will apply the no-changelog label to this PR.

Changelog fragment added:

changelog.d/duckdb_sink.feature.md

References

DuckDB Appender docs: https://duckdb.org/docs/current/data/appender
Related Arrow encoding optimization: perf(arrow codec): optimize scalar RecordBatch encoding #25734

Notes

Users should create the destination table before starting Vector. A good initial tuning point is:

batch.max_events = 1000
request.concurrency = 1

request.concurrency = 2 may help when RecordBatch encoding is a significant part of the workload, while 1 is the safer default starting point for local DuckDB writes.

github-actions · 2026-07-01T20:15:26Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

dannote · 2026-07-01T20:19:55Z

I have read the CLA Document and I hereby sign the CLA

domalessi

Editorial review (docs team). Focused on the new .cue description strings and the changelog fragment, per Vector review guidance — no .md docs pages in this diff. Overall the new copy is clear and concrete. One minor consistency nit on heading case; one optional wording suggestion on the changelog entry. Non-blocking.

domalessi · 2026-07-01T20:33:45Z

+		}
+
+		database: {
+			title: "Database/schema"


Nit: heading case is inconsistent within this how_it_works section. "Table Schema" above uses Title Case, but this and the next two headings use sentence case. Most existing sink files use Title Case for these (e.g. ClickHouse's "Data Formats," "Arrow Type Mappings"). Suggest:

Suggested change

title: "Database/schema"

title: "Database/Schema"

domalessi · 2026-07-01T20:33:45Z

+		}
+
+		type_mappings: {
+			title: "Type mappings"


Same case-consistency nit as above.

Suggested change

title: "Type mappings"

title: "Type Mappings"

domalessi · 2026-07-01T20:33:45Z

+		}
+
+		batching: {
+			title: "Batching and transactions"


Same case-consistency nit as above.

Suggested change

title: "Batching and transactions"

title: "Batching and Transactions"

domalessi · 2026-07-01T20:33:45Z

@@ -0,0 +1,3 @@
+Add a DuckDB sink for writing log events into existing DuckDB tables using schema-aware Arrow RecordBatch appends.


Optional: "schema-aware Arrow RecordBatch appends" is fairly implementation-heavy for a changelog entry aimed at users. Consider simplifying, e.g.:

Suggested change

Add a DuckDB sink for writing log events into existing DuckDB tables using schema-aware Arrow RecordBatch appends.

Add a DuckDB sink for writing log events into existing DuckDB tables.

dannote · 2026-07-02T04:29:19Z

Oops, accidentally force-pushed while applying the docs suggestions.

dannote requested review from a team as code owners July 1, 2026 20:15

This comment has been minimized.

Sign in to view

domalessi reviewed Jul 1, 2026

View reviewed changes

feat(duckdb sink): add DuckDB sink

e291e60

dannote force-pushed the duckdb-sink branch from 67e4a1c to e291e60 Compare July 2, 2026 04:25

github-actions Bot added the docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. label Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(duckdb sink): add DuckDB sink#25737

feat(duckdb sink): add DuckDB sink#25737
dannote wants to merge 1 commit into
vectordotdev:masterfrom
dannote:duckdb-sink

dannote commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

dannote commented Jul 1, 2026

Uh oh!

domalessi left a comment

Uh oh!

domalessi Jul 1, 2026

Uh oh!

domalessi Jul 1, 2026

Uh oh!

domalessi Jul 1, 2026

Uh oh!

domalessi Jul 1, 2026

Uh oh!

dannote commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	title: "Batching and transactions"
	title: "Batching and Transactions"

		@@ -0,0 +1,3 @@
		Add a DuckDB sink for writing log events into existing DuckDB tables using schema-aware Arrow RecordBatch appends.

Uh oh!

Conversation

dannote commented Jul 1, 2026

feat(duckdb sink): add DuckDB sink

Summary

Vector configuration

How did you test this PR?

Unit and integration tests

Stress testing

Real pipeline testing

Change Type

Is this a breaking change?

Does this PR include user facing changes?

References

Notes

Uh oh!

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

dannote commented Jul 1, 2026

Uh oh!

domalessi left a comment

Choose a reason for hiding this comment

Uh oh!

domalessi Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

domalessi Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

domalessi Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

domalessi Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

dannote commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jul 1, 2026 •

edited

Loading