feat(duckdb sink): add DuckDB sink#25737
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
This comment has been minimized.
This comment has been minimized.
|
I have read the CLA Document and I hereby sign the CLA |
domalessi
left a comment
There was a problem hiding this comment.
Editorial review (docs team). Focused on the new .cue description strings and the changelog fragment, per Vector review guidance — no .md docs pages in this diff. Overall the new copy is clear and concrete. One minor consistency nit on heading case; one optional wording suggestion on the changelog entry. Non-blocking.
| } | ||
|
|
||
| database: { | ||
| title: "Database/schema" |
There was a problem hiding this comment.
Nit: heading case is inconsistent within this how_it_works section. "Table Schema" above uses Title Case, but this and the next two headings use sentence case. Most existing sink files use Title Case for these (e.g. ClickHouse's "Data Formats," "Arrow Type Mappings"). Suggest:
| title: "Database/schema" | |
| title: "Database/Schema" |
| } | ||
|
|
||
| type_mappings: { | ||
| title: "Type mappings" |
There was a problem hiding this comment.
Same case-consistency nit as above.
| title: "Type mappings" | |
| title: "Type Mappings" |
| } | ||
|
|
||
| batching: { | ||
| title: "Batching and transactions" |
There was a problem hiding this comment.
Same case-consistency nit as above.
| title: "Batching and transactions" | |
| title: "Batching and Transactions" |
| @@ -0,0 +1,3 @@ | |||
| Add a DuckDB sink for writing log events into existing DuckDB tables using schema-aware Arrow RecordBatch appends. | |||
There was a problem hiding this comment.
Optional: "schema-aware Arrow RecordBatch appends" is fairly implementation-heavy for a changelog entry aimed at users. Consider simplifying, e.g.:
| Add a DuckDB sink for writing log events into existing DuckDB tables using schema-aware Arrow RecordBatch appends. | |
| Add a DuckDB sink for writing log events into existing DuckDB tables. |
|
Oops, accidentally force-pushed while applying the docs suggestions. |
feat(duckdb sink): add DuckDB sink
Summary
This PR adds an opt-in
duckdbsink for writing log events into existing DuckDB tables.The sink is designed around DuckDB's native table model rather than treating DuckDB as an opaque file output:
information_schema.columnsat startupRecordBatchThis avoids implicit schema inference or table creation in Vector. That felt like the safer behavior for an embedded analytical database where users may create, migrate, and query the database directly outside of Vector.
The sink is behind the
sinks-duckdbfeature and is not included in the defaultsinksfeature set. Theduckdbcrate currently uses DuckDB's bundled native library in this integration, so keeping it opt-in avoids adding that native build cost to regular Vector builds.Vector configuration
Example configuration:
Example destination table:
The sink also accepts plain filesystem paths as endpoints:
How did you test this PR?
Unit and integration tests
Added coverage for:
mainschemaRan:
All passed.
I also checked that enabling the regular
sinksfeature set does not pull in DuckDB:PROTOC=/tmp/protoc-vector/protoc mise exec rust@1.95 -- \ cargo tree --no-default-features --features sinks -e features -i duckdbExpected result:
Stress testing
The PR includes ignored stress tests that can be run manually. For example:
Those tests validate row counts, min/max/distinct IDs, and aggregate sums after ingestion. They also sample DuckDB database/WAL file sizes and can optionally run a concurrent reader during writes.
A separate ignored stress test verifies failed-batch atomicity by writing one valid batch followed by one invalid batch and asserting that only the valid batch is committed.
Real pipeline testing
I also tested a realistic file ingestion pipeline using a generated 1M-line NDJSON file, about 213 MiB:
With:
The DuckDB pipeline was effectively at the same throughput as the same file/remap pipeline writing to
blackhole:That suggests the sink overhead is small for this workload; file reading and JSON parsing dominate the end-to-end pipeline.
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.Changelog fragment added:
References
Notes
Users should create the destination table before starting Vector. A good initial tuning point is:
request.concurrency = 2may help when RecordBatch encoding is a significant part of the workload, while1is the safer default starting point for local DuckDB writes.