dxf: classify data errors in finished task metric by D3Hunter · Pull Request #69507 · pingcap/tidb

D3Hunter · 2026-06-29T02:08:09Z

What problem does this PR solve?

Issue Number: ref #61702

Problem Summary:

The DXF finished task metric currently classifies reverted tasks with user data errors as failed, which can trigger failure alerts even when the underlying cause is bad input data. Import Into Lightning encode/value conversion errors and add-index duplicate-key errors should be separated from infrastructure or framework failures.

What changed and how does it work?

Extract the metric label selection from onTaskFinished into getMetricState.

Add a new data-error metric state for reverted tasks whose error text matches known user-data failures:

Lightning encode/value conversion errors containing ErrEncodeKV and Truncated incorrect.
Add unique index duplicate-key errors containing [kv:1062] and Duplicate entry.

The matching is intentionally string-based so the DXF scheduler does not depend on Lightning error definitions. A code comment records the real Lightning error example and the add-index duplicate-entry shape for future refactoring when the error definitions can be split out.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)

add index

create table t(id int); insert into t values(1),(1);
alter table t add unique index(id);

...

tidb_dxf_finished_task_total{keyspace_name="SYSTEM",state="all"} 1
tidb_dxf_finished_task_total{keyspace_name="SYSTEM",state="data-error"} 1

import into

create table t(id datetime);
import into t from 's3://mybucket/string.csv?access-key=minioadmin&secret-access-key=minioadmin&endpoint=http%3a%2f%2f0.0.0.0%3a9000';

ERROR 1105 (HY000): when encoding 1-th data row in this chunk: encode kv error in file a.csv:0 at offset 0: Value conversion failed for column 'id'. Expected type: datetime BINARY, received value: "123123123". Reason: [types:1292]Incorrect datetime value: '123123123'.

tidb_dxf_finished_task_total{keyspace_name="SYSTEM",state="all"} 1
tidb_dxf_finished_task_total{keyspace_name="SYSTEM",state="data-error"} 1

create table t(id int);

ERROR 1105 (HY000): when encoding 1-th data row in this chunk: encode kv error in file string.csv:0 at offset 0: Value conversion failed for column 'id'. Expected type: int, received value: "aaa". Reason: [types:1292]Truncated incorrect DOUBLE value: 'aaa'.

tidb_dxf_finished_task_total{keyspace_name="SYSTEM",state="all"} 1
tidb_dxf_finished_task_total{keyspace_name="SYSTEM",state="data-error"} 1

No need to test
- I checked and no code files have been changed.

Unit and local validation commands:

./tools/check/failpoint-go-test.sh pkg/dxf/framework/scheduler -run TestOnTaskFinished -count=1
make lint
git diff --check

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

The affected behavior is the label value emitted by the DXF finished task metric for the matched user-data error cases.

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

DXF finished task metrics now classify known user-data errors as `data-error` instead of `failed`.

Summary by CodeRabbit

Bug Fixes
- Improved task completion metrics by computing a more accurate outcome for reverted tasks.
- User-cancelled tasks are now reported as cancelled rather than failed.
- Specific data import and duplicate-entry failures are now tracked separately as data-error.
- Total finished-task counts still include all completed tasks, with updated breakdowns across all, failed, cancelled, and data-error outcomes.

ti-chi-bot · 2026-06-29T02:08:13Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-06-29T02:08:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 865cd1c8-07f2-4365-a477-83525dd4cc18

📥 Commits

Reviewing files that changed from the base of the PR and between a35843e and fe5e663.

📒 Files selected for processing (1)

pkg/dxf/framework/scheduler/scheduler_nokit_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/dxf/framework/scheduler/scheduler_nokit_test.go

📝 Walkthrough

Walkthrough

The scheduler now classifies finished-task metrics with getMetricState and isDataErrorForMetric, adds "all", "cancelled", and "data-error" labels, and updates tests for reverted tasks with conversion and duplicate-entry errors.

Task-finished metric classification

Layer / File(s)	Summary
Metric constants, classification helpers, and tests `pkg/dxf/framework/scheduler/scheduler.go`, `pkg/dxf/framework/scheduler/scheduler_nokit_test.go`	Adds `"all"`, `"cancelled"`, and `"data-error"` metric states; routes finished-task counting through `getMetricState` with string-based data-error detection; extends `TestOnTaskFinished` with reverted-path cases and updated counter expectations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I hopped through metrics, neat and bright,
With data-error labels tucked in right.
Cancelled, failed, and all in view,
The counters now know what to do.
A tidy bounce for code review!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly states the main change: classifying data errors in DXF finished task metrics.
Description check	✅ Passed	The description matches the template and includes issue reference, problem summary, changes, tests, side effects, documentation, and release note.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

D3Hunter · 2026-06-29T08:57:36Z

/cherry-pick release-nextgen-20251011
/cherry-pick release-nextgen-202603

ti-chi-bot · 2026-06-29T08:57:39Z

@D3Hunter: once the present PR merges, I will cherry-pick it on top of release-nextgen-20251011/release-nextgen-202603 in the new PR and assign it to you.

Details

In response to this:

/cherry-pick release-nextgen-20251011
/cherry-pick release-nextgen-202603

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

codecov · 2026-06-29T09:17:33Z

Codecov Report

❌ Patch coverage is 0% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.4279%. Comparing base (aa35069) to head (fe5e663).
⚠️ Report is 9 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #69507        +/-   ##
================================================
- Coverage   76.3257%   74.4279%   -1.8979%     
================================================
  Files          2041       2045         +4     
  Lines        561045     574870     +13825     
================================================
- Hits         428222     427864       -358     
- Misses       131922     146725     +14803     
+ Partials        901        281       -620

Flag	Coverage Δ
integration	`40.8918% <0.0000%> (+1.2637%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`60.4471% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`47.6078% <ø> (-15.1433%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ingress-bot · 2026-06-29T09:26:36Z

🔍 Starting code review for this PR...

D3Hunter · 2026-06-29T09:40:03Z

/retest

ingress-bot

This review was generated by AI and should be verified by a human reviewer.
Manual follow-up is recommended before merge.

Summary

Total findings: 4
Inline comments: 4
Summary-only findings (no inline anchor): 0

Findings (highest risk first)

🟡 [Minor] (2)

New "data-error" metric label value breaks the single-word naming convention of the counter's existing labels (pkg/dxf/framework/scheduler/scheduler.go:54)
Reverted data-error tasks leave the failed metric series, silently lowering existing failure dashboards/alerts (pkg/dxf/framework/scheduler/scheduler.go:786, pkg/dxf/framework/dxfmetric/metric.go:70)

🧹 [Nit] (2)

Test assertions use raw string literals for metric label constants defined in the same package (pkg/dxf/framework/scheduler/scheduler.go:52, pkg/dxf/framework/scheduler/scheduler_nokit_test.go:765)
Implicit future-work intent in isDataErrorForMetric comment lacks a tracking reference (pkg/dxf/framework/scheduler/scheduler.go:800)

D3Hunter · 2026-06-29T10:47:30Z

/retest

D3Hunter · 2026-06-29T15:15:36Z

/retest

ti-chi-bot · 2026-06-30T02:14:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joechenrh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/dxf/OWNERS~~ [joechenrh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-06-30T02:14:03Z

[LGTM Timeline notifier]

Timeline:

2026-06-30 02:14:03.402457447 +0000 UTC m=+91985.102836880: ☑️ agreed by joechenrh.

D3Hunter · 2026-07-01T01:57:28Z

/hold

dxf: classify data errors in finished task metric

3e41958

ti-chi-bot Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jun 29, 2026

ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 29, 2026

dxf: classify datetime data errors in metrics

a35843e

D3Hunter marked this pull request as ready for review June 29, 2026 08:59

ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 29, 2026

ingress-bot reviewed Jun 29, 2026

View reviewed changes

Comment thread pkg/dxf/framework/scheduler/scheduler.go

Comment thread pkg/dxf/framework/scheduler/scheduler.go

Comment thread pkg/dxf/framework/scheduler/scheduler_nokit_test.go Outdated

Comment thread pkg/dxf/framework/scheduler/scheduler.go

dxf: use metric constants in scheduler test

fe5e663

ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 29, 2026

joechenrh approved these changes Jun 30, 2026

View reviewed changes

ti-chi-bot Bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 30, 2026

ti-chi-bot Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 1, 2026

Uh oh!

Conversation

D3Hunter commented Jun 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented Jun 29, 2026

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

D3Hunter commented Jun 29, 2026

Uh oh!

ti-chi-bot commented Jun 29, 2026

Uh oh!

codecov Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ingress-bot commented Jun 29, 2026

Uh oh!

D3Hunter commented Jun 29, 2026

Uh oh!

ingress-bot left a comment

Choose a reason for hiding this comment

Summary

🟡 [Minor] (2)

🧹 [Nit] (2)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

D3Hunter commented Jun 29, 2026

Uh oh!

D3Hunter commented Jun 29, 2026

Uh oh!

ti-chi-bot Bot commented Jun 30, 2026

Uh oh!

ti-chi-bot Bot commented Jun 30, 2026

[LGTM Timeline notifier]

Uh oh!

D3Hunter commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

D3Hunter commented Jun 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading

codecov Bot commented Jun 29, 2026 •

edited

Loading