Skip to content

dxf: classify data errors in finished task metric#69507

Open
D3Hunter wants to merge 3 commits into
pingcap:masterfrom
D3Hunter:codex/dxf-metric-data-error
Open

dxf: classify data errors in finished task metric#69507
D3Hunter wants to merge 3 commits into
pingcap:masterfrom
D3Hunter:codex/dxf-metric-data-error

Conversation

@D3Hunter

@D3Hunter D3Hunter commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: ref #61702

Problem Summary:

The DXF finished task metric currently classifies reverted tasks with user data errors as failed, which can trigger failure alerts even when the underlying cause is bad input data. Import Into Lightning encode/value conversion errors and add-index duplicate-key errors should be separated from infrastructure or framework failures.

What changed and how does it work?

Extract the metric label selection from onTaskFinished into getMetricState.

Add a new data-error metric state for reverted tasks whose error text matches known user-data failures:

  • Lightning encode/value conversion errors containing ErrEncodeKV and Truncated incorrect.
  • Add unique index duplicate-key errors containing [kv:1062] and Duplicate entry.

The matching is intentionally string-based so the DXF scheduler does not depend on Lightning error definitions. A code comment records the real Lightning error example and the add-index duplicate-entry shape for future refactoring when the error definitions can be split out.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

add index

create table t(id int); insert into t values(1),(1);
alter table t add unique index(id);

...

tidb_dxf_finished_task_total{keyspace_name="SYSTEM",state="all"} 1
tidb_dxf_finished_task_total{keyspace_name="SYSTEM",state="data-error"} 1

import into

create table t(id datetime);
import into t from 's3://mybucket/string.csv?access-key=minioadmin&secret-access-key=minioadmin&endpoint=http%3a%2f%2f0.0.0.0%3a9000';

ERROR 1105 (HY000): when encoding 1-th data row in this chunk: encode kv error in file a.csv:0 at offset 0: Value conversion failed for column 'id'. Expected type: datetime BINARY, received value: "123123123". Reason: [types:1292]Incorrect datetime value: '123123123'.

tidb_dxf_finished_task_total{keyspace_name="SYSTEM",state="all"} 1
tidb_dxf_finished_task_total{keyspace_name="SYSTEM",state="data-error"} 1

create table t(id int);

ERROR 1105 (HY000): when encoding 1-th data row in this chunk: encode kv error in file string.csv:0 at offset 0: Value conversion failed for column 'id'. Expected type: int, received value: "aaa". Reason: [types:1292]Truncated incorrect DOUBLE value: 'aaa'.

tidb_dxf_finished_task_total{keyspace_name="SYSTEM",state="all"} 1
tidb_dxf_finished_task_total{keyspace_name="SYSTEM",state="data-error"} 1
  • No need to test
    • I checked and no code files have been changed.

Unit and local validation commands:

./tools/check/failpoint-go-test.sh pkg/dxf/framework/scheduler -run TestOnTaskFinished -count=1
make lint
git diff --check

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

The affected behavior is the label value emitted by the DXF finished task metric for the matched user-data error cases.

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

DXF finished task metrics now classify known user-data errors as `data-error` instead of `failed`.

Summary by CodeRabbit

  • Bug Fixes
    • Improved task completion metrics by computing a more accurate outcome for reverted tasks.
    • User-cancelled tasks are now reported as cancelled rather than failed.
    • Specific data import and duplicate-entry failures are now tracked separately as data-error.
    • Total finished-task counts still include all completed tasks, with updated breakdowns across all, failed, cancelled, and data-error outcomes.

@ti-chi-bot

ti-chi-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jun 29, 2026
@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 865cd1c8-07f2-4365-a477-83525dd4cc18

📥 Commits

Reviewing files that changed from the base of the PR and between a35843e and fe5e663.

📒 Files selected for processing (1)
  • pkg/dxf/framework/scheduler/scheduler_nokit_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/dxf/framework/scheduler/scheduler_nokit_test.go

📝 Walkthrough

Walkthrough

The scheduler now classifies finished-task metrics with getMetricState and isDataErrorForMetric, adds "all", "cancelled", and "data-error" labels, and updates tests for reverted tasks with conversion and duplicate-entry errors.

Task-finished metric classification

Layer / File(s) Summary
Metric constants, classification helpers, and tests
pkg/dxf/framework/scheduler/scheduler.go, pkg/dxf/framework/scheduler/scheduler_nokit_test.go
Adds "all", "cancelled", and "data-error" metric states; routes finished-task counting through getMetricState with string-based data-error detection; extends TestOnTaskFinished with reverted-path cases and updated counter expectations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I hopped through metrics, neat and bright,
With data-error labels tucked in right.
Cancelled, failed, and all in view,
The counters now know what to do.
A tidy bounce for code review!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly states the main change: classifying data errors in DXF finished task metrics.
Description check ✅ Passed The description matches the template and includes issue reference, problem summary, changes, tests, side effects, documentation, and release note.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@ti-chi-bot ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 29, 2026
@D3Hunter

Copy link
Copy Markdown
Contributor Author

/cherry-pick release-nextgen-20251011
/cherry-pick release-nextgen-202603

@ti-chi-bot

Copy link
Copy Markdown
Member

@D3Hunter: once the present PR merges, I will cherry-pick it on top of release-nextgen-20251011/release-nextgen-202603 in the new PR and assign it to you.

Details

In response to this:

/cherry-pick release-nextgen-20251011
/cherry-pick release-nextgen-202603

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@D3Hunter D3Hunter marked this pull request as ready for review June 29, 2026 08:59
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 29, 2026
@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.4279%. Comparing base (aa35069) to head (fe5e663).
⚠️ Report is 9 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #69507        +/-   ##
================================================
- Coverage   76.3257%   74.4279%   -1.8979%     
================================================
  Files          2041       2045         +4     
  Lines        561045     574870     +13825     
================================================
- Hits         428222     427864       -358     
- Misses       131922     146725     +14803     
+ Partials        901        281       -620     
Flag Coverage Δ
integration 40.8918% <0.0000%> (+1.2637%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 60.4471% <ø> (ø)
parser ∅ <ø> (∅)
br 47.6078% <ø> (-15.1433%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ingress-bot

Copy link
Copy Markdown

🔍 Starting code review for this PR...

@D3Hunter

Copy link
Copy Markdown
Contributor Author

/retest

@ingress-bot ingress-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review was generated by AI and should be verified by a human reviewer.
Manual follow-up is recommended before merge.

Summary

  • Total findings: 4
  • Inline comments: 4
  • Summary-only findings (no inline anchor): 0
Findings (highest risk first)

🟡 [Minor] (2)

  1. New "data-error" metric label value breaks the single-word naming convention of the counter's existing labels (pkg/dxf/framework/scheduler/scheduler.go:54)
  2. Reverted data-error tasks leave the failed metric series, silently lowering existing failure dashboards/alerts (pkg/dxf/framework/scheduler/scheduler.go:786, pkg/dxf/framework/dxfmetric/metric.go:70)

🧹 [Nit] (2)

  1. Test assertions use raw string literals for metric label constants defined in the same package (pkg/dxf/framework/scheduler/scheduler.go:52, pkg/dxf/framework/scheduler/scheduler_nokit_test.go:765)
  2. Implicit future-work intent in isDataErrorForMetric comment lacks a tracking reference (pkg/dxf/framework/scheduler/scheduler.go:800)

Comment thread pkg/dxf/framework/scheduler/scheduler.go
Comment thread pkg/dxf/framework/scheduler/scheduler.go
Comment thread pkg/dxf/framework/scheduler/scheduler_nokit_test.go Outdated
Comment thread pkg/dxf/framework/scheduler/scheduler.go
@D3Hunter

Copy link
Copy Markdown
Contributor Author

/retest

@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 29, 2026
@D3Hunter

Copy link
Copy Markdown
Contributor Author

/retest

@ti-chi-bot

ti-chi-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joechenrh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jun 30, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

[LGTM Timeline notifier]

Timeline:

  • 2026-06-30 02:14:03.402457447 +0000 UTC m=+91985.102836880: ☑️ agreed by joechenrh.

@D3Hunter

D3Hunter commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/hold

@ti-chi-bot ti-chi-bot Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants