[test](pipeline) collect Cloud P0 resource usage#65125
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 29772 ms |
TPC-DS: Total hot run time: 173344 ms |
ClickBench: Total hot run time: 25.4 s |
|
/review |
There was a problem hiding this comment.
Reviewed PR #65125 at head bc9bb853256d66c9cc20ff770ebb171ddaf20d63.
No blocking issues found.
Critical checkpoint conclusions:
- Goal/test proof: The PR adds periodic system resource snapshots for Cloud P0 runs and stores them in
output/regression-test/log/system_resource_usage.log, which is covered by the existing Doris log archive path on failure. Syntax and whitespace checks pass; I also smoke-tested helper start/stop locally. - Scope/focus: The change is small and limited to the Cloud P0 runner plus shared pipeline utility helpers.
- Concurrency/lifecycle: The new background monitor has explicit manual stop before log collection and an EXIT trap fallback. Repeated stop is idempotent through the unset PID path.
- Configuration: Only optional environment knobs are introduced for monitor interval and top-N output; no Doris runtime config or dynamic config propagation is involved.
- Compatibility/parallel paths: No FE/BE protocol, storage, SQL semantics, or rolling-upgrade compatibility impact. Sibling regression runners are not semantically required to change for a Cloud P0-only monitor.
- Test coverage/results: This shell-only pipeline change does not need Doris regression
.outupdates. Verifiedbash -n,git diff --check, and a local helper smoke test. Full Cloud P0 was not run in this review environment. - Observability: The new log is collected in the existing regression log archive path and records memory/process snapshots around the Cloud P0 run.
User focus: No additional user-provided focus was present.
Subagent conclusions: The optimizer-rewrite subagent found no optimizer/rewrite, join, aggregate, or optimizer regression coverage impact. The tests-session-config subagent found no concrete session/config, expected-output, compatibility, CI, or style issue. Convergence round 1 ended with both live subagents reporting NO_NEW_VALUABLE_FINDINGS for the same ledger/comment set.
bc9bb85 to
308e19b
Compare
|
run buildall |
TPC-H: Total hot run time: 29770 ms |
TPC-DS: Total hot run time: 173333 ms |
ClickBench: Total hot run time: 25.31 s |
|
/review |
There was a problem hiding this comment.
Automated review completed for PR #65125.
I found one lifecycle cleanup issue in the new Cloud P0 resource monitor and left it inline. The monitor log path itself is archived because it writes under ${DORIS_HOME}/be/log/, which archive_doris_logs already collects.
Critical checkpoint conclusions:
- Goal/test proof: The patch implements periodic resource snapshots for Cloud P0 runs, but cleanup is incomplete because the active
sleepchild can survive after stop. I verified the changed scripts withbash -n,git diff --checkagainst the fetched base/head range, and contained monitor lifecycle checks. Full Cloud P0 was not run because this runner lacksthirdparty/installedandprotoc. - Scope/focus: The implementation is small and limited to
cloud_p0/run.shplus common pipeline utilities. - Concurrency/lifecycle: The only new concurrency is the background shell loop. The parent monitor PID is stopped, but the current
sleepchild can be orphaned; this is the submitted inline issue. - Config/session compatibility: No Doris session variables, regression suite selection, or
run-regression-test.shparallel flags changed. - Persistence/FE-BE/storage/security: Not applicable for this pipeline-only logging change.
- Parallel paths: Existing cloud P0 regression invocation remains unchanged; vault/performance paths are not changed by this PR.
- Test coverage/results: No regression
.outchanges are involved. Syntax/range checks passed; full pipeline execution was not available in this review environment. - User focus: No additional user-provided focus was supplied.
Subagent conclusions:
optimizer-rewrite: no optimizer/rewrite, semantic-equivalence, or parallel join/aggregate issue found.tests-session-config: proposed TSC-1; after main verification it became inline comment MAIN-1.- Final convergence: round 2 ended with both live subagents returning
NO_NEW_VALUABLE_FINDINGSfor this exact final ledger/comment set.
| stop_system_resource_monitor() { | ||
| if [[ -z "${SYSTEM_RESOURCE_MONITOR_PID:-}" ]]; then return 0; fi | ||
| if kill -0 "${SYSTEM_RESOURCE_MONITOR_PID}" 2>/dev/null; then | ||
| kill "${SYSTEM_RESOURCE_MONITOR_PID}" 2>/dev/null || true |
There was a problem hiding this comment.
stop_system_resource_monitor only kills and waits for the monitor subshell, but the loop is often blocked in its foreground sleep "${interval}" child. I can reproduce this with SYSTEM_RESOURCE_MONITOR_INTERVAL_SECONDS=123: before stop the monitor owns sleep 123, and after this function returns the monitor PID is gone but that sleep is still alive with PPID 1. In the normal Cloud P0 path this leaves a stray sleep 60 after cleanup, and a custom interval can leave it around much longer. Please have the monitor worker clean up its active child on TERM, or stop the whole monitor process tree/process group, before reporting that the monitor has stopped.
There was a problem hiding this comment.
Fixed in 0f7197f. The monitor worker now tracks the active sleep child and kills/waits it from a TERM/INT cleanup trap before exiting.
308e19b to
0f7197f
Compare
|
/review |
|
run buildall |
|
skip buildall |
There was a problem hiding this comment.
Reviewed PR #65125 at head 0f7197f7681eacf378b3f2ab029f35fbd075bc70.
No new blocking issue found. The earlier lifecycle issue about stop_system_resource_monitor leaving the active sleep child behind is already covered by existing thread 3517166741, and current head fixes it by tracking the sleep PID and killing/waiting it from the monitor worker cleanup trap.
Critical checkpoint conclusions:
- Goal/test proof: The PR adds Cloud P0 system resource snapshots around the regression run and writes them under
${DORIS_HOME}/be/log/system_resource_usage.log, which is included in existing BE log archival. The Cloud P0 check is reported successful on the PR; I also ranbash -non both changed scripts and a local monitor start/stop probe. - Scope: The change is narrowly scoped to
regression-test/pipeline/cloud_p0/run.shandregression-test/pipeline/common/doris-utils.sh. - Concurrency/lifecycle: The monitor is a background subshell. Current head stops the monitor and its active sleep child cleanly on the explicit stop path and the EXIT trap no-ops after the explicit stop unsets the PID.
- Configuration/session compatibility: No Doris session variables, FE/BE protocol, storage format, or rolling-upgrade boundary is changed. The optional monitor interval/top-N environment knobs only affect diagnostic collection.
- Parallel paths: This is intentionally wired only into Cloud P0. Existing Vault P0 and other pipeline paths are unchanged.
- Test/style coverage:
bash -npassed. ShellCheck only reported indirect-function/trap-handler informational noise and unrelated pre-existing diagnostics outside the changed behavior. GitHub checks showcloud_p0and the code-review check passing;COMPILEwas still pending when queried after the PR had already been merged. - Observability: The new output is diagnostic-only and captures uptime, memory, process, and top snapshots in the existing archived BE log directory.
User focus: No additional user-provided review focus was present.
Subagent conclusions:
optimizer-rewrite: no optimizer/Nereids/join/aggregate surface in this shell-only patch; convergence round returnedNO_NEW_VALUABLE_FINDINGS.tests-session-config: no regression expected-output, session/config propagation, compatibility, or basic CI/style candidate found; convergence round returnedNO_NEW_VALUABLE_FINDINGS.- Final inline comment set is empty after duplicate suppression against existing thread 3517166741.
Proposed changes
Testing