Skip to content

fix(ci): gate Stage_3 deploy on MCR image availability instead of Stage_2 completion#1727

Open
zanejohnson-azure wants to merge 1 commit into
ci_prodfrom
zane/stage3-mcr-image-wait
Open

fix(ci): gate Stage_3 deploy on MCR image availability instead of Stage_2 completion#1727
zanejohnson-azure wants to merge 1 commit into
ci_prodfrom
zane/stage3-mcr-image-wait

Conversation

@zanejohnson-azure

Copy link
Copy Markdown
Contributor

Problem

Stage_3 (Deploy ama-logs to CI AKS Prod Clusters) depends on Stage_2. But Stage_2's Ev2 SDP rollout does not "complete" for ~24h because of the bake/monitoring window — even though the ama-logs images are published to MCR early in that rollout. As a result, Stage_3 waits ~24h before it can even start deploying.

Change

Decouple Stage_3 from Stage_2 and gate the deploys on actual image availability on MCR instead — mirroring the ama-metrics release pipeline.

  • Stage_3: dependsOn: Stage_2dependsOn: []; dropped eq(dependencies.Stage_2.result, 'Succeeded') from the condition (other guards — not-PR, main branch, non-empty tag — retained).
  • Added a WaitForMCRImages gate job that polls MCR until both the Linux ($(AgentImageTagSuffix)) and Windows (win-$(AgentImageTagSuffix)) tags exist under mcr.microsoft.com/azuremonitor/containerinsights/ciprod (up to 24h, 1440 × 60s).
  • Each cluster deploy now has dependsOn: ['WaitForMCRImages'], so Helm only runs once the images are confirmed on MCR.

Note: the check queries the manifest endpoint directly (HTTP 200 = exists) rather than grepping /tags/list. The ciprod repo currently has ~494 tags, and a substring grep would false-match (e.g. 3.4.0 inside 3.4.01 or win-3.4.0). Manifest lookup is exact.

Local verification

The check_tag function and retry loop were tested locally against live MCR using GNU bash 5.2.21.

Test script (core check_tag/loop is verbatim from the pipeline; the rest is the harness)
#!/usr/bin/env bash
set -uo pipefail

MCR_REGISTRY="mcr.microsoft.com"
PROD_MCR_AGENT_REPO="/azuremonitor/containerinsights/ciprod"

# --- verbatim from the pipeline step ---
check_tag() {
  local tag="$1"
  curl -sSL -o /dev/null -w "%{http_code}" \
    -H "Accept: application/vnd.docker.distribution.manifest.list.v2+json,application/vnd.oci.image.index.v1+json,application/vnd.docker.distribution.manifest.v2+json" \
    "https://${MCR_REGISTRY}/v2${PROD_MCR_AGENT_REPO}/manifests/${tag}" || echo "000"
}

run() {
  local LINUX_TAG="$1" WINDOWS_TAG="$2" MAX_RETRIES="$3" SLEEP_INTERVAL="$4"
  echo "== Test: linux=${LINUX_TAG} windows=${WINDOWS_TAG} (max=${MAX_RETRIES}) =="
  for i in $(seq 1 "$MAX_RETRIES"); do
    linux_code=$(check_tag "$LINUX_TAG")
    windows_code=$(check_tag "$WINDOWS_TAG")
    echo "Attempt ${i}/${MAX_RETRIES}: linux=${linux_code} windows=${windows_code}"
    if [ "$linux_code" = "200" ] && [ "$windows_code" = "200" ]; then
      echo "RESULT: both published -> exit 0"; return 0
    fi
    sleep "$SLEEP_INTERVAL"
  done
  echo "RESULT: not published in time -> exit 1"; return 1
}

echo "### POSITIVE (real tags 3.4.0 / win-3.4.0)"
run "3.4.0" "win-3.4.0" 2 1 && echo "PASS positive" || echo "FAIL positive"

echo; echo "### NEGATIVE (bogus linux tag)"
run "does-not-exist-xyz-9999" "win-3.4.0" 1 1 && echo "FAIL negative (should not pass)" || echo "PASS negative"

echo; echo "### SUBSTRING SAFETY (exact manifest match, not tag-list grep)"
echo "3.4.099 code: $(check_tag 3.4.099)  (expect 404)"
echo "3.4.0 code:   $(check_tag 3.4.0)    (expect 200)"

Output (live MCR):

### POSITIVE (real tags 3.4.0 / win-3.4.0)
== Test: linux=3.4.0 windows=win-3.4.0 (max=2) ==
Attempt 1/2: linux=200 windows=200
RESULT: both published -> exit 0
PASS positive

### NEGATIVE (bogus linux tag)
== Test: linux=does-not-exist-xyz-9999 windows=win-3.4.0 (max=1) ==
Attempt 1/1: linux=404 windows=200
RESULT: not published in time -> exit 1
PASS negative

### SUBSTRING SAFETY (exact manifest match, not tag-list grep)
3.4.099 code: 404  (expect 404)
3.4.0 code:   200    (expect 200)

Result: ✅ Succeeds only when both Linux and Windows tags exist on MCR; correctly fails on a missing tag; exact manifest match avoids substring false positives.

…ge_2 completion

Stage_2's Ev2 SDP rollout does not complete for ~24h due to the bake/monitoring window, but the ama-logs images are published to MCR early in the rollout. Decouple Stage_3 (dependsOn: []) and add a WaitForMCRImages gate job that polls MCR for the new Linux and Windows image tags, so cluster deploys start as soon as the images are available. Mirrors the ama-metrics release pipeline.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@zanejohnson-azure zanejohnson-azure requested a review from a team as a code owner July 2, 2026 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant