Skip to content

Conversation

@cfsmp3
Copy link
Contributor

@cfsmp3 cfsmp3 commented Dec 30, 2025

Summary

  • Remove blocking wait_for_operation in delete_expired_instances() (VM deletion now fire-and-forget)
  • Remove blocking wait_for_operation in start_test() (VM creation now optimistic)
  • NEW: Add VM deletion tracking and verification system to prevent orphaned VMs

Problem

These blocking calls were causing 504 timeouts on webhook deliveries because:

  1. GitHub has a 10-second webhook timeout
  2. When cron jobs run wait_for_operation (which can block for up to 30 minutes), gunicorn workers become blocked
  3. Webhook requests queue behind slow requests and exceed the timeout

However, simply making deletions fire-and-forget creates a new problem: if a deletion fails, the VM continues running and incurring charges with no way to detect or recover.

Solution

Part 1: Remove blocking waits (original PR)

  • delete_expired_instances: Delete VMs without waiting for confirmation
  • start_test: Check for immediate errors, then record instance optimistically

Part 2: VM Deletion Tracking System (new)

A robust system to ensure VMs are properly deleted:

Component Purpose
PendingDeletion model Tracks deletion operations in DB
delete_instance_with_tracking() Records pending deletion when initiating delete
verify_pending_deletions() Checks GCP operation status, retries failures (up to 5 times)
scan_for_orphaned_vms() Final safety net - finds VMs with no tracking and deletes them

Cron flow (every 10 minutes):

  1. verify_pending_deletions() - Check pending ops from previous runs
  2. scan_for_orphaned_vms() - Catch any orphans we missed
  3. delete_expired_instances() - Delete timed-out VMs (with tracking)
  4. gcp_instance() - Start new tests

This provides defense in depth: tracked deletions are verified and retried, and the orphan scan catches anything that slips through.

⚠️ Maintainer Action Required

After merging, run the database migration:

cd /var/www/sample-platform
source venv/bin/activate
flask db upgrade

This creates the pending_deletion table needed for tracking. The migration is safe and non-destructive.

No action required before merging - the new code is backwards compatible and will work even before the migration runs (it will just fail to track deletions until the table exists).

Test plan

  • Verify existing tests pass
  • Add 11 new tests for deletion tracking functionality
  • CI tests pass
  • Manual verification on staging that tests still run correctly

🤖 Generated with Claude Code

This PR removes the last two blocking wait_for_operation calls that were
causing gunicorn workers to be blocked for extended periods:

1. delete_expired_instances() - VM deletion is now fire-and-forget
2. start_test() - VM creation is now optimistic (recorded immediately)

These blocking calls were causing 504 timeouts on webhook deliveries
because GitHub has a 10-second webhook timeout. When cron jobs were
running wait_for_operation (which can take up to 30 minutes), all
gunicorn workers could become blocked, causing webhook requests to
queue and exceed the timeout.

Changes:
- delete_expired_instances: Remove blocking wait, log operation initiation
- start_test: Check for immediate errors, then record instance optimistically
- Tests: Remove unused wait_for_operation mocks from affected tests

If VM creation ultimately fails, the test won't report progress and will
be cleaned up by the expired instances cron job.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Implements a robust system to ensure GCP VMs are properly deleted:

- Add PendingDeletion model to track deletion operations
- Add delete_instance_with_tracking() to record pending deletions
- Add verify_pending_deletions() to check operation status and retry failures
- Add scan_for_orphaned_vms() as final safety net for missed deletions
- Integrate verification into start_platforms() cron flow
- Update both deletion points to use tracking
- Add database migration for pending_deletion table
- Add 11 tests for new functionality

This prevents orphaned VMs from continuing to incur billing charges
when deletion operations fail silently.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants