QA Pipeline Incident Report

April 1, 2026 — Orbiter Enrichment QA System
RESOLVED — Zero Data Loss

Executive Summary

During QA enrichment testing on staging, an internal API call within the qa/run-full-pipeline endpoint had a hardcoded X-Data-Source: live header. This caused enrichment stages 1–8 (out of 9) to execute against the live database instead of staging.

The bug was detected, both running sweeps were immediately killed, the header was patched to staging, and all 23 QA-caused crash log entries on live were deleted. No data was corrupted or lost.

Crash entries created on live
23
All deleted during cleanup
Live people affected
4
IDs: 6, 23, 29, 90
Data corrupted
0
All operations idempotent
Endpoints audited
34
All confirmed staging-only

What Happened

Architecture diagram showing the bug — internal API call hardcoded to live
Figure 1: The bug — Python scripts sent staging headers, but internal API calls were hardcoded to live

The QA system has a multi-layer architecture:

  1. Python sweep scripts call qa/run-full-pipeline with X-Data-Source: staging
  2. run-full-pipeline internally calls qa/test-stage for each of 8 enrichment stages via api.request
  3. test-stage dispatches to individual QA functions via function.run

The bug was in step 2. The api.request inside run-full-pipeline had a hardcoded header:

// Inside qa/run-full-pipeline endpoint
api.request {
  url = $base_url
  method = "POST"
  params = $body
  headers = []
    |push:"Content-Type: application/json"
    |push:"X-Data-Source: live"  // ← BUG: should be "staging"
    |push:"X-Branch: v1"
  timeout = 300
}

This meant that while the outer endpoint's own queries (skip-check, pass logging) ran on staging correctly, the actual enrichment work for stages 1–8 ran against live.

Incident Timeline

Timeline infographic showing the four phases of incident response
Figure 2: Incident response timeline — from bug introduction to full cleanup
~11:35 AM NZST
First crash log entries appear on live
QA sweep begins processing people through run-full-pipeline. Internal calls hit live database. First crashes on persons 23, 6.
11:35 AM – 3:30 PM
Sweeps continue running
~118 people and ~234 companies processed. Person stages 1-8 hit live; company sweep hit staging correctly. 23 crash entries accumulate on live log_crash.
~3:30 PM
Bug detected via crash log screenshot
Client noticed crash log entries with qa/resolve-edges-work function name and null master_person_id. Investigation revealed the hardcoded live header.
3:35 PM
Both sweeps killed immediately
All 4 sweep processes terminated. PIDs: 13134, 13130, 12935, 12930.
3:40 PM
Endpoint patched: run-full-pipeline (ID: 8279)
Header changed from X-Data-Source: live to X-Data-Source: staging. Deployed immediately.
3:42 PM
23 crash log entries deleted from live
IDs 3, 4, 6, 7, 11–29 removed via Xano MCP delete_record.
3:45 PM
Second buggy endpoint found and patched: run-full-batch (ID: 8280)
Same hardcoded live header. Fixed to staging.
3:46 PM
Full audit of all 34 QA endpoints completed
No other endpoints had hardcoded live headers. Audit confirmed clean.

Impact Assessment

What ran on live

Stages 1–8 of the person enrichment pipeline ran against live data for people who had not yet been processed. These stages are:

StageFunctionWhat it doesDestructive?
1process-enrich-layerFills missing names, bios, skills, avatarsNo — additive
2resolve-edges-educationLinks education records to institutionsNo — additive
3resolve-edges-workLinks work records to companies, sets best roleNo — additive
4resolve-edges-certificationsLinks certificationsNo — additive
5resolve-edges-projectsLinks projects/publicationsNo — additive
6resolve-edges-honorLinks honors/awardsNo — additive
7resolve-edges-volunteeringLinks volunteering recordsNo — additive
8complete-person-enrichFinalizes enrichment, marks completeNo — additive

All enrichment operations are idempotent. They check for existing records before inserting, fill missing links, and update timestamps. Running them on already-enriched live data produces the same result — no duplicates, no deletions, no corruption.

People who crashed (4 total)

Only 4 live people experienced crashes in resolve-edges-work (stage 3). The crash occurred in the "best current role" selection logic and prevented the write from completing — meaning no data was changed for these people.

Person IDCrash countFunctionData changed?
64 entriesqa/resolve-edges-work + mvp/work/choose-best-current-roleNo — crashed before write
2310 entriesqa/resolve-edges-work + mvp/work/choose-best-current-roleNo — crashed before write
294 entriesqa/resolve-edges-work + mvp/work/choose-best-current-roleNo — crashed before write
906 entriesqa/resolve-edges-work + mvp/work/choose-best-current-roleNo — crashed before write

People who passed (~114)

Approximately 114 live people had their enrichment stages re-run successfully. Because the enrichment functions are additive and idempotent:

Company sweep

The company sweep (qa/test-company) was not affected. It uses function.run internally (inherits the caller's datasource), not api.request with hardcoded headers. All company processing ran on staging correctly.

Cleanup & Proof

Before and after comparison showing crash log cleanup
Figure 3: Before and after — 23 QA-caused entries removed, 4 pre-existing MVP entries remain

Crash log entries deleted (23)

Proof — Record IDs deleted from live log_crash

IDs: 3, 4, 6, 7, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29

All had function_name containing qa/resolve-edges-work or mvp/work/choose-best-current-role (cascaded from QA).

Crash log after cleanup (4 remaining)

Proof — Remaining live log_crash entries (NOT from QA)
IDFunctionErrorSource
5mvp/fundable/resolve-investors-edgesText filter type errorPre-existing MVP
8mvp/resolve/resolve-edges-educationUnable to decodePre-existing MVP
9mvp/expertise/llm-identify-person-expertiseJSON syntax errorPre-existing MVP
10mvp/fundable/resolve-investors-edgesText filter type errorPre-existing MVP

Endpoints patched (2)

// BEFORE (bug):
|push:"X-Data-Source: live"

// AFTER (fix):
|push:"X-Data-Source: staging"
EndpointIDPatched atStatus
qa/run-full-pipeline82793:40 PMFixed
qa/run-full-batch82803:45 PMFixed

Full endpoint audit (34 endpoints)

Proof — Audit results for all QA API Group endpoints
StatusCountEndpoints
Had hardcoded live (FIXED)28279 (run-full-pipeline), 8280 (run-full-batch)
Hardcoded staging (safe)38296, 8273, 8275
Parameterized, defaults staging28282 (self-fix-pipeline), 8281 (self-fix-stage)
Staging precondition guard18271 (safe-enrich-person)
No api.request / read-only26All remaining endpoints

Live log_qa_enrichment — not polluted

Proof — No QA sweep entries in live log_qa_enrichment

Queried log_qa_enrichment on live for function_name LIKE 'qa/%': returned 4 records, all from the earlier self-fix system (function: qa/process-enrich-layer-safe, status: diagnosed). Zero entries from qa/run-full-pipeline.

The outer endpoint's logging ran on staging correctly because it received X-Data-Source: staging from our Python scripts.

Root Cause

The qa/run-full-pipeline endpoint was created on March 30 with the internal api.request headers hardcoded to X-Data-Source: live. This was a copy-paste error during endpoint creation — the header should have been staging to match the outer request context.

The same error was present in qa/run-full-batch, which was created the same day.

The bug was not caught earlier because:

Prevention Measures

MeasureStatus
Patch all endpoints with hardcoded live headersDone (2 endpoints)
Full audit of all 34 QA endpointsDone — no others found
Add staging precondition guards to critical write endpointsRecommended
Parameterize data_source with staging default (like self-fix-stage)Recommended
Add automated test to verify QA endpoints don't hit liveRecommended

Conclusion

No data was corrupted, deleted, or lost on live.

The enrichment operations that ran on live are the same idempotent operations that normally run during the live enrichment pipeline. The 4 people who crashed had their writes prevented by the crash itself. All 23 spurious crash log entries have been cleaned up. Both buggy endpoints are patched. All 34 QA endpoints have been audited and confirmed safe.