Building Hybrid Human-Agent Squads with Shared OKRs

Maggie Nanyonga · 2026-05-26 · Hybrid Human-AI Teams, Human-Agent Teams, AI Agent Governance, AI Agent Management, Enterprise AI, Agentic AI, AI Operating Model, AI Governance Framework, AI Accountability, Human-AI Collaboration, Shared OKRs, AI Workflow Automation, AI Risk Management, Autonomous Agents, NIST AI RMF, Policy as Code, Runtime Governance, AI Productivity, Workslop Tax, Verification Tax

A governance and operating model for enterprise human-AI agent teams. Learn how to design roles, shared OKRs, AI agent accountability, audit trails & human tax.

A generic enterprise blueprint for designing, staffing, and governing project squads that integrate autonomous agents as bounded contributors.


Executive Summary

This paper establishes the generic enterprise blueprint for designing, staffing, and governing hybrid human-and-agent project squads. The transition from passive copilots to active autonomous participants forces a re-examination of how work is delegated, how accountability is shared, and how productivity is measured. Three independent lines of evidence frame the problem space.

First, the Stanford/CMU Collaborative Gym study (Shao et al., ICLR 2026) found that the best-performing collaborative agents consistently outperform their fully autonomous counterparts, achieving win rates of 86% in Travel Planning, 74% in Tabular Analysis, and 66% in Related Work when evaluated by real users (arXiv:2412.15701). The same study reported communication and situational-awareness failures in 65% and 40% of cases respectively. The shape of the data is clear: collaborative agents materially outperform autonomous ones, while communication failures remain pervasive.

Second, the METR randomized controlled trial of experienced open-source developers found that when developers use AI tools, they take 19% longer than without — even though developers forecast that AI will reduce completion time by 24%, and after participation still estimate it saved them ~20% (METR report; arXiv:2507.09089). This is the Fluency Illusion in operational form: subjective speedup of ~20% against an objective slowdown of 19%.

Third, the economics of "workslop" — low-quality AI output requiring human rework — are now quantified from two independent sources. One 2026 global study found that nearly 40% of AI time savings are lost to rework, including correcting errors, rewriting content, and verifying outputs (Workday, Jan 2026). A separate study put per-instance numbers on that loss: an average of 1 hour 56 minutes per workslop incident, an invisible tax of \$186 per month per affected worker, 41% prevalence, and over \$9 million per year for a 10,000-person organization (HBR, Sept 2025).

The blueprint that follows turns these findings into a deployable system: a skill taxonomy, a six-role squad matrix, three operational topologies with a recommended adoption path, a shared-OKR model with a net-productivity formula that accounts for rework, risk-tiered autonomy, a three-layer governance architecture, handoff contracts, operating cadence, an 8-week pilot plan, and explicit rollback triggers.

1. Theoretical Foundations

1.1 Cognitive Partnership and Task Chaining

Rather than treating AI as a passive assistant invoked task-by-task, hybrid squads treat agents as cognitive partners operating across full workflows. Enterprise value emerges at the workflow level — the sequence of interdependent tasks that produce an outcome — not at any single isolated task. Squads that cluster adjacent, agent-suitable tasks into continuous automated flows reduce friction, eliminate handoff delays, and lower the coordination cost humans pay every time work crosses the human-machine boundary.

This continuous-flow design is task chaining: identifying contiguous spans of work that an agent can execute end-to-end without intermediate human gating, then placing a single deliberate review at the boundary where domain judgment is required. Chaining converts the agent from a tool that humans poke at into a peer that owns a defined slice of the value stream.

1.2 Intellectual Sovereignty

Intellectual sovereignty is the conscious preservation of human judgment and critical thinking when operating alongside highly capable AI. It serves as the ultimate safety valve against organizational cognitive decline. Just as over-reliance on GPS navigation erodes a driver's internal spatial mapping, blind reliance on autonomous systems erodes an organization's operational and strategic orientation. If the system fails or encounters a novel event, an organization that has forfeited its intellectual sovereignty will lack the capacity to recover.

In a hybrid team, intellectual sovereignty requires human-owned objectives and policy, review of high-impact decisions, manual practice for critical skills, override rights, explanation and evidence visibility, auditability, periodic review of autonomy expansion, and active monitoring for rubber-stamping.

1.3 The Three Cognitive Traps

When humans collaborate with capable autonomous systems, three predictable failure modes emerge. These are not bugs of any specific model release; they are stable properties of the human-machine interface.

Anchoring Drift occurs when an agent's first-proposed execution path quietly narrows the solution space. Alternative — often superior — paths are discarded because the team has unconsciously committed to the first viable option presented.

Fluency Illusion is the systematic over-trust of polished output. Generative systems produce code, copy, and summaries that read as confident and complete. The surface quality lowers human scrutiny, allowing logical errors, hallucinations, and security defects to pass through review. The METR finding that developers felt faster while objectively going slower is the canonical empirical signature of this trap.

Delegation Creep describes the slow expansion of agent scope. What begins as delegating routine administrative work gradually accumulates into substantive decision-making, with the human's role shifting from active director to passive rubber-stamper. Delegation creep is the mechanism by which skills atrophy and audit quality collapse.

fig1

Figure 1 · The three cognitive traps. Each is a stable failure mode of the human–agent interface with a quantified empirical signature and a corresponding structural control (§9).

These three traps are the design constraints for everything that follows. Roles, topologies, OKRs, and governance must all actively counter them.

2. Role Redesign: The FADE/RISE Skill Taxonomy

To defend against delegation creep and preserve human capability, organizations must explicitly partition work by required cognitive competency.

FADE skills — the work delegated to agents — consist of Familiar patterns, Assembling data, Documented procedures, and Ephemeral memory. These are routine, repeatable, well-bounded tasks that consume 60–80% of typical knowledge-worker capacity. They are precisely the tasks where agent reliability is high and the marginal cost of human attention is wasteful.

RISE skills — the work preserved for humans — consist of Reasoning under uncertainty, Imagination (handling unprecedented edge cases), Social/relational collaboration, and Sensemaking. These are the capabilities where autonomous agents reliably degrade and where the organization's source-layer value resides.

fig2

Figure 2 · FADE / RISE skill taxonomy. Every workflow task is binned into agent-delegable (FADE) or human-preserved (RISE) competencies. The boundary is the design surface for role assignment.

The taxonomy is not a static partition. It is a decision rule applied during workflow audit to bin every task in the existing process into one of the two categories. FADE-dominant tasks become candidates for automation; RISE-dominant tasks become candidates for handoff-certainty improvements and explicit human ownership.

2.1 The Six-Role Generic Matrix

| Role | Core Responsibility | Primary Actor | Decision Rights | Escalation Path |

|-----------------------|-----------------------------------------------------------|---------------|-------------------------------------------------------------------|----------------------------------------|

| Squad Manager | Owns OKRs, capacity, risk tolerance | Human | Final approval on scope, priority, safety override | Functional Director / Governance Board |

| Domain Operator | Frontline domain judgment, relationship work | Human | Approves routine domain actions; handles agent exceptions | Squad Manager |

| Knowledge Owner | Maintains accuracy of knowledge bases and prompts | Human | Edits, appends, redacts source-of-truth files | Squad Manager / Compliance |

| Automation Specialist | Builds and maintains connectors and workflow logic | Human | Approves technical executions and integrations | Engineering Lead |

| QA / Governance Owner | Samples outputs, tracks drift, audits logs | Human | Can pause or roll back agent automation | Squad Manager / Risk Committee |

| Execution Agent | Processes, summarizes, drafts, executes bounded API calls | Synthetic | Autonomy restricted to pre-approved, reversible, low-risk actions | Associated Domain Operator |

fig3

Figure 3 · Six-role squad matrix. Non-overlapping ownership; every decision class has exactly one named owner and one named escalation path.

This matrix is intentionally non-overlapping. Every category of decision has exactly one named owner and exactly one named escalation path. The "if everyone is responsible, no one is" failure mode is the single most common source of delegation creep, and an unambiguous role matrix is the structural defense against it.

3. Shared OKRs and the Workslop Tax

3.1 Net Productivity Gain

Raw automation throughput is not the same as net value delivered. The difference is the rework, review, coordination, and risk overhead that agent output imposes on the human team. We model this as Net Productivity Gain:

P<sub>net</sub>  =  T<sub>saved</sub>  −  ( T<sub>rework</sub>  +  T<sub>review</sub>  +  T<sub>coordination</sub>  +  T<sub>risk</sub> )

where T<sub>saved</sub> is the raw time returned by automated execution, T<sub>rework</sub> is the human time spent correcting agent output (the Workslop Tax), T<sub>review</sub> is the time spent checking output before use, T<sub>coordination</sub> is the cognitive handoff penalty paid each time work crosses the human-machine boundary, and T<sub>risk</sub> is the expected cost from policy breaches, reversals, incidents, and downstream defects.

The T<sub>rework</sub> term alone absorbs ~40% of time saved per one global survey (Workday, Jan 2026). A separate study put the per-instance economics at 1 hour 56 minutes per workslop incident, \$186/month per affected worker, 41% prevalence, and ~\$9M annual exposure for a 10,000-person organization (HBR, Sept 2025). Any squad whose OKRs measure only T<sub>saved</sub> will report success while losing money.

fig5

Figure 4 · Net Productivity Gain decomposition. Rework absorbs ~40% of time saved; review, coordination, and risk compound it.

3.2 Tagged OKRs and the Five Key Results

Every Key Result should be explicitly tagged as Agent-Owned, Human-Owned, or Shared. This eliminates ambiguity about who is accountable when a metric slips and prevents the team from conflating agent throughput with squad outcome.

Objective: Scale workflow throughput without introducing downstream defects or raising cognitive fatigue.

KR3 is the integration metric; it cannot be improved by either party alone. KR5 is the sovereignty guardrail; it catches delegation creep by monitoring whether the human override rate collapses toward zero (rubber-stamping) or spikes without explanation (distrust). Both are warning signals.

3.3 Decision-Point Attribution

To measure accountability rather than just output, every action in the squad's workflow should be classified into one of four modes: agent-autonomous (agent decided and executed), agent-proposed, human-approved (agent recommended, human confirmed), human-authored, agent-assisted (human led, agent contributed), or human-autonomous (human acted without agent involvement). Tracking the distribution of these four modes over time is how the squad detects delegation creep structurally rather than anecdotally — a rising autonomous share, unaccompanied by an explicit governance decision to expand agent scope, is the creep signal.

4. Operational Topologies

Three topologies cover the practical design space. The choice depends on existing org structure, risk tolerance, and the maturity of the agent fleet.

fig4

Figure 5 · Three operational topologies. Model B (highlighted) is the recommended starting point. The adoption path is B → A → C as agent maturity and fleet size grow.

Model A — Flat Hybrid (Pipeline/Mesh Pattern)

Humans and agents operate as peer-level contributors pulling from a shared queue. Routing is dynamic, governed by confidence-score cutoffs. Agent-to-agent execution chains run automatically until a step fails or requires human judgment. Strengths: maximum velocity; instant absorption of routine volume. Weaknesses: failure localization is hard; high vulnerability to delegation creep because humans trust the pipeline by default.

Model B — Tiered Command (Orchestrator-Worker Pattern)

Agents are isolated in a distinct tier overseen by a designated human "Agent Boss." Rules are hard-coded; agents surface proposals to the human manager before executing. This is a widely-adopted enterprise pattern, with documented production deployments at major cloud and AI platforms (Databricks; IBM; Microsoft Azure). Strengths: strongest protection of intellectual sovereignty; complete auditability; lowest risk profile. Weaknesses: slower cycle time; the human manager becomes the operational bottleneck.

Model C — Two-Tier Hierarchical (SupervisorAgent Pattern)

The squad manager oversees a bifurcated team: a human specialist tier and a synthetic worker tier coordinated by a non-LLM meta-agent that monitors interaction points and intervenes only on anomalies. Research shows that on the GAIA benchmark, this pattern reduces token consumption by an average of 29.68% without compromising success rate (arXiv:2510.26585). Strengths: substantial token efficiency at fleet scale. Weaknesses: high implementation complexity; introduces delegation creep at the supervisor layer itself.

4.1 Recommended Adoption Path

Start with Model B when risk tolerance is low and agent maturity is unproven. Design mature workflow classes toward Model A as confidence scores calibrate and policy-as-code coverage improves. Reserve Model C for when the agent fleet is large enough that human orchestration cost has become the bottleneck and the team has the engineering capacity to audit the supervisor layer. If quality degrades at any stage, contract autonomy back to the prior model; the path is designed for retreat as well as advance.

5. Risk-Tiered Autonomy

A hybrid team should not ask "Can the agent do this?" It should ask "Under what risk tier, with what evidence, and with what approval?"

fig7

Figure 6 · T0–T4 routing pyramid. Autonomy decreases and human ownership increases as actions become more consequential. Tier crossings without a policy match are themselves audit events.

| Tier | Decision Class | Autonomy | Human Touch |

|------|-----------------------------------------------------------------------------------|---------------------------------------|--------------------------------|

| T0 | Read-only data fetch, format conversion, summarization | Full | None |

| T1 | Drafting (text, code, config) for human review | Full | Mandatory review before commit |

| T2 | Reversible bounded actions (ticket creation, low-stakes API writes) | Conditional on confidence ≥ threshold | Sampled review |

| T3 | Irreversible or high-stakes actions (financial, customer-facing, safety-relevant) | None — proposal only | Mandatory approval |

| T4 | Novel or out-of-policy situations | None — escalation only | Full human ownership |

Routing between tiers is governed by the agent's confidence score, the action's reversibility class, and the policy-as-code firewall version in force at action time. Crossing a tier boundary without an explicit policy match is itself an audit event.

6. Governance Architecture

Governance is the layer that converts the operational design above into something an external auditor can verify. The architecture comprises three layers: intent and policy, cognitive orchestration, and runtime enforcement.

6.1 Intent and Policy Layer

The intent and policy layer translates human judgment into executable constraints. It defines approved task classes, risk tiers, escalation triggers, tool permissions, data boundaries, change windows, approval requirements, human override rules, rollback requirements, audit requirements, and decommissioning conditions. This is where intellectual sovereignty is encoded — the human squad manager owns this layer and no agent may modify it.

6.2 Cognitive Orchestration Layer

The orchestration layer governs how agents reason together. To counter anchoring drift, agents must generate at least two execution paths for any non-trivial task. To counter fluency illusion, they must expose confidence by sub-claim, data lineage, alternatives considered, and uncertainty flags. The output of this layer is not a polished recommendation — it is an evidence packet with visible seams, designed to be scrutinized rather than approved on appearance.

6.3 Runtime Enforcement Layer

The runtime layer gates action. It should be deterministic where possible. Before a state-changing action executes, the system checks: actor identity, authorized tool, allowed target, risk tier, policy version, required approval, confidence threshold, data lineage completeness, rollback plan, rate limits, and audit logging. If any required check fails, execution is blocked, frozen, or escalated. This layer is the firewall between agent intent and real-world consequence.

6.4 AI Security Controls

Agentic systems introduce specific security risks beyond those of static AI deployments: indirect prompt injection (where malicious data in external content hijacks agent logic), goal hijacking, tool misuse, privilege escalation, data exfiltration, and unauthorized state changes. Minimum controls include: separating data parsing from command execution, treating external content as untrusted, enforcing least privilege per agent, maintaining tool allowlists, prohibiting agents from modifying their own policies or credentials, isolating execution environments, logging all tool calls, redacting sensitive inputs, requiring deterministic policy evaluation before state-changing actions, and testing rollback and kill-switch paths.

6.5 Adversarial and Metamorphic Evaluation

Static benchmarks are insufficient for agents that act in dynamic environments. Evaluation should measure the trajectory of actions, not only the final output. Squads should schedule bi-weekly adversarial drills where agents are tested against poisoned documents, misleading requests, stale or conflicting data, adversarial prompts embedded in external content, missing context, tool failures, simulated downstream incidents, and high-confidence wrong answers. The purpose is not to prove the agent never fails — it is to learn where the system must route to humans, narrow permissions, or redesign controls.

6.6 NIST AI RMF Alignment

The blueprint maps to the NIST AI Risk Management Framework. The six-role matrix satisfies GOVERN 3.2 (roles, responsibilities, and lines of communication for AI risks are defined and documented). The OKR tagging is the responsibility specification. The escalation paths are the communication specification. The audit schema maps to MEASURE 2.4 (measurement of appropriateness) and MEASURE 2.7 (AI system safety). The handoff contracts satisfy MANAGE 1.1 (response to risks). Specific field-level NIST mappings are provided in Appendix A.2.

6.7 Handoff Contracts

The handoff is where hybrid work most often breaks. Every agent-to-human, human-to-agent, and agent-to-agent handoff should carry enough context for the receiver to act without reconstructing the entire situation. The minimum payload structure — situation, background, assessment, recommendation, and receiver readback — is adapted from the I-PASS and SBAR protocols used in healthcare handoffs, which have evidence-based support for reducing information loss at shift boundaries. The full YAML schema is provided in Appendix A.1.

7. Operating Cadence

Hybrid teams need explicit rhythm because humans and agents operate at different tempos. The cadence below separates continuous automated loops from human-paced rituals.

| Ritual | Frequency | Purpose | Output |

|------------------------------|------------------|---------------------------------------------------------------------------------------------------|------------------------------------|

| KPI scan | Daily | Check KR movement, approval backlog, policy alerts | Anomaly list |

| Routing standup | Daily, 15 min | Assign blocked work and escalations | Owner decisions |

| QA sample review | Daily or 2×/week | Inspect agent outputs for quality drift | Accept/reject trends |

| Knowledge refresh | Weekly | Update prompts, SOPs, source-of-truth content | Versioned updates |

| Failure and near-miss review | Weekly | Analyze bad outputs, blocked actions, escalations | Controls backlog |

| Sovereignty audit | Weekly | Three questions: What did agents miss? Would we have decided differently? What if this was wrong? | Override calibration |

| Adversarial drill | Bi-weekly | Inject synthetic failures into sandbox; test boundary conditions | Updated controls, escalation rules |

| OKR review | Weekly | Decide expand, hold, contract, or redesign | Scope decision |

| Governance review | Monthly | Assess audit posture, risk, workforce effects, autonomy dial | Executive report |

The sovereignty audit is the weekly practice most squads will not have seen before. Its three questions — adapted from research on cognitive partnership — are designed to detect the exact moment when human judgment begins drifting toward passive approval. If the squad cannot answer "What did the agents not consider?" with specific examples, the oversight layer is degrading.

8. Implementation Roadmap: The 8-Week Pilot

A successful pilot treats agent onboarding with the same discipline as hiring a human colleague: defined scope, defined evaluation, defined probation.

fig6

Figure 7 · 8-week pilot timeline with Go/No-Go gates. Four phases of two weeks each; gate criteria must be met before advancing.

Phase 1 — Baseline and Scope (Weeks 1–2). Audit current workflows. Identify FADE-dominant task spans. Capture cycle times, error rates, and existing human workload distributions. Define success metrics — including the Workslop Tax baseline, which most organizations have never measured.

Phase 2 — Build Controls (Weeks 3–4). Define agent personas. Connect trusted data sources with read scopes only. Write the initial policy-as-code firewalls. Stand up sandboxed simulation environments. Specify the audit schema before the first agent action runs.

Phase 3 — Shadow Run (Weeks 5–6). Agents process live feeds and propose actions; humans execute manually. The purpose of shadow mode is not to test agent capability — that was Phase 2 — but to calibrate confidence scores and surface the prompts, tools, and contexts where the agent's self-reported confidence is miscalibrated.

Phase 4 — Controlled Rollout (Weeks 7–8). Auto-execution is enabled for low-risk, high-confidence actions only. Escalation thresholds are enforced strictly. The Workslop Tax is tracked weekly. If the rework rate is not decreasing, the autonomy dial is closed back to Phase 3 settings; the failure mode is structural, not transient.

9. Failure Modes, Rollback, and Calibration

The three cognitive traps from §1.3 each have a quantified empirical signature and a corresponding control.

Anchoring Drift is controlled by requiring the agent to surface at least two execution paths for any non-trivial task. If the human only ever sees one option, the trap is unavoidable.

Fluency Illusion is controlled by the Workslop Tax metric itself (KR2). Teams that track rework rate against agent output volume catch the trap directly; teams that track only output volume do not.

Delegation Creep is controlled by the decision-point attribution distribution (§3.3) and periodic re-audit of the autonomy dial. A rising "agent-autonomous" share without a corresponding governance decision to expand scope is the drift signal. The QA/Governance Owner monitors this weekly.

9.1 Rollback and Freeze Triggers

Freeze or contract agent scope if any of the following occur:

A rollback is not a pilot failure. It is evidence that the control system is working.

10. Evidence Boundaries

This paper distinguishes between sourced findings, fielded examples, and synthesized operating recommendations. The distinction matters because the hybrid human-agent squad is not yet a mature industry standard — it is a proposed operating model grounded in emerging research, governance principles, and observed workflow patterns.

Source-grounded claims include measured results from specific studies (Co-Gym, METR, Workday, HBR/Stanford/BetterUp), vendor-documented product capabilities and controls, published governance framework requirements (NIST AI RMF), and documented multi-agent architecture patterns.

Synthesized recommendations include exact meeting cadences, staffing ratios, target thresholds, topology adoption sequencing, review-tax thresholds, handoff schema details, and pilot gate values. These should be treated as starting assumptions calibrated by the evidence, not as benchmarks.

Claims that require local validation include universal productivity benchmarks, exact vendor volume claims from secondary reporting, and claims that any single topology is industry-dominant. Any organization implementing this framework should baseline its own workflows before adopting the thresholds proposed here.

11. Closing: The Recursive Loop

The clearest fielded examples of the recursive pattern described in this paper come from large-scale customer-service platforms that have shipped a meta-agent — an AI agent whose sole job is to manage another customer-facing AI agent. In documented production deployments, a customer-facing agent handles millions of resolutions per week under explicit human-escalation rules (a Model B deployment at scale), while a supervisory meta-agent manages the front-line agent's prompts, escalations, and drift — a Model C layer supervising a Model B production system (VentureBeat, May 2026; Metaintro, May 2026).

fig8

<div align="center">

Figure 8 · Recursive loop in a fielded production deployment. A Model C meta-agent supervises a Model B customer-facing agent. The same governance framework applies at every layer.

The loop is recursive in a specific sense: the team that builds the agent is itself a hybrid squad operating under the same OKRs and audit chain the agent enforces on customer interactions. The framework is not aspirational — it is the operating discipline of the organizations already shipping these systems.

Paper 2 in this series narrows the lens: a single topology applied to a single operational case (Network Operations Center), end-to-end, with the specific tool choices, audit fields, and OKR values that a NOC director would actually deploy. The framework here is the prerequisite. The case study is the proof.


Appendix A: Operational Reference

A.1 YAML Handoff Schema

handoff:

id: <ulid>

from_actor: <agent_id | human_id>

to_actor: <agent_id | human_id>

timestamp: <iso8601>

work_item_id: <system_of_record_id>

risk_tier: <T0 | T1 | T2 | T3 | T4>

situation:

summary: <one-line state>

severity: <low | medium | high | critical>

background:

workflow_id: <ulid>

prior_actions: [<action_ref>]

relevant_context: <free text, capped at 500 tokens>

data_sources:

timestamp: <iso8601>

freshness: <fresh | stale | unknown>

assessment:

agent_confidence: <0.0..1.0>

confidence_by_claim:

confidence: <0.0..1.0>

known_uncertainties: [<string>]

detected_anomalies: [<string>]

alternatives_considered: [<action_descriptor>]

recommendation:

proposed_next_action: <action_descriptor>

rationale: <why>

reversibility: <reversible | partially | irreversible>

rollback_reference: <rollback_plan_or_none>

human_decision_required: <bool>

readback:

receiver_acknowledgment: <bool>

receiver_notes: <free text>

A.2 Audit Field List with NIST AI RMF Mapping

| Field | Description | NIST AI RMF Reference |

|-----------------------|---------------------------------------------------------------|-----------------------|

| event_id | Unique trace identifier | MEASURE 2.4 |

| timestamp_utc | Sequence reconstruction | MEASURE 2.4 |

| work_item_id | Link to system of record | GOVERN 3.2 |

| actor_type | Human, agent, or supervisor agent | GOVERN 3.2 |

| actor_id | Identity of the executing actor | GOVERN 3.2 |

| agent_session_id | Traceable agent run | MEASURE 2.7 |

| action_type | Classify: query, draft, recommend, approve, execute, rollback | MAP 5.1 |

| risk_tier | T0–T4 authority class | MAP 5.1 |

| input_hash | Cryptographic hash of inputs at action time | MEASURE 2.4 |

| tool_calls | Ordered list of tools invoked with arguments | MEASURE 2.7 |

| output_hash | Cryptographic hash of produced output | MEASURE 2.4 |

| confidence_score | Agent's self-reported confidence at action time | MEASURE 2.9 |

| data_sources_accessed | Provenance of inputs consumed | MEASURE 2.7 |

| policy_version | Version of policy-as-code file in force | GOVERN 1.2 |

| policy_decision | Allow, block, escalate, or freeze | GOVERN 1.2 |

| approval_required | Whether human approval was needed | GOVERN 3.2 |

| reviewer_id | Identity of human reviewer (null if auto-approved) | GOVERN 3.2 |

| approval_outcome | Approved, rejected, or modified | GOVERN 3.2 |

| outcome | Succeeded, failed, reverted, or escalated | MEASURE 4.2 |

| reversibility | Reversibility class of the action taken | MAP 5.1 |

| override_reason | Human judgment capture when overriding agent | GOVERN 3.2 |

| quality_outcome | Downstream result classification | MEASURE 4.2 |

A.3 Phase Go/No-Go Checklists

Phase 1 → Phase 2 gate: workflow inventory complete; FADE/RISE binning applied; baseline cycle times captured; baseline error rates captured; Workslop Tax baseline captured; manager of record named.

Phase 2 → Phase 3 gate: agent personas defined; policy-as-code v1 written; sandboxes operational; audit schema deployed; handoff schema in use; rollback procedure tested at least once.

Phase 3 → Phase 4 gate: confidence calibration data collected across at least three task classes; rework rate measured in shadow mode; reviewer time-per-action measured; escalation triggers tested on at least one real edge case.

Phase 4 → Production gate: rework rate trending downward over a minimum two-week window; no policy-as-code violations in the trailing seven days; decision-point attribution distribution reviewed and accepted; QA/Governance Owner has signed off on the audit sample.

Case Study: Network Operations Center


A.4 Source Reference Table

| Claim | Primary Source |

|--------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

| Co-Gym win rates 86% / 74% / 66% across three tasks | Shao et al., arXiv:2412.15701 |

| Co-Gym communication / situational awareness failure rates 65% / 40% | Shao et al., arXiv:2412.15701 |

| METR developer RCT: 19% slower, 24% predicted speedup, 20% post-study estimate | METR Blog · arXiv:2507.09089 |

| ~40% of AI time savings lost to rework | Workday Newsroom, Jan 2026 |

| Workslop per-instance economics: 1h56m / \$186 / 41% / \$9M | HBR, Sept 2025 |

| SupervisorAgent 29.68% token reduction on GAIA | Lin et al., arXiv:2510.26585 |

| Fielded meta-agent supervising customer-facing AI agent | VentureBeat, May 2026 · Metaintro, May 2026 |

| Supervisor / Tiered enterprise patterns | Databricks · IBM · Microsoft Azure |

| NIST AI Risk Management Framework | NIST AI RMF |

| I-PASS / SBAR handoff evidence | PMC Systematic Review, 2025 |