An open-source spec for Codex orchestration: Symphony. | OpenAI

An open-source spec for Codex orchestration: Symphony. | OpenAI

Engineering

An open-source spec for Codex orchestration: Symphony

By Alex Kotliarskyi, Victor Zhu, and Zach Brock

Six months ago, while working on an internal productivity tool, our team made a controversial (at the time) decision: we’d build our repo with no human-written code. Every line in our project repository had to be generated by Codex.

To make that work, we redesigned our engineering workflow from the ground up. We built an agent-friendly repository, invested heavily in automated tests and guardrails, and treated Codex as a full-fledged teammate. We documented that journey in our previous blog post on harness engineering⁠.

And it worked, but then we ran into the next bottleneck: context switching.

To solve this new problem, we built a system called _Symphony_. Symphony⁠(opens in a new window) is an agent orchestrator that turns a project-management board like Linear into a control plane for coding agents. Every open task gets an agent, agents run continuously, and humans review the results.

This post explains how we created Symphony—resulting in a 500% increase in landed pull requests on some teams—and how to use it to turn your own issue tracker into an always-on agent orchestrator.

The ceiling of interactive coding agents

Even as they get easier to use, coding agents—whether accessed through web apps or CLI—are still interactive tools.

As the scale of agentic work increased at OpenAI, we found a new kind of burden. Each engineer would open a few Codex sessions, assign tasks, review the output, steer the agent, and repeat. In practice, most people could comfortably manage three to five sessions at a time before context switching became painful. Beyond that, productivity dropped. We'd forget which session was doing what, jump between terminals to nudge agents back on track, and debug long-running tasks that stalled halfway through.

The agents were fast, but we had a system bottleneck: human attention. We had effectively built a team of extremely capable junior engineers, then assigned our human engineers to micromanaging them. That wasn’t going to scale.

A shift in perspective

We realized we were optimizing the wrong thing. We were orienting our system around coding sessions and merged PRs, when PRs and sessions are really a means to an end. Software workflows are largely organized around deliverables: issues, tasks, tickets, milestones.

So we asked ourselves what would happen if we stopped supervising agents directly and instead let them pull work from our task tracker.

That idea became Symphony, a written spec that functions as a supervisor to orchestrate agentic work.

Turning our issue tracker into an agent orchestrator

Symphony started with a simple concept: any open task should get picked up and completed by an agent. Instead of managing Codex sessions in multiple tabs, we made our issue tracker the control plane.

00:00

In this setup, each open Linear issue maps to a dedicated agent workspace. Symphony continuously watches the task board and ensures that every active task has an agent running in the loop until it’s done. If an agent crashes or stalls, Symphony restarts it. If new work appears, Symphony picks it up and starts organizing work.

We built our workflow based on ticket statuses, using the task manager Linear as a state machine.

Image 1: Coding agents use Linear as a state machine to work alongside us.

In practice, Symphony decouples work from sessions and from pull requests. Some issues produce multiple PRs across repos; others are pure investigation or analysis that never touch the codebase.

> Once work is abstracted this way, tickets can represent much larger units of work.

We regularly use Symphony to orchestrate complex features and infrastructure migrations. For example, we might file a task asking the agent to analyze the codebase, Slack, or Notion and produce an implementation plan. Once we’re happy with the plan, the agent generates a tree of tasks, breaking the work into stages and defining dependencies between tasks.

Agents only start working on tasks that aren’t blocked, so execution unfolds naturally and optimally in parallel for this DAG (a sequence of execution steps). For example, we marked the React upgrade as blocked on a migration to Vite. As expected, agents started upgrading React only after the migration to Vite was complete.

Agents can also create work themselves. During implementation or review, they often notice improvements that fall outside the scope of the current task: a performance issue, a refactoring opportunity, or a better architecture. When that happens, they simply file a new issue that we can evaluate and schedule later—many of these follow-up tasks also get picked up by agents. While we oversee this process, agents stay organized and keep work moving forward.

This way of working dramatically reduces the cognitive cost of kicking off ambiguous work. If the agent gets something wrong, that’s still useful information, and the cost to us is near zero. We can very cheaply file tickets for the agent to go prototype and explore, and throw away any explorations we don’t like.

Because the orchestrator runs on devboxes and never sleeps, we can add tasks from anywhere and know an agent will pick it up. For instance, one engineer on our team made three significant changes from the Linear app on his phone from a cozy cabin on shoddy wifi.

An increase in exploration from working this way

When observing the effects of working with Symphony, the most obvious change was output. Among some teams at OpenAI, we saw the number of landed PRs increase by 500% in the first three weeks. Outside of OpenAI, Linear founder Karri Saarinen highlighted a spike in workspaces created⁠(opens in a new window) as we released Symphony. However, the deeper shift is how teams think about work.

When our engineers no longer spend time supervising Codex sessions, the economics of code changes completely. The perceived cost of each change drops because we’re no longer investing human effort in driving the implementation itself.

That changed our behavior. It's become trivial to spin up speculative tasks in Symphony. Try an idea, explore a refactor, test a hypothesis, and only keep the results that look promising.

It also broadens who can initiate work. Our product manager and designer can now file feature requests directly into Symphony. They don’t need to check out the repo or manage a Codex session. They describe the feature and get back a review packet that includes a video walkthrough of the feature working inside the real product.

Symphony also shines in large monorepos (like the one we have at OpenAI) where the last mile of landing a PR is slow and fragile. The system watches CI, rebases when needed, resolves conflicts, retries flaky checks, and generally shepherds changes through the pipeline. By the time a ticket reaches Merging, we have high confidence the change will make it into the main branch without human babysitting.

Image 2: Before and after grid of Symphony

After implementing Symphony, we delegate more work to agents and focus on harder, more exploratory tasks.

Progress comes with new, different problems

Operating at this level comes with tradeoffs. When we moved from steering agents interactively to assigning them work at the ticket level, we lost the ability to constantly nudge them mid-flight and course-correct when needed. Sometimes the agent produced something that completely missed the mark. That was useful—those failures revealed gaps in the system and helped us make it more robust.

Instead of patching the result manually, we added guardrails and skills so the agents could succeed the next time. Over time, this led us to add new capabilities to our harness, like running end-to-end tests, driving the app through Chrome DevTools, and managing QA smoke tests. We significantly improved our documentation and clarified what good looks like.

Not every task fits the Symphony style of work. Some problems still require engineers working directly with interactive Codex sessions, especially ambiguous problems or work that requires strong judgment and expertise. In practice, these are usually the most interesting and enjoyable tasks for our engineers to spend time on.

The difference is that Symphony can handle the bulk of routine implementation work. That lets engineers focus on a single hard problem at a time instead of constantly context-switching between smaller tasks.

We also learned that treating agents as rigid nodes in a state machine doesn’t work well. Models get smarter and can solve bigger problems than the box we try to fit them in. Our early versions of agentic work was only asking Codex to implement the task. That approach proved too limiting. Codex is perfectly capable of creating multiple PRs as well as reading review feedback and addressing it. So we gave it tools—`gh` CLI, skills to read CI logs, etc.—and now we can ask Codex to do more, like closing old PRs or pulling reports on completed vs. abandoned work. These types of tasks fell way outside the initial feature implementation box.

So we eventually moved toward giving agents _objectives_ instead of strict transitions, much like a good manager would assign a goal to a direct report on their team. The power of models comes from their ability to reason, so give them tools and context and let them cook.

Using Symphony to build Symphony

When you open the Symphony repository,⁠(opens in a new window) the first thing you’ll notice is that Symphony is technically just a `SPEC.md` file—a definition of the problem and the intended solution. Rather than building a complex supervision system, we defined the problem and intended solutions, giving agents high-level steering.

Markdown

`1# Symphony Service Specification23Status: Draft v1 (language-agnostic)45Purpose: Define a service that orchestrates coding agents to get project work done.67## 1. Problem Statement89Symphony is a long-running automation service that continuously reads work from an issue tracker10(Linear in this specification version), creates an isolated workspace for each issue, and runs a11coding agent session for that issue inside the workspace.1213The service solves four operational problems:1415- It turns issue execution into a repeatable daemon workflow instead of manual scripts.16- It isolates agent execution in per-issue workspaces so agent commands run only inside per-issue17 workspace directories.18- It keeps the workflow policy in-repo (`WORKFLOW.md`) so teams version the agent prompt and runtime19 settings with their code.20- It provides enough observability to operate and debug multiple concurrent agent runs.2122Implementations are expected to document their trust and safety posture explicitly. This23specification does not require a single approval, sandbox, or operator-confirmation policy; some24implementations may target trusted environments with a high-trust configuration, while others may25require stricter approvals or sandboxing.2627Important boundary:2829- Symphony is a scheduler/runner and tracker reader.30- Ticket writes (state transitions, comments, PR links) are typically performed by the coding agent31 using tools available in the workflow/runtime environment.32- A successful run may end at a workflow-defined handoff state (for example `Human Review`), not33 necessarily `Done`.3435## 2. Goals and Non-Goals3637### 2.1 Goals3839- Poll the issue tracker on a fixed cadence and dispatch work with bounded concurrency.40- Maintain a single authoritative orchestrator state for dispatch, retries, and reconciliation.41- Create deterministic per-issue workspaces and preserve them across runs.42- Stop active runs when issue state changes make them ineligible.43- Recover from transient failures with exponential backoff.44- Load runtime behavior from a repository-owned `WORKFLOW.md` contract.45- Expose operator-visible observability (at minimum structured logs).46- Support restart recovery without requiring a persistent database.4748### 2.2 Non-Goals4950- Rich web UI or multi-tenant control plane.51- Prescribing a specific dashboard or terminal UI implementation.52- General-purpose workflow engine or distributed job scheduler.53- Built-in business logic for how to edit tickets, PRs, or comments. (That logic lives in the54 workflow prompt and agent tooling.)55- Mandating strong sandbox controls beyond what the coding agent and host OS provide.56- Mandating a single default approval, sandbox, or operator-confirmation posture for all57 implementations.5859## 3. System Overview6061### 3.1 Main Components62631. `Workflow Loader`64 - Reads `WORKFLOW.md`.65 - Parses YAML front matter and prompt body.66 - Returns `{config, prompt_template}`.67682. `Config Layer`69 - Exposes typed getters for workflow config values.70 - Applies defaults and environment variable indirection.71 - Performs validation used by the orchestrator before dispatch.72733. `Issue Tracker Client`74 - Fetches candidate issues in active states.75 - Fetches current states for specific issue IDs (reconciliation).76 - Fetches terminal-state issues during startup cleanup.77 - Normalizes tracker payloads into a stable issue model.78794. `Orchestrator`80 - Owns the poll tick.81 - Owns the in-memory runtime state.82 - Decides which issues to dispatch, retry, stop, or release.83 - Tracks session metrics and retry queue state.84855. `Workspace Manager`86 - Maps issue identifiers to workspace paths.87 - Ensures per-issue workspace directories exist.88 - Runs workspace lifecycle hooks.89 - Cleans workspaces for terminal issues.90916. `Agent Runner`92 - Creates workspace.93 - Builds prompt from issue + workflow template.94 - Launches the coding agent app-server client.95 - Streams agent updates back to the orchestrator.96977. `Status Surface` (optional)98 - Presents human-readable runtime status (for example terminal output, dashboard, or other99 operator-facing view).1001018. `Logging`102 - Emits structured runtime logs to one or more configured sinks.103104### 3.2 Abstraction Levels105106Symphony is easiest to port when kept in these layers:1071081. `Policy Layer` (repo-defined)109 - `WORKFLOW.md` prompt body.110 - Team-specific rules for ticket handling, validation, and handoff.1111122. `Configuration Layer` (typed getters)113 - Parses front matter into typed runtime settings.114 - Handles defaults, environment tokens, and path normalization.1151163. `Coordination Layer` (orchestrator)117 - Polling loop, issue eligibility, concurrency, retries, reconciliation.1181194. `Execution Layer` (workspace + agent subprocess)120 - Filesystem lifecycle, workspace preparation, coding-agent protocol.1211225. `Integration Layer` (Linear adapter)123 - API calls and normalization for tracker data.1241256. `Observability Layer` (logs + optional status surface)126 - Operator visibility into orchestrator and agent behavior.127128### 3.3 External Dependencies129130- Issue tracker API (Linear for `tracker.kind: linear` in this specification version).131- Local filesystem for workspaces and logs.132- Optional workspace population tooling (for example Git CLI, if used).133- Coding-agent executable that supports JSON-RPC-like app-server mode over stdio.134- Host environment authentication for the issue tracker and coding agent.135136## 4. Core Domain Model137138### 4.1 Entities139140#### 4.1.1 Issue141142Normalized issue record used by orchestration, prompt rendering, and observability output.143144Fields:145146- `id` (string)147 - Stable tracker-internal ID.148- `identifier` (string)149 - Human-readable ticket key (example: `ABC-123`).150- `title` (string)151- `description` (string or null)152- `priority` (integer or null)153 - Lower numbers are higher priority in dispatch sorting.154- `state` (string)155 - Current tracker state name.156- `branch_name` (string or null)157 - Tracker-provided branch metadata if available.158- `url` (string or null)159- `labels` (list of strings)160 - Normalized to lowercase.161- `blocked_by` (list of blocker refs)162 - Each blocker ref contains:163 - `id` (string or null)164 - `identifier` (string or null)165 - `state` (string or null)166- `created_at` (timestamp or null)167- `updated_at` (timestamp or null)168169#### 4.1.2 Workflow Definition170171Parsed `WORKFLOW.md` payload:172173- `config` (map)174 - YAML front matter root object.175- `prompt_template` (string)176 - Markdown body after front matter, trimmed.177178#### 4.1.3 Service Config (Typed View)179180Typed runtime values derived from `WorkflowDefinition.config` plus environment resolution.181182Examples:183184- poll interval185- workspace root186- active and terminal issue states187- concurrency limits188- coding-agent executable/args/timeouts189- workspace hooks190191#### 4.1.4 Workspace192193Filesystem workspace assigned to one issue identifier.194195Fields (logical):196197- `path` (workspace path; current runtime typically uses absolute paths, but relative roots are198 possible if configured without path separators)199- `workspace_key` (sanitized issue identifier)200- `created_now` (boolean, used to gate `after_create` hook)201202#### 4.1.5 Run Attempt203204One execution attempt for one issue.205206Fields (logical):207208- `issue_id`209- `issue_identifier`210- `attempt` (integer or null, `null` for first run, `>=1` for retries/continuation)211- `workspace_path`212- `started_at`213- `status`214- `error` (optional)215216#### 4.1.6 Live Session (Agent Session Metadata)217218State tracked while a coding-agent subprocess is running.219220Fields:221222- `session_id` (string, `-`)223- `thread_id` (string)224- `turn_id` (string)225- `codex_app_server_pid` (string or null)226- `last_codex_event` (string/enum or null)227- `last_codex_timestamp` (timestamp or null)228- `last_codex_message` (summarized payload)229- `codex_input_tokens` (integer)230- `codex_output_tokens` (integer)231- `codex_total_tokens` (integer)232- `last_reported_input_tokens` (integer)233- `last_reported_output_tokens` (integer)234- `last_reported_total_tokens` (integer)235- `turn_count` (integer)236 - Number of coding-agent turns started within the current worker lifetime.237238#### 4.1.7 Retry Entry239240Scheduled retry state for an issue.241242Fields:243244- `issue_id`245- `identifier` (best-effort human ID for status surfaces/logs)246- `attempt` (integer, 1-based for retry queue)247- `due_at_ms` (monotonic clock timestamp)248- `timer_handle` (runtime-specific timer reference)249- `error` (string or null)250251#### 4.1.8 Orchestrator Runtime State252253Single authoritative in-memory state owned by the orchestrator.254255Fields:256257- `poll_interval_ms` (current effective poll interval)258- `max_concurrent_agents` (current effective global concurrency limit)259- `running` (map `issue_id -> running entry`)260- `claimed` (set of issue IDs reserved/running/retrying)261- `retry_attempts` (map `issue_id -> RetryEntry`)262- `completed` (set of issue IDs; bookkeeping only, not dispatch gating)263- `codex_totals` (aggregate tokens + runtime seconds)264- `codex_rate_limits` (latest rate-limit snapshot from agent events)265266### 4.2 Stable Identifiers and Normalization Rules267268- `Issue ID`269 - Use for tracker lookups and internal map keys.270- `Issue Identifier`271 - Use for human-readable logs and workspace naming.272- `Workspace Key`273 - Derive from `issue.identifier` by replacing any character not in `[A-Za-z0-9._-]` with `_`.274 - Use the sanitized value for the workspace directory name.275- `Normalized Issue State`276 - Compare states after `lowercase`.277- `Session ID`278 - Compose from coding-agent `thread_id` and `turn_id` as `-`.279280## 5. Workflow Specification (Repository Contract)281282### 5.1 File Discovery and Path Resolution283284Workflow file path precedence:2852861. Explicit application/runtime setting (set by CLI startup path).2872. Default: `WORKFLOW.md` in the current process working directory.288289Loader behavior:290291- If the file cannot be read, return `missing_workflow_file` error.292- The workflow file is expected to be repository-owned and version-controlled.293294### 5.2 File Format295296`WORKFLOW.md` is a Markdown file with optional YAML front matter.297298Design note:299300- `WORKFLOW.md` should be self-contained enough to describe and run different workflows (prompt,301 runtime settings, hooks, and tracker selection/config) without requiring out-of-band302 service-specific configuration.303304Parsing rules:305306- If file starts with `---`, parse lines until the next `---` as YAML front matter.307- Remaining lines become the prompt body.308- If front matter is absent, treat the entire file as prompt body and use an empty config map.309- YAML front matter must decode to a map/object; non-map YAML is an error.310- Prompt body is trimmed before use.311312Returned workflow object:313314- `config`: front matter root object (not nested under a `config` key).315- `prompt_template`: trimmed Markdown body.316317### 5.3 Front Matter Schema318319Top-level keys:320321- `tracker`322- `polling`323- `workspace`324- `hooks`325- `agent`326- `codex`327328Unknown keys should be ignored for forward compatibility.329330Note:331332- The workflow front matter is extensible. Optional extensions may define additional top-level keys333 (for example `server`) without changing the core schema above.334- Extensions should document their field schema, defaults, validation rules, and whether changes335 apply dynamically or require restart.336- Common extension: `server.port` (integer) enables the optional HTTP server described in Section337 13.7.338339#### 5.3.1 `tracker` (object)340341Fields:342343- `kind` (string)344 - Required for dispatch.345 - Current supported value: `linear`346- `endpoint` (string)347 - Default for `tracker.kind == "linear"`: `https://api.linear.app/graphql`348- `api_key` (string)349 - May be a literal token or `$VAR_NAME`.350 - Canonical environment variable for `tracker.kind == "linear"`: `LINEAR_API_KEY`.351 - If `$VAR_NAME` resolves to an empty string, treat the key as missing.352- `project_slug` (string)353 - Required for dispatch when `tracker.kind == "linear"`.354- `active_states` (list of strings)355 - Default: `Todo`, `In Progress`356- `terminal_states` (list of strings)357 - Default: `Closed`, `Cancelled`, `Canceled`, `Duplicate`, `Done`358359#### 5.3.2 `polling` (object)360361Fields:362363- `interval_ms` (integer or string integer)364 - Default: `30000`365 - Changes should be re-applied at runtime and affect future tick scheduling without restart.366367#### 5.3.3 `workspace` (object)368369Fields:370371- `root` (path string or `$VAR`)372 - Default: `/symphony_workspaces`373 - `~` and strings containing path separators are expanded.374 - Bare strings without path separators are preserved as-is (relative roots are allowed but375 discouraged).376377#### 5.3.4 `hooks` (object)378379Fields:380381- `after_create` (multiline shell script string, optional)382 - Runs only when a workspace directory is newly created.383 - Failure aborts workspace creation.384- `before_run` (multiline shell script string, optional)385 - Runs before each agent attempt after workspace preparation and before launching the coding386 agent.387 - Failure aborts the current attempt.388- `after_run` (multiline shell script string, optional)389 - Runs after each agent attempt (success, failure, timeout, or cancellation) once the workspace390 exists.391 - Failure is logged but ignored.392- `before_remove` (multiline shell script string, optional)393 - Runs before workspace deletion if the directory exists.394 - Failure is logged but ignored; cleanup still proceeds.395- `timeout_ms` (integer, optional)396 - Default: `60000`397 - Applies to all workspace hooks.398 - Non-positive values should be treated as invalid and fall back to the default.399 - Changes should be re-applied at runtime for future hook executions.400401#### 5.3.5 `agent` (object)402403Fields:404405- `max_concurrent_agents` (integer or string integer)406 - Default: `10`407 - Changes should be re-applied at runtime and affect subsequent dispatch decisions.408- `max_retry_backoff_ms` (integer or string integer)409 - Default: `300000` (5 minutes)410 - Changes should be re-applied at runtime and affect future retry scheduling.411- `max_concurrent_agents_by_state` (map `state_name -> positive integer`)412 - Default: empty map.413 - State keys are normalized (`lowercase`) for lookup.414 - Invalid entries (non-positive or non-numeric) are ignored.415416#### 5.3.6 `codex` (object)417418Fields:419420For Codex-owned config values such as `approval_policy`, `thread_sandbox`, and421`turn_sandbox_policy`, supported values are defined by the targeted Codex app-server version.422Implementors should treat them as pass-through Codex config values rather than relying on a423hand-maintained enum in this spec. To inspect the installed Codex schema, run424`codex app-server generate-json-schema --out

` and inspect the relevant definitions referenced425by `v2/ThreadStartParams.json` and `v2/TurnStartParams.json`. Implementations may validate these426fields locally if they want stricter startup checks.427428- `command` (string shell command)429 - Default: `codex app-server`430 - The runtime launches this command via `bash -lc` in the workspace directory.431 - The launched process must speak a compatible app-server protocol over stdio.432- `approval_policy` (Codex `AskForApproval` value)433 - Default: implementation-defined.434- `thread_sandbox` (Codex `SandboxMode` value)435 - Default: implementation-defined.436- `turn_sandbox_policy` (Codex `SandboxPolicy` value)437 - Default: implementation-defined.438- `turn_timeout_ms` (integer)439 - Default: `3600000` (1 hour)440- `read_timeout_ms` (integer)441 - Default: `5000`442- `stall_timeout_ms` (integer)443 - Default: `300000` (5 minutes)444 - If `<= 0`, stall detection is disabled.445446### 5.4 Prompt Template Contract447448The Markdown body of `WORKFLOW.md` is the per-issue prompt template.449450Rendering requirements:451452- Use a strict template engine (Liquid-compatible semantics are sufficient).453- Unknown variables must fail rendering.454- Unknown filters must fail rendering.455456Template input variables:457458- `issue` (object)459 - Includes all normalized issue fields, including labels and blockers.460- `attempt` (integer or null)461 - `null`/absent on first attempt.462 - Integer on retry or continuation run.463464Fallback prompt behavior:465466- If the workflow prompt body is empty, the runtime may use a minimal default prompt467 (`You are working on an issue from Linear.`).468- Workflow file read/parse failures are configuration/validation errors and should not silently fall469 back to a prompt.470471### 5.5 Workflow Validation and Error Surface472473Error classes:474475- `missing_workflow_file`476- `workflow_parse_error`477- `workflow_front_matter_not_a_map`478- `template_parse_error` (during prompt rendering)479- `template_render_error` (unknown variable/filter, invalid interpolation)480481Dispatch gating behavior:482483- Workflow file read/YAML errors block new dispatches until fixed.484- Template errors fail only the affected run attempt.485486## 6. Configuration Specification487488### 6.1 Source Precedence and Resolution Semantics489490Configuration precedence:4914921. Workflow file path selection (runtime setting -> cwd default).4932. YAML front matter values.4943. Environment indirection via `$VAR_NAME` inside selected YAML values.4954. Built-in defaults.496497Value coercion semantics:498499- Path/command fields support:500 - `~` home expansion501 - `$VAR` expansion for env-backed path values502 - Apply expansion only to values intended to be local filesystem paths; do not rewrite URIs or503 arbitrary shell command strings.504505### 6.2 Dynamic Reload Semantics506507Dynamic reload is required:508509- The software should watch `WORKFLOW.md` for changes.510- On change, it should re-read and re-apply workflow config and prompt template without restart.511- The software should attempt to adjust live behavior to the new config (for example polling512 cadence, concurrency limits, active/terminal states, codex settings, workspace paths/hooks, and513 prompt content for future runs).514- Reloaded config applies to future dispatch, retry scheduling, reconciliation decisions, hook515 execution, and agent launches.516- Implementations are not required to restart in-flight agent sessions automatically when config517 changes.518- Extensions that manage their own listeners/resources (for example an HTTP server port change) may519 require restart unless the implementation explicitly supports live rebind.520- Implementations should also re-validate/reload defensively during runtime operations (for example521 before dispatch) in case filesystem watch events are missed.522- Invalid reloads should not crash the service; keep operating with the last known good effective523 configuration and emit an operator-visible error.524525### 6.3 Dispatch Preflight Validation526527This validation is a scheduler preflight run before attempting to dispatch new work. It validates528the workflow/config needed to poll and launch workers, not a full audit of all possible workflow529behavior.530531Startup validation:532533- Validate configuration before starting the scheduling loop.534- If startup validation fails, fail startup and emit an operator-visible error.535536Per-tick dispatch validation:537538- Re-validate before each dispatch cycle.539- If validation fails, skip dispatch for that tick, keep reconciliation active, and emit an540 operator-visible error.541542Validation checks:543544- Workflow file can be loaded and parsed.545- `tracker.kind` is present and supported.546- `tracker.api_key` is present after `$` resolution.547- `tracker.project_slug` is present when required by the selected tracker kind.548- `codex.command` is present and non-empty.549550### 6.4 Config Fields Summary (Cheat Sheet)551552This section is intentionally redundant so a coding agent can implement the config layer quickly.553554- `tracker.kind`: string, required, currently `linear`555- `tracker.endpoint`: string, default `https://api.linear.app/graphql` when `tracker.kind=linear`556- `tracker.api_key`: string or `$VAR`, canonical env `LINEAR_API_KEY` when `tracker.kind=linear`557- `tracker.project_slug`: string, required when `tracker.kind=linear`558- `tracker.active_states`: list of strings, default `["Todo", "In Progress"]`559- `tracker.terminal_states`: list of strings, default `["Closed", "Cancelled", "Canceled", "Duplicate", "Done"]`560- `polling.interval_ms`: integer, default `30000`561- `workspace.root`: path, default `/symphony_workspaces`562- `worker.ssh_hosts` (extension): list of SSH host strings, optional; when omitted, work runs563 locally564- `worker.max_concurrent_agents_per_host` (extension): positive integer, optional; shared per-host565 cap applied across configured SSH hosts566- `hooks.after_create`: shell script or null567- `hooks.before_run`: shell script or null568- `hooks.after_run`: shell script or null569- `hooks.before_remove`: shell script or null570- `hooks.timeout_ms`: integer, default `60000`571- `agent.max_concurrent_agents`: integer, default `10`572- `agent.max_turns`: integer, default `20`573- `agent.max_retry_backoff_ms`: integer, default `300000` (5m)574- `agent.max_concurrent_agents_by_state`: map of positive integers, default `{}`575- `codex.command`: shell command string, default `codex app-server`576- `codex.approval_policy`: Codex `AskForApproval` value, default implementation-defined577- `codex.thread_sandbox`: Codex `SandboxMode` value, default implementation-defined578- `codex.turn_sandbox_policy`: Codex `SandboxPolicy` value, default implementation-defined579- `codex.turn_timeout_ms`: integer, default `3600000`580- `codex.read_timeout_ms`: integer, default `5000`581- `codex.stall_timeout_ms`: integer, default `300000`582- `server.port` (extension): integer, optional; enables the optional HTTP server, `0` may be used583 for ephemeral local bind, and CLI `--port` overrides it584585## 7. Orchestration State Machine586587The orchestrator is the only component that mutates scheduling state. All worker outcomes are588reported back to it and converted into explicit state transitions.589590### 7.1 Issue Orchestration States591592This is not the same as tracker states (`Todo`, `In Progress`, etc.). This is the service's internal593claim state.5945951. `Unclaimed`596 - Issue is not running and has no retry scheduled.5975982. `Claimed`599 - Orchestrator has reserved the issue to prevent duplicate dispatch.600 - In practice, claimed issues are either `Running` or `RetryQueued`.6016023. `Running`603 - Worker task exists and the issue is tracked in `running` map.6046054. `RetryQueued`606 - Worker is not running, but a retry timer exists in `retry_attempts`.6076085. `Released`609 - Claim removed because issue is terminal, non-active, missing, or retry path completed without610 re-dispatch.611612Important nuance:613614- A successful worker exit does not mean the issue is done forever.615- The worker may continue through multiple back-to-back coding-agent turns before it exits.616- After each normal turn completion, the worker re-checks the tracker issue state.617- If the issue is still in an active state, the worker should start another turn on the same live618 coding-agent thread in the same workspace, up to `agent.max_turns`.619- The first turn should use the full rendered task prompt.620- Continuation turns should send only continuation guidance to the existing thread, not resend the621 original task prompt that is already present in thread history.622- Once the worker exits normally, the orchestrator still schedules a short continuation retry623 (about 1 second) so it can re-check whether the issue remains active and needs another worker624 session.625626### 7.2 Run Attempt Lifecycle627628A run attempt transitions through these phases:6296301. `PreparingWorkspace`6312. `BuildingPrompt`6323. `LaunchingAgentProcess`6334. `InitializingSession`6345. `StreamingTurn`6356. `Finishing`6367. `Succeeded`6378. `Failed`6389. `TimedOut`63910. `Stalled`64011. `CanceledByReconciliation`641642Distinct terminal reasons are important because retry logic and logs differ.643644### 7.3 Transition Triggers645646- `Poll Tick`647 - Reconcile active runs.648 - Validate config.649 - Fetch candidate issues.650 - Dispatch until slots are exhausted.651652- `Worker Exit (normal)`653 - Remove running entry.654 - Update aggregate runtime totals.655 - Schedule continuation retry (attempt `1`) after the worker exhausts or finishes its in-process656 turn loop.657658- `Worker Exit (abnormal)`659 - Remove running entry.660 - Update aggregate runtime totals.661 - Schedule exponential-backoff retry.662663- `Codex Update Event`664 - Update live session fields, token counters, and rate limits.665666- `Retry Timer Fired`667 - Re-fetch active candidates and attempt re-dispatch, or release claim if no longer eligible.668669- `Reconciliation State Refresh`670 - Stop runs whose issue states are terminal or no longer active.671672- `Stall Timeout`673 - Kill worker and schedule retry.674675### 7.4 Idempotency and Recovery Rules676677- The orchestrator serializes state mutations through one authority to avoid duplicate dispatch.678- `claimed` and `running` checks are required before launching any worker.679- Reconciliation runs before dispatch on every tick.680- Restart recovery is tracker-driven and filesystem-driven (no durable orchestrator DB required).681- Startup terminal cleanup removes stale workspaces for issues already in terminal states.682683## 8. Polling, Scheduling, and Reconciliation684685### 8.1 Poll Loop686687At startup, the service validates config, performs startup cleanup, schedules an immediate tick, and688then repeats every `polling.interval_ms`.689690The effective poll interval should be updated when workflow config changes are re-applied.691692Tick sequence:6936941. Reconcile running issues.6952. Run dispatch preflight validation.6963. Fetch candidate issues from tracker using active states.6974. Sort issues by dispatch priority.6985. Dispatch eligible issues while slots remain.6996. Notify observability/status consumers of state changes.700701If per-tick validation fails, dispatch is skipped for that tick, but reconciliation still happens702first.703704### 8.2 Candidate Selection Rules705706An issue is dispatch-eligible only if all are true:707708- It has `id`, `identifier`, `title`, and `state`.709- Its state is in `active_states` and not in `terminal_states`.710- It is not already in `running`.711- It is not already in `claimed`.712- Global concurrency slots are available.713- Per-state concurrency slots are available.714- Blocker rule for `Todo` state passes:715 - If the issue state is `Todo`, do not dispatch when any blocker is non-terminal.716717Sorting order (stable intent):7187191. `priority` ascending (1..4 are preferred; null/unknown sorts last)7202. `created_at` oldest first7213. `identifier` lexicographic tie-breaker722723### 8.3 Concurrency Control724725Global limit:726727- `available_slots = max(max_concurrent_agents - running_count, 0)`728729Per-state limit:730731- `max_concurrent_agents_by_state[state]` if present (state key normalized)732- otherwise fallback to global limit733734The runtime counts issues by their current tracked state in the `running` map.735736Optional SSH host limit:737738- When `worker.max_concurrent_agents_per_host` is set, each configured SSH host may run at most739 that many concurrent agents at once.740- Hosts at that cap are skipped for new dispatch until capacity frees up.741742### 8.4 Retry and Backoff743744Retry entry creation:745746- Cancel any existing retry timer for the same issue.747- Store `attempt`, `identifier`, `error`, `due_at_ms`, and new timer handle.748749Backoff formula:750751- Normal continuation retries after a clean worker exit use a short fixed delay of `1000` ms.752- Failure-driven retries use `delay = min(10000 * 2^(attempt - 1), agent.max_retry_backoff_ms)`.753- Power is capped by the configured max retry backoff (default `300000` / 5m).754755Retry handling behavior:7567571. Fetch active candidate issues (not all issues).7582. Find the specific issue by `issue_id`.7593. If not found, release claim.7604. If found and still candidate-eligible:761 - Dispatch if slots are available.762 - Otherwise requeue with error `no available orchestrator slots`.7635. If found but no longer active, release claim.764765Note:766767- Terminal-state workspace cleanup is handled by startup cleanup and active-run reconciliation768 (including terminal transitions for currently running issues).769- Retry handling mainly operates on active candidates and releases claims when the issue is absent,770 rather than performing terminal cleanup itself.771772### 8.5 Active Run Reconciliation773774Reconciliation runs every tick and has two parts.775776Part A: Stall detection777778- For each running issue, compute `elapsed_ms` since:779 - `last_codex_timestamp` if any event has been seen, else780 - `started_at`781- If `elapsed_ms > codex.stall_timeout_ms`, terminate the worker and queue a retry.782- If `stall_timeout_ms <= 0`, skip stall detection entirely.783784Part B: Tracker state refresh785786- Fetch current issue states for all running issue IDs.787- For each running issue:788 - If tracker state is terminal: terminate worker and clean workspace.789 - If tracker state is still active: update the in-memory issue snapshot.790 - If tracker state is neither active nor terminal: terminate worker without workspace cleanup.791- If state refresh fails, keep workers running and try again on the next tick.792793### 8.6 Startup Terminal Workspace Cleanup794795When the service starts:7967971. Query tracker for issues in terminal states.7982. For each returned issue identifier, remove the corresponding workspace directory.7993. If the terminal-issues fetch fails, log a warning and continue startup.800801This prevents stale terminal workspaces from accumulating after restarts.802803## 9. Workspace Management and Safety804805### 9.1 Workspace Layout806807Workspace root:808809- `workspace.root` (normalized path; the current config layer expands path-like values and preserves810 bare relative names)811812Per-issue workspace path:813814- `/`815816Workspace persistence:817818- Workspaces are reused across runs for the same issue.819- Successful runs do not auto-delete workspaces.820821### 9.2 Workspace Creation and Reuse822823Input: `issue.identifier`824825Algorithm summary:8268271. Sanitize identifier to `workspace_key`.8282. Compute workspace path under workspace root.8293. Ensure the workspace path exists as a directory.8304. Mark `created_now=true` only if the directory was created during this call; otherwise831 `created_now=false`.8325. If `created_now=true`, run `after_create` hook if configured.833834Notes:835836- This section does not assume any specific repository/VCS workflow.837- Workspace preparation beyond directory creation (for example dependency bootstrap, checkout/sync,838 code generation) is implementation-defined and is typically handled via hooks.839840### 9.3 Optional Workspace Population (Implementation-Defined)841842The spec does not require any built-in VCS or repository bootstrap behavior.843844Implementations may populate or synchronize the workspace using implementation-defined logic and/or845hooks (for example `after_create` and/or `before_run`).846847Failure handling:848849- Workspace population/synchronization failures return an error for the current attempt.850- If failure happens while creating a brand-new workspace, implementations may remove the partially851 prepared directory.852- Reused workspaces should not be destructively reset on population failure unless that policy is853 explicitly chosen and documented.854855### 9.4 Workspace Hooks856857Supported hooks:858859- `hooks.after_create`860- `hooks.before_run`861- `hooks.after_run`862- `hooks.before_remove`863864Execution contract:865866- Execute in a local shell context appropriate to the host OS, with the workspace directory as867 `cwd`.868- On POSIX systems, `sh -lc