Maps every testable component in the planned architecture to concrete test cases, prioritised by risk. Covers CouncilState reducers, routing logic, agent node functions, dynamic graph builder, REST API, WebSocket events, God Mode, tool wrappers, and the React Flow blueprint parser. Includes tooling recommendations and a sequenced build order. https://claude.ai/code/session_01Dexzo7FAbhU5fMePHGVgRP
12 KiB
Test Coverage Analysis — CouncilOS
Date: 2026-02-20 Repository status: Pre-code / architecture stage Coverage today: 0% (no application code exists)
This document analyses the planned architecture defined in CLAUDE.md and README.md, identifies every testable unit, and prioritises the areas that carry the most risk if left untested.
1. Current State
The repository contains only README.md and CLAUDE.md. No production code, test files, or tooling configuration exist yet. Every recommendation below is therefore forward-looking: it defines the test strategy that must accompany each development phase rather than patching gaps in existing coverage.
2. Risk-Prioritised Coverage Areas
2.1 CouncilState and State Management (CRITICAL)
Why critical: CouncilState is the single source of truth passed between every agent. A bug here corrupts the entire pipeline silently.
What to test:
| Scenario | Type |
|---|---|
feedback_history uses operator.add and appends across loop iterations — never overwrites |
Unit |
messages reducer accumulates correctly when agents return partial state |
Unit |
route_decision is always one of the allowed string literals ("rework", "approve", custom values) |
Unit |
| Passing a state missing required fields raises a clear error at graph entry, not mid-run | Unit |
| A full loop (master → critic → rework → master → critic → approve) preserves all prior feedback in history | Integration |
File to test: backend/state.py
2.2 Routing Logic (CRITICAL)
Why critical: The conditional edges are the core business logic. A routing bug sends documents into infinite loops or skips the critic entirely — the two worst failure modes for the product's value proposition.
What to test:
| Scenario | Type |
|---|---|
Score < 8 → route_decision = "rework", graph returns to master |
Unit |
Score ≥ 8 → route_decision = "approve", graph advances to writer |
Unit |
| Critic returns a non-numeric or malformed score — routing falls back safely (no crash) | Unit |
| Loop terminates after N iterations even if score never reaches threshold (guard against infinite loops) | Integration |
| Custom routing values defined in user blueprints are handled without exception | Unit |
File to test: backend/agents/ (routing functions), backend/services/graph_builder.py
2.3 LangGraph Agent Node Functions (HIGH)
Each agent node function must be independently testable with mocked LLM responses.
What to test per node:
| Scenario | Type |
|---|---|
Node receives a valid CouncilState and returns a dict with only the keys it owns |
Unit |
Node appends to messages, never replaces the full list |
Unit |
| Node correctly uses its configured system prompt (verify prompt is sent as system message) | Unit |
Node handles an empty current_draft gracefully (first iteration edge case) |
Unit |
Critic node produces a numeric score in the expected range within route_decision output |
Unit |
Master node incorporates all items in feedback_history into its revised draft prompt |
Unit |
Key rule from CLAUDE.md: LLM calls must be mocked in CI — never make real API calls.
# Example mock pattern for every agent node test
from unittest.mock import patch, MagicMock
@patch("backend.agents.master_agent.ChatAnthropic")
def test_master_agent_node_appends_to_messages(mock_llm):
mock_llm.return_value.invoke.return_value = MagicMock(content="Draft v2")
state = CouncilState(
input_topic="Write a blog post",
current_draft="Draft v1",
feedback_history=["Too short"],
route_decision="",
messages=[],
)
result = master_agent_node(state)
assert len(result["messages"]) == 1
assert result["current_draft"] == "Draft v2"
2.4 Dynamic Graph Builder (HIGH)
Why high: From Phase 3 onward the graph is built at runtime from a user-supplied JSON blueprint. A parsing error or invalid edge definition produces a graph that silently misbehaves.
What to test:
| Scenario | Type |
|---|---|
A valid blueprint JSON produces a compilable StateGraph |
Unit |
| Blueprint with a conditional edge correctly wires the routing function | Unit |
| Blueprint with a cycle (A → B → A) compiles without flattening the cycle | Unit |
Blueprint referencing a non-existent node ID raises a descriptive ValueError, not a Python KeyError |
Unit |
| Blueprint with zero nodes raises a clear error | Unit |
| Blueprint with two nodes and no edges raises a clear error | Unit |
| A blueprint that was serialised from the frontend and round-tripped through PostgreSQL produces the same graph | Integration |
File to test: backend/services/graph_builder.py
2.5 Blueprint JSON Parser (HIGH — Frontend)
Why high: The parser converts a React Flow canvas into the JSON that controls backend execution. Any data loss or structural error here means user-designed councils don't run as intended.
What to test:
| Scenario | Type |
|---|---|
A two-node linear graph emits correct nodes + edges JSON |
Unit |
A conditional edge with a condition label is preserved in output |
Unit |
| An isolated node (no edges) is flagged as a warning, not silently dropped | Unit |
Node with empty systemPrompt field is preserved (validation happens at run time, not parse time) |
Unit |
Blueprint JSON includes a version field |
Unit |
| Re-parsing a previously saved blueprint produces identical output (idempotency) | Unit |
File to test: frontend/src/utils/parser.ts (or .js)
2.6 FastAPI REST Endpoints (MEDIUM-HIGH)
What to test:
| Endpoint | Scenario | Type |
|---|---|---|
POST /api/councils/ |
Creates a blueprint, returns 201 with ID | Integration |
POST /api/councils/ |
Payload missing required fields returns 422 | Integration |
GET /api/councils/{id} |
Returns stored blueprint | Integration |
GET /api/councils/{id} |
Non-existent ID returns 404 | Integration |
PUT /api/councils/{id} |
Updates blueprint, bumps version field |
Integration |
DELETE /api/councils/{id} |
Removes blueprint, subsequent GET returns 404 | Integration |
| All endpoints | Unauthenticated requests (when auth is added) return 401 | Integration |
Use httpx.AsyncClient + pytest-asyncio for async FastAPI tests. Use a dedicated test database, never the production instance.
2.7 WebSocket Agent Status Events (MEDIUM-HIGH)
Why this is risky: WebSocket behaviour is easy to get right manually and easy to break silently. The frontend's live-highlighting feature depends entirely on these events arriving in the correct order.
What to test:
| Scenario | Type |
|---|---|
Connecting to /ws/council/{run_id} returns a 101 upgrade |
Integration |
When LangGraph enters master_agent_node, the WebSocket emits {"node": "master_agent", "status": "running"} |
Integration |
When a node completes, a "completed" status event is emitted before the next node's "running" event |
Integration |
When the graph finishes, a "done" event is emitted with the final output |
Integration |
| Client disconnect during a run does not crash the backend process | Integration |
Two concurrent WebSocket sessions for different run_ids do not cross-contaminate events |
Integration |
2.8 Human-in-the-Loop / God Mode (MEDIUM)
This is the most stateful and interactive code path and the hardest to test manually.
What to test:
| Scenario | Type |
|---|---|
With interrupt_before configured, the graph pauses before the specified node |
Integration |
The paused state (including current_draft and feedback_history) is correctly surfaced to the frontend |
Integration |
| Sending "approve" resumes execution and the graph continues from the paused node | Integration |
Sending "reject" resumes execution with route_decision = "rework" |
Integration |
Modifying current_draft in the approval UI causes the next node to receive the modified content |
Integration |
Session timeout while paused: the run is marked as timed_out, not left in paused forever |
Integration |
2.9 External Tool Wrappers (MEDIUM)
Tools must be tested in isolation with mocked external calls.
Tavily Web Search:
| Scenario | Type |
|---|---|
Tool returns a list of results with url and snippet fields |
Unit |
| Tavily API returns 429 (rate limit) — tool raises a retriable error | Unit |
| Tavily API is unreachable — tool raises a non-retriable error with a clear message | Unit |
PyPDF + Vector Store:
| Scenario | Type |
|---|---|
| A valid PDF is chunked and embedded without error | Unit |
| A password-protected PDF raises a clear error | Unit |
| A zero-byte file raises a clear error | Unit |
| Semantic search returns the top-K most relevant chunks | Unit |
| Vector store survives a restart and retrieves previously stored embeddings | Integration |
2.10 React Flow Custom Node Components (MEDIUM — Frontend)
What to test:
| Scenario | Type |
|---|---|
| Rendering a node with no props does not throw | Unit |
Changing the systemPrompt field in the settings panel updates the node data |
Unit |
| Selecting a different LLM model updates the node's model field | Unit |
| Toggling "Web Search" on/off reflects in node data | Unit |
| Node label truncates at a defined character limit without overflowing the card UI | Unit |
Use React Testing Library. Do not snapshot-test node styling — it changes too often and generates false negatives.
3. Test Infrastructure Recommendations
Backend (backend/)
pytest
pytest-asyncio # async FastAPI route tests
httpx # async test client for FastAPI
pytest-cov # coverage reports
unittest.mock # LLM mocking (stdlib, no extra dep)
factory-boy # fixture factories for CouncilState
Minimum coverage target: 80% for agents/, 90% for state.py and services/graph_builder.py.
Configuration in pyproject.toml:
[tool.pytest.ini_options]
asyncio_mode = "auto"
testpaths = ["tests"]
[tool.coverage.run]
omit = ["tests/*", "**/__init__.py"]
[tool.coverage.report]
fail_under = 80
Frontend (frontend/)
vitest (or jest)
@testing-library/react
@testing-library/user-event
msw # Mock Service Worker for API/WebSocket mocking
4. What to Write First (Priority Order)
Build tests in this sequence — it matches the development phases and catches the highest-risk bugs earliest:
state.py—CouncilStatereducers (write before any agent code)agents/— each node function (write alongside each node implementation)- Routing logic (write before connecting the graph)
graph_builder.py(write alongside Phase 3 dynamic graph work)- REST API endpoints (write alongside Phase 2 blueprint persistence)
- Frontend parser (write alongside Phase 2 React Flow work)
- WebSocket events (write alongside Phase 3 integration work)
- God Mode / Human-in-the-Loop (write alongside Phase 4)
- Tool wrappers (write alongside Phase 4)
5. What to Avoid
- Do not make real LLM API calls in any test. This adds cost, latency, and flakiness. Always mock
ChatAnthropicandChatOpenAI. - Do not snapshot-test React Flow canvas state. React Flow's internal node positions change non-deterministically and will make snapshot tests fail constantly.
- Do not use a shared test database. Each integration test that touches the database must run in a transaction that is rolled back after the test, or use a separate ephemeral schema.
- Do not test the routing score threshold at exactly 8. Always test both sides of the boundary (score = 7 → rework, score = 8 → approve, score = 9 → approve) to catch off-by-one errors in conditional logic.
- Do not write end-to-end browser tests before Phase 3 is stable. E2E tests on an unstable UI are expensive to maintain. Add Playwright/Cypress tests only once the API contract between frontend and backend is frozen.