diff --git a/docs/test-coverage-analysis.md b/docs/test-coverage-analysis.md new file mode 100644 index 0000000..6f4776e --- /dev/null +++ b/docs/test-coverage-analysis.md @@ -0,0 +1,286 @@ +# Test Coverage Analysis — CouncilOS + +**Date:** 2026-02-20 +**Repository status:** Pre-code / architecture stage +**Coverage today:** 0% (no application code exists) + +This document analyses the planned architecture defined in `CLAUDE.md` and `README.md`, identifies every testable unit, and prioritises the areas that carry the most risk if left untested. + +--- + +## 1. Current State + +The repository contains only `README.md` and `CLAUDE.md`. No production code, test files, or tooling configuration exist yet. Every recommendation below is therefore forward-looking: it defines the test strategy that must accompany each development phase rather than patching gaps in existing coverage. + +--- + +## 2. Risk-Prioritised Coverage Areas + +### 2.1 `CouncilState` and State Management (CRITICAL) + +**Why critical:** `CouncilState` is the single source of truth passed between every agent. A bug here corrupts the entire pipeline silently. + +**What to test:** + +| Scenario | Type | +|---|---| +| `feedback_history` uses `operator.add` and appends across loop iterations — never overwrites | Unit | +| `messages` reducer accumulates correctly when agents return partial state | Unit | +| `route_decision` is always one of the allowed string literals (`"rework"`, `"approve"`, custom values) | Unit | +| Passing a state missing required fields raises a clear error at graph entry, not mid-run | Unit | +| A full loop (master → critic → rework → master → critic → approve) preserves all prior feedback in history | Integration | + +**File to test:** `backend/state.py` + +--- + +### 2.2 Routing Logic (CRITICAL) + +**Why critical:** The conditional edges are the core business logic. A routing bug sends documents into infinite loops or skips the critic entirely — the two worst failure modes for the product's value proposition. + +**What to test:** + +| Scenario | Type | +|---|---| +| Score < 8 → `route_decision = "rework"`, graph returns to master | Unit | +| Score ≥ 8 → `route_decision = "approve"`, graph advances to writer | Unit | +| Critic returns a non-numeric or malformed score — routing falls back safely (no crash) | Unit | +| Loop terminates after N iterations even if score never reaches threshold (guard against infinite loops) | Integration | +| Custom routing values defined in user blueprints are handled without exception | Unit | + +**File to test:** `backend/agents/` (routing functions), `backend/services/graph_builder.py` + +--- + +### 2.3 LangGraph Agent Node Functions (HIGH) + +Each agent node function must be independently testable with mocked LLM responses. + +**What to test per node:** + +| Scenario | Type | +|---|---| +| Node receives a valid `CouncilState` and returns a dict with only the keys it owns | Unit | +| Node appends to `messages`, never replaces the full list | Unit | +| Node correctly uses its configured system prompt (verify prompt is sent as system message) | Unit | +| Node handles an empty `current_draft` gracefully (first iteration edge case) | Unit | +| Critic node produces a numeric score in the expected range within `route_decision` output | Unit | +| Master node incorporates all items in `feedback_history` into its revised draft prompt | Unit | + +**Key rule from `CLAUDE.md`:** LLM calls must be mocked in CI — never make real API calls. + +```python +# Example mock pattern for every agent node test +from unittest.mock import patch, MagicMock + +@patch("backend.agents.master_agent.ChatAnthropic") +def test_master_agent_node_appends_to_messages(mock_llm): + mock_llm.return_value.invoke.return_value = MagicMock(content="Draft v2") + state = CouncilState( + input_topic="Write a blog post", + current_draft="Draft v1", + feedback_history=["Too short"], + route_decision="", + messages=[], + ) + result = master_agent_node(state) + assert len(result["messages"]) == 1 + assert result["current_draft"] == "Draft v2" +``` + +--- + +### 2.4 Dynamic Graph Builder (HIGH) + +**Why high:** From Phase 3 onward the graph is built at runtime from a user-supplied JSON blueprint. A parsing error or invalid edge definition produces a graph that silently misbehaves. + +**What to test:** + +| Scenario | Type | +|---|---| +| A valid blueprint JSON produces a compilable `StateGraph` | Unit | +| Blueprint with a conditional edge correctly wires the routing function | Unit | +| Blueprint with a cycle (A → B → A) compiles without flattening the cycle | Unit | +| Blueprint referencing a non-existent node ID raises a descriptive `ValueError`, not a Python `KeyError` | Unit | +| Blueprint with zero nodes raises a clear error | Unit | +| Blueprint with two nodes and no edges raises a clear error | Unit | +| A blueprint that was serialised from the frontend and round-tripped through PostgreSQL produces the same graph | Integration | + +**File to test:** `backend/services/graph_builder.py` + +--- + +### 2.5 Blueprint JSON Parser (HIGH — Frontend) + +**Why high:** The parser converts a React Flow canvas into the JSON that controls backend execution. Any data loss or structural error here means user-designed councils don't run as intended. + +**What to test:** + +| Scenario | Type | +|---|---| +| A two-node linear graph emits correct `nodes` + `edges` JSON | Unit | +| A conditional edge with a `condition` label is preserved in output | Unit | +| An isolated node (no edges) is flagged as a warning, not silently dropped | Unit | +| Node with empty `systemPrompt` field is preserved (validation happens at run time, not parse time) | Unit | +| Blueprint JSON includes a `version` field | Unit | +| Re-parsing a previously saved blueprint produces identical output (idempotency) | Unit | + +**File to test:** `frontend/src/utils/parser.ts` (or `.js`) + +--- + +### 2.6 FastAPI REST Endpoints (MEDIUM-HIGH) + +**What to test:** + +| Endpoint | Scenario | Type | +|---|---|---| +| `POST /api/councils/` | Creates a blueprint, returns 201 with ID | Integration | +| `POST /api/councils/` | Payload missing required fields returns 422 | Integration | +| `GET /api/councils/{id}` | Returns stored blueprint | Integration | +| `GET /api/councils/{id}` | Non-existent ID returns 404 | Integration | +| `PUT /api/councils/{id}` | Updates blueprint, bumps `version` field | Integration | +| `DELETE /api/councils/{id}` | Removes blueprint, subsequent GET returns 404 | Integration | +| All endpoints | Unauthenticated requests (when auth is added) return 401 | Integration | + +Use `httpx.AsyncClient` + `pytest-asyncio` for async FastAPI tests. Use a dedicated test database, never the production instance. + +--- + +### 2.7 WebSocket Agent Status Events (MEDIUM-HIGH) + +**Why this is risky:** WebSocket behaviour is easy to get right manually and easy to break silently. The frontend's live-highlighting feature depends entirely on these events arriving in the correct order. + +**What to test:** + +| Scenario | Type | +|---|---| +| Connecting to `/ws/council/{run_id}` returns a 101 upgrade | Integration | +| When LangGraph enters `master_agent_node`, the WebSocket emits `{"node": "master_agent", "status": "running"}` | Integration | +| When a node completes, a `"completed"` status event is emitted before the next node's `"running"` event | Integration | +| When the graph finishes, a `"done"` event is emitted with the final output | Integration | +| Client disconnect during a run does not crash the backend process | Integration | +| Two concurrent WebSocket sessions for different `run_id`s do not cross-contaminate events | Integration | + +--- + +### 2.8 Human-in-the-Loop / God Mode (MEDIUM) + +This is the most stateful and interactive code path and the hardest to test manually. + +**What to test:** + +| Scenario | Type | +|---|---| +| With `interrupt_before` configured, the graph pauses before the specified node | Integration | +| The paused state (including `current_draft` and `feedback_history`) is correctly surfaced to the frontend | Integration | +| Sending "approve" resumes execution and the graph continues from the paused node | Integration | +| Sending "reject" resumes execution with `route_decision = "rework"` | Integration | +| Modifying `current_draft` in the approval UI causes the next node to receive the modified content | Integration | +| Session timeout while paused: the run is marked as `timed_out`, not left in `paused` forever | Integration | + +--- + +### 2.9 External Tool Wrappers (MEDIUM) + +Tools must be tested in isolation with mocked external calls. + +**Tavily Web Search:** + +| Scenario | Type | +|---|---| +| Tool returns a list of results with `url` and `snippet` fields | Unit | +| Tavily API returns 429 (rate limit) — tool raises a retriable error | Unit | +| Tavily API is unreachable — tool raises a non-retriable error with a clear message | Unit | + +**PyPDF + Vector Store:** + +| Scenario | Type | +|---|---| +| A valid PDF is chunked and embedded without error | Unit | +| A password-protected PDF raises a clear error | Unit | +| A zero-byte file raises a clear error | Unit | +| Semantic search returns the top-K most relevant chunks | Unit | +| Vector store survives a restart and retrieves previously stored embeddings | Integration | + +--- + +### 2.10 React Flow Custom Node Components (MEDIUM — Frontend) + +**What to test:** + +| Scenario | Type | +|---|---| +| Rendering a node with no props does not throw | Unit | +| Changing the `systemPrompt` field in the settings panel updates the node data | Unit | +| Selecting a different LLM model updates the node's model field | Unit | +| Toggling "Web Search" on/off reflects in node data | Unit | +| Node label truncates at a defined character limit without overflowing the card UI | Unit | + +Use React Testing Library. Do not snapshot-test node styling — it changes too often and generates false negatives. + +--- + +## 3. Test Infrastructure Recommendations + +### Backend (`backend/`) + +``` +pytest +pytest-asyncio # async FastAPI route tests +httpx # async test client for FastAPI +pytest-cov # coverage reports +unittest.mock # LLM mocking (stdlib, no extra dep) +factory-boy # fixture factories for CouncilState +``` + +Minimum coverage target: **80% for `agents/`**, **90% for `state.py` and `services/graph_builder.py`**. + +Configuration in `pyproject.toml`: + +```toml +[tool.pytest.ini_options] +asyncio_mode = "auto" +testpaths = ["tests"] + +[tool.coverage.run] +omit = ["tests/*", "**/__init__.py"] + +[tool.coverage.report] +fail_under = 80 +``` + +### Frontend (`frontend/`) + +``` +vitest (or jest) +@testing-library/react +@testing-library/user-event +msw # Mock Service Worker for API/WebSocket mocking +``` + +--- + +## 4. What to Write First (Priority Order) + +Build tests in this sequence — it matches the development phases and catches the highest-risk bugs earliest: + +1. **`state.py` — `CouncilState` reducers** (write before any agent code) +2. **`agents/` — each node function** (write alongside each node implementation) +3. **Routing logic** (write before connecting the graph) +4. **`graph_builder.py`** (write alongside Phase 3 dynamic graph work) +5. **REST API endpoints** (write alongside Phase 2 blueprint persistence) +6. **Frontend parser** (write alongside Phase 2 React Flow work) +7. **WebSocket events** (write alongside Phase 3 integration work) +8. **God Mode / Human-in-the-Loop** (write alongside Phase 4) +9. **Tool wrappers** (write alongside Phase 4) + +--- + +## 5. What to Avoid + +- **Do not make real LLM API calls in any test.** This adds cost, latency, and flakiness. Always mock `ChatAnthropic` and `ChatOpenAI`. +- **Do not snapshot-test React Flow canvas state.** React Flow's internal node positions change non-deterministically and will make snapshot tests fail constantly. +- **Do not use a shared test database.** Each integration test that touches the database must run in a transaction that is rolled back after the test, or use a separate ephemeral schema. +- **Do not test the routing score threshold at exactly 8.** Always test both sides of the boundary (score = 7 → rework, score = 8 → approve, score = 9 → approve) to catch off-by-one errors in conditional logic. +- **Do not write end-to-end browser tests before Phase 3 is stable.** E2E tests on an unstable UI are expensive to maintain. Add Playwright/Cypress tests only once the API contract between frontend and backend is frozen.