Merge pull request #3 from Kenearos/claude/analyze-test-coverage-tXWWZ

Add test coverage analysis and strategy document
This commit is contained in:
Kenearos 2026-02-20 17:26:44 +01:00 committed by GitHub
commit 34dcfb3dcd
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -0,0 +1,286 @@
# Test Coverage Analysis — CouncilOS
**Date:** 2026-02-20
**Repository status:** Pre-code / architecture stage
**Coverage today:** 0% (no application code exists)
This document analyses the planned architecture defined in `CLAUDE.md` and `README.md`, identifies every testable unit, and prioritises the areas that carry the most risk if left untested.
---
## 1. Current State
The repository contains only `README.md` and `CLAUDE.md`. No production code, test files, or tooling configuration exist yet. Every recommendation below is therefore forward-looking: it defines the test strategy that must accompany each development phase rather than patching gaps in existing coverage.
---
## 2. Risk-Prioritised Coverage Areas
### 2.1 `CouncilState` and State Management (CRITICAL)
**Why critical:** `CouncilState` is the single source of truth passed between every agent. A bug here corrupts the entire pipeline silently.
**What to test:**
| Scenario | Type |
|---|---|
| `feedback_history` uses `operator.add` and appends across loop iterations — never overwrites | Unit |
| `messages` reducer accumulates correctly when agents return partial state | Unit |
| `route_decision` is always one of the allowed string literals (`"rework"`, `"approve"`, custom values) | Unit |
| Passing a state missing required fields raises a clear error at graph entry, not mid-run | Unit |
| A full loop (master → critic → rework → master → critic → approve) preserves all prior feedback in history | Integration |
**File to test:** `backend/state.py`
---
### 2.2 Routing Logic (CRITICAL)
**Why critical:** The conditional edges are the core business logic. A routing bug sends documents into infinite loops or skips the critic entirely — the two worst failure modes for the product's value proposition.
**What to test:**
| Scenario | Type |
|---|---|
| Score < 8 `route_decision = "rework"`, graph returns to master | Unit |
| Score ≥ 8 → `route_decision = "approve"`, graph advances to writer | Unit |
| Critic returns a non-numeric or malformed score — routing falls back safely (no crash) | Unit |
| Loop terminates after N iterations even if score never reaches threshold (guard against infinite loops) | Integration |
| Custom routing values defined in user blueprints are handled without exception | Unit |
**File to test:** `backend/agents/` (routing functions), `backend/services/graph_builder.py`
---
### 2.3 LangGraph Agent Node Functions (HIGH)
Each agent node function must be independently testable with mocked LLM responses.
**What to test per node:**
| Scenario | Type |
|---|---|
| Node receives a valid `CouncilState` and returns a dict with only the keys it owns | Unit |
| Node appends to `messages`, never replaces the full list | Unit |
| Node correctly uses its configured system prompt (verify prompt is sent as system message) | Unit |
| Node handles an empty `current_draft` gracefully (first iteration edge case) | Unit |
| Critic node produces a numeric score in the expected range within `route_decision` output | Unit |
| Master node incorporates all items in `feedback_history` into its revised draft prompt | Unit |
**Key rule from `CLAUDE.md`:** LLM calls must be mocked in CI — never make real API calls.
```python
# Example mock pattern for every agent node test
from unittest.mock import patch, MagicMock
@patch("backend.agents.master_agent.ChatAnthropic")
def test_master_agent_node_appends_to_messages(mock_llm):
mock_llm.return_value.invoke.return_value = MagicMock(content="Draft v2")
state = CouncilState(
input_topic="Write a blog post",
current_draft="Draft v1",
feedback_history=["Too short"],
route_decision="",
messages=[],
)
result = master_agent_node(state)
assert len(result["messages"]) == 1
assert result["current_draft"] == "Draft v2"
```
---
### 2.4 Dynamic Graph Builder (HIGH)
**Why high:** From Phase 3 onward the graph is built at runtime from a user-supplied JSON blueprint. A parsing error or invalid edge definition produces a graph that silently misbehaves.
**What to test:**
| Scenario | Type |
|---|---|
| A valid blueprint JSON produces a compilable `StateGraph` | Unit |
| Blueprint with a conditional edge correctly wires the routing function | Unit |
| Blueprint with a cycle (A → B → A) compiles without flattening the cycle | Unit |
| Blueprint referencing a non-existent node ID raises a descriptive `ValueError`, not a Python `KeyError` | Unit |
| Blueprint with zero nodes raises a clear error | Unit |
| Blueprint with two nodes and no edges raises a clear error | Unit |
| A blueprint that was serialised from the frontend and round-tripped through PostgreSQL produces the same graph | Integration |
**File to test:** `backend/services/graph_builder.py`
---
### 2.5 Blueprint JSON Parser (HIGH — Frontend)
**Why high:** The parser converts a React Flow canvas into the JSON that controls backend execution. Any data loss or structural error here means user-designed councils don't run as intended.
**What to test:**
| Scenario | Type |
|---|---|
| A two-node linear graph emits correct `nodes` + `edges` JSON | Unit |
| A conditional edge with a `condition` label is preserved in output | Unit |
| An isolated node (no edges) is flagged as a warning, not silently dropped | Unit |
| Node with empty `systemPrompt` field is preserved (validation happens at run time, not parse time) | Unit |
| Blueprint JSON includes a `version` field | Unit |
| Re-parsing a previously saved blueprint produces identical output (idempotency) | Unit |
**File to test:** `frontend/src/utils/parser.ts` (or `.js`)
---
### 2.6 FastAPI REST Endpoints (MEDIUM-HIGH)
**What to test:**
| Endpoint | Scenario | Type |
|---|---|---|
| `POST /api/councils/` | Creates a blueprint, returns 201 with ID | Integration |
| `POST /api/councils/` | Payload missing required fields returns 422 | Integration |
| `GET /api/councils/{id}` | Returns stored blueprint | Integration |
| `GET /api/councils/{id}` | Non-existent ID returns 404 | Integration |
| `PUT /api/councils/{id}` | Updates blueprint, bumps `version` field | Integration |
| `DELETE /api/councils/{id}` | Removes blueprint, subsequent GET returns 404 | Integration |
| All endpoints | Unauthenticated requests (when auth is added) return 401 | Integration |
Use `httpx.AsyncClient` + `pytest-asyncio` for async FastAPI tests. Use a dedicated test database, never the production instance.
---
### 2.7 WebSocket Agent Status Events (MEDIUM-HIGH)
**Why this is risky:** WebSocket behaviour is easy to get right manually and easy to break silently. The frontend's live-highlighting feature depends entirely on these events arriving in the correct order.
**What to test:**
| Scenario | Type |
|---|---|
| Connecting to `/ws/council/{run_id}` returns a 101 upgrade | Integration |
| When LangGraph enters `master_agent_node`, the WebSocket emits `{"node": "master_agent", "status": "running"}` | Integration |
| When a node completes, a `"completed"` status event is emitted before the next node's `"running"` event | Integration |
| When the graph finishes, a `"done"` event is emitted with the final output | Integration |
| Client disconnect during a run does not crash the backend process | Integration |
| Two concurrent WebSocket sessions for different `run_id`s do not cross-contaminate events | Integration |
---
### 2.8 Human-in-the-Loop / God Mode (MEDIUM)
This is the most stateful and interactive code path and the hardest to test manually.
**What to test:**
| Scenario | Type |
|---|---|
| With `interrupt_before` configured, the graph pauses before the specified node | Integration |
| The paused state (including `current_draft` and `feedback_history`) is correctly surfaced to the frontend | Integration |
| Sending "approve" resumes execution and the graph continues from the paused node | Integration |
| Sending "reject" resumes execution with `route_decision = "rework"` | Integration |
| Modifying `current_draft` in the approval UI causes the next node to receive the modified content | Integration |
| Session timeout while paused: the run is marked as `timed_out`, not left in `paused` forever | Integration |
---
### 2.9 External Tool Wrappers (MEDIUM)
Tools must be tested in isolation with mocked external calls.
**Tavily Web Search:**
| Scenario | Type |
|---|---|
| Tool returns a list of results with `url` and `snippet` fields | Unit |
| Tavily API returns 429 (rate limit) — tool raises a retriable error | Unit |
| Tavily API is unreachable — tool raises a non-retriable error with a clear message | Unit |
**PyPDF + Vector Store:**
| Scenario | Type |
|---|---|
| A valid PDF is chunked and embedded without error | Unit |
| A password-protected PDF raises a clear error | Unit |
| A zero-byte file raises a clear error | Unit |
| Semantic search returns the top-K most relevant chunks | Unit |
| Vector store survives a restart and retrieves previously stored embeddings | Integration |
---
### 2.10 React Flow Custom Node Components (MEDIUM — Frontend)
**What to test:**
| Scenario | Type |
|---|---|
| Rendering a node with no props does not throw | Unit |
| Changing the `systemPrompt` field in the settings panel updates the node data | Unit |
| Selecting a different LLM model updates the node's model field | Unit |
| Toggling "Web Search" on/off reflects in node data | Unit |
| Node label truncates at a defined character limit without overflowing the card UI | Unit |
Use React Testing Library. Do not snapshot-test node styling — it changes too often and generates false negatives.
---
## 3. Test Infrastructure Recommendations
### Backend (`backend/`)
```
pytest
pytest-asyncio # async FastAPI route tests
httpx # async test client for FastAPI
pytest-cov # coverage reports
unittest.mock # LLM mocking (stdlib, no extra dep)
factory-boy # fixture factories for CouncilState
```
Minimum coverage target: **80% for `agents/`**, **90% for `state.py` and `services/graph_builder.py`**.
Configuration in `pyproject.toml`:
```toml
[tool.pytest.ini_options]
asyncio_mode = "auto"
testpaths = ["tests"]
[tool.coverage.run]
omit = ["tests/*", "**/__init__.py"]
[tool.coverage.report]
fail_under = 80
```
### Frontend (`frontend/`)
```
vitest (or jest)
@testing-library/react
@testing-library/user-event
msw # Mock Service Worker for API/WebSocket mocking
```
---
## 4. What to Write First (Priority Order)
Build tests in this sequence — it matches the development phases and catches the highest-risk bugs earliest:
1. **`state.py``CouncilState` reducers** (write before any agent code)
2. **`agents/` — each node function** (write alongside each node implementation)
3. **Routing logic** (write before connecting the graph)
4. **`graph_builder.py`** (write alongside Phase 3 dynamic graph work)
5. **REST API endpoints** (write alongside Phase 2 blueprint persistence)
6. **Frontend parser** (write alongside Phase 2 React Flow work)
7. **WebSocket events** (write alongside Phase 3 integration work)
8. **God Mode / Human-in-the-Loop** (write alongside Phase 4)
9. **Tool wrappers** (write alongside Phase 4)
---
## 5. What to Avoid
- **Do not make real LLM API calls in any test.** This adds cost, latency, and flakiness. Always mock `ChatAnthropic` and `ChatOpenAI`.
- **Do not snapshot-test React Flow canvas state.** React Flow's internal node positions change non-deterministically and will make snapshot tests fail constantly.
- **Do not use a shared test database.** Each integration test that touches the database must run in a transaction that is rolled back after the test, or use a separate ephemeral schema.
- **Do not test the routing score threshold at exactly 8.** Always test both sides of the boundary (score = 7 → rework, score = 8 → approve, score = 9 → approve) to catch off-by-one errors in conditional logic.
- **Do not write end-to-end browser tests before Phase 3 is stable.** E2E tests on an unstable UI are expensive to maintain. Add Playwright/Cypress tests only once the API contract between frontend and backend is frozen.