KI-Konzil/docs/test-coverage-analysis.md
Claude 852df60f68
Add test coverage analysis and strategy document
Maps every testable component in the planned architecture to concrete
test cases, prioritised by risk. Covers CouncilState reducers, routing
logic, agent node functions, dynamic graph builder, REST API, WebSocket
events, God Mode, tool wrappers, and the React Flow blueprint parser.
Includes tooling recommendations and a sequenced build order.

https://claude.ai/code/session_01Dexzo7FAbhU5fMePHGVgRP
2026-02-20 15:31:08 +00:00

12 KiB

Test Coverage Analysis — CouncilOS

Date: 2026-02-20 Repository status: Pre-code / architecture stage Coverage today: 0% (no application code exists)

This document analyses the planned architecture defined in CLAUDE.md and README.md, identifies every testable unit, and prioritises the areas that carry the most risk if left untested.


1. Current State

The repository contains only README.md and CLAUDE.md. No production code, test files, or tooling configuration exist yet. Every recommendation below is therefore forward-looking: it defines the test strategy that must accompany each development phase rather than patching gaps in existing coverage.


2. Risk-Prioritised Coverage Areas

2.1 CouncilState and State Management (CRITICAL)

Why critical: CouncilState is the single source of truth passed between every agent. A bug here corrupts the entire pipeline silently.

What to test:

Scenario Type
feedback_history uses operator.add and appends across loop iterations — never overwrites Unit
messages reducer accumulates correctly when agents return partial state Unit
route_decision is always one of the allowed string literals ("rework", "approve", custom values) Unit
Passing a state missing required fields raises a clear error at graph entry, not mid-run Unit
A full loop (master → critic → rework → master → critic → approve) preserves all prior feedback in history Integration

File to test: backend/state.py


2.2 Routing Logic (CRITICAL)

Why critical: The conditional edges are the core business logic. A routing bug sends documents into infinite loops or skips the critic entirely — the two worst failure modes for the product's value proposition.

What to test:

Scenario Type
Score < 8 → route_decision = "rework", graph returns to master Unit
Score ≥ 8 → route_decision = "approve", graph advances to writer Unit
Critic returns a non-numeric or malformed score — routing falls back safely (no crash) Unit
Loop terminates after N iterations even if score never reaches threshold (guard against infinite loops) Integration
Custom routing values defined in user blueprints are handled without exception Unit

File to test: backend/agents/ (routing functions), backend/services/graph_builder.py


2.3 LangGraph Agent Node Functions (HIGH)

Each agent node function must be independently testable with mocked LLM responses.

What to test per node:

Scenario Type
Node receives a valid CouncilState and returns a dict with only the keys it owns Unit
Node appends to messages, never replaces the full list Unit
Node correctly uses its configured system prompt (verify prompt is sent as system message) Unit
Node handles an empty current_draft gracefully (first iteration edge case) Unit
Critic node produces a numeric score in the expected range within route_decision output Unit
Master node incorporates all items in feedback_history into its revised draft prompt Unit

Key rule from CLAUDE.md: LLM calls must be mocked in CI — never make real API calls.

# Example mock pattern for every agent node test
from unittest.mock import patch, MagicMock

@patch("backend.agents.master_agent.ChatAnthropic")
def test_master_agent_node_appends_to_messages(mock_llm):
    mock_llm.return_value.invoke.return_value = MagicMock(content="Draft v2")
    state = CouncilState(
        input_topic="Write a blog post",
        current_draft="Draft v1",
        feedback_history=["Too short"],
        route_decision="",
        messages=[],
    )
    result = master_agent_node(state)
    assert len(result["messages"]) == 1
    assert result["current_draft"] == "Draft v2"

2.4 Dynamic Graph Builder (HIGH)

Why high: From Phase 3 onward the graph is built at runtime from a user-supplied JSON blueprint. A parsing error or invalid edge definition produces a graph that silently misbehaves.

What to test:

Scenario Type
A valid blueprint JSON produces a compilable StateGraph Unit
Blueprint with a conditional edge correctly wires the routing function Unit
Blueprint with a cycle (A → B → A) compiles without flattening the cycle Unit
Blueprint referencing a non-existent node ID raises a descriptive ValueError, not a Python KeyError Unit
Blueprint with zero nodes raises a clear error Unit
Blueprint with two nodes and no edges raises a clear error Unit
A blueprint that was serialised from the frontend and round-tripped through PostgreSQL produces the same graph Integration

File to test: backend/services/graph_builder.py


2.5 Blueprint JSON Parser (HIGH — Frontend)

Why high: The parser converts a React Flow canvas into the JSON that controls backend execution. Any data loss or structural error here means user-designed councils don't run as intended.

What to test:

Scenario Type
A two-node linear graph emits correct nodes + edges JSON Unit
A conditional edge with a condition label is preserved in output Unit
An isolated node (no edges) is flagged as a warning, not silently dropped Unit
Node with empty systemPrompt field is preserved (validation happens at run time, not parse time) Unit
Blueprint JSON includes a version field Unit
Re-parsing a previously saved blueprint produces identical output (idempotency) Unit

File to test: frontend/src/utils/parser.ts (or .js)


2.6 FastAPI REST Endpoints (MEDIUM-HIGH)

What to test:

Endpoint Scenario Type
POST /api/councils/ Creates a blueprint, returns 201 with ID Integration
POST /api/councils/ Payload missing required fields returns 422 Integration
GET /api/councils/{id} Returns stored blueprint Integration
GET /api/councils/{id} Non-existent ID returns 404 Integration
PUT /api/councils/{id} Updates blueprint, bumps version field Integration
DELETE /api/councils/{id} Removes blueprint, subsequent GET returns 404 Integration
All endpoints Unauthenticated requests (when auth is added) return 401 Integration

Use httpx.AsyncClient + pytest-asyncio for async FastAPI tests. Use a dedicated test database, never the production instance.


2.7 WebSocket Agent Status Events (MEDIUM-HIGH)

Why this is risky: WebSocket behaviour is easy to get right manually and easy to break silently. The frontend's live-highlighting feature depends entirely on these events arriving in the correct order.

What to test:

Scenario Type
Connecting to /ws/council/{run_id} returns a 101 upgrade Integration
When LangGraph enters master_agent_node, the WebSocket emits {"node": "master_agent", "status": "running"} Integration
When a node completes, a "completed" status event is emitted before the next node's "running" event Integration
When the graph finishes, a "done" event is emitted with the final output Integration
Client disconnect during a run does not crash the backend process Integration
Two concurrent WebSocket sessions for different run_ids do not cross-contaminate events Integration

2.8 Human-in-the-Loop / God Mode (MEDIUM)

This is the most stateful and interactive code path and the hardest to test manually.

What to test:

Scenario Type
With interrupt_before configured, the graph pauses before the specified node Integration
The paused state (including current_draft and feedback_history) is correctly surfaced to the frontend Integration
Sending "approve" resumes execution and the graph continues from the paused node Integration
Sending "reject" resumes execution with route_decision = "rework" Integration
Modifying current_draft in the approval UI causes the next node to receive the modified content Integration
Session timeout while paused: the run is marked as timed_out, not left in paused forever Integration

2.9 External Tool Wrappers (MEDIUM)

Tools must be tested in isolation with mocked external calls.

Tavily Web Search:

Scenario Type
Tool returns a list of results with url and snippet fields Unit
Tavily API returns 429 (rate limit) — tool raises a retriable error Unit
Tavily API is unreachable — tool raises a non-retriable error with a clear message Unit

PyPDF + Vector Store:

Scenario Type
A valid PDF is chunked and embedded without error Unit
A password-protected PDF raises a clear error Unit
A zero-byte file raises a clear error Unit
Semantic search returns the top-K most relevant chunks Unit
Vector store survives a restart and retrieves previously stored embeddings Integration

2.10 React Flow Custom Node Components (MEDIUM — Frontend)

What to test:

Scenario Type
Rendering a node with no props does not throw Unit
Changing the systemPrompt field in the settings panel updates the node data Unit
Selecting a different LLM model updates the node's model field Unit
Toggling "Web Search" on/off reflects in node data Unit
Node label truncates at a defined character limit without overflowing the card UI Unit

Use React Testing Library. Do not snapshot-test node styling — it changes too often and generates false negatives.


3. Test Infrastructure Recommendations

Backend (backend/)

pytest
pytest-asyncio         # async FastAPI route tests
httpx                  # async test client for FastAPI
pytest-cov             # coverage reports
unittest.mock          # LLM mocking (stdlib, no extra dep)
factory-boy            # fixture factories for CouncilState

Minimum coverage target: 80% for agents/, 90% for state.py and services/graph_builder.py.

Configuration in pyproject.toml:

[tool.pytest.ini_options]
asyncio_mode = "auto"
testpaths = ["tests"]

[tool.coverage.run]
omit = ["tests/*", "**/__init__.py"]

[tool.coverage.report]
fail_under = 80

Frontend (frontend/)

vitest (or jest)
@testing-library/react
@testing-library/user-event
msw                    # Mock Service Worker for API/WebSocket mocking

4. What to Write First (Priority Order)

Build tests in this sequence — it matches the development phases and catches the highest-risk bugs earliest:

  1. state.pyCouncilState reducers (write before any agent code)
  2. agents/ — each node function (write alongside each node implementation)
  3. Routing logic (write before connecting the graph)
  4. graph_builder.py (write alongside Phase 3 dynamic graph work)
  5. REST API endpoints (write alongside Phase 2 blueprint persistence)
  6. Frontend parser (write alongside Phase 2 React Flow work)
  7. WebSocket events (write alongside Phase 3 integration work)
  8. God Mode / Human-in-the-Loop (write alongside Phase 4)
  9. Tool wrappers (write alongside Phase 4)

5. What to Avoid

  • Do not make real LLM API calls in any test. This adds cost, latency, and flakiness. Always mock ChatAnthropic and ChatOpenAI.
  • Do not snapshot-test React Flow canvas state. React Flow's internal node positions change non-deterministically and will make snapshot tests fail constantly.
  • Do not use a shared test database. Each integration test that touches the database must run in a transaction that is rolled back after the test, or use a separate ephemeral schema.
  • Do not test the routing score threshold at exactly 8. Always test both sides of the boundary (score = 7 → rework, score = 8 → approve, score = 9 → approve) to catch off-by-one errors in conditional logic.
  • Do not write end-to-end browser tests before Phase 3 is stable. E2E tests on an unstable UI are expensive to maintain. Add Playwright/Cypress tests only once the API contract between frontend and backend is frozen.