Development & testing¶
Repository structure¶
Everything the installer and skills need lives in two top-level trees: coach/ (the Python installer/configurator
package) and skills/ (shared, harness-agnostic skill definitions). data/ is created on first run and gitignored.
coach/ # Python package (installer + sources)
├── __init__.py
├── cli.py # `coach setup`, `coach setup --source <name>`, `coach install --harness ...`
├── harness/
│ ├── base.py # BaseHarness ABC
│ ├── claude.py # ClaudeHarness
│ └── codex.py # CodexHarness
├── sources/
│ ├── registry.py # SOURCES registry (garmin/strava/google_calendar/outlook_calendar functional; rest scaffolded)
│ ├── base.py # SourceSpec dataclass + CAPABILITIES vocabulary constant
│ ├── garmin.py # Garmin source (Taxuspt/garmin_mcp) — metrics + workout_calendar
│ ├── strava.py # Strava source (reuses scripts/setup_auth.py + strava_server.py) — metrics only
│ ├── google_calendar.py # Google Calendar source (nspady/google-calendar-mcp) — workout_calendar
│ └── outlook_calendar.py # Outlook Calendar source (softeria/ms-365-mcp-server) — workout_calendar
├── storage/
│ ├── schema.py # Goal / PlanNote / ReadinessCheckin Pydantic models
│ └── store.py # file-based read/write helpers over data/ tree (+ personality)
├── analysis/
│ └── assemble.py # context assembler → agent-friendly payload (NO scoring/verdicts)
├── scheduling.py # parse_time_of_day/to_cron + local-only writers for Claude/Codex/launchd
└── prompts/
└── coach_personality.md # base coach persona seed
skills/ # shared SKILL.md folders, authored once, installed to each harness
├── setup-coach-personality/SKILL.md # one skill, two modes — setup + refine
├── research-goal-plan/SKILL.md
├── evaluate-training/SKILL.md
├── generate-daily-workout/SKILL.md
├── readiness-check/SKILL.md
├── adjust-workout/SKILL.md
└── body-checkin/SKILL.md
data/ # local store (gitignored), created on first run
├── athlete/profile.json
├── coach/{personality.md, personality.json}
├── goals/{goals.json, research/*.md}
├── plan/<YYYY-MM-DD>.json
└── logs/readiness/<YYYY-MM-DD>.json
strava/ # Strava metrics-source assets
├── strava_server.py
├── strava_client.py
├── format_workout_file.py
└── (tools/get_*.py, explore_segments.py, get_route.py, export_route_*.py)
scripts/
└── setup_auth.py # Strava OAuth CLI flow
tests/ # schema, store, assemble, registry, harness installers
├── test_schema.py
├── test_store.py
├── test_assemble.py
├── test_registry.py
└── test_harness.py
pyproject.toml # deps + `coach` console entry-point
Running the tests¶
uv sync
uv run pytest -q
Functional testing strategy¶
Most of Coach AI's "logic" lives in SKILL.md and CLAUDE.md/AGENTS.md — instructions an LLM reads, not code
whose reasoning can be unit-tested. The strategy below tests everything that is code (schema, store, capability
resolution, harness installers, scheduling, assemble.py) thoroughly and automatically, and treats agent reasoning
as something verified structurally — did the right files get written, the right tools get called? — plus a manual
walkthrough per milestone.
Four layers¶
| Layer | What's tested | How | Runs |
|---|---|---|---|
| 1. Unit | Pydantic schema round-trips, store.py CRUD, to_cron/parse_time_of_day, capability-set resolution |
plain pytest, temp dirs, no network/MCP |
every commit |
| 2. Functional (fixture-driven) | install_skills() renders the right SKILL.md per path; harness writers produce correct .mcp.json/CLAUDE.md/config.toml; assemble.py normalizes recorded MCP responses into the right shape (incl. fields omitted on Strava+Calendar); scheduling writers produce correct artifacts for a given time |
pytest against recorded MCP-response fixtures (tests/fixtures/) and temp project dirs — asserts on file contents, never on agent output |
every commit |
| 3. Integration (live MCP) | real Garmin/Strava/Google/Outlook MCP servers launch, authenticate, and respond to one read tool each | opt-in pytest marker (--run-integration), needs real OAuth caches/credentials in the environment |
manual, before a milestone ships |
| 4. Agent-in-the-loop | a skill, run by a real agent in Claude Code/Codex, produces the expected data/ writes, calendar mutations, and a coherent recommendation |
the annotated examples and end-to-end diagram in Daily workflow, run as a manual script — same steps as the installation verification checklist | manual, once per path × harness, before a milestone ships |
Fixture library¶
One JSON fixture per MCP read tool actually consumed by a skill, captured once from real accounts with personal data scrubbed, checked in, and reused across every Layer-2 test:
tests/fixtures/
├── garmin/
│ ├── training_readiness.json # get_training_readiness / get_morning_training_readiness
│ ├── hrv.json # get_hrv_data
│ ├── sleep.json # get_sleep_data
│ ├── body_battery.json # get_body_battery
│ ├── activities.json # get_activities / get_activity (incl. trainingEffect)
│ ├── training_status.json # get_training_status / get_vo2max_trend
│ └── scheduled_workouts.json # get_scheduled_workouts / get_workouts
├── strava/
│ ├── activities.json # get-activities / get-activity-streams
│ ├── athlete_stats.json # get-athlete-stats
│ └── segment_prs.json # get-segment-prs / get-athlete-zones
├── google_calendar/
│ ├── list_events.json # list-events
│ └── event.json # get-event / create-event / update-event
└── outlook_calendar/
├── list_events.json
└── event.json
The skill × path functional-test matrix¶
Layer-2 tests run every row below twice — once with Garmin fixtures, once with Strava+Calendar fixtures — and assert
the rendered SKILL.md and/or assemble.py output match the Full/Degraded/Unavailable behavior from
Capabilities & paths:
| Skill | Garmin-path assertion | Strava+Calendar-path assertion |
|---|---|---|
research-goal-plan |
capability-independent — rendered SKILL.md identical on both paths |
same |
setup-coach-personality |
step 3 resolves to get_training_status/get_training_readiness/get_vo2max_trend |
Degraded — step 3 resolves to get-athlete-stats/get-activities/get-segment-prs |
readiness-check |
assemble.py output includes a populated metrics_snapshot from the Garmin fixtures |
Degraded — output has no metrics_snapshot key (omitted, not null/zero) |
generate-daily-workout |
{{tool: structured_workout_create}} resolves to create_strength_workout/create_z2_walk_workout → schedule_workout |
Degraded — resolves to create-event; plan note has workout_source: "google_calendar" |
evaluate-training |
assemble.py includes training_effect from get_activity |
Degraded — training_effect absent; HR/power streams present instead |
adjust-workout |
{{tool: workout_modify}} resolves to unschedule_workout → create_* → schedule_workout |
Full — resolves to a single update-event call |
body-checkin |
capability-independent — Full on both paths | same |
Harness parity¶
A separate parametrized test runs the same resolved capability set through ClaudeHarness and CodexHarness and
asserts structural equivalence — same set of skills installed, same {{tool: ...}} resolutions, same MCP server
entries (different file targets, per the file-target matrix). This is
what makes "Codex parity" in the installation verification checklist a quick re-run rather than a second
from-scratch verification.
Agent reasoning isn't unit-testable
Whether a recommendation is good coaching — "ease off because HRV is low and calves are sore" — depends on the LLM, the personality dials, and the moment. Layer 2 guarantees the agent has the right data and tools for each path; Layer 4 is the only place that confirms the agent uses them sensibly. Milestone 1 keeps Layer 4 manual; a future milestone could script it with an agent SDK driving a real session against fixture MCP servers.
CI¶
# every commit — Layers 1+2, no credentials needed
uv run pytest
# before a milestone ships — adds Layer 3, needs real OAuth caches/env vars
uv run pytest --run-integration
What survives the pivot¶
The pivot from the original LlamaIndex ReActAgent runtime is mostly subtractive, but several pieces of working code
moved forward, sometimes reshaped:
| Existing | Action |
|---|---|
tools/baseline_fitness.py |
Salvaged only normalization helpers (unit conversion, per-day/week grouping) into coach/analysis/assemble.py. Dropped VO2max estimation, TSS/CTL/ATL/TSB, strength-baseline, recovery-capacity, and the overall 0–100 fitness score — judgment moved to the agent. |
tools/user_profile_schema.py |
Reworked into coach/storage/schema.py as Goal / PlanNote / ReadinessCheckin — no monolithic profile, no embedded completed-workout list. |
tools/workout_db_tools.py (TinyDB) |
Dropped; replaced by coach/storage/store.py file-based helpers. |
prompts.py (FITNESS_COACH_SYSTEM_PROMPT) |
Seeded coach/prompts/coach_personality.md (stripped of ReAct scaffolding), then removed. |
scripts/setup_auth.py, strava_server.py, strava_client.py, tools/get_*.py, format_workout_file.py |
Kept, moved under strava/ as coach/sources/strava.py's assets — functional as the metrics source on the Strava + Calendar path. format_workout_file.py remains available for ad-hoc .zwo export but is outside the core skill flows. |
main.py, workout_db_server.py |
Removed — LlamaIndex runtime and TinyDB MCP server retired. |
pyproject.toml |
Dropped llama-index*, mcp[cli], tinydb, duckduckgo, and tavily — research now runs on the harness's native WebSearch/WebFetch/web search tools, so no third-party search dependency remains. Added a coach console entry-point; kept pydantic, requests, python-dotenv. |
tests/ |
Replaced with schema round-trip, store CRUD, assemble.py, registry, and harness-installer tests — run against both Garmin and Strava + Calendar fixtures. |
New, not reused — MCP servers for the Strava + Calendar path¶
The Strava + Calendar path adds two or three small TypeScript/stdio MCP servers that have no Python codebase analog. They are configuration, not code this repo maintains:
r-huijts/strava-mcp— optional alternative to the existingstrava_server.pyfor the metrics role; either works against the same Strava API.nspady/google-calendar-mcp— workout-calendar role, lead binding. Requires a one-timegcp-oauth.keys.jsondesktop OAuth setup.softeria/ms-365-mcp-server— workout-calendar role, second binding (Outlook). Requires an Azure app registration withCalendars.ReadWrite.
Contributing¶
Issues and PRs are welcome — please open an issue first for new
sources, skills, or larger changes so we can align on approach. Priority areas: new SourceSpecs (Apple Health,
Whoop, Polar, Oura), additional harness support beyond Claude Code/Codex, and refinements to the skill catalog.