Kodulabor — Applied AI research with measured outcomes

Problem

Every gym session starts with the same three questions: what should I do today, what did I do last time, and how is this week tracking against the goal I set for the quarter. Answering any one of them well takes more attention than is realistic standing on the gym floor between sets. The standard alternatives — a spreadsheet, a notes app, one of the many commercial fitness apps — each have their own version of the same problem: either they require the operator to be the brain (entering plans, tracking history, evaluating progress), or they generate plans without remembering yesterday's session, last month's injury, or this quarter's goal.

The first attempt at solving this used Claude's mobile app directly with a pasted system prompt and a manual session log. Two real workouts revealed the limits — no persistent memory across sessions, no enforcement of user-specific rules (notably a personal injury constraint that has to exclude certain exercises), no ability to receive a screenshot from a wearable and have heart-rate data extracted, no scheduled nudges, and no measurable evaluation of whether the bot's plan was any good.

The goal: a system that survives the gym (mobile-first), remembers profile, goals, and history, takes wearable screenshots as visual input, sends scheduled motivational messages, and — distinctively — includes a self-review loop where every day's sessions are evaluated by a senior-coach-quality reviewer that proposes concrete fixes for the bot's own behaviour.

AI Approach

The entire system was built using Claude Code over many short sessions across roughly five weeks. The development pattern matters: this was an iterative personal product, not a contained project. The architecture emerged from the friction points of real workouts rather than a design document up front.

Architecture chosen by Claude Code (with operator review):

—Telegram Bot API — chosen because the bot is fully usable from any phone the moment a user starts a chat, with no app-store distribution and no client app to maintain
—Vercel serverless functions — api/webhook.ts (inbound Telegram messages), api/notify.ts (scheduled outbound), api/setup.ts (one-time configuration)
—Supabase Postgres — schema for profile, goals, workouts, conversations, errors, pending check-ins; five migrations evolved as features were added
—Claude API on Opus 4.6 (1M context) for reply generation, with vision capabilities for wearable screenshots
—Vitest test suite — 41 tests covering pure functions for prompt building, injury guardrails, JSON extraction, and the webhook entry point
—A daily-review skill (.claude/scheduled-tasks/fitness-bot-review/) that runs autonomously each day, reviews the previous 24 hours of sessions for rule violations, plan quality, and database write integrity, and writes its findings to a markdown report committed to the repo

Key design decisions:

—Started in Claude mobile app to validate the conversational design with two real workouts before any code was written. The discovery from those two workouts (need for persistence, vision, scheduled prompts, evaluation) shaped the architecture, not the other way round.
—Claude Opus 4.6 selected for reply generation after Sonnet's tone occasionally drifted from the system prompt's no-emoji, no-markdown rule. The ~5× cost per reply is negligible given the low message volume of a personal product.
—Smart intent router for free-text messages outside an active session — classifies the message as workout / check-in / ad-hoc question and routes accordingly. Replaces an early dead-end where the bot would say "no active session" to anything outside its narrow expected flow.
—Multi-user data model from day one — the user is keyed in every Supabase table even though only one user is enrolled today. The added schema cost is small; opening the bot to a second user later requires no migration.

What was not automated:

—The Telegram bot creation and token issuance.
—The Vercel project setup and domain attachment.
—Real workouts — the source of every feedback signal that shaped the product.
—The daily-review reading and triage. The bot writes reviews of itself; the operator reads each one and decides which findings to feed back as fixes.

Human Effort

This is where the iterative-product nature matters for honest accounting. Total active engagement was approximately 40 hours, but it was spread across roughly five calendar weeks in many short bursts — sometimes ten-minute fixes after a real workout exposed a bug, sometimes longer architectural sessions when a daily review surfaced a class of issues.

Active effort breakdown:

Phase	Active time	What happened
Initial design + two real-workout validations in Claude mobile	~3 hours	Established the requirements that drove the rest
Telegram + Vercel + Supabase scaffolding	~6 hours	Webhook, signatures, conversation state, five DB migrations
Claude API integration (chat + vision) and prompt engineering	~8 hours	Including multiple system-prompt iterations
Coach domain logic (plan generation, injury rules, goal tracking)	~7 hours	The slowest part — domain rules emerge slowly from real use
Smart intent router	~2 hours	Replaced the "no active session" dead-end
Self-review skill + report generation	~5 hours	The distinguishing feature of the whole project
Vitest suite	~3 hours	41 tests, written as the codebase stabilised
Feedback-driven improvements (variable session length, evening check-ins, additional injury types, rule code-vs-data split)	~6 hours	Spread over weeks; each came from a real-world finding

77 user prompts in the main Claude Code session. 15 of them contained gotcha-keywords (fix / wrong / broken / error). That density is much higher than a contained build project — characteristic of a multi-week product where each real-world use exposes fresh issues.

Traditional Benchmark

A Telegram bot of this scope, built without AI assistance, would require a single developer who is comfortable across the full stack:

Item	Estimate
Telegram bot wiring (webhook, signatures, conversation state)	12–20 hours
Vercel + Supabase scaffolding, schema design, five migrations	8–12 hours
Claude API integration (chat + vision) and prompt engineering	20–30 hours
Coach domain logic (plan generation, injury rules, goal tracking)	25–35 hours
Vision-from-wearable extraction (testing against real screenshots)	8–12 hours
Scheduled notifications and check-in flow	8–12 hours
Self-review skill + report generation	12–20 hours
Test suite	10–15 hours
Total	103–156 hours
Cost (€60–100/hr freelance)	€6,200–15,600
Calendar time	6–10 weeks

Mid-range planning value: ~120 hours over 8 weeks.

Acceleration Factor

Metric	Traditional (mid-range)	AI-assisted	Factor
Active human hours	~120	~40	3×
Calendar time	6–10 weeks	~5 weeks (intermittent)	~1.5×
Direct cost	~€10,000	~€200 + time	~50×

The 3× effort acceleration is more modest than Revalia Homes' 27×, and the reason is structural. This was an iterative product, not a contained project. Most of the time was not spent generating code — it was spent reacting to real-world feedback. Daily reviews caught rule violations. Real workouts surfaced UX dead-ends. Gym-style abbreviations the bot didn't recognise required new parser cases. AI accelerates code generation; it does not eliminate the inherent time cost of "use it for two weeks, see what's wrong, fix it, repeat." That cost is the price of building something good.

The interesting metric is not the acceleration factor at all. It is the self-review loop: how many bugs the daily-review skill caught that the operator would otherwise have lived with. Several, by direct count — including one day where a cardio-only session completed but did not write to the database, which would have appeared to "work" indefinitely without that check.

Quality Assessment

The bot is in production, has been used for real workouts for several weeks, and has survived multiple feedback cycles where issues were caught by the daily-review skill before the operator noticed.

What met production standards:

—41-test Vitest suite passes. Pure-function coverage on coach prompt building, injury guardrails, JSON extraction.
—Self-review loop is real. Every day, the bot's previous-day output is reviewed by a coach-quality reviewer, and the review file is committed to the repo.
—Vision path handles real wearable screenshots and extracts heart-rate data reliably enough for use.
—Smart intent router replaced a dead-end UX with a classifier that handles free-text outside an active session.
—Multi-user data model means the bot can be opened to other users without a schema rewrite.
—Reply tone is consistently plain text — no emoji, no markdown — as required by the system prompt after this rule was promoted from "guideline" to enforced.

What a senior coach with bespoke software would do better:

—Periodisation across multi-month training blocks. The current implementation handles per-quarter goals but does not vary intensity across weeks within a block.
—Recovery-aware programming based on subjective wellness ratings beyond a single evening check-in.
—Exercise substitution from a curated library, rather than relying on Claude to propose alternatives at session time.

Open quality items:

—Variable session length is a recent addition; needs more real-workout data to validate against goal progression.
—Evening check-in flow ("anything besides sport that created physical stress today") is recent; the prompt tuning to make this feel like a friendly question and not an interrogation is ongoing.

Gotchas & Limitations

Every project has friction. Documenting it honestly is the point of this framework.

1. The opening message broke its own rules. The system prompt explicitly said no markdown, no emojis. The very first message of every plan — the highest-stakes message because it sets the workout — ignored this and used bold text and tables. The first daily review caught it immediately. The fix was simple; the lesson is that in a personal product, the first message is the worst place to fail, because it's the user's first impression on every single workout.

2. Rule split: code vs. system prompt. Some user-specific constraints (the injury exclusion list) were buried in code; others were in the system prompt. When a review caught the bot recommending an exercise that violated the constraint, the fix had to be made in both places. The eventual decision was to move user-specific rules into per-user data so the prompt is generic and the constraints are runtime config. This is the right pattern for a multi-user-ready product.

3. Daily review caught zero workouts saved. On one cardio-only day in mid-April, the bot completed a session but did not write the workout to the database. Detected only by the daily review's database-integrity check. Without that check, the bot would have appeared to work while losing data silently for who knows how many weeks. Self-review is non-optional for a personal-data product.

4. The "no active session" dead-end. For the first few weeks the bot replied "no active session" to free-text outside a session window. Real-world test: ad-hoc messages got nothing useful. Replaced with a smart intent router that classifies the message and either routes it to the right handler or asks a clarifying question.

5. Real personal data in production complicates testing. Real fitness data lives in the same database the bot writes to. Testing changes meant either polluting the operator's own training history or risking destructive migrations. Resolved by adding a dry-run mode and a separate test-user fixture, but the temptation to "just deploy and see" had to be resisted explicitly.

6. Cost of context. The cache-heavy pattern of an iterative product is the right one — it amortises the system prompt across sessions — but the absolute cache-read token count over five weeks is a planning input for any similar project. Roughly 359 million cache-read tokens, mostly from the system prompt being re-resolved on every Telegram message.

Replicability Score

4 out of 5.

The Telegram + Vercel + Supabase + Claude API pattern is widely applicable for any personal-product LLM workflow. The specific elements that could be lifted by another builder:

—The Vercel function structure (api/webhook.ts + api/notify.ts + api/setup.ts).
—The Supabase migration set (multi-user from day one).
—The scheduled-task self-review pattern, including the report-file format and the daily-review-into-fix loop.
—The vision-from-screenshot integration.
—The smart intent router pattern for handling free-text outside an active conversation.

What does not transfer cleanly:

—The injury-rule set is fitness-specific.
—The system prompt is sport-specific.
—The Telegram-as-frontend choice depends on the audience already being on Telegram. (For a Western European audience, this is a real constraint — Telegram is not the default messenger.)

A developer wanting to build a personal nutrition coach, a study coach, a personal CFO, or any other "agent that knows you and tracks your progress" could lift 60–70% of the architecture and rewrite the domain logic. The replicability score reflects this honestly: high pattern reuse, low domain reuse.

Verdict

This project demonstrates a category that the earlier Kodulabor case studies did not cover: an iterative personal product built by its own user, refined over weeks of real use, with a self-review loop that catches problems before the user does.

The 3× effort-acceleration is real but not the headline. The headline is this: a single individual built and now operates a multi-user-ready Telegram coach with vision input, scheduled notifications, and an automated quality-review loop, in roughly 40 hours of total time, at ~€200 in API costs. A traditional team-of-two with similar scope would have taken at least six weeks and several thousand euros, and would not have built the self-review loop — because nobody would have asked for it explicitly.

The self-review loop is the most transferable lesson. It is cheap to build, it surfaces problems the operator does not see in normal use, and it pays for itself the first time it catches a silent data-write bug. Any AI-assisted product that handles personal data should have one.

For someone considering a similar build: the iterative pattern (use it → review it → fix it → repeat) is the work. AI accelerates the code; it does not replace the discipline of using your own product on yourself. The discipline is the product.

This case study was produced using the Kodulabor Assessment Framework. Methodology and findings published openly at kodulabor.ai.

Data Appendix

Metric	Value
Calendar window	~5 weeks (intermittent)
Estimated active effort	~40 hours
User prompts in main session	77
Assistant messages	1,134
Bash commands	190
Files created	36
Files edited	21
Output tokens	~344,000
Cache-read tokens	~359,000,000
Cache-create tokens	~21,000,000
Estimated direct cost	~€200 (Opus reply generation, cache-heavy pattern)
Deployment	Vercel serverless, production
Backend services	Telegram Bot API, Supabase (five migrations), Vercel
Test suite	41 tests (Vitest), all passing at session close
Self-review loop	`.claude/scheduled-tasks/fitness-bot-review/` (daily)
Daily review files published	~9 across the project window

Fitness Coach Bot — A Personal Product That Reviews Itself