Kodulabor — Applied AI research with measured outcomes

Problem

Kodulabor needed a website. Not a placeholder — a credible launch vehicle for a new business line, timed to a public LinkedIn announcement. The site needed to establish positioning, present the assessment framework, host two published case studies in full, support English and Estonian localization across two domains (kodulabor.ai and kodulabor.ee), and be deployable on Vercel from a GitHub repository.

The scope was clear: home page, about page, methodology page, case studies section with detail pages, contact page. Bilingual. Markdown rendering for full-length case studies. Domain-based locale routing middleware. All in a single working session.

This sounds like a straightforward AI-assisted web development project. It was not. The technical build was fast. The content quality was a disaster that required five correction rounds.

AI Approach

The website was built using Claude Cowork — Anthropic's desktop AI assistant — running on Claude Opus. The development method combined direct prompting for architecture decisions with delegated subagents for bulk content generation and code scaffolding.

Technical stack:

—Next.js 16 (App Router) with TypeScript
—Tailwind CSS v4
—react-markdown with remark-gfm for case study rendering
—Vercel deployment target
—Two domains: kodulabor.ai (English), kodulabor.ee (Estonian)

AI architecture — and the root cause of problems:

The session used a parent-child agent pattern. I (the parent Claude session) handled architecture, decisions, and quality review. Subagents were delegated to handle:

—Creating all translation files (English and Estonian)
—Building all Next.js page components
—Generating case study data

The subagents worked fast. They also worked without the full context. The parent session had the brief, the case study documents, the resume, the positioning decisions from the conversation — but the subagents received only their task descriptions. They filled in gaps by fabricating.

Human Effort

Session duration: ~4 hours (13:00–17:00 UTC, March 19, 2026)

Commits: 9

Total lines of code generated: 6,381

Prompt count: ~25 human messages

Effort breakdown by phase:

Phase	Time	Activity
Concept and brief	~45 min	Naming, positioning, brief document, planning
Initial build	~30 min	Next.js scaffolding, all pages, i18n, middleware
First correction: case study content	~20 min	Discovered fabricated content, replaced with originals
Second correction: About page	~15 min	Third-person voice, fabricated career details
Third correction: Methodology	~20 min	Wrong framework (process steps, not dimensions)
Fourth correction: Full rebuild	~40 min	Repositioning based on independent feedback
Fifth correction: Sharpening	~30 min	Consulting contradiction, bio compression, pain-led copy
Total	~4 hours

The revealing ratio: Of the ~4 hours, roughly 30 minutes was productive AI-assisted building. Roughly 2.5 hours was catching and correcting AI-generated content that was plausible but wrong.

Traditional Benchmark

Building this website without AI:

Item	Estimate
Design & frontend (Next.js, Tailwind, responsive)	20–30 hours
Content writing (5 pages × 2 languages)	15–20 hours
Case study integration (markdown rendering, data layer)	8–12 hours
i18n & middleware (locale routing, domain mapping)	5–8 hours
Total	48–70 hours
Calendar time	1–2 weeks

Acceleration Factor

Metric	Traditional	AI-assisted	Factor
Wall clock time	1–2 weeks	1 afternoon (~4 hours)	~20x
Human effort (total)	48–70 hours	~4 hours	12–18x
Human effort (productive)	48–70 hours	~1.5 hours	32–47x
Human effort (corrections)	0 hours	~2.5 hours	N/A

The acceleration is real but misleading if you only count productive time. The actual experience was: 30 minutes of impressive generation, then 2.5 hours of quality control. The acceleration factor on the technical build is extraordinary (the Next.js scaffolding, middleware, and page structure appeared in minutes). The acceleration factor on content that required judgment was much lower — and in some cases negative, because fixing plausible-but-wrong content is harder than writing it from scratch.

Quality Assessment

What was generated correctly on first attempt

The technical infrastructure was flawless:

—Next.js project structure with App Router, TypeScript, proper configs
—i18n system with [locale] dynamic segments and translation loading
—Domain-based locale routing middleware
—Markdown rendering component with styled table, code, and heading support
—Static generation with generateStaticParams for all routes
—Build passed on every attempt

Technical verdict: 9/10. The AI excels at well-trodden infrastructure patterns.

What was fabricated and had to be replaced

The content was a catastrophe:

1. Case study content — entirely fabricated (Commit 4) The subagent was told to create case study data. Instead of using the actual case study documents we had already written, it invented completely different content. The Revalia Homes study became a fictional story about "copywriting automation" with made-up metrics ("83% time reduction across 12 sessions, $0.24 per session"). The automated case studies became a story about "CSV upload pipelines." Both were plausible, well-structured, and entirely wrong.

Lines replaced: 423

2. About page — wrong voice, fabricated details (Commit 5) The subagent wrote the About page in third person ("He") despite the instruction to use first person. It also fabricated career details. "Skype (7 years)" became "seven years" (it was 5). "Joined at 20 employees" was correct but surrounded by invented narrative. The claimed "400+ engineers in Estonia known by first name" appeared despite not being in the brief or resume.

Lines replaced: 38

3. Methodology — wrong framework entirely (Commit 7) The brief defined a 9-dimension assessment framework (Problem, AI Approach, Human Effort, Traditional Benchmark, Acceleration Factor, Quality Assessment, Gotchas, Replicability, Verdict). The subagent invented a 9-step process (Project intake, Baseline measurement, AI integration design...) that sounded professional but didn't match anything in the brief or the actual case studies. The methodology page and the case studies described completely different things.

Lines replaced: 65

4. Positioning — "Not a consultancy" self-sabotage (Commits 8–9) The initial content included the line "Not a consultancy. Not an agency." — memorable copy that actively contradicted the business model. The contact page described paid consulting engagements while the home page rejected the category. It took external feedback to identify this as a structural problem, not a style issue.

Lines replaced: 206 across two correction rounds

Content verdict: 3/10. The AI produces content that reads well but means wrong. It is fluent without being accurate. It fills gaps in its context by pattern-matching against similar content it has seen, which produces plausible fabrications that require expertise to catch.

Gotchas & Limitations

1. Subagent context loss is the root cause

The parent session had 25+ messages of accumulated context: the brief, the positioning decisions ("kitchen table, not government"), the naming ontology, the case study documents, the resume, the photo, the internal Bolt announcement. When a subagent was delegated a task like "create the translation files," it received a summary of this context — not the full context. Every fabrication traces back to the subagent filling in what it didn't know.

Lesson: Delegating to subagents without passing the source documents is like briefing a junior copywriter verbally and expecting them to get the details right. They won't. They'll write something that sounds right.

2. Plausible fabrication is worse than obvious failure

When AI generates code that doesn't compile, you catch it immediately. When AI generates content that is well-written, properly formatted, internally consistent, and factually wrong — you might not catch it until someone else reads it. The fabricated case study content could have been published. It read fine. It just wasn't true.

Lesson: Content review cannot be skipped, even when the output looks professional. Especially when it looks professional.

3. Positioning requires taste, not generation

The AI was asked to write website copy. It produced copy that was clear, well-structured, and strategically incoherent. "Not a consultancy" on one page, "paid consulting engagements" on another. No amount of prompt engineering fixes this, because the problem isn't generation quality — it's strategic judgment. The AI doesn't know whether you should lean into or away from the consulting category. That's a human decision.

Lesson: AI can draft positioning. It cannot decide positioning. The human must own the strategic frame.

4. The most valuable feedback came from another AI model — used differently

The session quality improved dramatically after I fed the site content and the original brief into a separate AI model (outside this Cowork session) and asked it to critique the site as a positioning reviewer. That model, working with fresh context and no accumulated blind spots, produced two rounds of structured feedback that identified the positioning contradiction, the biography-heavy About page, the methodology mismatch, the missing conversion path, and the consulting self-sabotage.

This is the most instructive part of the whole project. The AI that built the site could not see its own strategic errors. A different AI instance, given the right framing ("critique this as a launch vehicle for a new business line"), caught them immediately. The problem was never AI capability — it was context contamination. The building session had accumulated so many incremental decisions that it lost the ability to evaluate the whole.

Lesson: AI reviewing AI works — but only when the reviewer has clean context and a different role. Using the same session to both build and critique produces blind spots. The review model saw the "Not a consultancy" contradiction instantly because it wasn't the one who wrote it. Separation of concerns applies to AI workflows, not just code architecture.

5. Human taste was still the final filter

Even with AI-on-AI review, the human made the final calls. Which feedback to accept ("remove 'Not a consultancy'" — yes), which to modify ("shorten the bio by 30-40%" — yes, but I chose the compression level), and which to reject or defer. The AI reviewer suggested specific copy; I used the direction but not the exact words. Taste — knowing what sounds like you, what matches the tone, what your audience will believe — is still a human function.

Lesson: The best workflow was: AI builds fast → different AI critiques strategically → human decides what's true. Three layers, not two.

6. The 25% rework ratio

Of the 2,898 lines in the initial build, 732 lines (25.2%) were deleted in subsequent correction commits. That means one in four lines generated by subagents was wrong enough to require replacement. On a small project, this is manageable. On a large project, a 25% fabrication rate would be disastrous.

Replicability Score

3 out of 5

The technical pattern (Next.js + i18n + Tailwind + Vercel) is highly replicable. The content failures would reproduce identically for anyone using the same subagent delegation pattern without full context passing. The correction process required:

—An experienced engineer who noticed the fabrications
—Actual source documents to compare against
—A separate AI model used as a strategic reviewer (not the same session that built the site)
—Human taste as the final filter on what feedback to accept

The three-layer workflow (AI builds → different AI critiques → human decides) is replicable. The specific judgment calls are not. Someone without the experience to evaluate the AI reviewer's suggestions would either accept everything (overcorrection) or nothing (wasted feedback).

Verdict

This project is the most instructive Kodulabor case study so far, because it shows where AI assistance fails — and the failure mode is insidious rather than obvious.

The technical build was genuinely impressive. A complete bilingual Next.js website with 17 pre-rendered routes, domain-based middleware, markdown rendering, and proper static generation — scaffolded in under 30 minutes. No human developer matches that speed on infrastructure.

The content was genuinely bad. Not obviously bad — subtly bad. Fabricated metrics, wrong voice, mismatched framework, contradictory positioning. Each piece of content read well in isolation. The problems only emerged when you compared the output against source documents, checked it against the brief, or asked someone with strategic judgment to review it.

The key insight: AI acceleration is real on the structural and technical layers. It is dangerous on the content and strategic layers — not because the AI is slow, but because it is confidently wrong. The 12–18x overall acceleration is real, but it hides a split: infrastructure was ~50x faster, content was ~2x faster after corrections, and strategic positioning required purely human judgment.

The most surprising finding: the best reviewer was also an AI — just a different one. Feeding the site content and the original brief into a separate AI model for critique produced sharper, more actionable feedback than the building session could generate internally. The building session had context blindness; the review session had fresh eyes. This suggests that AI-assisted projects should build separation of concerns into their workflow: one AI builds, a different AI reviews, and a human makes the final calls.

The practical takeaways for Kodulabor projects: never delegate content to subagents without passing the full source documents. Always review content against originals. Use a separate AI instance for strategic review. And keep the human in the loop as the taste layer — the one who decides what's true, what sounds right, and what the audience will actually believe.

The AI builds the house fast. A different AI checks if it's the right house. The human decides whether to live in it.

This case study was written during the same Cowork session it describes. The data comes from git commit history, line counts, and timestamps. The strategic feedback came from a separate AI model given the brief and site content for independent review. The irony of the whole thing is not lost on me.

Data Appendix

Metric	Value
Session date	March 19, 2026
Session duration	~4 hours
Total commits	9
Total lines generated	6,381
Total lines deleted (corrections)	732
Rework ratio	25.2%
Correction rounds	5
Human prompts	~25
Subagent delegations	6
Fabrication incidents	4 (case studies, about, methodology, positioning)
External AI review rounds	2 (separate model, fresh context)
Build failures	0
Technical infrastructure errors	0
Content/strategic errors	4 major, multiple minor
Final build	17 pre-rendered routes, 2 languages
Stack	Next.js 16, TypeScript, Tailwind v4, react-markdown
Deployment target	Vercel (kodulabor.ai / kodulabor.ee)

Correction Timeline

13:51  Initial commit — 2,898 lines, all pages, both languages
       ✓ Infrastructure: perfect
       ✗ Case study content: fabricated
       ✗ About page: third-person, fabricated details
       ✗ Methodology: wrong framework
       ✗ Positioning: "Not a consultancy"

13:54  Middleware — working correctly

13:59  README/CLAUDE.md — working correctly

14:12  CORRECTION 1: Case study content replaced
       423 lines deleted, real documents inserted

14:22  CORRECTION 2: About page rewritten
       38 lines deleted, first-person voice, real resume data

14:52  CORRECTION 3: Methodology rewritten
       65 lines deleted, framework dimensions replace process steps

15:06  CORRECTION 4: Major rebuild from external feedback
       161 lines deleted, positioning + structure overhaul

15:46  CORRECTION 5: Sharpening from second feedback round
       45 lines deleted, contradiction fixed, copy tightened

Building the Kodulabor Website — Why AI Needs Rigor and Taste