Work / ai

ContentForge AI

An agentic system that drafts content in your voice, not the model's. Built because every generic AI writer produces the same averaged blog post.

LLMAgentsRAGStyle transfer

Role

Creator & Lead Engineer

Date

2025-03-01

Read time

2 min read

Stack

5 techs

I write fairly often. Books, essays, the notes section of this site. The thing that kept frustrating me about every commercial AI writer I tried was the same: the draft was technically correct, fluent, and unmistakably not me. Run any of them at scale and you produce indistinguishable content. The exact opposite of what writers actually want.

ContentForge is what I built when I got tired of complaining about it.

How it works

The system inverts the usual prompt-and-pray flow. Before drafting anything, it builds a fingerprint of the writer's voice from their existing corpus. Diction, sentence rhythm, paragraph length distribution, the metaphors they reach for under pressure, where they tend to break a thought. Some of this is straightforward statistical features. Some is LLM-extracted rules ("Abel almost never uses semicolons. He uses fragments. Often."). The fingerprint is just a structured document the drafting agent has to operate under.

Then the agent loop runs: research, outline, draft, self-critique against the fingerprint, revise. If the draft drifts too far from the voice metrics, it gets rejected and the loop iterates again.

What surprised me

Most agents don't need a vector DB. Stop putting one in the diagram. ContentForge does need one, but for a specific reason: paragraph-level retrieval over the writer's past work, so the drafting agent can pull stylistic examples that match the topic at hand. That's a real use case for embeddings. Most chatbot architectures aren't.

The hardest part wasn't the agent loop. It was the eval gate. How do you measure "this sounds like me"? I tried a few approaches and landed on a hybrid: a small set of objective metrics (sentence length, lexical overlap with the corpus, syntactic patterns) plus a model-graded judgment against a held-out sample of real writing. Neither alone was enough. Together they're roughly good enough.

What's still rough

The fingerprint takes a meaningful corpus to be useful. Below maybe 30,000 words of source material, the system mostly produces a generic-but-slightly-warmer-than-default voice. With 100,000+ words it starts to feel uncanny in the right way.

If I rebuilt it today I'd spend more time on the eval gate. The drafting loop is fine; the bottleneck is knowing whether the output is good. That's the lesson I keep relearning across every AI project.

Stack

LLMRAGPythonTypeScriptVector DB