My first RAG, part 2: I built generation. One endpoint sang. The other lied about people needing the bathroom.

In part 1 I built the retrieval half of a RAG pipeline. That article was a polished retrospective: I waited until things worked, then wrote it up. This one is different. I’m still in the middle of generation work, things are bending and unbending, and the lessons feel fresher when I write them while the git diff is open.

A note on the equation I’m solving against: quality, cost, UX honesty, and how easy the result is to maintain later all pull at each other. Some of the choices below were cost-driven (LLM bills are not free, especially when iterating). Some were quality-driven (the model getting the wrong end of the stick about what a video actually was). Some were UX-driven (one endpoint should not pretend it answers every shape of question). A few were just future-me being grumpy at past-me. I’ll flag which factor is winning each decision as we go.

I started with /summarize because it felt like the easier one

The plan from part 1’s “what’s next” was to build /rag/answer — embed the question, retrieve top-k chunks, prompt the LLM with those chunks. The textbook RAG loop. But I had a feeling broad questions (“what’s this video about?”) wouldn’t survive that loop, so I built /rag/summarize first as the lower-risk warm-up.

The shape: no retrieval at all. Pull every chunk for the video out of Postgres, paste the whole transcript into the prompt, ask for either an "overview" (one sentence + 3-7 topic bullets) or a "detailed" chronological walkthrough.

Tested on a Mandarin language-learning video. First call worked. Overview correctly named it as a series of conversational phrases, detailed listed every topic with timestamps. I was a little suspicious of how easy that was.

The catch came at the next thing I checked: the bill.

A 30-minute video transcript is ~15-20k input tokens. Each language in ["en", "source"] runs as its own LLM call (parallel — asyncio.gather keeps wall-clock down even when dollar cost doubles). So a single Summarize click is ~40k input tokens going into the LLM. That’s small money but the cost compounds quickly if you let users hit the same button repeatedly.

So I added a cache. New summaries table keyed on (source_id, detail, model, languages) — a JSONB column with the LLM outputs, a created_at timestamp, a unique constraint on the key. The endpoint checks cache first; on hit it returns immediately and flags cached: true in the response.

I added cached: true/false to the response almost as an afterthought and now I look at it on every test. The frontend shows “served from cache · 0 LLM calls” in the footer of the result. Cheap transparency, surprisingly load-bearing for trust. It also makes the “free repeat” math obvious to the user.

One small but load-bearing detail I want to flag: I put languages in the cache key, not just source_id. If a caller switches from ["en"] to ["en", "source"], that’s a cache miss and we generate the new entry. If I’d ignored languages in the key, switching would silently return a wrong-shape response. Schema keys are not the place to be clever about deduplication.

The numbered-list bug that ate an afternoon

While I was prompt-tuning the overview output, the model started emitting something weird:

1. The video provides a series of common conversational phrases ...

2.

- The speaker expresses a desire to go to the restroom (2:01).
- ...

A literal 2. with nothing after it. Looked at the prompt:

Produce a summary in this exact format:
  1. A single-sentence overview of what the video is about.
  2. A blank line.
  3. 3-7 bullet points covering the main topics ...

The model had read my structural numbering (steps 1, 2, 3 of the spec) as literal output numbering. Item 2 in my spec was “a blank line”, so item 2 in its output was, well, blank — with a “2.” header.

Fix that worked: describe shape in prose with a skeleton showing the literal pattern in angle brackets.

Output exactly this shape:

    <one plain sentence>

    - <topic 1, ending with (M:SS)>
    - <topic 2 (M:SS)>
    ...

After this change the artefact stopped appearing on five different test videos. Not a controlled experiment but consistent enough to ship. Lesson I’m keeping: structural directives in prompts shouldn’t visually resemble the format you want as output. The model can’t tell them apart reliably.

Then I built /answer and got politely refused

/rag/answer: embed the query, fetch the top 5 nearest chunks via pgvector cosine, stuff them in a prompt, tell the model answer only from this context. Run it in parallel for each requested language, same ["en", "source"] shape as summarize.

Asked it the obvious first question: “Give me a summary.”

“I don’t have enough information to answer that based on this video.”

Hmm. Tried “What is this video about?”. Same answer.

Asked something narrower — “How does she say ‘I want to go to the bathroom’?” — and it came back perfectly, with a timestamp citation pulled from the right chunk.

OK so the failure mode was clear: broad questions don’t embed near any specific chunk because the answer is “all of the chunks”. Top-5 retrieval returns 5 arbitrary slices that don’t cover “what is this video about”. The prompt explicitly says “only answer from these chunks”. The model honours the instruction and refuses.

By construction, not a bug. But I’d been mentally treating /answer as the canonical chat-with-this-video endpoint and the refusal made that framing untenable. /answer is for narrow, factual, grounded questions. For broad ones, retrieval can’t help — you need the whole transcript, like summarize does.

I now think of them as two products. UI-wise I’m leaning toward a mode toggle (more on that later) instead of trying to auto-detect.

The bathroom incident

This is the one that bothered me for a full afternoon.

Same test video: Japanese language-learning channel, title “70 Must-Know Chinese Sentences: Listen Once A Day, Naturally Understand Fast Chinese”. The speaker is teaching common Chinese phrases by demonstrating them — saying things like “I want to go to the bathroom”, “Can you speak more slowly?”, “Where’s the nearest restaurant?” as examples.

The detailed /summarize walkthrough returned, verbatim:

- 2:01 – 2:18: The speaker expresses a desire to go to the restroom.
- 2:21 – 3:00: The speaker asks how someone met another person.
- 3:25 – 3:50: The speaker asks about the signature dish of the
                other person's hometown.
- 4:52 – 5:39: The speaker admits to never having been to China
                and struggles with speaking the language.

Read that again. The LLM thinks the speaker urgently needs the bathroom at 2:01, then immediately gets curious about someone’s romantic history, then confesses to never having visited China.

That isn’t what’s happening. The speaker is teaching how to say those phrases. Every “the speaker says X” should be “the speaker demonstrates how to say X”. It’s a tutorial. The transcript chunks contain literal example phrases the speaker is teaching.

Why did the LLM get this so wrong? I sat with it for a while.

The model had no way of knowing. The prompt contained the chunks. The chunks contained the phrases. Nothing in the prompt said this is a tutorial. So the model interpreted the chunks at face value — exactly as I’d asked it to.

I had context the model didn’t. The title literally says “Chinese Sentences” and “Naturally Understand”. The channel name — ShuoshuoChinese说说中文 — gives it away even further. None of it was in the prompt. I had populated documents.title with NULL all along because I never threaded it through the ingest path.

That afternoon went from “what’s wrong with my prompt?” to “I have been doing this all wrong” to “this is the cheapest fix in the project so far.”

So I went back to ingest and asked yt-dlp for everything

yt-dlp’s extract_info(download=False) returns a fat metadata dict: title, uploader (the channel), duration, description, categories, tags, and — when the creator has added them — chapters (the clickable section markers on the YouTube timeline).

I added all of it to documents:

title (was already there, finally populated)
author (new column, the uploader/channel)
duration_seconds
description — truncated to 2000 chars because YouTube descriptions are routinely 5-10kB of sponsor links, hashtags, and timestamp navigation, none of which the LLM needs
metadata (JSONB) — bucket for the rest: thumbnail URLs, categories, tags, chapters, upload date. JSONB so I can grow the shape without another migration
info_fetched_at — timestamp marker; null means “haven’t fetched yet”, non-null means “already cached, no need to re-hit yt-dlp”

The cost of this was about 100 lines of Python plus a small schema migration. The win was the whole project working better. With title + description in the prompt, the bathroom bullets became:

- 2:01 – 2:18: Teaches how to say "I want to use the bathroom"
                in Chinese.
- 2:21 – 3:00: Teaches a question about how someone met someone.

Same model, same chunks, same temperature. Just 50-500 extra tokens of “here’s what kind of video this is” at the top of the prompt.

This was the single biggest quality jump in the project, and it cost effectively nothing — fractions of a cent per call. The metadata also doubles as cache (one row per video), so subsequent calls don’t re-hit yt-dlp. Same DELETE FROM documents WHERE source_id = ... cascades through chunks + summaries + metadata in a single transaction.

While I was there I also added a generic preamble in the prompt:

“Videos can be tutorials/teaching, documentaries, vlogs, interviews, news, gaming/reviews, etc. Interpret content according to the apparent video type — in a teaching/tutorial video, the speaker isn’t doing the things they say; they’re teaching how to say or do them.”

Not video-specific. It nudges the model to think about genre when it sees the title and description. Maybe 30 tokens. Free.

What I’m trying for /answer

The bathroom fix made /summarize great. /answer still has the narrow-vs-broad question split. Three pulls on the design: quality (broad questions should not silently fail), UX honesty (the user should know what they’re getting), cost (whole-transcript prompts aren’t free). Here’s the shape I’m landing on:

Shipping: Quick / Broad mode toggle, user picks

The current /answer becomes Quick mode — k=5 retrieval, ~2k input tokens per call, the cheap path. It works great for narrow factual questions, which is what RAG is good at.

Broad mode skips retrieval entirely and puts the whole transcript in the prompt, the same way /summarize does. ~10× more expensive per call. But it’s the honest answer to “what topics are taught?” or “list every phrase about ordering food”, because those questions genuinely need the whole transcript.

The user picks the mode in the UI. I considered an auto-router — a tiny LLM call to classify “narrow vs broad” before the main call. Skipped it. Reasons in order:

It’s another LLM call, even if a cheap one.
It adds latency on every Ask.
A user-facing toggle is more honest about the cost. People understand “this option is more expensive but more thorough”.

Shipping: cache answers like I cache summaries

New answers table mirroring summaries, keyed on (source_id, q_hash, mode, model, languages). First time you ask a question, it costs one LLM call. Second time anyone asks the same question of the same video, it’s a Postgres read.

q_hash matters because question text can be long and arbitrary; hashing keeps the index efficient and dedupes whitespace differences.

Caching here pulls in multiple directions all at once. Cost: avoid paying for the same generation twice. UX: a repeat question returns in milliseconds and the user sees that it was cached. Correctness: the same input produces the same output (no temperature noise on repeat). All three reasons point at the same design, which is the nicest kind of design decision to make.

Maybe later: bump k from 5 to 10 in Quick mode

Doubles the input cost per call. Probably worth it for marginal quality. Will wait until the answer cache lands so the impact amortises — bumping k matters less if cache is doing the heavy lifting.

Skipping (for now): multi-turn Ask

Conversation history would be nice but the token cost grows fast — each turn carries the previous one in context. At my scale, the incremental UX win doesn’t justify the linear cost growth per question. Maybe later.

The rename that came along for the ride

Mid-pass I noticed something. My documents table had columns named for one specific source type — video_id, transcript_lang. The table itself was conceptually generic (one row per thing we’ve ingested), but its columns assumed which thing. If I ever want to add another source type (a podcast episode, a Substack article, a PDF) it would either need a parallel table or columns that don’t make sense.

So mid-stream I renamed:

documents.video_id → documents.source_id
documents.transcript_lang → documents.content_lang
summaries.video_id → summaries.source_id
added documents.source_type with default 'youtube_video'

Postgres’ ALTER TABLE ... RENAME COLUMN is non-destructive — preserves rows, indexes, and even foreign-key references. Wrapped each rename in a DO $$ ... IF EXISTS check ... END $$ block so re-running the schema file is a no-op on already-migrated branches.

The HTTP boundary still uses video_id in request/response bodies, because the endpoints are surfaced via YouTube tools and “video_id” is clearer to callers. The translation lives in a single SQL alias:

SELECT d.source_id AS video_id, ...

A future podcast / article / PDF source becomes a new module under api/modules/<source>/ that ingests into the same documents + chunks tables with a different source_type. The RAG layer doesn’t change. This wasn’t strictly necessary for the generation work, but doing it now while everything was already in my head was much cheaper than doing it later.

The general rule I noted for myself: column names should match the abstraction level of the table they live in. If the table is generic, the columns should be generic; if the columns are source-specific, the table is probably misnamed.

What I wish I’d internalised earlier

Two things that present-me would tell three-weeks-ago-me:

The model has zero context outside the prompt. Title, author, description, video type — all of these exist in your ingest data and do not exist in the prompt unless you put them there. The bathroom incident was mine to prevent. I didn’t, because I assumed the chunks would carry enough context. They don’t. The metadata always carries more signal than I expect.
/answer is not a general-purpose “chat with this video” endpoint. It’s “find me the bit that answers this specific factual question”. RAG is for grounded retrieval, not open conversation. Build the UI around that truth instead of pretending one endpoint solves every shape of question.

The retrieval pipeline I built in part 1 still works. The honest update on what I thought RAG was: retrieval is about a third of what makes a useful RAG product. The other two thirds are generation prompts that know what they’re looking at, and a UI that doesn’t oversell what RAG can do.

What’s next (still)

Same list as part 1, mostly. The eval harness moved from “would be nice” to “increasingly necessary” — every prompt tweak in this pass I validated by reading output. Worked at this scale. Won’t scale much further. Twenty hand-curated (question, video, expected answer) pairs and a script to diff against my baseline would let me iterate on prompts and retrieval choices with actual confidence instead of vibes.

bge-m3 (the multilingual embedding model that runs locally) is still on the list and will matter more once I have something to measure against. Cross-lingual retrieval quality on the Mandarin videos is probably my biggest unmeasured-but-suspected gap.

If you’re building your own RAG and you find yourself thinking “retrieval is the hard part” — that was me too, three weeks ago. Retrieval is the foundational part. The hard part is realising the generation prompt is doing a lot of heavy lifting you hadn’t accounted for yet, and that some of the biggest quality wins are 50 tokens of meta-context rather than better embeddings, fancier chunking, or a more expensive model.

Back to the workbench.

← All articles