I’d been writing TypeScript professionally for years, shipped a few production Node services, and read enough about transformer models to hold a conversation. What I hadn’t done was build either a Python service or anything involving retrieval-augmented generation.

A part-time job ad caught my eye. It listed Python + Postgres + RAG + TypeScript — three of those I had, one I didn’t. So I decided to fix the gap the most useful way I know: build something that exercises it end-to-end and write about what I learned.

This is that article.

What RAG actually is (the honest 60-second version)

Before I built anything, I had to admit I was fuzzy on what RAG was actually doing. The marketing copy made it sound like specialising the LLM. It isn’t.

RAG is two systems stapled together:

  1. A retrieval pipeline. You take a corpus (documents, transcripts, web pages), chop it into chunks, embed each chunk into a vector, and store the vectors in a database. At query time you embed the user’s question with the same model and ask the database “what’s nearest to this vector?”
  2. A generation step. You take those nearest chunks and stuff them into the prompt for an LLM. The LLM never learned your corpus — you just gave it the relevant excerpts every time it answers.

There is no fine-tuning. The base model is unchanged. RAG is closer to “smart search, then prompt the LLM with the search results” than “teach the LLM your domain.”

The retrieval system is the heart of it. Generation is a thin wrapper around an OpenAI / Anthropic call.

What I built

A small FastAPI microservice that ingests YouTube videos and lets you search across them by meaning:

  • POST /rag/ingest — fetch a video’s transcript, chunk it into ~30s windows with 5s overlap, embed each chunk via OpenAI’s text-embedding-3-small, persist to Postgres.
  • POST /rag/query — embed the query, do an HNSW cosine-similarity search against the chunk vectors, return the top-k with timestamps.

It’s deliberately the R of RAG. The G (asking an LLM to answer based on those chunks) is the natural next step but is the boring part — a templated prompt and an API call. The interesting design decisions all live in the retrieval pipeline.

The decisions I made (and why)

Postgres with pgvector, not a dedicated vector DB. Pinecone, Qdrant, Weaviate — the dedicated vector databases exist, and at billion-scale they probably win. At my scale (one developer, learning, dozens of videos) the simplest thing was the one I already understood. pgvector adds a VECTOR(1536) column type and an <=> cosine-distance operator. That’s the entire integration. One less moving part, one less bill, one less mental model.

Neon for managed Postgres. I considered self-hosting on a $6 droplet. Three things pushed me to Neon: the free tier is generous, pgvector ships built-in (no apt install postgresql-16-pgvector), and the branching model lets me mirror the four-environment split I use on the Mongo side (local / dev / qa / prod) as four cheap branches off one project.

OpenAI’s text-embedding-3-small, not a local model. For an English- heavy first project this is the path of least resistance: $0.02 per million tokens, 1536-dim vectors, well-supported by every vector database. I architected behind an EmbeddingProvider protocol so a local model can slot in later. The interesting future swap is BAAI’s bge-m3 for Mandarin — same shape, just a different embed() implementation.

FastAPI with a modular layout. The Fastify backend I’d already written uses a server/ foundation plus modules/auth, modules/ payment plugged in. I mirrored that exactly in Python: api/server/ holds the app factory, DB pool, and config; api/modules/rag/ and api/modules/youtube/ each expose an APIRouter and get include_router-ed at startup. The point isn’t elegance — it’s that I can think about the Python service with the same mental shape as the Node service.

Gotchas that cost me time

A few things I didn’t see coming:

  • Async everywhere or nowhere. FastAPI is ASGI; asyncpg is async; the OpenAI client has an async variant. Mix one sync call in and you block the event loop for the whole request. The fix wasn’t hard (just AsyncOpenAI and await provider.embed(...)) but it was a Python idiom I’d never thought about — Node forces async on you; Python lets you mix the two.
  • Postgres vs Mongo mental shift. I’d been on MongoDB for years. Postgres’ CREATE EXTENSION vector, the strict schema, BIGSERIAL PRIMARY KEY, the way you bootstrap migrations — none of it was unfamiliar, but the Mongo “store a document, ask for it back” reflex kept trying to take over.
  • The Neon endpoint timing race. When you create a branch via the API, Neon auto-provisions a compute endpoint asynchronously. The first connection_uri call after POST /branches returns endpoint not found because the endpoint isn’t ready yet. The symptom is psql falling back to its local Unix socket and complaining No such file or directory. Took me a couple of diagnostic curls to figure out the endpoint was being provisioned and I was just being impatient.
  • The peaks-vs-scenes problem. Vector search finds individual high-similarity chunks. A conversation about food might span five consecutive chunks but only the most on-topic one will score in the top-k. The fix — “neighbour expansion”: pull the top-k chunks, then also grab their temporal neighbours from the same video — is the most interesting bit of retrieval design I learned, and it’s not in most RAG tutorials.

Chunking — the bug that ate 56 minutes of video

The chunker is the most boring-looking part of the pipeline. Splitting text on time windows is not interesting work. It also turned out to be the place I lost the most time, because the boring code makes silent assumptions about its input.

The strategy: fixed-time windows with overlap

I went with the boring-and-reliable baseline. Each chunk is a 30-second window of speech, walking forward in 25-second steps so each chunk overlaps its neighbour by 5 seconds. The overlap is small but matters: a query that lands near a boundary still has context on both sides instead of falling between two chunks.

window_seconds = 30.0
overlap_seconds = 5.0
step = window_seconds - overlap_seconds   # 25.0

There are smarter strategies — semantic chunking that splits on topic shifts, sentence-aware chunking that respects clause boundaries. Each adds a model call per chunk or an NLP dependency. For conversational YouTube content, time-window chunking captures most of the value. Easy to explain, easy to A/B test later when I add an eval harness.

A one-hour video came back with four chunks

When I ingested my first long video — a one-hour Mandarin interview — the response said chunks_inserted: 4. For an hour of speech. I’d expected ~140.

Diagnostic first. The transcript itself had 2030 raw segments and ~57 minutes of speech, confirmed by listing the segments directly. The data WAS there. Something inside the chunker was dropping 56 minutes of it on the floor.

Found it in the loop. The chunker decides when to stop by looking at the last segment in the array:

last_segment_end = segments[-1].start + segments[-1].duration

segments[-1] is the last element of the list, not the segment with the latest start time. The chunker assumed those were the same. They weren’t. youtube-transcript-api v1.x returned this particular video’s segments out of chronological order — segments[-1] happened to land at ~80 seconds. The loop terminated at 80 seconds, ignoring the other 3,400 seconds of speech that came after it in the array but earlier in the timeline.

Sort at the boundary, not in every consumer

One-line fix:

segments.sort(key=lambda s: s.start)

I put it in transcript.py, not the chunker. Every consumer of fetch_transcript() wants chronological order — the chunker, the caption-list HTTP endpoint, the future neighbour-expansion logic. Sorting once at the data boundary means downstream code can trust it forever; sorting inside the chunker would have left the bug latent in every other consumer.

Wider lesson: library return values are contractually weaker than you assume. The docs said “returns transcript snippets” — they did not say “in chronological order”. Read the upstream more carefully than your reflexes, then canonicalise at your system’s boundary, not at every internal callsite.

Caching — paying for the LLM only when you have to

The R in RAG is cheap. Retrieval is a vector multiplication and an index lookup. Both run in milliseconds and cost fractions of a cent. The G is expensive. Every /answer and /summarise call is dollars out the door and seconds of latency. As soon as I had the endpoints working end-to-end, I noticed I was paying for the same answer multiple times because nothing cached.

Why caching is non-optional

The first time I summarised my 1-hour Mandarin video, the LLM call took 11 seconds and burned through several thousand input tokens. The second time I asked the same question with the same parameters: same 11 seconds, same cost. No browser-level cache to help me, no service-level memo, nothing.

For a personal portfolio this is fine — I’m hitting it maybe ten times a day. For a real product where users re-open the same video page or review the same summary, this scales linearly with traffic. The cost profile makes you want to cache aggressively. The latency profile makes you NEED to.

What goes into the cache key

The cache row is keyed by (video_id, detail, model, languages). That’s the smallest set of inputs that, when identical between two calls, MUST produce identical output:

  • video_id — which video’s transcript
  • detailoverview (1-sentence + bullets) vs detailed (chronological walkthrough)
  • model — different model = different output, even from same prompt
  • languages — an ordered array; ["en", "zh-Hans"] produces different rows than ["zh-Hans", "en"] (the order in the response matters)

Anything that doesn’t affect output stays out of the key. I considered including the prompt template version but my prompts aren’t versioned yet — when they are, the key gets a fifth dimension.

POST returns + caches (vs. POST stores only)

I spent half an hour deciding the HTTP semantics. The two designs:

  1. POST returns the data and caches it. Single round-trip. Cache hit on every subsequent call. The response tells you whether the data came from cache or a fresh LLM call.
  2. POST stores only (202 Accepted), GET reads the cache. Two round-trips first time; every read goes through the cache by construction.

Design 2 is more REST-pure. Design 1 is friendlier for the synchronous LLM-call pattern — the caller already waited for the LLM, may as well hand them the result.

I went with #1, then added a cached: bool field to the response. That field is doing real work. It makes the cache invisible-but-honest: callers can ignore it and use the data, OR they can log it and notice when a call they expected to be cached wasn’t. It also explains latency variance without forcing the caller to interpret it. If I were running this at scale where billing-per-call became a serious concern, I’d flip to #2 — but for now, transparency via cached is enough.

Deleting one row deletes everything

The cache table has a foreign key:

video_id TEXT NOT NULL REFERENCES documents(video_id) ON DELETE CASCADE

That ON DELETE CASCADE looked unremarkable at the time. Two days later when I built the “delete this video” endpoint, I realised what it meant.

A single DELETE FROM documents WHERE video_id = $1 triggers two cascades automatically:

  • chunks via chunks.document_id → documents.id
  • summaries via summaries.video_id → documents.video_id

One SQL statement. Atomic. No application-level coordination, no half-deleted state if the process crashes between calls. In Mongo I’d have written three deletes and a transaction to wrap them. In Postgres with proper FKs, the schema enforces the lifecycle and the application code stays a single line.

This is the underrated half of “boring relational database”. The HNSW vector index gets all the press; the foreign-key constraints quietly do half the work.

What I’d tell my past self

Most of the RAG content online makes it sound more magical than it is. The actual code is ~400 lines across config, schema, transcript fetching, chunking, embedding, and a couple of SQL queries. The interesting part isn’t the pipeline — it’s the dial-twisting that comes after: chunk size, overlap strategy, embedding model choice, top-k, neighbour expansion, hybrid lexical+vector retrieval, a reranker pass. Building the pipeline is the price of admission; tuning it is where the engineering lives.

I also wish I’d appreciated earlier that the database is the co-engineer. pgvector’s HNSW index isn’t just a storage trick — the choice of index (HNSW vs IVFFlat), the cosine-vs-L2 metric, the similarity threshold for filtering noise — these are retrieval quality decisions that live in SQL. A vector DB isn’t a checkbox in your architecture diagram. It’s a dial you turn.

What’s next

A few things on the list:

  • The /answer endpoint. Take the top-k chunks, expand to neighbours, build a prompt, ask GPT-4o-mini. The boring part. The point at which “search service” becomes “RAG service.”
  • A local embedding option (bge-m3). I’m interested in studying Mandarin via YouTube videos, and English-trained embedding models are demonstrably weaker on Chinese. bge-m3 runs on my M-series laptop, has Chinese-native quality, and slots in behind the EmbeddingProvider protocol without changing call sites.
  • An eval harness. Twenty hand-curated (query, expected-chunk) pairs, ran on every retrieval-config change. Without this you’re just guessing whether your dial-twisting helped.

If you’re staring at “I should learn RAG” and wondering where to start: pick a corpus you actually care about (mine was a Mandarin study channel), wire up the dumbest possible chunk-embed-store-query pipeline against pgvector, and only then start asking which dials matter. The mystique disappears around hour three.

← All articles