Beyond Prompts: Building a Production-Ready RAG Agentic System

Earlier this year, I decided to dive deeper into the world of AI Engineering through Turing College's AI Engineering Program.

I wasn’t looking to just learn how to prompt better. My goal was to understand the engineering reality behind modern AI systems: how they retrieve data, how they stream responses, how they fail safely, and how they behave in production.

After three intense months of sprints and obtaining my certification, I want to share both what the program looked like and what I learned building a real, agentic RAG system.

The 3-Month Roadmap

The program was structured into three high-intensity monthly sprints, each building upon the last:

1. Foundations of LLM Application Development

We started with the fundamentals: how large language models actually work in practice.

This sprint focused on:

Tokenization, temperature, and model parameters
Prompt engineering (and its limits)
Calling LLM provider APIs (OpenAI, Anthropic, Gemini)

This phase set the baseline: understanding what models can and can’t do on their own.

2. Retrieval-Augmented Generation (RAG)

Next, we moved beyond the model’s internal knowledge.

This sprint introduced:

Vector databases (ChromaDB, Qdrant)
Embeddings and semantic similarity
Frameworks like LangChain and LangGraph

Here’s where things clicked: most real-world AI systems are data systems first, model systems second.

3. AI Agents & Tool Use

The final sprint shifted from linear pipelines to agentic systems:

Tool calling
Multi-step reasoning
Memory and control flow
Human-in-the-loop patterns

All of this culminated in the Capstone Project.

The Capstone: A Documentation-Aware Support Ecosystem

For my final project, I wanted to solve a problem I’ve seen repeatedly in production systems: Support bots that sound confident but are wrong, outdated, or generic.

The result was a documentation-aware, agentic support system that:

Combines internal documentation with live web search.
Uses an orchestrator to decide how to answer each query.
Reduces hallucinations through confidence checks and fallbacks.
Exposes its internal reasoning to the user via streaming.
Is able to use note-taking app APIs to store results and guidance.

The Technical Stack

I chose a stack that prioritizes type-safety and streaming performance. Instead of sticking strictly to the technologies mentioned in the course, I re-implemented the same concepts using technologies I use daily. This turned out to be one of the most valuable learning experiences of the program since I needed to embrace the concepts I was learning to then understand how each tool implements it and how the switch can work and be improved on.

Core stack for my Capstone project:

Frontend: Next.js 14, Tailwind CSS
Backend: Node.js 22
Orchestration: Vercel AI SDK (streaming, embeddings, tool calling)
Database: PostgreSQL + pgvector
Query Builder: Kysely (type-safe SQL)
Caching: Redis (semantic cache)

It was also really fun hearing how other students that had different use cases and opted for a different tech-stack obtained different results to mine when trying out other agent orchestration or RAG retrieval techniques.

Architectural Deep-Dive

The Foundation: A Context-Aware Data Pipeline

In RAG (Retrieval-Augmented Generation), your model is only as good as the context you feed it. I quickly realized that standard header-based chunking worked for clean documentation but failed miserably for unstructured data like support ticket comments or YouTube transcriptions.

Smart Labeling & Keyword Indexing

To solve this, I evolved the pipeline from simple ingestion to a keyword-based labeling system.

The Problem: Vector similarity alone is resource-expensive and can sometimes surface "semantically close" but contextually irrelevant noise.
The Solution: I used PostgreSQL to store keyword tags for every chunk. This allowed me to perform high-speed SQL filtering before any vector operations.

By filtering the search space via specific labels (e.g., category: "billing", source: "github_issue"), I significantly reduced the "needle in the haystack" problem and lowered the compute load on vector search workloads.

Hybrid Retrieval: SQL Filters meet Vector Search

The retrieval logic in this project follows a three-tier sequential design to balance speed and precision:

SQL Pre-Filtering: Narrowing the database using metadata and keywords.
Semantic Search (pgvector): Finding the top 10 most similar chunks in the filtered subset.

// Example of pre-filtering with Kysely before vector similarity
static async searchWithMetadata(queryEmbedding: number[], tags: string[]) {
  const vectorString = `[${queryEmbedding.join(',')}]`;

  return await db
    .selectFrom('embeddings')
    .select(['id', 'content', 'metadata', 
      sql`1 - (embedding <=> ${vectorString}::vector)`.as('similarity')])
    // High-efficiency SQL filtering before expensive vector math
    .where('metadata', '@>', JSON.stringify({ keywords: tags })) 
    .where(sql`1 - (embedding <=> ${vectorString}::vector)`, '>', 0.3)
    .orderBy(sql`embedding <=> ${vectorString}::vector`, 'asc')
    .limit(10)
    .execute();
}

LLM Reranking: When similarity is very low, another LLM may choose from the result subset for actual relevance to the user's specific intent. This will increase latency and LLM usage costs, but yield more accurate results.

On top of this, I added a Redis-backed semantic cache. If a query is semantically similar to a previous one, latency drops from seconds to milliseconds. In practice, this relies on similarity thresholds and cache invalidation strategies tied to document updates.

The Agent Orchestrator

The "brain" of the system is the Orchestrator. Instead of a monolithic prompt, it follows a logic-gate flow: Semantic Cache Check → Query Routing → Specialized Agent Execution.

If a user asks a question that has been answered before with high similarity, the system retrieves it from the Redis-backed semantic cache instantly, saving LLM costs and latency.

// agent-orchestrator.ts logic flow
export async function orchestrate(query: string, history: any[]) {
  // 1. Check for semantically similar questions in the cache
  const cachedQA = await checkCachedQA(query);
  if (cachedQA.found) return cachedQA.answer;

  // 2. Route the query to the appropriate specialist
  const routing = await routeQuery(query);

  // 3. Trigger Specialist (Documentation Expert or Web Search Agent)
  if (routing.agent === 'docs-expert') {
    return await generateRAGResponse(query, history);
  } else {
    return await generateWebSearchResponse(query, history);
  }
}

This keeps reasoning explicit and debuggable, something that’s critical as systems grow.

One key architectural learning with agent orchestrators is that when giving an agent too many routing options it will generally make a worse job of selecting where the information needs to be forwarded. If you call an LLM with just one or two options that are very different, it will very likely choose the right choice without much iteration. However, you may be dealing with a more complex use case where many different tool options must be available.

Creating a hierarchy between agents will generally increase accuracy in tool selection, but it will also increase latency and reduce performance. Depending on the use case, perhaps you want to keep a flatter hierarchy for faster responses, or you may be able to compromise on speed to obtain better responses. You may even want the user to directly choose which specialized agent(s) should deal with their prompt.

Agentic Fallbacks & Human-in-the-Loop

When confidence is low, the system must not hallucinate. Instead, it can:

Trigger a Web Search Agent.
Ask the user for clarification.
Combine live data with internal knowledge.

This balance is crucial. Too many follow-up questions frustrate users, too few lead to confident nonsense.

Type-Safe Tool Use with Zod & Agentic Integrations

One of the core features is the Note-Taking Integration Agent. I wanted the bot to be able to save summaries or links directly to tools like Capacities or Notion.

The challenge with tool-calling is ensuring the LLM provides the exact schema the external API expects. To make this reliable:

Tool schemas are defined with Zod.
Outputs are validated before hitting external APIs.
Invalid responses fail fast.

This turned tool calling from a “best effort” feature into something production-safe.

UX: Opening the "Black Box" with Custom Streaming

A major friction point in AI apps is the "thinking" silence. If you have a multi-level hierarchy to your agents, users will most likely need to wait for a couple seconds to see a response. If there's no feedback in the UI, it will seem like the system is broken. Streaming is now a baseline UX requirement for multi-step AI systems. To keep the user engaged, I added custom stream events via the Vercel AI SDK to include information about the steps and decisions that are being made in the backend.

I modified the streaming functions to forward metadata about the agent's internal state. The frontend receives real-time updates like:

“Docs Expert is searching the knowledge base...”
“Web Agent is double-checking online...”
“Saving results to your notes...”

By exposing the agent type and current operation, the UI feels more responsive and the "agentic" nature of the system becomes visible to the user, building trust in the process. By having custom states, you may also customize the feedback loops to your application's context.

The Monitoring Agent

In production, you can't manually verify every response, and there's too many shades of gray to how good an AI system is doing out in the wild. I decided to build an Asynchronous Monitoring Agent that samples interactions and scores them on three pillars:

Faithfulness: Did the AI stick to the provided context or "go rogue"?
Tone: Is the response professional and aligned with the brand?
Safety: Is the system responding well to potential prompt injections?

Keeping it asynchronous means that you don't increase latency by adding another sequential step to your client responses. Instead you run your evaluation metrics on a background job to determine if the system is doing a good overall job, and you can continuously analyze metrics from your deployed AI system.

One tricky part here is designing how the monitoring agent should evaluate the prompt-response samples. Most scores I came up with are mostly heuristic signals rather than ground truth, but they provide trend-level visibility that's specific to the use cases and context where the system is designed to operate. There's some very interesting literature on LLM-as-a-judge and human feedback gathering, I really liked this technical article from Datadog that goes into this topic.

Key Takeaways from the Journey

This 3-month training confirmed my expectations going into it: AI Engineering is 10% AI and 90% Engineering.

Monitoring and security: In the world of LLMs, a simple loading spinner is a UX killer. Mastering streaming patterns is essential for keeping users engaged while different agents are going about their tasks.
Retrieval is an Art: Designing a good RAG system to produce fast and accurate embedding lookups is an incredible feat. The data needs to be in good shape, and there's already many different strategies that will greatly impact the project outcomes based on the use case you're trying to accomplish.
Ethics as Architecture: Privacy and bias mitigation shouldn't be "add-ons". They must be handled at the database and filtering layers.
Data Architecture is King: A great RAG system starts with a smart filtering system and a refined chunking strategy, not just a bigger model.
Type Safety means Reliability: Restricting typing on tools with libraries like Zod is the only way to build reliable, multi-step agentic workflows that don't break on API calls.

Let's Connect

I’m incredibly grateful for the mentors and peers at Turing College. I shared more about the personal impact of this transition in a LinkedIn post.

If you're working on RAG pipelines or exploring agentic workflows, I'd love to chat! Feel free to reach out!