AI Engineering Guidebook
AI Engineering Guidebook
2025 EDITION EE
AI Engineering
System Design Patterns for
LLMs, RAG and Agents
Scan the QR code below or open this link to start the assessment. It will only take
2 minutes to complete.
[Link]
1
[Link]
AI Engineering
2
[Link]
Table of contents
LLMs....................................................................................................................................... 6
What is an LLM?..........................................................................................................................................7
Need for LLMs............................................................................................................................................ 10
What makes an LLM ‘large’ ?...............................................................................................................12
How are LLMs built?................................................................................................................................ 14
How to train LLM from scratch?.........................................................................................................18
How do LLMs work?............................................................................................................................... 23
7 LLM Generation Parameters...........................................................................................................29
4 LLM Text Generation Strategies................................................................................................... 33
3 Techniques to Train An LLM Using Another LLM.................................................................. 37
4 Ways to Run LLMs Locally.............................................................................................................. 40
Transformer vs. Mixture of Experts in LLMs............................................................................... 44
Prompt Engineering................................................................................... 49
What is Prompt Engineering?..............................................................................................................50
3 prompting techniques for reasoning in LLMs........................................................................... 51
Verbalized Sampling............................................................................................................................... 58
JSON prompting for LLMs..................................................................................................................... 61
Fine-tuning...................................................................................................66
What is Fine-tuning?............................................................................................................................... 67
Issues with traditional fine-tuning................................................................................................. 68
5 LLM Fine-tuning Techniques............................................................................................................70
Implementing LoRA From Scratch....................................................................................................78
How does LoRA work?...........................................................................................................................80
Implementation........................................................................................................................................ 82
Generate Your Own LLM Fine-tuning Dataset (IFT)................................................................86
SFT vs RFT.................................................................................................................................................. 92
Build a Reasoning LLM using GRPO [Hands On]....................................................................... 94
Bottleneck in Reinforcement Learning...........................................................................................101
The Solution: The OpenEnv Framework.........................................................................................101
Agent Reinforcement Trainer(ART)................................................................................................103
3
[Link]
RAG.............................................................................................................. 105
What is RAG?........................................................................................................................................... 106
What are vector databases?............................................................................................................. 107
The purpose of vector databases in RAG................................................................................... 109
Workflow of a RAG system.................................................................................................................113
5 chunking strategies for RAG..........................................................................................................118
Prompting vs. RAG vs. Finetuning?................................................................................................. 124
8 RAG architectures..............................................................................................................................126
RAG vs Agentic RAG.............................................................................................................................128
Traditional RAG vs HyDE..................................................................................................................... 131
Full-model Fine-tuning vs. LoRA vs. RAG.................................................................................... 134
RAG vs REFRAG...................................................................................................................................... 139
RAG vs CAG............................................................................................................................................... 141
RAG, Agentic RAG and AI Memory...............................................................................................143
Context Engineering................................................................................. 146
What is Context Engineering?...........................................................................................................147
Context Engineering for Agents...................................................................................................... 149
6 Types of Contexts for AI Agents................................................................................................ 156
Build a Context Engineering workflow......................................................................................... 158
Context Engineering in Claude Skills.............................................................................................168
Manual RAG Pipeline vs Agentic Context Engineering.......................................................... 171
AI Agents................................................................................................... 176
What is an AI Agent?...........................................................................................................................177
Agent vs LLM vs RAG.......................................................................................................................... 180
Building blocks of AI Agents............................................................................................................. 181
Memory Types in AI Agents..............................................................................................................195
Importance of Memory for Agentic Systems............................................................................ 196
5 Agentic AI Design Patterns........................................................................................................... 199
ReAct Implementation from Scratch............................................................................................ 205
5 Levels of Agentic AI Systems....................................................................................................... 231
30 Must-Know Agentic AI Terms.................................................................................................. 235
4 Layers of Agentic AI.......................................................................................................................240
7 Patterns in Multi-Agent Systems...............................................................................................242
Agent2Agent(A2A) Protocol.............................................................................................................245
Agent-User Interaction Protocol(AG-UI)................................................................................... 248
Agent Protocol Landscape................................................................................................................ 252
4
[Link]
5
[Link]
LLMs
6
[Link]
What is an LLM?
Imagine someone begins a sentence:
Or they say:
This simple act of predicting what comes next is the foundation of how large
language models(LLMs) operate.
7
[Link]
With enough exposure, the model becomes remarkably good at continuing any
piece of text in a coherent, meaningful way.
At the technical level, an LLM processes text in small units called tokens. A
token may be a word, part of a word or even punctuation.
The model looks at the tokens so far and predicts the next one. Repeating this
process generates full answers, explanations, or code.
8
[Link]
9
[Link]
Each new problem required a new model and a new pipeline. This created a
fragmented landscape that didn’t scale.
10
[Link]
As a result, a single system could now answer questions, write code, analyze text
and more simply by predicting the continuation of your input.
11
[Link]
Parameters are the internal values that the model adjusts during training. Each
parameter represents a small piece of the patterns the model has learned.
Earlier language models were much smaller and could only capture surface-level
text patterns.
They could mimic style but struggled with tasks that required reasoning,
abstraction or generalization.
12
[Link]
This wasn’t the result of adding new rules or programming specific behaviors. It
emerged naturally from giving the model enough capacity to learn deeper
relationships in language.
This effect held consistently: models with more parameters, trained on broader
data, produced more reliable, coherent and adaptable outputs.
In practical terms, the “large” in large language model is what enables these
capabilities.
13
[Link]
This architecture is built from several core components that work together to
turn raw text into structured representations the model can learn from.
Transformer
A Transformer is designed to look at all tokens in the input at once and identify
which parts of the text are most relevant to each other.
This lets the model follow long sentences, track references, and understand
relationships that appear far apart in the sequence.
Tokenization
14
[Link]
Text is first broken into tokens. A token may be a word or part of a word,
depending on how common it is.
This approach keeps the vocabulary manageable and allows the model to handle
any language input.
These tokens are then mapped to numerical representations so the model can
work with them.
Transformer Layers
The model contains many Transformer layers stacked on top of each other.
15
[Link]
As the sequence moves through these layers, the model builds a deeper view of
the text.
Positional Encoding
Transformers do not naturally know the order in which tokens appear.
Parameters
Inside the architecture are parameters - the values the model adjusts during
training.
They store the patterns the model learns from text and form the basis for its
ability to understand and generate language.
16
[Link]
Because these models are too large for a single machine, they are trained across
many GPUs in parallel.
The model’s parameters, computations and training data are distributed so the
system can process massive datasets and update billions of parameters efficiently.
17
[Link]
We’ll cover:
● Pre-training
● Instruction fine-tuning
● Preference fine-tuning
● Reasoning fine-tuning
18
[Link]
You ask, “What is an LLM?” and get gibberish like “try peter hand and hello
448Sn”.
It hasn’t seen any data yet and possesses just random weights.
1) Pre-training
This stage teaches the LLM the basics of language by training it on massive
corpora to predict the next token. This way, it absorbs grammar, world facts, etc.
But it’s not good at conversation because when prompted, it just continues the
text.
2) Instruction fine-tuning
To make it conversational, we do Instruction Fine-tuning by training on
instruction-response pairs. This helps it learn how to follow prompts and format
replies.
19
[Link]
Now it can:
● Answer questions
● Summarize content
● Write code, etc.
20
[Link]
That’s not just for feedback, but it’s valuable human preference data.
In PFT:
A reward model is then trained to predict human preference and the LLM is
updated using RL.
21
[Link]
It teaches the LLM to align with humans even when there’s no "correct" answer.
4) Reasoning fine-tuning
In reasoning tasks (maths, logic, etc.), there's usually just one correct response
and a defined series of steps to obtain the answer.
So we don’t need human preferences, and we can use correctness as the signal.
Steps:
22
[Link]
23
[Link]
For instance, if we're predicting whether it will rain today (event A), knowing that
it's cloudy (event B) might impact our prediction.
As it's more likely to rain when it's cloudy, we'd say the conditional probability
P(A|B) is high.
These models are tasked with predicting/guessing the next word in a sequence.
24
[Link]
This is a question of conditional probability: given the words that have come
before, what is the most likely next word?
To predict the next word, the model calculates the conditional probability for
each possible next word, given the previous words (context).
The word with the highest conditional probability is chosen as the prediction.
25
[Link]
If we always pick the word with the highest probability, we end up with repetitive
outputs, making LLMs almost useless and stifling their creativity.
26
[Link]
To make LLMs more creative, instead of selecting the best token (for simplicity
let's think of tokens as words), they "sample" the prediction.
So even if “Token 1” has the highest score, it may not be chosen since we are
sampling.
27
[Link]
Now, temperature introduces the following tweak in the softmax function, which,
in turn, influences the sampling process:
28
[Link]
Knowing how to tune is important so that you can produce sharp and more
controlled outputs.
1) Max tokens
29
[Link]
This is a hard cap on how many tokens the model can generate in one response.
Too low → truncated outputs; too high → could lead to wasted compute.
2) Temperature
3) Top-k
The default way to generate the next token is to sample from all tokens,
proportional to their probability.
Example: k=5 → model only considers 5 most likely next tokens during sampling.
Helps enforce focus, but overly small k may give repetitive outputs.
30
[Link]
Instead of picking from all tokens or top k tokens, model samples from a
probability mass up to p.
Example: top_p=0.9 → only the smallest set of tokens covering 90% probability are
considered.
More adaptive than top_k, useful when balancing coherence with diversity.
5) Frequency penalty
6) Presence penalty
Encourages the model to bring in new tokens not yet seen in the text.
Higher values push for novelty, lower values make the model stick to known
patterns.
31
[Link]
7) Stop sequences
Lets you enforce strict response boundaries without heavy prompt engineering.
It looks at the probability of the most likely token and only keeps tokens that are
at least a certain fraction (say 10%) as likely.
For instance, if your top token has 60% probability, then only a few options
remain. But if it’s just 20%, many others can pass the threshold.
32
[Link]
But here’s the catch: predicting probabilities is not enough. We still need a
strategy to pick which token to use at each step.
33
[Link]
The naive approach greedily chooses the word with the highest probability from
the probability vector, and autoregresses. This is often not ideal since it leads to
repetitive sentences.
To maximize this product, you’d need to know future conditionals (what comes
after each candidate).
But when decoding, we only know probabilities for the next step, not the
downstream continuation.
34
[Link]
Some beams may have started with less probable tokens initially, but lead to
much higher-probability completions.
By keeping alternatives alive, beam search explores more of the probability tree.
This is widely used in tasks like machine translation, where correctness matters
more than creativity.
Applies a penalty if the token is too similar to what’s already been generated.
This way, it also prevents “stuck in a loop” problems while keeping coherence
high.
35
[Link]
It’s especially effective for longer generations like stories, where repetition can
easily creep in.
SLED introduces a small but meaningful change: instead of using only the final
layer’s logits, it looks at how logits evolve across all layers. Each layer contributes
its own prediction, and SLED measures how closely these predictions agree. It
then nudges the final logits toward this layer-wise consensus before selecting the
next token.
36
[Link]
Distillation helps us do so, and the visual below depicts three popular techniques.
37
[Link]
The idea is to transfer "knowledge" from one LLM to another, which has been
quite common in traditional deep learning
● Pre-training
○ Train the bigger Teacher LLM and the smaller student LLM
together.
○ Llama 4 did this.
● Post-training:
○ Train the bigger Teacher LLM first and distill its knowledge to the
smaller student LLM.
○ DeepSeek did this by distilling DeepSeek-R1 into Qwen and Llama
3.1 models.
You can also apply distillation during both stages, which Gemma 3 did.
1) Soft-label distillation
38
[Link]
However, you must have access to the Teacher’s weights to get the output
probability distribution.
Say your vocab size is 100k tokens and your data corpus is 5 trillion tokens.
Since we generate softmax probabilities of each input token over the entire
vocabulary, you would need 500 million GBs of memory to store soft labels under
float8 precision.
2) Hard-label distillation
● Use a fixed pre-trained Teacher LLM to just get the final one-hot output
token.
● Use the untrained Student LLM to get the softmax probabilities from the
same data.
● Train the Student LLM to match the Teacher's probabilities.
DeepSeek did this by distilling DeepSeek-R1 into Qwen and Llama 3.1 models.
3) Co-distillation
39
[Link]
Llama 4 did this to train Llama 4 Scout and Maverick from Llama 4 Behemoth.
Of course, during the initial stages, soft labels of the Teacher LLM won't be
accurate.
That is why Student LLM is trained using both soft labels + ground-truth hard
labels.
1) Ollama
40
[Link]
Done!
Now, you can download any of the supported models using these commands:
For programmatic usage, you can also install the Python package of Ollama or its
integration with orchestration frameworks like Llama Index or CrewAI:
41
[Link]
2) LMStudio
The app does not collect data or monitor your actions. Your data stays local on
your machine. It’s free for personal use.
It offers a ChatGPT-like interface, allowing you to load and eject models as you
chat. This video shows its usage:
3) vLLM
vLLM is a fast and easy-to-use library for LLM inference and serving(more details
in LLM deployment section)
With just a few lines of code, you can locally run LLMs (like DeepSeek) in an
OpenAI-compatible format:
42
[Link]
4) LlamaCPP
LlamaCPP enables LLM inference with minimal setup and good performance.
43
[Link]
All modern LLMs rely on the Transformer architecture, but there is another
important question:
How do we scale models further without making them impossibly large and
expensive to run?
MoE keeps the overall parameter count large but activates only a small subset of
“experts” for each token.
With that context, let’s compare the traditional Transformer block with the MoE
alternative.
44
[Link]
45
[Link]
During inference, a subset of experts are selected. This makes inference faster in
MoE.
But how does the model decide which experts should be ideal?
The router is like a multi-class classifier that produces softmax scores over
experts. Based on the scores, we select the top K experts.
The router is trained with the network and it learns to select the best experts.
46
[Link]
● The model selects "Expert 2" (randomly since all experts are similar).
● The selected expert gets a bit better.
● It may get selected again since it’s the best.
● This expert learns more.
● The same expert can get selected again since it’s the best.
● It learns even more.
● And so on!
47
[Link]
● Add noise to the feed-forward output of the router so that other experts
can get higher logits.
● Set all but top K logits to -infinity. After softmax, these scores become
zero.
Challenge 2) Some experts may get exposed to more tokens than others - leading
to under-trained experts.
If an expert reaches the limit, the input token is passed to the next best expert
instead.
MoEs have more parameters to load. However, a fraction of them are activated
since we only select some experts.
This leads to faster inference. Mixtral 8x7B by MistralAI is one famous LLM that
is based on MoE.
48
[Link]
Prompt
Engineering
49
[Link]
You’re not changing weights (the learned parameters inside the model). You’re
changing instructions and that changes everything.
● Think step-by-step
● Follow constraints
● Stay focused
● Avoid shallow answers
It’s the fastest, lowest-effort way to get better results from any model.
50
[Link]
And that's not unique to code. It’s the same when we prompt LLMs to solve
complex reasoning tasks like math, logic, or multi-step problems.
Let’s look at three popular prompting techniques that help LLMs think more
clearly before they answer.
51
[Link]
Instead of asking the LLM to jump straight to the answer, we nudge it to reason
step by step.
This often improves accuracy because the model can walk through its logic
before committing to a final output.
For instance:
It’s a simple example but this tiny nudge can unlock reasoning capabilities that
standard zero-shot prompting could miss.
If you prompt the same question multiple times, you might get different answers
depending on the temperature setting (we covered temperature in LLMs here).
You ask the LLM to generate multiple reasoning paths and then select the most
common final answer.
52
[Link]
It’s a simple idea: when in doubt, ask the model several times and trust the
majority.
However, it doesn’t evaluate how the reasoning was done—just whether the final
answer is consistent across paths.
While Self-Consistency varies the final answer, Tree of Thoughts varies the steps
of reasoning at each point and then picks the best path overall.
At every reasoning step, the model explores multiple possible directions. These
branches form a tree, and a separate process evaluates which path seems the
most promising at a particular timestamp.
53
[Link]
Think of it like a search algorithm over reasoning paths, where we try to find the
most logical and coherent trail to the solution.
CoT, Self-Consistency, and ToT all improve how the model reasons through a
problem.
But they still rely on free-form thinking, which breaks down in long, rule-heavy
tasks.
Bonus: ARQ
Here’s the core problem with current techniques that this new approach solves.
We have enough research to conclude that LLMs often struggle to assess what
truly matters in a particular stage of a long, multi-turn conversation.
For instance, when you give Agents a 2,000-word system prompt filled with
policies, tone rules, and behavioral dos and don’ts, you expect them to follow it
word by word.
And finally, the LLM that was supposed to “never promise a refund” is happily
offering one.
This means they can easily ignore crucial rules (stated initially) halfway through
the process.
54
[Link]
But even with methods like CoT, reasoning remains free-form, i.e., the model
“thinks aloud” but it has limited domain-specific control.
That’s the exact problem the new technique, called Attentive Reasoning Queries
(ARQs), solves.
Instead of letting LLMs reason freely, ARQs guide them through explicit,
domain-specific questions.
55
[Link]
By the time the LLM generates the final response, it’s already walked through a
sequence of *controlled* reasoning steps, which did not involve any free text
exploration (unlike techniques like CoT or ToT).
● ARQ - 90.2%
● CoT reasoning - 86.1%
● Direct response generation - 81.5%
56
[Link]
When you make reasoning explicit, measurable, and domain-aware, LLMs stop
improvising and start reasoning with intention. Free-form thinking sounds
powerful, but in high-stakes or multi-turn scenarios, structure always wins.
But there’s another challenge: many aligned LLMs stop exploring alternative
answers altogether.
Even with good reasoning steps, the model may collapse into the same safe,
typical responses.
To regain that lost diversity without retraining the model, we use Verbalized
Sampling.
57
[Link]
Verbalized Sampling
Post-training alignment methods, such as RLHF, are designed to make LLMs
helpful and safe.
According to a paper, mode collapse happens because the human preference data
used to train the LLM has a hidden flaw called typicality bias.
Annotators are asked to rate different responses from an LLM, and later, the
LLM is trained using a reward model that learns to mimic these human
preferences.
58
[Link]
However, it is observed that annotators naturally tend to favor answers that are
more familiar, easy to read, and predictable. This is the typicality bias. So even if
a new, creative answer is just as good (or correct) as a common one, the human’s
preference often leans toward the common one.
Due to this, the reward model boosts responses that the original (pre-aligned)
model already considered likely.
That said, this is not an irreversible effect, and the LLM still has two
personalities after alignment:
The original model that learned the rich possibilities during pre-training.
The safety-focused, post-aligned model [to mention again, due to typicality bias,
it had been unintentionally suppressed to strongly favor the most predictable
response]
59
[Link]
The core idea of verbalized sampling is that the prompt itself acts like a mental
switch.
When you directly prompt “Tell me a joke”, the aligned personality immediately
takes over and outputs the most reinforced answer.
But in verbalized sampling, you prompt it with “Generate 5 responses with their
corresponding probabilities. Tell me a joke.”
In this case, the prompt does not request an instance, but a distribution.
This causes the aligned model to talk about its full knowledge and is forced to
utilize the diverse distribution it learned during pre-training.
60
[Link]
Larger, more capable models like GPT-4.1 and Gemini-2.5-Pro benefit more from
verbalized sampling, showing diversity gains up to 2 times greater than smaller
models.
The problem isn’t the model - it’s the lack of structure in the prompt.
For tasks like extraction, reporting, automation, or analysis, you need the output
to stay consistent every single time.
Let us discuss exactly what JSON prompting is and how it can drastically
improve your AI outputs!
61
[Link]
When you give instructions like "summarize this email" or "give me key
takeaways," you leave room for interpretation, which can lead to hallucinations.
62
[Link]
The reason JSON is so effective is that AI models are trained on massive amounts
of structured data from APIs and web applications.
When you speak their "native language," they respond with laser precision!
JSON forces you to think in terms of fields and values, which is a gift.
Prompting isn't just about what you ask; it's about what you expect back.
63
[Link]
And this works irrespective of what you are doing, like generating content,
reports, or insights. JSON prompts ensure a consistent structure every time.
You can turn JSON prompts into shareable templates for consistent outputs.
Teams can plug results directly into APIs, databases, and apps; no manual
formatting, so work stays reliable and moves much faster.
64
[Link]
To summarise:
Structured JSON prompting for LLMs is like writing modular code; it brings
clarity of thought, makes adding new requirements effortless, & creates better
communication with AI.
It's not just a technique, but rather evolving towards a habit worth developing for
cleaner AI interactions.
65
[Link]
Fine-tuning
66
[Link]
What is Fine-tuning?
In the pre-LLM era, whenever someone open-sourced any high-utility model for
public use, in most cases, practitioners would fine-tune that model to their
specific task.
When the model was developed, it was trained on a specific dataset that might
not perfectly match the characteristics of the data a practitioner wants to use it
on.
The original dataset might have had slightly different distributions, patterns, or
levels of noise compared to the new dataset.
Fine-tuning allows the model to adapt to these differences, learning from the
new data and adjusting its parameters to improve its performance on the specific
task at hand.
It’s open-source.
BERT was pre-trained on a large corpus of text data, which might be very very
different from what someone else may want to use it on.
67
[Link]
Thus, when using it on any downstream task, we can adjust the weights of the
BERT model along with the augmented layers, so that it better aligns with the
nuances and specificities of the new dataset.
This is because, as you may already know, these models are huge - billions or
even trillions of parameters.
68
[Link]
Traditional fine-tuning is just not practically feasible here, and in fact, not
everyone can afford to do it due to a lack of massive infrastructure.
69
[Link]
Thankfully, today, we have many optimal ways to fine-tune LLMs, and five such
popular techniques are depicted below:
70
[Link]
1) LoRA
Add two low-rank matrices A and B alongside weight matrices, which contain
the trainable parameters. Instead of fine-tuning W, adjust the updates in these
low-rank matrices.
71
[Link]
2) LoRA-FA
3) VeRA
In LoRA, every layer has a different pair of low-rank matrices A and B, and both
matrices are trained. In VeRA, however, matrices A and B are frozen, random,
and shared across all model layers. VeRA focuses on learning small, layer-specific
scaling vectors, denoted as b and d, which are the only trainable parameters in
this setup.
72
[Link]
4) Delta-LoRA
Here, in addition to training low-rank matrices, the matrix W is also adjusted but
not in the traditional way. Instead, the difference (or delta) between the product
of the low-rank matrices A and B in two consecutive training steps is added to W:
5) LoRA+
In LoRA, both matrices A and B are updated with the same learning rate.
Authors found that setting a higher learning rate for matrix B results in more
optimal convergence.
73
[Link]
Bonus: LoRA-drop
LoRA-drop observes that not all layers benefit equally from LoRA updates. It first
adds low-rank matrices to every layer and trains briefly, then measures each
layer’s activation strength to see which layers actually matter.
Layers whose LoRA activations stay near zero have minimal influence on the
model's output and can be removed.
74
[Link]
75
[Link]
For instance, consider your model has over a million parameters, each
represented with 32-bit floating-point numbers.
76
[Link]
While reducing the bit-width of parameters makes the model smaller, it also
leads to a loss of precision.
This means the model's predictions become more somewhat approximate than
the original, full-precision model.
This somewhat lies along the lines of Quantization in Model compression, which
we will cover ahead in the LLM optimization section.
Bonus: DoRA
At its core, DoRA builds upon the principles of LoRA but introduces a
decomposition step that separates a pretrained weight matrix W into two
components: magnitude (m) and direction (V).
77
[Link]
Consider the current weights of some random layer in the pre-trained model are
W of dimensions d ∗ k, and we wish to fine-tune it on some other dataset.
During fine-tuning, the gradient update rule suggests that we must add ΔW to
get the updated parameters:
For simplicity, you can think about ΔW as the update obtained after running
gradient descent on the new dataset:
78
[Link]
In fact, in all the model fine-tuning iterations, W can be kept static, and all
weight updates using gradient computation can be incorporated to ΔW instead.
The matrix W is already huge, and we are talking about introducing another
matrix that is equally big.
So, we must introduce some smart tricks to manipulate ΔW so that we can fulfill
the fine-tuning objective while ensuring we do not consume high memory.
Now, we really can’t do much about W as these weights refer to the pre-trained
model. So all optimization (if we intend to use any) must be done ΔW instead.
While doing so, we must also remember that currently, both W and ΔW have the
same [Link] given that W already is huge, we must ensure that ΔW does
not end up being of the same dimensions, as this will defeat the entire purpose of
efficient fine-tuning.
79
[Link]
But how can we even add two matrics if both have different dimensions?
Thus, the dimensions of ΔW must also be d ∗ k. But this does not mean that the
total trainable parameters in A and B matrix must also align with the dimensions
of ΔW.
80
[Link]
Instead, A and B can be extremely small matrices, and the only thing we must
ensure is that their product results in a matrix, which has dimensions d ∗ k.
Thus:
During training, only matrix A and B are trained while the entire network’s
weights are kept fixed.
In the above image, every point denotes a possible LoRA configuration. Also, the
upper right corner refers to full fine-tuning.
81
[Link]
Implementation
While a few open-source implementations are already available for LoRA, yet, we
shall implement it from scratch using PyTorch so that we get a better idea of the
practical details.
As discussed above, a typical LoRA layer comprises two matrices, A and B. These
have been implemented in the LoRAWeights class below along with the forward
pass:
82
[Link]
Also, both self.A and self.B are learnable parameters of the module, representing
the matrices used in the decomposition.
In the forward method, the input x is multiplied by the matrices A and B, and then
scaled by alpha. The result is returned as the output of the module.
● A higher value of alpha means that the changes made by the LoRA layer
will be more significant, potentially leading to more pronounced
adjustments in the model's behavior.
83
[Link]
As LoRA is used after training, so we will already have a trained model available.
Let’s say it is accessible using the model object.
Now, our primary objective is to attach the matrices in LoRAWeights class with
the matrices in the layers of the above network. And, of course, each layer (fc1,
fc2, fc3, fc4) will have its respective LoRAWeights layer.
Note: Of course, it is not necessary that each layer must have a respective fine-tuning
LoRAWeights layer too. In fact, in the original paper, it is mentioned that they limited
the study to only adapting the attention weights for downstream tasks and they froze the
multi-layer perception (feedforward) units of the Transformer for parameter efficiency.
In our case, for instance, we can freeze the fc4 layer as it is not enormously big
compared to other layers in the network.
84
[Link]
Also, we must remember that the network is trained as we would usually train
any other neural network, but while only training the weight matrices A and B,
i.e., the pre-trained model (model) is frozen. We do this as follows:
Done!
Next, we utilize the LoRAWeights class to define the fine-tuning network below:
As depicted above:
● The LoRA layers are applied over the fully connected layer (fc1, fc2, fc3) in
the existing model. More specifically, we create three LoRAWeights layers
(loralayer1, loralayer2, loralayer3) based on the dimensions of the fully
connected layers (fc1, fc2, fc3) in the model.
● In the forward method, we pass the input through the first fully connected
layer (fc1) of the original model and add the output to the result of the
LoRA layer applied to the same input (self.loralayer1(x)). Next, we apply a
ReLU activation function to the sum. We repeat the process for the second
85
[Link]
and third fully connected layers (fc2, fc3) and lastly, return the final output
of the last fully connected layer of the pre-trained model (fc4).
Done!
Now this MyNeuralNetworkwithLoRA model can be trained like any other neural
network.
We have ensured that the pre-trained model (model) does not update during
fine-tuning and only weights in LoRAWeights class is learned.
So far, we explored how to update model weights efficiently (LoRA and its
variants).
But fine-tuning also depends on what data you use to update the model.
IFT is the foundation of supervised fine-tuning (SFT), and most modern LLMs
rely on some form of IFT during [Link] was pretty simple, wasn't it?
For instance, check this to understand how a pre-trained LLM behaves when
prompted:
86
[Link]
87
[Link]
● Input an instruction.
● Two LLMs generate responses.
● A judge LLM rates the responses.
● The best response is paired with the instruction.
88
[Link]
89
[Link]
Once the pipeline has been defined, we need to execute it by giving it a seed
dataset.
The seed dataset helps it generate new but similar samples. So we execute the
pipeline with our seed dataset as follows:
Done!
90
[Link]
So far, we’ve explored how to fine-tune a model efficiently using LoRA and its
variants, and what kind of data is typically used through instruction fine-tuning.
The next question is: how do different fine-tuning objectives actually change the
learning process?
Both update the model using LoRA or similar PEFT methods, but their goals and
training signals differ dramatically.
91
[Link]
SFT vs RFT
Before diving deeper, it's crucial to understand how we usually fine-tune LLMs
using SFT, or supervised fine-tuning.
SFT process:
RFT process:
92
[Link]
SFT uses static data and often memorizes answers. RFT, being online, learns
from rewards and explores new strategies.
This flowchart gives a quick guide on which fine-tuning method to use based on
your data and the nature of the task.
93
[Link]
Overall, this decision tree helps you quickly identify the most efficient and
reliable fine-tuning strategy for your use case.
Now that we understand when to use SFT or RFT, let's apply RFT in practice.
For reasoning-heavy tasks like math or logic, GRPO (Group Relative Policy
Optimization) is one of the most effective RFT methods available.
Let’s walk through how to fine-tune a model using GRPO with Unsloth.
94
[Link]
Let’s dive into the code to see how we can use GRPO to turn any model into a
reasoning powerhouse without any labeled data or human intervention.
We’ll use:
95
[Link]
The code is available here: Build a reasoning LLM from scratch using GRPO. You
can run it without any installations by reproducing our environment below:
Let’s begin!
96
[Link]
We'll use LoRA to avoid fine-tuning the entire model weights. In this code, we
use Unsloth's PEFT by specifying:
● The model
● LoRA low-rank (r)
● Modules for fine-tuning, etc.
We load the Open R1 Math dataset (a math problem dataset) and format it for
reasoning.
97
[Link]
98
[Link]
Now that we have the dataset and reward functions ready, it's time to apply
GRPO.
99
[Link]
Comparison
Again, we can see how GRPO turned a base model into a reasoning powerhouse.
RFT methods like GRPO work best when paired with reliable reinforcement
learning environments. This brings us to an important component of RL-based
fine-tuning: how agents interact with environments.
100
[Link]
Bottleneck in Reinforcement
Learning
A central difficulty in reinforcement learning lies not in training the agent but in
managing the environment in which the agent operates.
The environment defines the task, the rules, the available actions and the reward
structure. Because there is no standard way to construct these environments,
each project tends to develop its own APIs and interaction patterns.
101
[Link]
Because the interface is stable and uniform, the same pattern applies to a wide
variety of tasks, from simple games to complex, custom-built worlds.
102
[Link]
Agent Reinforcement
Trainer(ART)
Reinforcement learning becomes more complex when the “agent” is an LLM.
Instead of choosing a simple action like moving left or right - an LLM agent
produces multi-step reasoning traces, tool calls, conversations and plans.
Training such agents requires a system that can collect these trajectories, assign
rewards and update the model reliably.
103
[Link]
ART uses a lightweight client that wraps your existing agent with minimal
changes. The client communicates with an ART training server, which manages
rollouts, reward computation, batching and optimization.
A key feature is ART’s support for Group Relative Policy Optimization (GRPO),
an RL algorithm widely used for training LLMs. GRPO allows the model to learn
from trajectory-level rewards rather than token-level labels, which is essential for
improving behaviors like planning, correction and tool use.
● You start with your existing agent code - ART simply wraps it so you don’t
need to rewrite anything.
● The agent runs and produces a trajectory.
● The trajectory is scored using a reward function.
● ART applies GRPO (or another supported RL method) to update the policy.
● The loop repeats, gradually improving the agent’s behavior.
104
[Link]
RAG
105
[Link]
What is RAG?
Up to this point, we have seen two ways to adapt an LLM to a task:
Both approaches are powerful, but they share one fundamental limitation: the
model can only use the knowledge it already contains.
106
[Link]
With RAG, the language model can use the retrieved information (which is
expected to be reliable) from the vector database to ensure that its responses are
grounded in real-world knowledge and context, reducing the likelihood of
hallucinations.
This makes the model's responses more accurate, reliable, and contextually
relevant, while also ensuring that we don't have to train the LLM repeatedly on
new data. This makes the model more "real-time" in its responses.
Each data point, whether a word, a document, an image, or any other entity, is
transformed into a numerical vector using ML techniques (which we shall see
ahead).
This numerical vector is called an embedding, and the model is trained in such a
way that these vectors capture the essential features and characteristics of the
underlying data.
107
[Link]
This shows that embeddings can learn the semantic characteristics of entities
they represent (provided they are trained appropriately).
Once stored in a vector database, we can retrieve original objects that are similar
to the query we wish to run on our unstructured data.
108
[Link]
Once we have trained our LLM, it will have some model weights for text generation.
Where do vector databases fit in here?
For instance, if the model was deployed after considering the data until 31st Jan
2024, and we use it, say, a week after training, it will have no clue about what
happened in those days.
Repeatedly training a new model (or adapting the latest version) every single day
on new data is impractical and cost-ineffective. In fact, LLMs can take weeks to
train.
109
[Link]
Also, what if we open-sourced the LLM and someone else wants to use it on their
privately held dataset, which, of course, was not shown during training?
But if you think about it, is it really our objective to train an LLM to know every
single thing in the world?
Not at all!
Instead, it is more about helping the LLM learn the overall structure of the
language, and how to understand and generate it.
So, once we have trained this model on a ridiculously large enough training
corpus, it can be expected that the model will have a decent level of language
understanding and generation capabilities.
Thus, if we could figure out a way for LLMs to look up new information they
were not trained on and use it in text generation (without training the model
again), that would be great!
110
[Link]
But since LLMs usually have a limit on the context window (number of
words/tokens they can accept), the additional information can exceed that limit.
When the LLM needs to access this information, it can query the vector database
using an approximate similarity search with the prompt vector to find content
that is similar to the input query vector.
111
[Link]
Once the approximate nearest neighbors have been retrieved, we gather the
context corresponding to those specific vectors, which were stored at the time of
indexing the data in the vector database (this raw data is stored as payload, which
we will learn during implementation).
The above search process retrieves context that is similar to the query vector,
which represents the context or topic the LLM is interested in.
We can augment this retrieved content along with the actual prompt provided by
the user and give it as input to the LLM.
112
[Link]
Consequently, the LLM can easily incorporate this info while generating text
because it now has the relevant details available in the prompt.
Now that we understand the purpose, let's get into the technical details.
113
[Link]
We start with some external knowledge that wasn't seen during training, and we
want to augment the LLM with:
The first step is to break down this additional knowledge into chunks before
embedding and storing it in the vector database.
Moreover, if we don't chunk, the entire document will have a single embedding,
which won't be of any practical use to retrieve relevant context.
114
[Link]
Since these are “context embedding models” (not word embedding models),
models like bi-encoders are highly relevant here.
This shows that a vector database acts as a memory for your RAG application
since this is precisely where we store all the additional knowledge, using which,
the user's query will be answered.
A vector database also stores the metadata and original content along with the vector
embeddings.
With that, our vector database has been created and information has been added.
More information can be added to this if needed.
115
[Link]
Next, the user inputs a query, a string representing the information they're
seeking.
This query is transformed into a vector using the same embedding model we used
to embed the chunks earlier in Step 2.
The vectorized query is then compared against our existing vectors in the
database to find the most similar information.
116
[Link]
After retrieval, the selected chunks might need further refinement to ensure the
most relevant information is prioritized.
This process rearranges the chunks so that the most relevant ones are prioritized
for the response generation.
That said, not every RAG app implements this, and typically, they just rely on the
similarity scores obtained in step 6 while retrieving the relevant context from the
vector database.
117
[Link]
Almost done!
Once the most relevant chunks are re-ranked, they are fed into the LLM.
This model combines the user's original query with the retrieved chunks in a
prompt template to generate a response that synthesizes information from the
selected documents.
The LLM leverages the context provided by the chunks to generate a coherent
and contextually relevant answer that directly addresses the user’s query.
Since chunking is the very first step in any RAG pipeline, it’s important to
understand the different ways it can be done.
118
[Link]
Since the additional document(s) can be large, step 1 also involves chunking,
wherein a large document is divided into smaller/manageable pieces.
This step is crucial since it ensures the text fits the input size of the embedding
model.
1) Fixed-size chunking
119
[Link]
Since a direct split can disrupt the semantic flow, it is recommended to maintain
some overlap between two consecutive chunks (the blue part above).
This is simple to implement. Also, since all chunks are of equal size, it simplifies
batch processing.
But this usually breaks sentences (or ideas) in between. Thus, important
information will likely get distributed between chunks.
2) Semantic chunking
Let’s say we start with the first segment and its embedding.
If the first segment’s embedding has a high cosine similarity with that of the
second segment, both segments form a chunk.
120
[Link]
Unlike fixed-size chunks, this maintains the natural flow of language and
preserves complete ideas.
Since each chunk is richer, it improves the retrieval accuracy, which, in turn,
produces more coherent and relevant responses by the LLM.
3) Recursive chunking
Next, split each chunk into smaller chunks if the size exceeds a pre-defined
chunk size limit. If, however, the chunk fits the chunk-size limit, no further
splitting is done.
121
[Link]
As shown above:
Unlike fixed-size chunks, this approach also maintains the natural flow of
language and preserves complete ideas.
122
[Link]
That said, this approach assumes that the document has a clear structure, which
may not be true.
Also, chunks may vary in length, possibly exceeding model token limits. You can
try merging it with recursive splitting.
5) LLM-based chunking
This method ensures high semantic accuracy since the LLM can understand
context and meaning beyond simple heuristics (used in the above four
approaches).
But this is the most computationally demanding chunking technique of all five
techniques discussed here.
123
[Link]
Also, since LLMs typically have a limited context window, that is something to be
taken care of.
We have observed that semantic chunking works pretty well in many cases, but
again, you need to test.
The choice will depend on the nature of your content, the capabilities of the
embedding model, computational resources, etc.
● Prompt engineering
● Fine-tuning
● RAG
● Or a hybrid approach (RAG + fine-tuning)
The following visual will help you decide which one is best for you:
124
[Link]
That's it!
125
[Link]
Once you've decided that RAG is the right approach, the next step is choosing
the right RAG architecture for your use case.
8 RAG architectures
We prepared the following visual that details 8 types of RAG architectures used
in AI systems:
1) Naive RAG
126
[Link]
Works best for simple, fact-based queries where direct semantic matching
suffices.
2) Multimodal RAG
Handles multiple data types (text, images, audio, etc.) by embedding and
retrieving across modalities.
Ideal for cross-modal retrieval tasks like answering a text query with both text
and image context.
3) HyDE
This technique generates a hypothetical answer document from the query before
retrieval.
Uses this generated document’s embedding to find more relevant real documents.
4) Corrective RAG
Validates retrieved results by comparing them against trusted sources (e.g., web
search).
5) Graph RAG
127
[Link]
6) Hybrid RAG
Useful when the task requires both unstructured text and structured relational
data for richer answers.
7) Adaptive RAG
Breaks complex queries into smaller sub-queries for better coverage and
accuracy.
8) Agentic RAG
Best suited for complex workflows that require tool use, external APIs, or
combining multiple RAG techniques.
128
[Link]
These systems retrieve once and generate once. This means if the retrieved
context isn't enough, the LLM can not dynamically search for more information.
RAG systems may provide relevant context but don't reason through complex
queries. If a query requires multiple retrieval steps, traditional RAG falls short.
There's little adaptability. The LLM can't modify its strategy based on the
problem at hand.
Due to this, Agentic RAG is becoming increasingly popular. Let's understand this
in more detail.
Agentic RAG
Note: The diagram above is one of many blueprints that an agentic RAG system may
possess. You can adapt it according to your specific use case.
As shown above, the idea is to introduce agentic behaviors at each stage of RAG.
Think of agents as someone who can actively think through a task - planning,
adapting, and iterating until they arrive at the best solution, rather than just
129
[Link]
Steps 1-2) The user inputs the query, and an agent rewrites it (removing spelling
mistakes, simplifying it for embedding, etc.)
Step 3) Another agent decides whether it needs more details to answer the query.
Step 5-8) If yes, another agent looks through the relevant sources it has access to
(vector database, tools & APIs, and the internet) and decides which source should
be useful. The relevant context is retrieved and sent to the LLM as a prompt.
Step 10) A final agent checks if the answer is relevant to the query and context.
Step 12) If not, go back to Step 1. This procedure continues for a few iterations
until the system admits it cannot answer the query.
This makes the RAG much more robust since, at every step, agentic behavior
ensures that individual outcomes are aligned with the final goal.
That said, it is also important to note that building RAG systems typically boils
down to design preferences/choices.
130
[Link]
As a result, several irrelevant contexts get retrieved during the retrieval step due
to a higher cosine similarity than the documents actually containing the answer.
The following visual depicts how it differs from traditional RAG and HyDE.
131
[Link]
132
[Link]
Use an LLM to generate a hypothetical answer H for the query Q (this answer
does not have to be entirely correct).
Embed the answer using a contriever model to get E (Bi-encoders are famously
used here).
Use the embedding E to query the vector database and fetch relevant context (C).
Done!
But this does not severely affect the performance due to the contriever model -
one which embeds.
More specifically, this model is trained using contrastive learning and it also
functions as a near-lossless compressor whose task is to filter out the
hallucinated details of the fake document.
Several studies have shown that HyDE improves the retrieval performance
compared to the traditional embedding model.
133
[Link]
But this comes at the cost of increased latency and more LLM usage.
All three techniques are used to augment the knowledge of an existing model
with additional data.
1) Full fine-tuning
134
[Link]
While this fine-tuning technique has been successfully used for a long time,
problems arise when we use it on much larger models — LLMs, for instance,
primarily because of:
Their size.
2) LoRA fine-tuning
The core idea is to decompose the weight matrices (some or all) of the original
model into low-rank matrices and train them instead.
For instance, in the graphic below, the bottom network represents the large
pre-trained model, and the top network represents the model with LoRA layers.
135
[Link]
The idea is to train only the LoRA network and freeze the large model.
But the LoRA model has more neurons than the original model. How does that
help?
To understand this, you must make it clear that neurons don't have anything to
do with the memory of the network. They are just used to illustrate the
dimensionality transformation from one layer to another.
It is the weight matrices (or the connections between two layers) that take up
memory.
136
[Link]
Looking at the above visual, it is pretty clear that the LoRA network has
relatively very few connections.
3) RAG
There are 7 steps, which are also marked in the above visual:
Step 1-2: Take additional data, and dump it in a vector database after embedding.
(This is only done once. If the data is evolving, just keep dumping the
137
[Link]
embeddings into the vector database. There’s no need to repeat this again for the
entire data)
Step 3: Use the same embedding model to embed the user query.
Step 4-5: Find the nearest neighbors in the vector database to the embedded
query.
Step 6-7: Provide the original query and the retrieved documents (for more
context) to the LLM to get a response.
In fact, even its name entirely justifies what we do with this technique:
Of course, there are many problems with RAG too, such as:
138
[Link]
RAGs involve similarity matching between the query vector and the vectors of
the additional documents. However, questions are structurally very different from
answers.
So, it’s pretty clear that RAG has both pros and cons.
RAG vs REFRAG
Most of what we retrieve in RAG setups never actually helps the LLM.
That’s the exact problem Meta AI’s new method REFRAG solves.
It fundamentally rethinks retrieval and the diagram below explains how it works.
139
[Link]
Essentially, instead of feeding the LLM every chunk and every token, REFRAG
compresses and filters context at a vector level:
140
[Link]
This way, the model processes just what matters and ignores the rest.
Step 1-2) Encode the docs and store them in a vector database.
Step 3-5) Encode the full user query and find relevant chunks. Also, compute the
token-level embeddings for both the query (step 7) and matching chunks.
Step 6) Use a relevance policy (trained via RL) to select chunks to keep.
Step 8) Concatenate the token-level representations of the input query with the
token-level embedding of selected chunks and a compressed single-vector
representation of the rejected chunks.
That means you can process 16x more context at 30x the speed, with the same
accuracy.
RAG vs CAG
RAG changed how we build knowledge-grounded systems, but it still has a
weakness.
141
[Link]
Every time a query comes in, the model often re-fetches the same context from
the vector DB, which can be expensive, redundant, and slow.
And you can take this one step ahead by fusing RAG and CAG as depicted below:
● In a regular RAG setup, your query goes to the vector database, retrieves
relevant chunks, and feeds them to the LLM.
● But in RAG + CAG, you divide your knowledge into two layers.
○ The static, rarely changing data, like company policies or reference
guides, gets cached once inside the model’s KV memory.
142
[Link]
This way, the model doesn’t have to reprocess the same static information every
time.
It uses it instantly from cache, and supplements it with whatever’s new via
retrieval to give faster inference.
You should only include stable, high-value knowledge that doesn’t change often.
If you cache everything, you’ll hit context limits, so separating “cold” (cacheable)
and “hot” (retrievable) data is what keeps this system reliable.
Many APIs like OpenAI and Anthropic already support prompt caching, so you
can start experimenting right away.
143
[Link]
RAG (2020-2023):
Agentic RAG:
AI Memory:
144
[Link]
The agent can now “remember” things, like user preferences, past conversations
and important dates. All stored and retrievable for future interactions.
Instead of being frozen at training time, agents can now accumulate knowledge
from every interaction. They improve over time without retraining.
Memory is the bridge between static models and truly adaptive AI systems.
145
[Link]
Context
Engineering
146
[Link]
Most AI agents (or LLM apps) fail not because the models are bad, but because
they lack the right context to succeed.
For instance, a RAG workflow is typically 80% retrieval and 20% generation.
Thus:
If your RAG isn't working, most likely, it's a context retrieval issue.
In the same way, LLMs aren't mind readers. They can only work with what you
give them.
147
[Link]
Dynamic information flow: Context comes from multiple sources: users, previous
interactions, external data, and tool calls. Your system needs to pull it all together
intelligently.
148
[Link]
Smart tool access: If your AI needs external information or actions, give it the
right tools. Format the outputs so they're maximally digestible.
Memory management:
Context engineering is becoming the new core skill since it addresses the real
bottleneck, which is not model capability, but setting up an architecture of
information.
149
[Link]
Agents today have evolved into much more than just chatbots.
The graphic below summarizes the 6 types of contexts an agent needs to function
properly, which are:
150
[Link]
● Instructions
● Examples
● Knowledge
● Memory
● Tools
● Guardrails
This tells you that it's not enough to simply "prompt" the agents.
151
[Link]
● If LLM is a CPU.
● Then the context window is the RAM.
You're essentially programming the "RAM" with the perfect instructions for your
AI.
How do we do it?
152
[Link]
● Writing Context
● Selecting Context
● Compressing Context
● Isolating Context
1) Writing context
Writing context means saving it outside the context window to help an agent
perform a task.
153
[Link]
2) Read context
Reading context means pulling it into the context window to help an agent
perform a task.
● A tool
● Memory
● Knowledge base (docs, vector DB)
3) Compressing context
Compressing context means keeping only the tokens needed for a task.
154
[Link]
4) Isolating context
● Using multiple agents (or sub-agents), each with its own context
● Using a sandbox environment for code storage and execution
● And using a state object
So essentially, when you are building a context engineering workflow, you are
engineering a “context” pipeline so that the LLM gets to see the right
information, in the right format, at the right time.
155
[Link]
That is why production-grade LLM apps don’t just need instructions but rather
structure, which is the full ecosystem of context that defines their reasoning,
memory, and decision loops.
Here’s the mental model to use when you think about the types of contexts for
Agents:
156
[Link]
1) Instructions
2) Examples
3) Knowledge
4) Memory
You want your Agent to remember what it did in the past. This layer gives it
continuity across sessions.
5) Tools
This layer extends the Agent’s power beyond language and takes real-world
action.
157
[Link]
6) Tool Results
● This layer feeds the tool’s results back to the model to enable
self-correction, adaptation, and dynamic decision-making.
These are the exact six layers that help you build fully context-aware Agents.
This Agent will gather its context across 4 sources: Documents, Memory, Web
search, and Arxiv.
158
[Link]
Tech stack:
Let's go!
159
[Link]
160
[Link]
The extracted data can be directly embedded and stored in a vector DB without
further processing.
Now that we have RAG-ready chunks along with the metadata, it's time to store
them in a self-hosted Milvus vector database.
161
[Link]
Zep acts as the core memory layer of our workflow. It creates temporal
knowledge graphs to organize and retrieve context for each interaction.
We use it to store and retrieve context from chat history and user data.
162
[Link]
We use Firecrawl web search to fetch the latest news and developments related to
the user query.
163
[Link]
To further support research queries, we use the arXiv API to retrieve relevant
results from their data repository based on the user query.
164
[Link]
Now, we pass our combined context to the context evaluation agent that filters
out irrelevant context.
This filtered context is then passed to the synthesizer agent that generates the
final response.
165
[Link]
Based on the query, we notice that the RAG tool, powered by Tensorlake, was the
most relevant source for the LLM to generate a response.
166
[Link]
167
[Link]
The workflow explained above is one of the many blueprints. Your implementation can
vary.
They solve a practical issue in agent design: LLMs forget everything unless all
instructions, examples and edge cases are restated each time.
Skills package this information into small, self-contained units that Claude loads
only when they’re relevant.
This allows an agent to use hundreds of specialized workflows while keeping its
active context lightweight.
To make this scalable, Skills use a three-layer context management system that
lets it use 100s of skills without hitting context limits.
168
[Link]
Supporting files like scripts and templates aren’t pre-loaded but accessed directly
when in use, consuming zero tokens.
Now let’s zoom into the main ideas behind Skills, because understanding what
they are clarifies why this 3-layer system matters.
169
[Link]
Instead of re-explaining steps, examples, constraints, and edge cases every time,
you define the workflow once and reuse it forever.
Anatomy of a Skill
This separation lets Claude stay lightweight until a specific skill is activated.
Each solves a different layer of the agent problem, and skills serve as the
procedural knowledge base.
170
[Link]
Claude Desktop even includes a “Skill Creator” skill that helps generate the
structure for you.
...and their approach would be: “Embed the data, store in a vector DB and do
RAG.”
171
[Link]
What’s blocking the Chicago office project, and when’s our next meeting about it?
Answering this single query requires searching across sources like Linear (for
blockers), Calendar (for meetings), Gmail (for emails), and Slack (for discussions).
● Ingestion layer:
○ Connect to apps without auth headaches.
○ Process different data sources properly before embedding (email vs
code vs calendar).
○ Detect if a source is updated and refresh embeddings (ideally,
without a full refresh).
172
[Link]
● Retrieval layer:
○ Expand vague queries to infer what users actually want.
○ Direct queries to the correct data sources.
○ Layer multiple search strategies like semantic-based,
keyword-based, and graph-based.
○ Ensure retrieving only what users are authorized to see.
○ Weigh old vs. new retrieved info (recent data matters more, but old
context still counts).
● Generation layer:
○ Provide a citation-backed LLM response.
...but this is precisely how giants like Google (in Vertex AI Search), Microsoft (in
M365 products), AWS (in Amazon Q Business), etc., are solving it.
173
[Link]
For instance, to detect updates and initiate a re-sync, one might do timestamp
comparisons.
But this does not tell if the content actually changed (maybe only the permission
was updated), and you might still re-embed everything unnecessarily.
174
[Link]
You can see the full implementation on GitHub and try it yourself.
But the core insight applies regardless of the framework you use:
You need to build for continuous sync, intelligent chunking, and hybrid search
from day one.
175
[Link]
AI Agents
176
[Link]
What is an AI Agent?
Imagine you want to generate a report on the latest trends in AI research. If you
use a standard LLM, you might:
This iterative process takes time and effort, requiring you to act as the
decision-maker at every step.
177
[Link]
● A Filtering Agent scans the retrieved papers, identifying the most relevant
ones based on citation count, publication date, and keywords.
178
[Link]
Here, the AI agents not only execute the research process end-to-end but also
self-refine their outputs, ensuring the final report is comprehensive, up-to-date,
and well-structured - all without requiring human intervention at every step.
To formalize AI Agents are autonomous systems that can reason, think, plan,
figure out the relevant sources and extract information from them when needed,
take actions, and even correct themselves if something goes wrong.
179
[Link]
It can reason, generate, summarize but only using what it already knows (i.e., its
training data).
It’s smart, but static. It can’t access the web, call APIs, or fetch new facts on its
own.
180
[Link]
RAG makes the LLM aware of updated, relevant info without retraining.
Agent
An Agent adds autonomy to the mix.
An Agent uses an LLM, calls tools, makes decisions, and orchestrates workflows
just like a real assistant.
181
[Link]
to be effective, they must be built with certain key principles in mind. There are
six essential building blocks that make AI agents more reliable, intelligent, and
useful in real-world applications:
1. Role-playing
2. Focus
3. Tools
4. Cooperation
5. Guardrails
6. Memory
Let’s explore each of these concepts and understand why they are fundamental to
building great AI agents.
1) Role-playing
One of the simplest ways to boost an agent’s performance is by giving it a clear,
specific role.
A generic AI assistant may give vague answers. But define it as a “Senior contract
lawyer,” and it responds with legal precision and context.
Why?
Because role assignment shapes the agent’s reasoning and retrieval process. The
more specific the role, the sharper and more relevant the output.
2) Focus/Tasks
Focus is key to reducing hallucinations and improving accuracy.
182
[Link]
Giving an agent too many tasks or too much data doesn’t help - it hurts.
For example, a marketing agent should stick to messaging, tone, and audience
not pricing or market analysis.
3) Tools
Agents get smarter when they can use the right tools.
183
[Link]
184
[Link]
CrewAI supports several tools that you can integrate with Agents, as depicted
below:
Below, let's look at how you can build one for your custom needs in the CrewAI
framework.
185
[Link]
Next, we define the input fields the tool expects using Pydantic.
186
[Link]
Every tool class should have the _run method which we will execute whenever
the Agents wants to make use of it.
In the above code, we fetch live exchange rates using an API request. We also
handle errors if the request fails or the currency code is invalid.
Now, we define an agent that uses the tool for real-time currency analysis and
attach our CurrencyConverterTool, allowing the agent to call it directly if needed:
187
[Link]
Finally, we create a Crew, assign the agent to the task, and execute it.
Works as expected!
188
[Link]
Instead of embedding the tool directly in every Crew, we’ll expose it as a reusable
MCP tool - making it accessible across multiple agents and flows via a simple
server.
We’ll now write a lightweight [Link] script that exposes the currency converter
tool. We start with the standard imports:
189
[Link]
This function takes three inputs - amount, source currency, and target currency
and returns the converted result using the real-time exchange rate API.
To make the tool accessible, we need to run the MCP server. Add this at the end
of your script:
190
[Link]
This starts the server and exposes your convert_currency tool at:
[Link]
Now any CrewAI agent can connect to it using MCPServerAdapter. Let’s now
consume this tool from within a CrewAI agent.
First, we import the required CrewAI classes. We’ll use Agent, Task, and Crew
from CrewAI, and MCPServerAdapter to connect to our tool server.
Next, we connect to the MCP tool server. Define the server parameters to
connect to your running tool (from [Link]).
This agent is assigned the convert_currency tool from the remote server. It can
now call the tool just like a locally defined one.
191
[Link]
Finally, we create the Crew, pass in the inputs and run it:
192
[Link]
4) Cooperation
Multi-agent systems work best when agents collaborate and exchange feedback.
Instead of one agent doing everything, a team of specialized agents can split tasks
and improve each other’s outputs.
5) Guardrails
Agents are powerful but without constraints, they can go off track. They might
193
[Link]
Guardrails ensure that agents stay on track and maintain quality standards.
For example, an AI-powered legal assistant should avoid outdated laws or false
claims - guardrails ensure that.
6) Memory
Finally, we have memory, which is one of the most critical components of AI
agents.
Without memory, an agent would start fresh every time, losing all context from
previous interactions. With memory, agents can improve over time, remember
past actions, and create more cohesive responses.
194
[Link]
195
[Link]
This memory isn’t just nice-to-have but it enables agents to learn from past
interactions without retraining the model.
This is especially powerful for continual learning: letting agents adapt to new
tasks without touching LLM weights.
196
[Link]
This means the Agent is mostly stateless, and it has no recall abilities.
197
[Link]
It doesn’t matter if the user told the Agent their name five seconds ago, it’s
forgotten. If the Agent helped troubleshoot an issue in the last session, it won’t
remember any of it now.
If you dive deeper, it follows a structured and intuitive architecture with several
types of Memory.
● Short-Term Memory
● Long-Term Memory
● Entity Memory
● Contextual Memory, and
198
[Link]
● User Memory
Each serves a unique purpose in helping agents “remember” and utilize past
information.
To simulate memory, the system has to manage context explicitly: choosing what
to keep, what to discard, and what to retrieve before each new model call.
This is why memory is not a property of the model itself. It is a system design
problem.
The following visual depicts the 5 most popular design patterns employed in
building AI agents.
199
[Link]
1) Reflection pattern
200
[Link]
The AI reviews its own work to spot mistakes and iterate until it produces the
final response.
201
[Link]
This is helpful since the LLM is not solely reliant on its internal knowledge.
202
[Link]
As shown above, the Agent is going through a series of thought activities before
producing a response.
More specifically, under the hood, many such frameworks use the ReAct
(Reasoning and Acting) pattern to let LLM think through problems and use tools
to act on the world.
This enhances an LLM agent’s ability to handle complex tasks and decisions by
combining chain-of-thought reasoning with external tool use.
203
[Link]
4) Planning pattern
● Subdividing tasks
● Outlining objectives
5) Multi-Agent pattern
204
[Link]
● There are several agents, each with a specific role and task.
● Each agent can also access tools.
All agents work together to deliver the final outcome, while delegating tasks to
other agents if needed.
We'll manually simulate each round of the agent's reasoning, pausing, acting and
observing exactly as a ReAct loop is meant to function.
By running the logic cell-by-cell, we will gain full visibility and control over the
thinking process, allowing us to debug and validate the agent’s behavior at each
step.
To begin, we load the environment variables (like your LLM API key) and import
completion from LiteLLM (also install it first–pip install litellm), a lightweight
wrapper to query LLMs like OpenAI or local models via Ollama.
205
[Link]
● system (str): This is the system prompt that sets the personality and
behavioral constraints for the agent. If passed, it becomes the very first
message in the conversation just like in OpenAI Chat APIs.
● [Link]: This list acts as the conversation memory. Every interaction,
whether it’s user input or assistant output is appended to this list. This
history is crucial for LLMs to behave coherently across multiple turns.
● If system is provided, it's added to the message list using the special "role":
"system" identifier. This ensures that every completion that follows is
conditioned on the system instructions.
206
[Link]
This is the core interface you’ll use to interact with your agent.
● If a message is passed:
○ It gets appended as a "user" message to [Link].
○ This simulates the human asking a question or giving instructions.
● Then, [Link]() is called (which we will define shortly). This method
sends the full conversation history to the LLM.
● The model’s reply (stored in result) is then appended to [Link] as an
"assistant" role.
● Finally, the reply is returned to the caller.
207
[Link]
This method handles the actual API call to your LLM provider - in this case, via
LiteLLM, using the "openai/gpt-4o" model.
208
[Link]
At this stage, if we ask it about the previous message, we get the correct output,
which shows the assistant has visibility on the previous context:
Now that our conversational class is setup, we come to the most interesting part,
which is defining a ReAct-style prompt.
Before an LLM can behave like an agent, it needs clear instructions - not just on
what to answer, but how to go about answering. That’s exactly what this
system_prompt does, which is defined below:
209
[Link]
210
[Link]
This isn’t just a prompt. It’s a behavioral protocol - defining what structure the
agent should follow, how it should reason, and when it should stop.
This is the framing sentence. It tells the LLM not to rush toward an answer.
Instead, it should proceed step by step, following a defined pattern in a loop -
mirroring how a ReAct agent works.
3) "Action" to decide what action to take from the list of actions available to you.
Here, we give the LLM a reasoning template. These are the same primitives
found in all ReAct-style agents.
By splitting this into explicit parts, we avoid hallucinations and ensure the agent
works in a controlled loop.
211
[Link]
This tells the agent: once it has all the required information - break the loop and
give the final answer. No need to keep reasoning indefinitely.
math:
lookup_population:
By using clear formatting and examples, we teach the model how to interface
with tools in a safe, predictable way.
212
[Link]
Iteration 1:
Iteration 2:
PAUSE
...
Iteration 9:
Observation: 250000000
Iteration 10:
This worked-out example gives the LLM a pattern to follow. Even more
importantly, it provides the developer (you) a way to intervene at each step -
injecting tool results or validating whether the flow is working correctly.
Whenever you have the answer, stop the loop and output it to the user.
Without this explicit stop signal, the LLM might continue indefinitely. You're
telling it: "When you have all the puzzle pieces, just say the answer and exit the
loop."
213
[Link]
Iteration 1:
214
[Link]
We, as a user, don't have any input to give at this stage so we just invoke the
complete() method again:
Iteration 2:
PAUSE
Yet again, we, as a user, don't have any input to give at this stage so we just
invoke the complete() method again:
Iteration 3:
215
[Link]
We still don't have any input to give at this stage so we just invoke the complete()
method again:
Iteration 4:
PAUSE
At this stage, it needs to get the tool output in the form of an observation. Here,
let's intervene and provide it with the observation:
Iteration 5:
216
[Link]
Iteration 6:
PAUSE
Iteration 7:
At this stage, it needs to get the tool output in the form of an observation. Here,
let's again intervene and provide it with the observation:
217
[Link]
Iteration 8:
Thought: I now have the populations of both India and Japan. I need to add them
together.
Iteration 9:
218
[Link]
Iteration 10:
PAUSE
Iteration 11:
Answer: The sum of the population of India and the population of Japan is
1,525,000,000.
Great!!
219
[Link]
In the next part, we’ll fully automate this - no manual calls required and build a
full controller that simulates this entire loop programmatically.
Now that we have understood how the above ReAct execution went, we can easily
automate that to remove our interventions.
220
[Link]
It takes:
221
[Link]
Next, inside this function, we initialize the Agent and available tools:
222
[Link]
previous_step helps track the last stage (e.g., Thought, Action) for better control
flow.
Next, we run the reasoning loop, which continues until the agent produces a final
answer. The answer is expected to be marked with Answer: based on our prompt
design:
223
[Link]
224
[Link]
225
[Link]
Next, we catch the first PAUSE right after the Thought. Nothing else needs to be
done here - we just move to the next step.
226
[Link]
227
[Link]
For example, in: Action: lookup_population: India, the regex pulls out:
228
[Link]
● If the tool name is valid, we call it like a Python function and capture the
result.
● We format the output into Observation: ... so the agent can use it in the
next step.
● If the tool doesn't exist, we ask the agent to retry.
Done!
229
[Link]
You now have a fully working ReAct loop without needing any external
framework.
This approach works well for a tightly controlled setup like this demo. However,
it’s brittle:
● If the agent slightly deviates from the expected format (e.g., adds extra
whitespace, uses different casing, or mislabels an action), the regex could
fail to match.
230
[Link]
● We’re also assuming that the agent will never call a tool that doesn’t exist,
and that all tools will succeed silently.
● Add more robust parsing (e.g., structured prompts with JSON outputs or
function calling).
● Include tool validation, retries, and exception handling.
● Use guardrails or output formatters to constrain what the LLM is allowed
to emit.
But for the purpose of understanding how ReAct-style loops work under the
hood, this is a clean and minimal place to start. It gives you complete
transparency into what’s happening at each stage of the agent’s reasoning and
execution process.
This loop demonstrates how a simple agent can think, act, and observe, all
powered by your own Python + local LLM stack.
231
[Link]
1) Basic responder
232
[Link]
The LLM is just a generic responder that receives an input and produces an
output. It has little control over the program flow.
2) Router pattern
The LLM makes basic decisions on which function or path it can take.
3) Tool calling
233
[Link]
A human defines a set of tools the LLM can access to complete a task.
LLM decides when to use them and also the arguments for execution.
4) Multi-agent pattern
A manager agent coordinates multiple sub-agents and decides the next steps
iteratively.
A human lays out the hierarchy between agents, their roles, tools, etc.
5) Autonomous pattern
234
[Link]
The most advanced pattern, wherein, the LLM generates and executes new code
independently, effectively acting as an independent AI developer.
235
[Link]
Agent: An autonomous AI entity that perceives, reasons, and acts toward a goal
(covered with full implementations here).
Observation: The data or input an agent receives from its environment at any
given moment.
LLMs: Large Language Models that enable agents to reason and generate natural
language.
Tools: APIs or utilities agents use to extend their functionality and capabilities to
interact with the world.
236
[Link]
Evaluation: The process of assessing how well an agent performs against its
intended goals (covered here with implementation).
Planning: Determining the sequence of steps an agent must take to reach its goal
(implemented from scratch in pure Python here).
ReAct: A framework where reasoning (thought) and acting (tool use) are
combined step by step (implemented from scratch in pure Python here).
237
[Link]
Few-shot learning: Teaching an agent new behaviors or tasks with just a few
examples.
Knowledge base: A structured repository of information that agents can use for
reasoning and decision-making (covered in detail here with code).
238
[Link]
MCP: A standardized way for agents to connect to external tools, APIs, and data
sources (learn how to build MCP servers, MCP clients, JSON-RPC, Sampling,
Security, Sandboxing in MCPs, and using
LangGraph/LlamaIndex/CrewAI/PydanticAI with MCP here).
239
[Link]
Router: A mechanism that directs tasks to the most appropriate agent or tool.
Each of these terms forms a key piece of the agentic AI ecosystem that AI
engineers should know.
4 Layers of Agentic AI
The following graphic depicts a layered overview of Agentic AI concepts,
depicting how the ecosystem is structured from the ground up (LLMs) to
higher-level orchestration (Agentic Infrastructure).
240
[Link]
● Tokenization & inference parameters: how text is broken into tokens and
processed by the model.
● Prompt engineering: designing inputs to get better outputs.
● LLM APIs: programmatic interfaces to interact with the model.
Agents wrap around LLMs to give them the ability to act autonomously.
Key responsibilities:
● Tool usage & function calling: connecting the LLM to external APIs/tools.
● Agent reasoning: reasoning methods like ReAct (reasoning + act) or
Chain-of-Thought.
● Task planning & decomposition: breaking a big task into smaller ones.
● Memory management: keeping track of history, context, and long-term
info.
Agents are the brains that make LLMs useful in real-world workflows.
Features:
241
[Link]
4) Agentic Infrastructure
The top layer ensures these systems are robust, scalable, and safe.
This includes:
Overall, Agentic AI, as a whole, involves a stacked architecture, where each outer
layer adds reliability, coordination, and governance over the inner layers.
7 Patterns in Multi-Agent
Systems
Monolithic agents (single LLMs stuffed with system prompts) didn’t sustain for
too long.
242
[Link]
This visual explains the 7 core patterns of multi-agent orchestration, each suited
for specific workflows:
1) Parallel
Each agent tackles a different subtask, like data extraction, web retrieval, and
summarization, and their outputs merge into a single result.
2) Sequential
243
[Link]
Each agent adds value step-by-step, like one generates code, another reviews it,
and a third deploys it.
You’ll see this in workflow automation, ETL chains, and multi-step reasoning
pipelines.
3) Loop
Agents continuously refine their own outputs until a desired quality is reached.
Great for proofreading, report generation, or creative iteration, where the system
thinks again before finalizing results.
4) Router
Here, a controller agent routes tasks to the right specialist. For instance, user
queries about finance go to a FinAgent, legal queries to a LawAgent.
5) Aggregator
Many agents produce partial results that the main agent combines into one final
output. So each agent forms an opinion, and a central one aggregates them into a
consensus.
6) Network
There’s no clear hierarchy here, and agents just talk to each other freely, sharing
context dynamically.
7) Hierarchical
A top-level planner agent delegates subtasks to workers, tracks their progress,
244
[Link]
and makes final calls. This is exactly like a manager and their team.
It’s easy to spin up 10 agents and call it a team. What’s hard is designing the
communication flow so that:
Agent2Agent(A2A) Protocol
Agentic applications require both A2A and MCP.
245
[Link]
While A2A allows agents to connect with other agents and collaborate in teams.
Let's clearly understand what A2A is and how it can work with MCP.
If you don't know about MCP servers, we covered them in detail in the next section.
In a gist:
So using A2A, while two Agents might be talking to each other...they themselves
might be communicating to MCP servers.
246
[Link]
Using this, AI agents connecting to an MCP server can discover new agents to
collaborate with and connect via the A2A protocol.
Clients use this to find and communicate with the best agent for a task.
247
[Link]
● Secure collaboration
● Task and state management
● Capability discovery
● Agents from different frameworks working together (LlamaIndex, CrewAI,
etc.)
Agent-User Interaction
Protocol(AG-UI)
In the realm of Agents:
248
[Link]
The problem
Today, you can build powerful multi-step agentic workflows using a toolkit like
LangGraph, CrewAI, Mastra, etc.
But the moment you try to bring that Agent into a real-world app, things fall
apart:
249
[Link]
● You want to display tool execution progress as it happens, pause for human
feedback, without blocking or losing context.
● You want to sync large, changing objects (like code or tables) without
re-sending everything to the UI.
● You want to let users interrupt, cancel, or reply mid-agent run, without
losing context.
Every Agent backend has its own mechanisms for tool calling, ReAct-style
planning, state diffs, and output formats.
So if you use LangGraph, the front-end will implement custom WebSocket logic,
messy JSON formats, and UI adapters specific to LangGraph.
It standardizes the interaction layer between backend agents and frontend UIs
(the green layer below).
250
[Link]
Technically speaking…
Each event has an explicit payload (like keys in a Python dictionary) like:
251
[Link]
And it comes with SDKs in TypeScript and Python to make this plug-and-play for
any stack, like shown below:
In the above image, the response from the Agent is not specific to any toolkit. It
is a standardized AG-UI response.
This means you need to write your backend logic once and hook it into AG-UI,
and everything just works:
This is the layer that will make your Agent apps feel like real software, not just
glorified chatbots.
252
[Link]
But finally, the industry is converging around three protocols that work together.
These are:
● The standard for how agents connect to tools, data, and workflows.
● Started by Anthropic, now adopted everywhere.
A2A (Agent-to-Agent):
AG-UI can handshake with both MCP and A2A, meaning tool outputs and
multi-agent collaboration can flow seamlessly to your user interface.
253
[Link]
Your frontend stays connected to the entire agent ecosystem through one unified
protocol layer.
It acts as the practical layer that lets you actually build with these protocols
without dealing with the complexity.
254
[Link]
It breaks down handshakes, misconceptions and real examples and shows exactly
how to start building.
Let’s learn how to use the Opik Agent Optimizer toolkit that lets you
automatically optimize prompts for LLM apps.
The idea is to start with an initial prompt and an evaluation dataset, and let an
LLM iteratively improve the prompt based on evaluations.
255
[Link]
To begin, install Opik and its optimizer package, and configure Opik:
Next, import all the required classes and functions from opik and
opik_optimizer:
256
[Link]
Moving on, configure the evaluation metric, which tells the optimizer how to
score the LLM’s outputs against the given label:
Next, define your base prompt, which is the initial instruction that the
MetaPromptOptimizer will try to enhance:
257
[Link]
Then it iterates through several different prompts (written by AI), evaluates them,
258
[Link]
and prints the most optimal prompt. You can invoke [Link]() to see a
summary of the optimization, the best prompt found and its score:
The optimization results are also available in the Opik dashboard for further
analysis and visualization:
And that’s how you can use Opik Agent Optimizer to enhance the performance
and efficiency of your LLM apps.
Note: While we used GPT-4o, everything here can be executed 100% locally since you
can use any other LLM + Opik is fully open-source.
259
[Link]
1) Batch deployment
You can think of this as a scheduled automation.
260
[Link]
2) Stream deployment
Here, the Agent becomes part of a streaming data pipeline.
3) Real-Time deployment
This is where Agents act like live backend services.
261
[Link]
4) Edge deployment
The agent runs directly on user devices: mobile phones, smartwatches, and
laptops so no server round-trip is needed.
To summarize:
262
[Link]
Each pattern serves different needs. The key is matching your deployment
strategy to your specific use case, performance requirements, and user
expectations.
263
[Link]
MCP
264
[Link]
What is MCP?
Imagine you only know English. To get info from a person who only knows:
265
[Link]
It lets you (Agents) talk to other people (tools or other capabilities) through a
single interface.
If they need to access real-time information, they must use external tools and
resources on their own.
266
[Link]
If you had three AI applications and three external tools, you might end up
writing nine different integration modules (each AI x each tool) because there
was no common standard. This doesn’t scale.
Developers of AI apps were essentially reinventing the wheel each time, and tool
providers had to support multiple incompatible APIs to reach different AI
platforms.
The problem
Before MCP, the landscape of connecting AI to external data and actions looked
like a patchwork of one-off solutions.
Either you hard-coded logic for each tool, managed prompt chains that were not
267
[Link]
The diagram below illustrates this complexity: each AI (each “Model”) might
require unique code to connect to each external service (database, filesystem,
calculator, etc.), leading to spaghetti-like interconnections.
The solution
MCP tackles this by introducing a standard interface in the middle. Instead of M
× N direct integrations, we get M + N implementations: each of the M AI
applications implements the MCP client side once, and each of the N tools
implements an MCP server once.
Now everyone speaks the same “language”, so to speak, and a new pairing doesn’t
require custom code since they already understand each other via MCP.
268
[Link]
● On the left (pre-MCP), every model had to wire into every tool.
● On the right (with MCP), each model and tool connects to the MCP layer,
drastically simplifying connections. You can also relate this to the
translator example we discussed earlier.
However, the terminology is tailored to the AI context. There are three main
roles to understand: the Host, the Client, and the Server.
269
[Link]
Host
The Host is the user-facing AI application, the environment where the AI model
lives and interacts with the user.
Host is the one that initiates connections to the available MCP servers when the
system needs them. It captures the user's input, keeps the conversation history,
and displays the model’s replies.
270
[Link]
Client
The MCP Client is a component within the Host that handles the low-level
communication with an MCP Server.
Think of the Client as the adapter or messenger. While the Host decides what to
do, the Client knows how to speak MCP to actually carry out those instructions
with the server.
271
[Link]
Server
The MCP Server is the external program or service that actually provides the
capabilities (tools, data, etc.) to the application.
Servers can run locally on the same machine as the Host or remotely on some
cloud service since MCP is designed to support both scenarios seamlessly. The
key is that the Server advertises what it can do in a standard format (so the client
can query and understand available tools) and will execute requests coming from
the client, then return results.
272
[Link]
Tools
Tools are what they sound like: functions that do something on behalf of the AI
model. These are typically operations that can have effects or require
computation beyond the AI’s own capabilities.
Importantly, tools are usually triggered by the AI model’s choice, which means
the LLM (via the host) decides to call a tool when it determines it needs that
functionality.
Suppose we have a simple tool for weather. In an MCP server’s code, it might
look like:
273
[Link]
This Python function, registered with @[Link](), can be invoked by the AI via
MCP.
When the AI calls tools/call with name "get_weather" and {"location": "San
Francisco"} as arguments, the server will execute get_weather("San Francisco")
and return the dictionary result.
The client will get that JSON result and make it available to the AI. Notice the
tool returns structured data (temperature, conditions), and the AI can then use or
verbalize (generate a response) that info.
Since tools can do things like file I/O or network calls, an MCP implementation
often requires that the user permit a tool call.
For example, Claude’s client might pop up “The AI wants to use the ‘get_weather’
tool, allow yes/no?” the first time, to avoid abuse. This ensures the human stays in
control of powerful actions.
Tools are analogous to “functions” in classic function calling, but under MCP,
they are used in a more flexible, dynamic context. They are model-controlled but
developer/governance-approved in execution.
274
[Link]
Resources
Resources provide read-only data to the AI model.
These are like databases or knowledge bases that the AI can query to get
information, but not modify.
Unlike tools, resources typically do not involve heavy computation or side effects,
since they are often just information lookups.
Another key difference is that resources are usually accessed under the host
application’s control (not spontaneously by the model). In practice, this might
mean the Host knows when to fetch a certain context for the model.
For instance, if a user says, “Use the company handbook to answer my question,”
the Host might call a resource that retrieves relevant handbook sections and
feeds them to the model.
Resources could include a local file’s contents, a snippet from a knowledge base
or documentation, a database query result (read-only), or any static data like
configuration info.
275
[Link]
The AI (or Host) could ask the server for [Link] with a URI like
file://home/user/[Link], and the server would
callread_file("/home/user/[Link]") and return the text.
Notice that resources are usually identified by some identifier (like a URI or
name) rather than being free-form functions.
They are also often application-controlled, meaning the app decides when to
retrieve them (to avoid the model just reading everything arbitrarily).
From a safety standpoint, since resources are read-only, they are less dangerous,
but still, one must consider privacy and permissions (the AI shouldn’t read files
it’s not supposed to).
The Host can regulate which resource URIs it allows the AI to access, or the
server might restrict access to certain data.
In summary, Resources give the AI knowledge without handing over the keys to
change anything.
They’re the MCP equivalent of giving the model reference material when needed,
which acts like a smarter, on-demand retrieval system integrated through the
protocol.
Prompts
Prompts in the MCP context are a special concept: they are predefined prompt
templates or conversation flows that can be injected to guide the AI’s behavior.
Think of recurring patterns: e.g., a prompt that sets up the system role as “You
276
[Link]
are a code reviewer,” and the user’s code is inserted for analysis.
Rather than hardcoding that in the host application, the MCP server can supply
it.
The model doesn’t spontaneously decide to use prompts the way it does tools.
Rather, the prompt sets the stage before the model starts generating. In that
sense, prompts are often fetched at the beginning of an interaction or when the
user chooses a specific “mode”.
Suppose we have a prompt template for code review. The MCP server might have:
This prompt function returns a list of message objects (in OpenAI format) that
set up a code review scenario.
When the host invokes this prompt, it gets those messages and can insert the
actual code to be reviewed into the user content.
277
[Link]
Then it provides these messages to the model before the model’s own answer.
Essentially, the server is helping to structure the conversation.
While we have personally not seen much applicability of this yet, common use
cases for prompt capabilities include things like “brainstorming guide,”
“step-by-step problem solver template,” or domain-specific system roles.
By having them on the server, they can be updated or improved without changing
the client app, and different servers can offer different specialized prompts.
An important point to note here is that prompts, as a capability, blur the line
between data and instructions.
In a way, MCP prompts are similar to how ChatGPT plugins can suggest how to
format a query, but here it’s standardized and discoverable via the protocol.
Key differences:
● MCPs aim to standardize how AI agents interact with tools, while APIs
can vary greatly in their implementation.
● MCPs are designed to manage dynamic, evolving context, including data
resources, executable tools, and prompts for workflows.
● MCPs are particularly well-suited for AI agents that need to adapt to new
capabilities and tools without pre-programming.
278
[Link]
● If your API initially requires two parameters (e.g., location and date for a
weather service), users integrate their applications to send requests with
those exact parameters.
● Later, if you decide to add a third required parameter (e.g., unit for
temperature units like Celsius or Fahrenheit), the API’s contract changes.
● This means all users of your API must update their code to include the new
parameter. If they don’t update, their requests might fail, return errors, or
provide incomplete results.
279
[Link]
● The server responds with details about its available tools, resources,
prompts, and parameters. For example, if your weather API initially
supports location and date, the server communicates these as part of its
capabilities.
● If you later add a unit parameter, the MCP server can dynamically update
its capability description during the next exchange. The client doesn’t need
to hardcode or predefine the parameters since it simply queries the server’s
current capabilities and adapts accordingly.
● This way, the client can then adjust its behavior on-the-fly, using the
updated capabilities (e.g., including unit in its requests) without needing to
rewrite or redeploy code.
We’ll understand this topic better in Part 3 of this course, when we build a
280
[Link]
281
[Link]
determines which function to invoke by analyzing the user's prompt. The process
involves:
MCP offers a standardized protocol for integrating LLMs with external tools and
data sources. It decouples tool implementation from their consumption, allowing
for more modular and scalable AI systems.
But unlike simple tool calling, MCP creates a two-way communication between
your AI apps and servers.
Here’s a breakdown of the 6 core MCP primitives that make MCPs powerful
(explained with examples):
282
[Link]
Let’s start with the client, the entity that facilitates conversation between the
LLM app and the server, offering 3 key capabilities:
1) Sampling
The client side always has an LLM.
Thus, if needed, the server can ask the client’s LLM to generate some
completions, while the client still controls permissions and safety.
For example, an MCP server with travel tools can ask the LLM to pick the
optimal flight from a list.
2) Roots
This allows the client to define what files the server can access, making
interactions secured, sandboxed, and scoped.
For example, a server for booking travel may be given access to a specific
directory, from which it can read a user’s calendar.
3) Elicitations
283
[Link]
For example, a server booking travel may ask for the user’s preferences on
airplane seats, room type, or their contact number to finalise a booking.
4) Tools
Controlled by the model, tools are functions that do things: write to DBs, trigger
logic, send emails, etc.
For example:
● search flights
● send messages
● create calendar events
5) Resources
Controlled by the app, resources are the passive, read-only data like files,
calendars, knowledge bases, etc.
Examples:
● retrieve docs
● read calendars
● access knowledge bases
6) Prompts
Controlled by the user, prompts are pre-built instruction templates that guide
how the LLM uses tools/resources.
Examples:
● plan a vacation
● draft an email
284
[Link]
● summarize my meetings
It shows that MCP is not just another tool calling standard. Instead, it creates a
two-way communication between your AI apps and servers to build powerful AI
workflows.
With the core ideas of MCP in place, we’re now ready to see how this translates
into real development workflows.
MCP defines the structure, but developers still need a straightforward way to
build agents, configure clients and expose capabilities through servers.
It supports the full MCP ecosystem including agents, clients and servers to help
build end-to-end workflows suitable for both experimentation and production
environments.
285
[Link]
It sets up the MCP client, connects to one or more servers, discovers available
tools, and exposes them to the LLM in a structured way.
This allows the agent to decide when to call a tool, while the framework manages
capability loading and communication under the hood.
From here, the LLM can request tool calls naturally during reasoning, while
mcp-use handles execution and streaming.
286
[Link]
These issues arise from typical LLM behavior when exposed to large toolsets.
287
[Link]
Instead of exposing every tool from every server at once - something that often
leads to tool-name hallucinations, confusion between similar tools and degraded
reasoning, the Server Manager keeps the agent’s active toolset intentionally
narrow and context-driven.
288
[Link]
The client is the component responsible for all communication between the
agent and any MCP server.
It manages the transport layer and ensures that capabilities (tools, resources,
prompts, etc.) remain synchronized throughout an interaction.
In short, the client keeps the agent and server aligned, ensuring both sides speak
the same protocol.
mcp-use provides simple ways to create an MCP client depending on how you
prefer to manage server settings.
289
[Link]
This approach is ideal when working with multiple environments or when you
prefer keeping server settings version-controlled.
This mirrors the same structure as configuration files but allows programmatic
customization inside Python.
Although the agent manages the client internally, mcp-use still allows you to
inspect the client directly when needed.
290
[Link]
MCP Server
Agents and clients decide how to act, but MCP servers are what make those
actions possible.
Once a server exposes these capabilities, any standard MCP client can discover
and use them.
The following sections walk through how to create, run, test and deploy MCP
servers using mcp-use.
291
[Link]
mcp-use includes a project generator that lets you create a new server project
with these commands:
● A TypeScript entrypoint
● Example tools, prompts and resources
● Configuration files
● Built-in support for the MCP Inspector
Earlier we discussed the six core MCP primitives that power the protocol.
Using mcp-use, an MCP server can expose them - tools, resources, prompts,
sampling, elicitation and notifications through small, declarative definitions.
This keeps the server’s surface area simple to describe and easy for agents to
discover automatically during capability negotiation.
Tools
Tools represent actions the agent can perform. They are the main way an MCP
server exposes functionality - anything from API calls to calculations to workflow
steps.
292
[Link]
Each tool includes a name, input parameters and a callback that returns content
to the client. Here is a minimal tool example:
This example defines a complete MCP server with a single tool: get_weather,
which returns a basic weather response.
Any MCP-compatible client can automatically discover and invoke this tool
during capability negotiation.
Resources
293
[Link]
Prompts
Prompts define reusable instruction templates that agents can invoke to generate
structured messages.
They let your server provide consistent, well-formed prompts for common tasks.
Sampling
Sampling lets your server ask the client’s model to generate text mid-workflow.
It’s useful when the server needs the model to decide, summarize or choose
between options.
294
[Link]
Elicitation
Elicitation requests structured input from the user, such as selecting an option or
entering text.
Notifications
295
[Link]
Together, these primitives cover the full MCP surface: operations, structured
context retrieval, user interactions and asynchronous messaging.
3) MCP Inspector
When you start your server in development mode (npm run dev), mcp-use
automatically launches the MCP Inspector, a web-based dashboard for
inspecting and debugging MCP servers.
296
[Link]
It’s the fastest way to verify your server’s capabilities before connecting it to an
agent.
4) MCP-UI
These widgets let your server surface status, previews or quick visual outputs
without requiring a full application.
297
[Link]
They’re optional, but they significantly enhance the developer experience when
building or testing MCP servers.
5) Apps SDK
The Apps SDK is OpenAI’s framework for building interactive UI widgets that
appear directly inside ChatGPT or other Apps-SDK-compatible clients.
These widgets are written in React and allow tools to return interfaces such as
cards, previews or small apps rather than plain text.
MCP servers can expose these widgets as capabilities, enabling richer workflows
with minimal overhead.
298
[Link]
It defines metadata (so the server can expose the widget as a capability) and a
React component (which the client renders inside ChatGPT).
6) Tunneling
299
[Link]
mcp-use provides a tunneling command that exposes your local MCP server
through a temporary, secure public endpoint:
If you're using the built-in development runner, you can enable tunneling
300
[Link]
This automatically spins up your local server and creates a tunnel for it.
Other tunneling tools such as ngrok can also be used, provided the public URL
maps to the /mcp endpoint:
Tunneling enables:
MCP servers can run anywhere [Link] runs. Once deployed, any MCP client can
reach your server and automatically discover its tools, resources, prompts, and
other capabilities through standard MCP negotiation.
301
[Link]
● Local machines
● Cloud VMs
● Docker containers
● Serverless platforms
● Edge runtimes
● mcp-use Cloud
Regardless of the platform, the flow is the same: build your project and expose
the /mcp endpoint publicly.
302
[Link]
If your project is on GitHub, the CLI automatically detects it and can deploy
directly from your latest commit.
Whether you update a tool, change a prompt, or add UI widgets, agents pick up
the changes during their next capability negotiation.
303
[Link]
LLM
Optimization
304
[Link]
As a result, models grow larger and more complex because bigger models often
perform better during training.
These constraints determine whether a model can reliably serve real users. Even
a highly accurate system may be unusable if inference is slow or memory-hungry.
305
[Link]
Model compression techniques help address this need. They reduce the model’s
size and computational cost while preserving most of its performance, making
the model practical for real-world deployment.
Model Compression
As the name suggests, model compression is a set of techniques used to reduce
the size and computational complexity of a model while preserving or even
improving its performance.
306
[Link]
They aim to make the model smaller - that is why the name “model compression.”
● Knowledge Distillation
● Pruning
● Low-rank Factorization
● Quantization
1) Knowledge Distillation
307
[Link]
Knowledge distillation is one of the simplest and most effective ways to shrink a
model without sacrificing much performance.
308
[Link]
● Train the large model as you typically would. This is called the “teacher”
model.
● Train a smaller model, which is intended to mimic the behavior of the
larger model. This is also called the “student” model.
This allows the student model to achieve comparable performance with fewer
parameters and reduced computational complexity.
309
[Link]
But with consistent training, we can create a smaller model that is almost as good
as the larger one.
2) Pruning
Pruning is commonly used in tree-based models, where it involves removing
branches (or nodes) to simplify the model.
Thus, in the case of decision trees, the core idea is to iteratively drop sub-trees,
which, after removal, leads to:
310
[Link]
In the image above, both sub-trees result in the same increase in cost. However,
it makes more sense to remove the sub-tree with more nodes to reduce
computational complexity.
As you may have guessed, pruning in neural networks involves identifying and
eliminating specific connections or neurons that contribute minimally to the
model’s overall performance.
What’s more, it isn't easy to quantify the contribution of a specific layer towards
the final output.
311
[Link]
With pruning, the goal is to create a more compact neural network while
retaining as much predictive power as possible.
Neuron pruning:
Weight pruning:
312
[Link]
It is evident from the above ideas that by removing unimportant weights (or
nodes) from a network, several improvements can be expected:
● Better generalization
● Improved speed of inference
● Reduced model size
3) Low-rank Factorization
313
[Link]
The idea will become more clear if we understand these individual terms:
Low-rank:
Factorization:
Thus, Low-rank Factorization means breaking down a given weight matrix into
the product of two or more matrices of lower dimensions.
314
[Link]
There are many different matrix factorization methods available, such as Singular
Value Decomposition (SVD), Non-negative Matrix Factorization (NMF), or
Truncated SVD.
In matrix factorization, you'll typically have to choose the rank k for the
lower-rank approximation.
It determines the number of singular values (for SVD) or the number of factors
(for factorization methods).
315
[Link]
The choice of rank k is directly linked to the trade-off between model size
reduction and preservation of information.
Once you've obtained the lower-rank matrices, you can use them to transform the
input instead of the original weight matrices.
The benefit of doing this is that it reduces the computational complexity of the
neural network while retaining important features learned during training.
316
[Link]
its size.
4) Quantization
Typically, the parameters of a neural network (layer weights) are represented
using 32-bit floating-point numbers. This is useful because it offers a high level
of precision.
Also, as parameters are typically not constrained to any specific range of values,
all deep learning frameworks, by default, assign the biggest data type to
parameters.
But using the biggest data type also means consuming more memory.
For instance, consider your model has over a million parameters, each
represented with 32-bit floating-point numbers.
317
[Link]
While reducing the bit-width of parameters makes the model smaller, it also
leads to a loss of precision.
This means the model's predictions become more approximate than the original,
full-precision model.
Model compression makes models smaller and faster, but LLMs also introduce
challenges that show up only during inference.
Continuous batching
Traditional models, like CNNs, have a fixed-size image input and a fixed-length
output (like a label). This makes batching easy.
318
[Link]
LLMs, however, deal with variable-length inputs (the prompt) and generate
variable-length outputs.
So if you batch some requests, all will finish at different times, and the GPU
would have to wait for the longest request to finish before it can process new
requests. This leads to idle time on the GPU:
Instead of waiting for the entire batch to finish, the system monitors all
sequences and swaps completed ones (<EOS> token) with new queries:
319
[Link]
Prefill-decode disaggregation
LLM inference is a two-stage process with fundamentally different resource
requirements.
● The prefill stage processes all the input prompt tokens at once, so this is
compute-heavy.
● The decode stage autoregressively generates the output, and this demands
low latency.
Running both stages on the GPU means the compute-heavy prefill requests will
interfere with the latency-sensitive decode requests.
320
[Link]
321
[Link]
This KV Cache grows linearly with the total length of the conversation history.
But in many workflows, inputs like the system prompts are shared across many
requests. So we can avoid recomputing them by using these KV vectors across all
chats:
That said, KV cache takes up a significant memory since it’s stored in contiguous
blocks. This wastes GPU memory and leads to memory fragmentation:
Prefix-aware routing
To scale standard ML models, you can simply replicate the model across multiple
servers/GPUs and use straightforward load-balancing schemes like Round Robin
or routing to the least-busy server.
322
[Link]
But LLMs heavily rely on caching (like the shared KV prefix discussed above), so
requests are no longer independent.
If a new query comes in with a shared prefix that has already been cached on
Replica A, but the router sends it to Replica B (which is less busy), Replica B has
to recompute the entire prefix’s KV cache.
Generally, prefix-aware routing requires the router to maintain a map or table (or
use a predictive algorithm) that tracks which KV prefixes are currently cached on
which GPU replicas.
323
[Link]
When a new query arrives, the router sends the query to the replica that has the
relevant prefix already cached.
324
[Link]
325
[Link]
So each GPU holds the full weights of only some experts, not all. This means that
each GPU processes only the tokens assigned to the experts stored on that GPU.
Now, when a query arrives, the gating network in the MoE layer dynamically
decides which GPU it should go to, depending on which experts are activated.
This is a complex internal routing problem that cannot be treated like a simple
replicated model. You need a sophisticated inference engine to manage the
dynamic flow of computation across the sharded expert pool.
KV Caching in LLMs
KV caching is a popular technique to speed up LLM inference.
To get some perspective, look at the inference speed difference in the image
below:
326
[Link]
Thus, to generate a new token, we only need the hidden state of the most recent
token. None of the other hidden states are required.
Next, let's see how the last hidden state is computed within the transformer layer
from the attention mechanism.
327
[Link]
During attention, we first do the product of query and key matrices, and the last
row involves the last token’s query vector and all key vectors:
Also, the last row of the final attention result involves the last query vector and
all key & value vectors. Check this visual to understand better:
The above insight suggests that to generate a new token, every attention
operation in the network only needs:
328
[Link]
As we generate new tokens, the KV vectors used for ALL previous tokens do not
change.
Thus, we just need to generate a KV vector for the token generated one step
before.
The rest of the KV vectors can be retrieved from a cache to save compute and
time.
329
[Link]
To generate a token:
● Generate QKV vector for the token generated one step before.
● Get all other KV vectors from the cache.
● Compute attention.
● Store the newly generated KV values in the cache.
In fact, this is why ChatGPT takes some time to generate the first token than the
subsequent tokens. During that little pause, the KV cache of the prompt is
computed.
● total layers = 80
● hidden size = 8k
● max output size = 4k
Here:
330
[Link]
LLM Evaluation
331
[Link]
Optimizing a model makes it faster and cheaper but it doesn’t tell you whether
the system is actually good.
In practice, this means measuring how well the model reasons, follows
instructions, uses tools, stays consistent across turns, and remains safe under
adversarial pressure.
G-eval
If you are building with LLMs, you absolutely need to evaluate them.
Let’s understand how to create any evaluation metric for your LLM apps in pure
English with Opik - an open-source, production-ready end-to-end LLM
evaluation platform.
Let’s begin!
The problem
Standard metrics are usually not that helpful since LLMs can produce varying
outputs while conveying the same message.
332
[Link]
The solution
G-Eval is a task-agnostic LLM as a Judge metric in Opik that solves this.
It allows you to specify a set of criteria for your metric (in English), after which it
will use a Chain of Thought prompting technique to create evaluation steps and
return a score.
First, import the GEval class and define a metric in natural language:
333
[Link]
Done!
Next, invoke the score method to generate a score and a reason for that score.
Below, we have a related context and output, which leads to a high score:
However, with unrelated context and output, we get a low score as expected:
Under the hood, G-Eval first uses the task introduction and evaluation criteria to
334
[Link]
Next, these evaluation steps are combined with the task to return a single score.
That said, you can easily self-host Opik, so your data stays where you want.
G-Eval works well for single-output scoring, but it still treats each output in
isolation.
LLM Arena-as-a-Judge
Typical LLM-powered evals can easily mislead you to believe that one model is
better than the other, primarily due to the way they are set up.
For instance, techniques like G-Eval assume you’re scoring one output at a time
in isolation, without understanding the alternative.
So when prompt A scores 0.72 and prompt B scores 0.74, you still don’t know
which one’s actually better.
This is unlike scoring, say, classical ML models, where metrics like accuracy, F1,
or RMSE give a clear and objective measure of performance.
335
[Link]
There’s no room for subjectivity, and the results are grounded in hard numbers,
not opinions.
LLM Arena-as-a-Judge is a new technique that addresses this issue with LLM
evals.
In a gist, instead of assigning scores, you just run A vs. B comparisons and pick
the better output.
Just like G-Eeval, you can define what “better” means (e.g., more helpful, more
concise, more polite), and use any LLM to act as the judge.
336
[Link]
Note: LLM Arena-as-a-Judge can either be referenceless (like shown in the snippet
below) or reference-based. If needed, you can specify an expected output as well for the
given input test case and specify that in the evaluation parameters.
337
[Link]
This means the AI’s behavior must be consistent, compliant, and context-aware
across turns, not just accurate in one-shot outputs.
The code snippet below depicts how to use DeepEval (open-source) to run
multi-turn, regulation-aware evaluations in just a few lines:
338
[Link]
Define your multi-turn test case: Use ConversationalTestCase and pass in a list of
turns, just like OpenAI’s message format:
339
[Link]
Done!
This will provide a detailed breakdown of which conversations passed and which
failed, along with a score distribution:
340
[Link]
In MCP apps, we must evaluate not only what the model says but how it uses
tools.
Let's learn how to evaluate any MCP workflow using DeepEval’s latest MCP
evaluations (open-source).
341
[Link]
#1) Setup
First, we install DeepEval to run MCP evals.
342
[Link]
Next, we define our own MCP server with two tools that the LLM app can
interact with.
343
[Link]
This is the layer that sits between the LLM and the MCP server.
344
[Link]
We filter the tool calls from the response to create an object of MCPToolCall
class from DeepEval.
345
[Link]
● How well did the LLM utilize the MCP capabilities given to it?
● How well did the LLM ensure argument correctness for tool call?
346
[Link]
● query
● response
● failure/success reason
● tools invoked and params, etc.
As expected, the app failed on most queries, and our MCPUseMetric spotted that
correctly.
This evaluation helped us improve this app by defining better docstrings, and the
347
[Link]
app which initially passed only 1 or 2 out of 24 test cases, now achieves a 100%
success rate:
Issues can come from the retriever, the model, or the tool handler, so we evaluate
each component separately.
Feed the input → Get the output → Run evals on the overall end-to-end system.
348
[Link]
But LLM apps need component-level evals and tracing since the issue can be
anywhere inside the box, like the retriever, tool call, or the LLM itself.
349
[Link]
Define your LLM app in a method decorated with the @observe decorator:
350
[Link]
Done!
Finally, we define some test cases and run component-level evals on the LLM
app:
You can also inspect individual tests to understand why they failed/passed:
351
[Link]
We also need to test how the system behaves under adversarial pressure - this is
where red teaming comes in.
Because none of these metrics tells how easily the model can be exploited to do
something it should never do.
A well-crafted prompt can make even the safest model leak PII, generate harmful
content, or give away internal data. That’s why every major AI lab treats red
teaming as a core part of model development.
352
[Link]
Alongside these strategies, you need well-crafted and clever prompts that mimic
real hackers.
This will help in evaluating the LLM’s response against PII leakage, bias, toxic
outputs, unauthorized access, and harmful content generation.
353
[Link]
Below, we have our LLM app we want to perform red teaming on:
We have kept a simple LLM call here for simplicity, but you can have any LLM
app here (RAG, Agent, etc.)
So we define the vulnerabilities we want to detect (Bias and Toxicity) and the
strategy we want to detect them with (which is Prompt Injection in this case, and it
means bias and toxicity will be smartly injected in the prompts):
354
[Link]
Done!
Running this script (uv run llm_tests.py) produces a detailed report about the
exact prompt produced by DeepTeam, the LLM response, whether the test
passed, and the reason for test success/failure:
355
[Link]
Lastly, you can further assess the risk report by logging everything in your
Confident AI dashboard:
356
[Link]
The framework also implements all SOTA red teaming techniques from the latest
research.
Lastly, the setup does not require any dataset because adversarial attacks are
dynamically simulated at run-time based on the specified vulnerabilities.
But the core insight applies regardless of the framework you use:
357
[Link]
LLM Deployment
358
[Link]
Up to this point, we’ve optimized our model and evaluated how well it behaves.
A production LLM system must be fast, stable as well as scalable under load. To
do that, we need two things:
1) Continuous Batching
Earlier, we saw how LLM requests finish at different times, leaving GPUs idle
unless batching is managed carefully.
359
[Link]
This keeps the GPU pipeline fully utilized without any code changes on your
side.
2) PagedAttention
We previously discussed how KV-cache grows with every token and how
contiguous memory leads to fragmentation.
vLLM avoids this by storing KV-cache in small non-contiguous pages rather than
a single block.
A lightweight lookup table tells the model where each page lives, allowing vLLM
to:
4) Prefix-Aware Routing
Shared prefixes (like system prompts) can be reused across requests, but only if
they run on the same replica.
360
[Link]
Because LoRA adapters are lightweight, vLLM loads them once and applies them
per-request without duplicating memory.
It can also host multiple models inside one server, making it easy to support:
● Personalization
● A/B testing
● Multi-feature applications
Since vLLM exposes the same API structure as OpenAI’s Chat Completions API,
migrating your application is often as simple as changing the base_url.
All these challenges explain why specialized inference engines exist and why
frameworks like vLLM, SGLang, TensorRT-LLM, and LMCache are necessary.
In this chapter, we focus on vLLM, one of the simplest and fastest ways to serve
LLMs in practice.
361
[Link]
● Continuous batching
● PagedAttention for memory-efficient KV caching
● An OpenAI-compatible API
In practice, using vLLM feels almost identical to using the OpenAI API except
you are hosting the model yourself.
Let’s walk through how to serve a model using vLLM, step by step.
This loads the model into GPU memory and exposes a /v1 endpoint that follows
the OpenAI API format.
362
[Link]
We use the standard Chat Completions format and simply point the client to our
vLLM server.
At this point, our local deployment behaves like any hosted LLM endpoint.
If we need more throughput or want to serve a larger model, we can scale across
multiple GPUs.
363
[Link]
We can also support fine-tuned LoRA variants without loading separate models.
This lets us serve multiple LoRA adapters from the same base model.
Finally, if our application requires more than one model, we can host them in a
single server.
364
[Link]
LitServe
With vLLM, we now have a reliable way to serve LLMs efficiently. It handles the
core challenges of inference like batching, KV-cache management and routing
while letting us expose a simple API endpoint.
However, deployment often requires more than just running the model.
● request validation
● preprocessing and postprocessing
● custom routing
● authentication
● logging and monitoring
To support these pieces, we typically wrap the model inside a broader application
server.
It is an open-source framework that lets you build your own custom inference
engine.
It gives you control over how requests are handled - batching, streaming, routing,
and coordinating multiple models or components.
365
[Link]
Because it works at this level, LitServe can serve a wide range of model types,
including vision, audio, text and multimodal systems.
To make this concrete, let’s start with a minimal example and then break it down.
This example deploys a Llama 3 model with LitServe in a simple end-to-end flow.
Each part of the service maps to one function in the API class.
Next, we’ll break down each part to understand the LitServe pattern clearly.
366
[Link]
Here we load the Llama 3 model into memory so it’s ready for inference.
This method extracts the part of the request the model needs - in this case, the
prompt.
3) Run Inference
367
[Link]
368
[Link]
Once it is exposed through an API, it encounters real traffic, fluctuating load and
the operational constraints of a production environment.
At this stage, the focus moves from serving the model to understanding how the
system behaves as it runs.
369
[Link]
LLM
Observability
370
[Link]
Once a model is deployed, its behavior is no longer defined only by its weights or
the test set.
Real users, prompts, edge cases and system interactions now shape how the
application performs.
At this stage, the core question shifts from “Is the model good?” to “What is actually
happening inside the system?”
Evaluation vs Observability
Before we dive deeper, let’s understand the difference between evaluation and
371
[Link]
Evaluation measures how well the system performs on a defined set of tasks.
372
[Link]
Observability ensures the system continues to meet that standard under real
operating conditions.
Implementation
Now that we understand what observability means in an LLM system, the next
step is to implement it in practice. For this, we will use Opik, an open-source
framework that provides tracing and monitoring tools for LLM applications.
Opik offers a lightweight way to capture prompts, model outputs and full
execution traces. It helps debug pipelines whether simple function calls, LLM
interactions or more complex systems like RAG or multi-step agents.
Implement
373
[Link]
In the following sections, we’ll start with a minimal example and build up from
there.
Imagine we want to track all the invocations to this simple Python function
specified below:
That's it!
374
[Link]
By wrapping any function with this decorator, you can automatically trace and
log its execution inside the Opik dashboard.
If we run the above code, which is decorated with the @track decorator, and after
that, we go to the dashboard, we will find this:
In this project, you can explore the inputs provided to the function, the outputs it
produced, and everything that happened during its execution.
375
[Link]
For instance, once we open this project, we see the following invocation of the
function created above, along with the input, the output produced by the
function, and the time it took to generate a response.
376
[Link]
Opening any specific invocation, we can look at the inputs and the outputs in a
clean YAML format, along with other details that were tracked by Opik:
This seamless integration makes it easy to monitor and debug your workflows
without adding any complex boilerplate code.
377
[Link]
After specifying the OpenAI API key in the .env file, we will load it into our
environment:
Moving on, we wrap the OpenAI client with Opik’s track_openai function. This
ensures that all interactions with the OpenAI API are tracked and logged in the
Opik dashboard. Any API calls made using this client will now be automatically
monitored, including their inputs, outputs, and associated metadata.
378
[Link]
Yet again, if we go to the dashboard, we can see the input and the output of the
LLM:
379
[Link]
Opening this specific run highlights so many details about the LLM invocation,
like the input, the output, the number of tokens used, the cost incurred for this
specific run and more.
380
[Link]
This shows that by using track_openai, every input, output and intermediate
detail is logged in the Opik dashboard, for improved observability.
We can also do the same with Ollama for LLMs running locally.
We shall again use Opik's OpenAI integration for this demo, which is imported
below, along with the OpenAI library:
Next, we again create an OpenAI client, but this time, we specify the base_url as
[Link]
Next, to log all the invocations made to our client, we pass the client to the
track_openai method:
381
[Link]
382
[Link]
Opening the latest (top) invocation, we can again see similar details like we saw
with OpenAI - the input, the output, the number of tokens used, the cost and
more.
383