Welcome to  Kernel Labs  by Kuriko IWAI.

A comprehensive Machine Learning frameworks and MLOps.

This website hosts a complehensive framework on the entire machine learning lifecycle - from algorithmic deep-dives to robust MLOps exercise.

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.



Or explore:

Kuriko IWAI - Architect of Kernel Labs

Hosted by Kuriko IWAI

Recommended Reads

What's New

Architecting Semantic Chunking Pipelines for High-Performance RAG

Master critical chunking strategies for RAG to enhance retrieval accuracy and context retention.

Machine LearningDeep LearningData SciencePythonAgentic AILLM

In Retrieval-Augmented Generation (RAG), your model’s output is strictly capped by the quality of the retrieved context.

This technical deep-dive explores the transition from arbitrary text slicing to semantic optimization. We evaluate the trade-offs between fixed-token splits and advanced hierarchical structures, providing Python implementation patterns to ensure your vector database delivers coherent, context-rich information for complex queries.

Chunking is the process of breaking down large bodies of text into smaller, manageable pieces (chunks) before they are converted into mathematical representations called embeddings.

The below diagram illustrate the data ingestion pipeline and how chunking plays a key role in the pipeline:

Figure A.  Technical diagram of the RAG data ingestion pipeline illustrating the flow from raw data to chunking, embedding, and high-dimensional vector storage (Created by Kuriko IWAI)

Figure A. Technical diagram of the RAG data ingestion pipeline illustrating the flow from raw data to chunking, embedding, and high-dimensional vector storage (Created by Kuriko IWAI)

The process begins with raw, unstructured data in various formats like text (PDFs, docs), images, audio, and video.

The chunking process happens after the retrieval (second box, Figure A) where the data is broken down into smaller pieces called chunks.

By splitting a long document into smaller paragraphs or segments, the system can later retrieve only the specific part that answers a user's question, rather than the entire file.

Then each chunk is passed through an embedding model (an AI algorithm) that converts the content into a vector embedding, a long list of numbers (coordinates) that represent the semantic meaning of the chunk.

Lastly, these embeddings are stored in a vector store (database).

The vector space in Figure A has three dimensions for demonstration purpose, but usually the space is high-dimensional.

These data points are stored close together when they are considered related.

For example, if we have a chunk about "Golden Retrievers" and another about "Labradors," they will be mathematically near each other in the database, allowing the LLM to find relevant information almost instantly.

Why Chunking Matters

Chunking matters for the following reasons:

  • Relevance: Small chunks ensure that RAG can retrieve the exact piece from the large document.

  • Cost efficiency: Processing smaller, targeted snippets saves on tokens and computation time.

  • Context retention: Well-chunked data maintains enough surrounding information to enable the LLM to comprehend the context on why and how.

Overall, chunking helps the system to retrieve the most relevant context to the user query, while saving input tokens (and fit the context into the LLM's context window).

Comparative Analysis: 5 Industry-Standard Chunking Strategies

Selecting a chunking strategy is not a one-size-fits-all decision.

This section explores five major chunking strategies:

  • Fixed-size chunking.

  • Recursive character chunking.

  • Document-specific chunking.

  • Semantic chunking.

  • Parent–Child (Hierarchical) chunking.

To understand how these methods diverge in practice, we will apply each to a sample text regarding Solar Energy Infrastructure.

Sample Text

I'll use the sample text to see how each chunking strategy splits the text:

1text = "Solar panels, or photovoltaic cells, convert sunlight into electricity. This process happens at the atomic level. Some materials exhibit a property known as the photoelectric effect. This causes them to absorb photons and release electrons. Beyond the cells, an inverter is required to convert DC to AC. Large-scale solar farms also require battery storage systems to manage peak load during non-sunny hours."
2

Fixed-Size Chunking

The fixed-size chunking is the most straightforward approach to define a specific number of characters or tokens per chunk.

The below diagram illustrates how it works:

Figure B. Visualization of fixed-size chunking showing the sliding window mechanism with defined chunk size and yellow-highlighted overlap section (Created by Kuriko IWAI)

Figure B. Visualization of fixed-size chunking showing the sliding window mechanism with defined chunk size and yellow-highlighted overlap section (Created by Kuriko IWAI)

For example, the chunk size (Green cells, Figure B) and chunk overlap (or called sliding windows) (yellow cells, Figure B) are set to 500 tokens / 50 tokens respectively.

The overlap ensures that context isn't lost if a key sentence is split in half.

Pros

Computationally affordable and easy to implement.

Best For

  • Quick prototyping.

  • Handling simple text.

  • General use cases where speed is prioritized over granular semantic accuracy.

Practical Implementation

The CharacterTextSplitter class from the langchain_text_splitters library can split the sample text:

1from langchain_text_splitters import CharacterTextSplitter
2
3# fixed size chunking
4fixed_splitter = CharacterTextSplitter(
5    separator="",
6    chunk_size=100,
7    chunk_overlap=20
8)
9
10fixed_chunks = fixed_splitter.split_text(text)
11

The configuration shows 100 tokens for each chunk with 20 tokens overlapped.

Resulting Chunks:

Each chunk contains exactly 100 tokens:

  • Chunk 1: "Solar panels, or photovoltaic cells, convert sunlight into electricity. This process happens at the a"

  • Chunk 2: "ns at the atomic level. Some materials exhibit a property known as the photoelectric effect. This ca"

In this method, a word like "atomic" is sliced.

Although computationally fastest, the method can create semantic noise as the LLM receives partial words, which can degrade the quality of the generated response.

Recursive Character Chunking

The recursive character text splitting uses a hierarchy of separators to find a natural breaking point, instead of cutting text at a hard character limit.

The below diagram illustrates how the method works:

Figure C. Diagram of recursive character splitting logic demonstrating the hierarchical priority of separators like newlines and periods to preserve sentence integrity (Created by Kuriko IWAI)

Figure C. Diagram of recursive character splitting logic demonstrating the hierarchical priority of separators like newlines and periods to preserve sentence integrity (Created by Kuriko IWAI)

The method first attempts to split by the most significant separator, period, and then moves down the list for commas and spaces, until the chunk size requirement (pink box, Figure C) is met.

Pros

  • Keeps related ideas together better than fixed-size splitting.

Best For

  • Maintaining the integrity of paragraphs and sentences.

  • Articles and blogs.

Practical Implementation

The RecursiveCharacterTextSplitter class from the langchain_text_splitters library can split the sample text:

1from langchain_text_splitters import RecursiveCharacterTextSplitter
2
3recursive_splitter = RecursiveCharacterTextSplitter(
4    chunk_size=100,
5    chunk_overlap=20,
6    separators=["\n\n", "\n", " ", ""]
7)
8
9recursive_chunks = recursive_splitter.split_text(text)
10

The configuration shows 100 tokens for each chunk with 20 tokens overlapped.

The splitters have priority of double lines, single lines, double spaces, and single spaces.

Resulting Chunks

Each chunk contains exactly 100 tokens such that:

  • Chunk 1: 'Solar panels, or photovoltaic cells, convert sunlight into electricity.'

  • Chunk 2: 'This process happens at the atomic level. Some materials exhibit a property'

Compared to the fixed-size chunking method, the recursive splitter identifies the period at the end of the first sentence and stops there, preserving grammatical integrity.

This makes the method more readable for the LLM.

Document-Specific Chunking

The document-specific chunking respects the inherent format of the file types like Markdown, HTML, LaTeX, or Code.

For example:

  • Markdown: Splits by headers (#, ##, ###).

  • HTML: Splits by html tags, comma, or dots.

  • Code: Splits by function or class definitions.

Pros

  • Preserves the logical hierarchy of the document.

Best For

  • Highly structured technical documentation like PDF reports, manuals.

  • Codebases.

Practical Implementation

The MarkdownHeaderTextSplitter class from the langchain_text_splitter library can define the splitters and split the document:

1from langchain_text_splitter import MarkdownHeaderTextSplitter
2
3markdown_document = """
4# Solar Energy Guide
5
6## The Physics
7Solar panels, or photovoltaic cells, convert sunlight into electricity. 
8This process happens at the atomic level. 
9Some materials exhibit the photoelectric effect.
10
11## The Hardware
12Beyond the cells, an inverter is required to convert DC to AC. 
13Large-scale solar farms also require battery storage systems.
14
15### Maintenance
16Regular cleaning of panels ensures maximum photon absorption.
17"""
18
19# define the headers
20headers_to_split_on = [
21    ("#", "Header 1"),
22    ("##", "Header 2"),
23    ("###", "Header 3"),
24]
25
26# initialize the splitter
27markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
28
29# split
30md_header_splits = markdown_splitter.split_text(markdown_document)
31

Resulting Chunks:

Each chunk has been sliced by the headers defined in the code snippets:

Chunk 1.

  • Content: Solar panels, or photovoltaic cells, convert sunlight into electricity. This pro...

  • Metadata: {'Header 1': 'Solar Energy Guide', 'Header 2': 'The Physics'}

Chunk 2.

  • Content: Beyond the cells, an inverter is required to convert DC to AC. Large-scale solar...

  • Metadata: {'Header 1': 'Solar Energy Guide', 'Header 2': 'The Hardware'}

Chunk 3.

  • Content: Regular cleaning of panels ensures maximum photon absorption....

  • Metadata: {'Header 1': 'Solar Energy Guide', 'Header 2': 'The Hardware', 'Header 3': 'Maintenance'}

The method can leverage the author's intent, ensuring that "The Physics" and "The Hardware" are never accidentally blended into the same chunk, which is vital for technical manuals.

Semantic Chunking

The semantic chunking is a more advanced technique that calculates the distance of the meaning to determine where a topic changes.

The method looks at the cosine similarity between the embeddings of consecutive sentences.

When the similarity drops below a certain threshold, it assumes a new topic has started and creates a break to move onto a new chunk.

Pros

  • High retrieval accuracy because chunks represent complete ideas.

Best For

  • Academic paper.

  • Narrative-heavy documents.

  • Long-form essays (paragraphs don't align with the topic shift).

Practical Implementation

I'll first create vector embeddings using the SentenceTransformer class from the sentence_transformers library.

Then, split the embeddings into smaller chunks by calculating cosign similarity scores between embeddings.

1from sklearn.metrics.pairwise import cosine_similarity
2from sentence_transformers import SentenceTransformer
3
4# split the text into sentences
5sentences = split_into_sentences(text)
6
7# create vector embeddings 
8model = SentenceTransformer(MODEL)
9embeddings = model.encode(sentences)
10
11# compute cosine similarity to split into chunks
12chunks = []
13current_chunk = [sentences[0]]
14for i in range(1, len(sentences)):
15    prev_emb = embeddings[i - 1].reshape(1, -1)
16    curr_emb = embeddings[i].reshape(1, -1)
17
18    similarity = cosine_similarity(prev_emb, curr_emb)[0][0]
19    print(f"Similarity between sentence {i-1} and {i}: {similarity:.3f}")
20
21    # if similarity drops → meaning shift → split
22    if similarity < similarity_threshold:
23        chunks.append(" ".join(current_chunk))
24        current_chunk = [sentences[i]]
25    else:
26        current_chunk.append(sentences[i])
27
28# add last chunk
29if current_chunk:
30    chunks.append(" ".join(current_chunk))
31
32return chunks
33

Resulting Chunks

In this method, the system notices a shift in meaning between "release electrons" and "Beyond the cells."

  • Chunk 1. Solar panels, or photovoltaic cells, convert sunlight into electricity.

    • Cosign similarity: 0.196

  • Chunk 2. This process happens at the atomic level.

    • Cosign similarity: 0.313

  • Chunk 3. Some materials exhibit a property known as the photoelectric effect.

    • Cosign similarity: 0.546

  • Chunk 4. This causes them to absorb photons and release electrons.

    • Cosign similarity: 0.266

  • Chunk 5. Beyond the cells, an inverter is required to convert DC to AC.

    • Cosign similarity: 0.336

  • Chunk 6. Large-scale solar farms also require battery storage systems to manage peak load during non-sunny hours.

Semantic chunking recognizes that the topic changed from physics to engineering and forces a split.

Parent–Child (Hierarchical) Chunking

The parent-child (hierarchical) chunking involves storing two versions of the same data: a small child chunk for searching and a larger parent chunk for context.

The system first searches against small, highly specific chunks (e.g., 100 tokens).

Then, once a match is found, it retrieves the larger surrounding parent document (e.g., 1000 tokens) to provide to the LLM.

Pros

  • Avoids the lost in the middle problem by giving the LLM plenty of background info.

Best For

  • Enterprise-grade RAG systems.

  • Balancing high-precision search with comprehensive context.

Practical Implementation

I'll first create parent chunk which contains all the sample text, and then create child chunks which contains vector embeddings:

1import re, uuid
2from sentence_transformers import SentenceTransformer
3
4# create parent chunk
5parent_chunk = {
6    "id": str(uuid.uuid4()),
7    "text": text.strip() # the entire sample text
8}
9
10
11# load embedding model
12model = SentenceTransformer("all-MiniLM-L6-v2")
13
14def create_child_chunks(parent_chunk):
15    sentences = split_into_sentences(parent_chunk["text"])
16    
17    children = []
18    for sentence in sentences:
19        child = {
20            "id": str(uuid.uuid4()),
21            "parent_id": parent_chunk["id"],
22            "text": sentence,
23            "embedding": model.encode(sentence)
24        }
25        children.append(child)
26    return children
27
28child_chunks = create_child_chunks(parent_chunk)
29

Results

Parent context:
Solar panels, or photovoltaic cells, convert sunlight into electricity.
This process happens at the atomic level.
Some materials exhibit a property known as the photoelectric effect.
This causes them to absorb photons and release electrons.
Beyond the cells, an inverter is required to convert DC to AC.
Large-scale solar farms also require battery storage systems to manage peak load during non-sunny hours.

Matched child sentence:
Large-scale solar farms also require battery storage systems to manage peak load during non-sunny hours.

In the process, the retriever pulls the parent chunk when it finds that the query matches some vector embedding, instead of returning just the matched sentence.

This enables the LLM to receive the full paragraph and assess the relationship between solar panels, inverters, and battery storage. In other words, it explains the "Why" (infrastructure) rather than just the "What" (batteries).

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.



Or explore:

Wrapping Up

Chunking is not just slicing data.

With a proper strategy, it can work as semantic optimization.

Proper grouping ensures that when a user asks a question, the vector database returns a coherent piece of information rather than a fragmented snippet that leaves the AI guessing.

Here are key strategies to consider when it comes to choosing the optimal chunking strategies.

Implementation Roadmap: Choosing Optimal Strategy

1. Identify Patterns & Logical Structures

Look for repetitions, sequences, or inherent connections in the data.

In a technical context, this means identifying document headers, Markdown tags, or paragraph breaks to ensure a chunk doesn't cut off in the middle of a vital sentence or thought.

  • The key question:

Does my data have a predictable layout or specific syntax that carries meaning?

  • Strategy to choose: Document-specific (structure-aware) chunking.

Use parsers that respect the document's native format (e.g., Markdown, HTML, or LaTeX) to split data at logical boundaries rather than arbitrary character counts.

2. Maintain Semantic Context (The Mnemonic for AI)

Just as humans use mnemonics to link ideas, AI systems use overlapping chunks.

By including a small portion of the previous chunk at the start of the next one, you create a narrative bridge that prevents the model from losing the broader context of the data.

  • The key question to ask:

If I read this chunk in isolation, would I still understand the subject of the sentence?

  • Strategy to choose: Fixed-sized chunking with sliding window (overlapping).

Implementing a context window of 10–20% overlap between chunks ensures that the end of one chunk and the beginning of the next share enough connective tissue to maintain semantic flow.

3. Prioritize Categorical Grouping

Organize information by category or hierarchy (e.g., grouping a grocery list by "produce" or "dairy").

  • The key question to ask:

How granular is the information my users are looking for—specific facts or broad overviews?

  • Strategy to choose: Recursive Character Splitting.

Start with a large separator (like a double newline) and progressively move to smaller separators (space, character) until the desired chunk size is reached. This keeps neighboring ideas in the same bucket.

Optimize for Retrieval & Relevance

Lastly, in either chunking strategy we choose, it is best to regularly test the chunk size against real-world queries because if chunks are too small, they lack context; if they are too large, they introduce noise that can confuse the LLM.

The key question one can ask is:

Am I retrieving irrelevant fluff that wastes my model's context window, or am I missing the answer entirely?

And if this is the case, experimental benchmarking would work the best.

The experimental benchmarking runs chunking with different chunk sizes (e.g., 256, 512, and 1024 tokens), and evaluates each of them using metrics like Hit Rate or MRR (Mean Reciprocal Rank).

It allows one to determine which size consistently yields the most accurate answers for a task in hand.

Continue Reading

How to Build Reliable RAG: A Deep Dive into 7 Failure Points and Evaluation Frameworks

Master how to evaluate the RAG pipeline and solve common failures with DeepEval, RAGAS, TruLens, and Phoenix.

Machine LearningDeep LearningData SciencePythonAgentic AILLM

Building a RAG prototype is easy; ensuring it doesn't hallucinate in production is the real engineering challenge.

This article dissects the Seven Failure Points (FPs) of RAG—from missing content to incorrect specificity—and provides a technical roadmap for mitigation using industry-leading evaluation frameworks like DeepEval, RAGAS, and Arize Phoenix.

According to researchers Barnett et al., Retrieval Augmented Generation (RAG) systems encounter seven specific Failure Points (FPs) throughout the pipeline.

The below diagram illustrates these stages:

Figure A. Indexing and Query processes required for creating a RAG system. The indexing process is done at development time and queries at runtime. Failure points identified in this study are shown in red boxes (source)

Figure A. Indexing and Query processes required for creating a RAG system. The indexing process is done at development time and queries at runtime. Failure points identified in this study are shown in red boxes (source)

Let us explore each FP arranged according to the pipeline sequence, following the top-left to bottom-right progression shown in Figure A.

FP1. Missing Content

Missing content happens when the system is asked a question that cannot be answered because the relevant information is not present in the available vector store in the first place.

The failure occurs when an LLM provides a plausible-sounding but incorrect response instead of stating it doesn't know.

FP2. Missed the Top-Ranked Documents

This is a situation where a correct document exists in the vector store, but the retriever fails to rank it highly enough to include it in top-k documents fed to an LLM as context.

In consequence, the correct information never reaches the LLM.

FP3. Not in Context (Consolidation Strategy Limitations)

This is a situation where a correct document exists and is retrieved from the vector store, but is excluded during the consolidation process.

This happens when too many documents are returned and the system must filter them down to fit within an LLM's context window, token limits, or rate limits.

FP4. Not Extracted

This is a situation where an LLM fails to identify the correct information in the context, even though the correct information was in the vector store, and successfully retrieved/consolidated.

This happens when the context is overly noisy or contains contradictory information that confuses the LLM.

FP5. Wrong Format

This is a situation where storage, retrieval, consolidation, and LLM interpretation are successfully handled, but the LLM fails to follow specific formatting instructions provided in the prompt, such as a table, a bulleted list, or a JSON schema.

FP6. Incorrect Specificity

An LLM's output is technically present, but either too general or too complex compared to the user's needs.

For example, an LLM generates simple answers to a user query with a complex professional goal.

FP7. Incomplete Answers

This is a situation where an LLM generates an output not necessarily wrong, but missing key pieces of information that were available in the context.

For example, when a user asks a complex question like "What are the key points in documents A, B, and C?", the LLM only addresses one or two of the sources.

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.



Or explore:

How FPs Compromise RAG Pipeline Performance

Each of these FPs impact performance of RAG pipelines:

Data Integrity & Trust Failures

When missing or incorrect information is present, the system is no longer a reliable source of information. Primary FPs include:

  • FP1 (Missing Content): The answer is not in the doc in the first place.

  • FP4 (Not Extracted): The LLM decides to ignore the correct answer in the doc.

  • FP7 (Incomplete): The LLM gives half-truths, missing important pieces.

Retrieval & Efficiency Bottlenecks

The RAG pipeline can be inefficient when it misses key information in the retrieval and consolidation stages. Primary FPs include:

  • FP2 (Missed Top Ranked): The embedding model fails to select top-k embeddings.

  • FP3 (Consolidation Strategy): The script to trim docs to fit the LLM limits drops the most important parts.

User Experience & Formatting Errors

Although correct, an output with poor readability or in a wrong format can compromise user experience. Primary FPs include:

  • FP5 (Wrong Format): The LLM fails to follow the specific output format like JSON.

  • FP6 (Incorrect Specificity): The LLM generates a lengthy output for a simple yes/no question, or vise versa (too brief answer to a complicated question).

The Evaluation Stack: Frameworks to Mitigate FPs

Evaluation metrics are designed to systematically mitigate these FPs.

This section explores major evaluation metrics with practical use cases.

Major RAG Evaluation Metrics:

  • DeepEval

  • RAGAS

  • TruLens

  • Arize Phoenix

  • Braintrust

DeepEval - The Unit Test before Deployment

DeepEval calculates a weighted score based on the criteria.

An LLM-as-a-judge (e.g., GPT-4o) evaluates each criteria against an LLM's output:

DeepEval leverages G-eval, a chain-of-thought (CoT) framework which takes the multi-step approach to evaluate the output:

  1. Define a criteria to measure (e.g., "coherence,""fluency," or "relevance").

  2. Generate evaluation steps (using an evaluator LLM).

  3. Follow the evaluation step and analyzes the input and the LLM's output.

  4. Calculates an expected weighted sum of the score of each criteria.

Leveraging the approach, DeepEval measures the score:

Score=i=1nwif(Ci,O)(1)

where:

  • w_i: The weight of a specific parameter like tone or helpfulness.

  • C_i: A specific score for the criteria i against an output O.

  • f: The LLM's Likert-scale assessment:

Types

Response Options





Agreement

Strongly Agree

Agree

Neutral

Disagree

Strongly Disagree

Likelihood

Very Likely

Likely

Neutral

Unlikely

Very Unlikely

Quality

Excellent

Above Average

Average

Below Average

Poor

Frequency

Very Often

Often

Sometimes

Rarely

Never

Numeric

5

4

3

2

1

Table 1. The Likert-Scale Framework for LLM-as-a-Judge Scoring.

Common Scenario in Practice

  • Situation: A technical documentation assistant (bot) for a complex software product seems to be working every time the engineer team updates the codebase.

  • Problem: No quantitative proof if the bot can still answer the user query (You just "think" it's working...).

  • Solution: Integrate a PyTest function as CI/CD regression suite into Github Action where DeepEval runs G-Eval and others metrics over a test case:

1# pytest component
2import pytest
3from deepeval import assert_test 
4from deepeval.test_case import LLMTestCase, LLMTestCaseParams
5from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, GEval
6
7def test_bot_silent_regression(): 
8    # setup metrics with threshold 
9    relevancy = AnswerRelevancyMetric(threshold=0.85)
10    faithfulness = FaithfulnessMetric(threshold=0.85)
11
12    # geval (llm judge)
13    geval_correctness = GEval(
14        name="Correctness",
15        criteria="Determine if the actual output is factually accurate based on the expected output.",
16        evaluation_params=[
17            LLMTestCaseParams.ACTUAL_OUTPUT,
18            LLMTestCaseParams.EXPECTED_OUTPUT
19],
20        threshold=0.85
21    )
22
23    # define a test case
24    test_case = LLMTestCase(
25        input="How do I rotate API keys in the dashboard?",
26        actual_output="To rotate keys, go to Settings > Security and click 'Regenerate'.",
27        retrieval_context=["The security tab allows users to regenerate API keys for safety."
28],
29        expected_output="Users can rotate API keys via the Security section in Settings."
30    )
31
32    # assert the test case against the metrics
33    assert_test(test_case,
34[relevancy, faithfulness, geval_correctness
35])
36
  • Expected results: If any score of the metrics drops below the threshold (0.85), the PyTest raises AssertionError - immediately failing the CI build, preventing the silent regression from reaching production.

Pros

  • A variety of metrics (50+) including specialized bias and toxicity checks are available.

  • Seamlessly integrates with existing CI/CD pipelines.

  • No reference needed. Assess an output based solely on the prompt and provided context.

Cons

  • The quality of evaluation heavily depends on the judge LLM's capabilities.

  • Computationally expensive when the judge LLM is a high-end model.

Developer Note - The Test Case for DeepEval
A set of LLMTestCase objects defines the test case that DeepEval runs.

In practice, this test case should contain most important user queries and labeled outputs with the retrieved context.

These can be retrieved from a JSON or CSV file.

RAGAS - The Needle in a Haystack Optimizer

Retrieval Augmented Generation Assessment (Ragas) aims to evaluate RAG without human-annotated dataset by generating synthetic test sets.

Then, it computes flagship metrics:

Figure B. The RAGAS evaluation triad diagram connecting Question, Context, and Answer through Precision, Recall, Faithfulness, and Relevancy metrics (Created by Kuriko IWAI)

Figure B. The RAGAS evaluation triad diagram connecting Question, Context, and Answer through Precision, Recall, Faithfulness, and Relevancy metrics (Created by Kuriko IWAI)

The flagship metrics are categorized into the three groups:

  • Retrieval pipeline (black, solid line, Figure B): Context precision, context recall.

  • Generation pipeline (black, dotted line, Figure B): Faithfulness, answer relevancy.

  • Ground truth (red box, Figure B): Answer semantic similarity, answer correctness.

For example, faithfulness calculates the overlap between the claims in the response and the retrieved context such that:

Faithfulness=XACXA(2)

where:

  • X_AC: The number of the claims in the answer (response) A supported by the given context, and

  • X_A: The total number of the claims in the response A (with or without the context supported).

Common Scenario in Practice

  • Situation: The RAG system for legal contracts is missing key clauses. You are unsure if the problem is in the Search (Retriever) or the Reading (Generator).

  • Problem: No idea on the optimal top-k (number of chunks retrieved).

  • Solution: Use RAGAS to create a synthetic test set with 100 pairs of questions and evidence. Then, run the RAG pipeline against the test set to calculate context recall and context precision:

1from datasets import Dataset
2from langchain_openai import ChatOpenAI, OpenAIEmbeddings
3from langchain.docstore.document import Document
4from ragas import evaluate
5from ragas.testset.synthesizers.generate import TestsetGenerator
6from ragas.metrics.collections import context_precision, context_recall, faithfulness, answer_relevancy
7
8# setup models
9generator_llm = ChatOpenAI(model="gpt-3.5-turbo")
10critic_llm = ChatOpenAI(model="gpt-4o")
11embedding_model = OpenAIEmbeddings()
12
13# setup documents
14langchain_docs = [
15    Document(page_content=doc_1),
16    Document(page_content=doc_2),
17]
18
19# instantiate generator and generate synthetic testset w/ 100 pairs
20generator = TestsetGenerator.from_langchain(
21    llm=generator_llm,
22    embedding_model=embedding_model
23)
24testset = generator.generate_with_langchain_docs(langchain_docs, testset_size=100)
25test_df = testset.to_pandas()
26
27### <--- rag pipeline execution --->
28
29# create results dataset w/ the rag pipeline
30results_data = {
31    "question": test_df["question"
32  ].tolist(),
33    "contexts": [rag_pipeline.get_chunks(q) for q in test_df["question"
34    ]
35  ],
36    "answer": [rag_pipeline.get_answer(q) for q in test_df["question"
37    ]
38  ],
39    "ground_truth": test_df["ground_truth"
40  ].tolist()
41}
42result_dataset = Dataset.from_dict(results_data)
43
44# evaluate the results w/ metrics
45score = evaluate(
46    dataset=result_dataset,
47    metrics=[
48        context_precision,
49        context_recall,
50        faithfulness,
51        answer_relevancy
52],
53    llm=critic_llm,
54    embeddings=embedding_model,
55)
56

Expected result: Depending on the metric results, action plan can be the following:

Metric

Score

Diagnostic

Action Plan

Context Recall

Low

The retriever missed the correct info.

- Increase top-k.
- Try hybrid search (BM25 + Vector).

Context Precision

Low

Top-k chunks contain too much filter and noise - confusing the LLM.

- Decrease top-k
- Implement a Reranker (e.g., Cohere).

Faithfulness

Low

The generator is hallucinating despite having data.

- Adjust system prompt.
- Check for context window limits.

Table 2. RAGAS Diagnostic Action Plan - Mapping Scores to System Adjustments.

Pros

  • Excellent for an early-stage project without ground-true datasets (As we saw in the code snippet, RAGAS can make a synthetic test set).

Cons

  • Synthetic test set might miss nuanced factual errors.

  • Requires a robust extractor model to break down answers into individual claims (I used gpt-4o in the example).

TruLens - The Feedback Loop Specialist

TruLens focuses on the internal mechanics of the RAG process rather than just the final output by using feedback functions.

For example, it measures answer relevance with a cosine similarity:

Relevance=cos(θ)=VQVRVQVR(3)

where Q and R represent the query and response respectively.

It also uses an LLM-based score reflecting how well the response satisfies the query's intent, using a 4-point Likert scale (0-3), making it superior for ranking the quality of different search results.

Common Scenario in Practice

  • Situation: A medical advisor bot answers a user's question correctly but adds a pro-tip that isn't in the vetted PDF base.

  • Problem: The add-on pro-tip might be helpful, but not grounded.

  • Solution: Use TruLens to implement a groundedness feedback function with a threshold like score > 0.8:

1import os
2from trulens_eval import Tru, Feedback, Select, TruCustomApp 
3from trulens_eval.feedback.provider.openai import OpenAI as tOpenAI
4
5# instantiate tru and evaluator model
6tru = Tru()
7provider = tOpenAI(model_engine=MODEL_NAME)
8
9# define the feedback func - compare the output vs source (chunks)
10f_groundedness = (
11    Feedback(
12        provider.groundedness_measure_with_cot_reasons,
13        name="Groundedness"
14    )
15    .on(Select.RecordCalls.func.args.context) # source (chunks)
16    .on_output() # output (llm response)
17)
18
19# wrap the rag pipeline w/ tru recorder
20tru_recorder = TruCustomApp(
21    rag_query_engine, # rag app
22    app_id="app_ver1", 
23    feedbacks=[f_groundedness
24]
25)
26
27# execute the query w/ tru recoder
28with tru_recorder as recording:
29    res = rag_query_engine(
30        prompt="What are the side effects of Ibuprofen?"
31    )
32
33# retrieve the tru assessment results
34record_df, feedback_cols = tru.get_records_and_feedback(app_ids=['app_ver1'
35])
36
  • Expected results: When the LLM generates a response that contains information not present in the retrieved chunks, TruLens flags the record in your dashboard.

Pros

  • Visualizes the reasoning chain to identify exactly where the agent went off-track.

  • Provides built-in support for grounding to catch hallucinations in real-time.

Cons

  • Learning curve for defining custom feedback functions.

  • The dashboard can feel heavyweight for simple scripts.

Arize Phoenix - The Silent Failure Map

Arize Phoenix is an open-source observability and evaluation tool to evaluate LLM outputs, including complex RAG systems.

Built on OpenTelemetry by Arize AI, it focuses on observability by treating LLM evaluation as a subset of MLOps.

In the context of RAG evaluation, Phoenix excels at embedding analysis, using Uniform Manifold Approximation and Projection (UMAP) to reduce high-dimensional vector embeddings into 2D/3D space:

f:RnRd where dn(4)

where n is the original dimension of the vector space, and d is the reduced dimension (e.g., d = 3)

This embedding analysis mathematically reveals if the failed queries are semantically grouped together, which indicates a gap in the vector database.

Common Scenario in Practice

  • Situation: A customer support bot works great for refunds, but gives nonsensical answers to warranty claims.

  • Problem: Data hole in the vector database (Cannot find in logs).

  • Solution: Use Arize Phoenix to generate a Umap Embedding Visualization (UEV), a 3D map for the vector database - to overlay user queries on the document chunks.

  • Expected results: Visually see a cluster of user queries landing in the dark zone where no documents exist, telling that some documents are forgotten to upload to the vector store.

Pros

  • OpenTelemetry-native; integrates with existing enterprise monitoring stacks.

  • The best tool for visualizing blind spots of the vector store.

Cons

  • Less focused on scoring, more on observing.

  • Can be overkill for small-scale applications or single-agent tools.

Braintrust - The Prompt Regression Safety Net

Braintrust is designed for high-frequency iteration cycles by using cross-model comparison.

It assesses if Model A + Prompt B is mathematically superior to Model C + Prompt D:

EA=11+10(RBRA)/400(5)

where R_A and R_B are the performance ratings of two different RAG configurations.

Common Scenario in Practice

  • Situation: An engineer team upgrade prompt from "Answer the question" (Case A) to a more complex 500-word system instruction (Case B).

  • Problem: Improving the prompt for Case B might accidentally break Case A.

  • Solution: Use Braintrust to create a golden dataset with a set of N perfect examples (e.g., N = 50). Let Braintrust run side-by-side (SxS) comparison every time the team updates a single word in prompt:

1import braintrust 
2from autoevals import Levenshtein
3
4# initialize the project
5project = braintrust.init(project="Prompt-Upgrade-Regression")
6
7# define the ground truth dataset (N=50)
8dataset =  [
9  {"input": "What is 2+2?", "expected": "4"
10  }, # case a (simple)
11    {"input": "Explain quantum entanglement in the style of a pirate.", "expected": "Arr, particles be linked..."
12  }, # case b (complex) 
13    ...
14]
15
16# evaluate
17braintrust.Eval(
18    name="Prompt Upgrade SxS", 
19    data=dataset, 
20    task=lambda input: { 
21        "case_a": prompt_case_a(input), # current prompt
22        "case_b": prompt_case_b(input), # new, complex prompt
23}, 
24    scores=[Levenshtein
25],
26)
27
  • Expected result: A difference report showing exactly which cases got better/worse for each of the golden dataset (N = 50).

Pros

  • Extremely fast to test before the deployment.

  • Great UI for non-technical stakeholders to review and grade the output.

Cons

  • Proprietary/SaaS-focused (though they have open-source components).

  • Fewer built-in deep-tech metrics compared to DeepEval or Ragas.

Wrapping Up

When handled with proper evaluation frameworks, RAG can be a competitive tool to provide an LLM context most relevant to the user query.

Implementation Strategy: Mapping Metrics to Failure Points

Although there’s no one-fit-all solution, Table 3 shows which evaluation metrics to apply for each FP we covered in this article:

Failure Point

Evaluation Metric Idea

Feature to Use

FP1: Missing Content

RAGAS

Faithfulness / Answer Correctness

FP2: Missed Ranking

TruLens

Context Recall / Precision

FP3: Consolidation

Arize Phoenix

Retrieval Tracing & Latency Analysis

FP4: Not Extracted

DeepEval

Faithfulness / Contextual Recall

FP5: Wrong Format

DeepEval

G-Eval (Custom Rubric)

FP6: Specificity

Braintrust

Manual Grading & Side-by-Side Eval

FP7: Incomplete

RAGAS

Answer Relevancy

Table 3. The Failure Point Mitigation Matrix - Which Tool Solves Which FP?

DeepEval and RAGAS can leverage their faithfulness metrics to measure data integrity failures (FP1, FP4, FP7).

TruLens leverages its context precision / recall to measure the context relevance to the output - effectively assessing FP2.

Arize Phoenix provides a visual trace of the retrieval process, making it easy to see if the document retrieved was lost during the consolidation (FP3).

For UX failures, DeepEval creates custom metrics to assess UX failures, while Braintrust excels at ground-truth dataset comparison.

Continue Reading

Understanding Vector Databases and Embedding Pipelines

Explore the mechanics of vector databases, text embedding (Dense, Sparse, Hybrid), and similarity metrics like Cosine Similarity with coding examples.

Machine LearningDeep LearningData SciencePythonAgentic AILLM

Traditional databases excel at keywords but fail at context.

To bridge the gap between structured storage and neural processing, engineers utilize Vector Databases and Vectorization.

This technical deep-dive explains how unstructured data is transformed into high-dimensional coordinates, explores the mathematical foundations of similarity scoring, and provides practical Python implementations for dense, sparse, and hybrid embedding tactics.

Vector database (DB) is a specialized system designed to store, index, and query vector embeddings—long arrays of numbers that represent the semantic meaning of unstructured data like text, images, audio, and video.

The diagram below illustrates the process of unstructured data stored in the vector database:

Figure A. Diagram of the vector database workflow showing raw data ingestion, embedding generation, and storage in a 3D coordinate space (Created by Kuriko IWAI)

Figure A. Diagram of the vector database workflow showing raw data ingestion, embedding generation, and storage in a 3D coordinate space (Created by Kuriko IWAI)

For example, in a simplified 3-dimensional space (right, Figure A), unstructured text data:

I am a cat.

becomes a vector embedding:

V=[0.12,0.98,0.45 ](a)

In real-world models like Gemini or BERT, these vectors are much larger with 768 or 1, 536 dimensions.

Each number represents a specific feature of the sentence's semantic meaning that the AI has learned during the pre-training.

Traditional databases like SQL can only search for exact matches.

So, when a user queries on "cats", they'd miss documents on "animals", "pets", or "felines" if those documents don't include the exact word "cats".

Conversely, vector databases can perform similarity searches by finding vector embedding closest to the given query.

Because “felines“ are considered close to “cats“, similarity search can find the documents which only mention “felines”, but not exactly “cats“.

This is useful for:

  • Semantic search: Finds information based on semantic meaning, not just text.

  • Retrieval-Augmented Generation (RAG): Provides LLMs long-term memory relative to the user query.

  • Recommendation engines: Suggests products based on user behavior patterns.

The Vectorization Pipeline

Vectorization is the process of converting unstructured, raw data into vector embeddings to populate a vector database (left → middle, Figure A).

The process involves the following five steps:

  1. Load: Pull data from sources (PDFs, Notion, SQL, Slack).

  2. Clean: Remove noise like headers, footers, or HTML tags.

  3. Chunk: Split long data into smaller pieces.

  4. Embed: Pass the chunks through an embedding model to turn them into lists of vectors.

  5. Index: Store the vectors in a Vector database.

Taking those steps, vectorization dictates how to feed data into the model, which is as important a process as selecting the model itself.

Embedding has three primary types: Text, Vision, and Audio embeddings.

Text Embedding

Text embedding is the most common strategy to vectorize text data while capturing its semantic meaning and context.

Chunking (Step 3) plays a key role in the process because too big chunks confuse the model, while too small ones lose context.

To tackle this challenge, there are two primary chunking strategies:

  • Late chunking: Embed the entire document first and then break it into small chunks. Ensure each small chunk remembers the context of the entire document.

  • Semantic chunking: Use AI to find natural breaks in the meaning, instead of chunking the fixed number of characters (e.g., every 500 characters).

Primary text embedding methods employ these chunking strategies such that:

Method

Preferred Chunking

Best For

Main Weakness

Dense

Late

Finding synonyms.
Maximize the context awareness.

Struggles with exact IDs or rare jargon.

Sparse

Semantic

Exact matches, part numbers, names.
(Avoids splitting words like "Apple / Watch".)

Fails if the user uses different words.

Hybrid

Late + Semantic

Production-grade RAG systems.
Uses semantic boundaries but applies late-chunking logic to keep the document's global context.

More complex and expensive to run.

Table 1. Comparison of Embedding Methods: Dense, Sparse, and Hybrid.

Vision Embedding

Vision embedding treats images as data points, enabling the direct search for images without labelling:

Figure B. Architecture of a Vision Embedding model (CNN) processing images into patches for vector representation  (Created by Kuriko IWAI)

Figure B. Architecture of a Vision Embedding model (CNN) processing images into patches for vector representation (Created by Kuriko IWAI)

Its key methods include:

  • Contrastive Language-Image Pre-training (CLIP): Maps images and text into the same vector space to search images with words.

  • Vision Transformers (ViT): Breaks images into patches to process them like text tokens.

Major players like OpenAI (CLIP), Meta (DINOv2), Google (SigLIP) have developed their own models.

Audio Embedding

Audio embedding is useful for music recommendation, speech recognition, or identifying sounds-like patterns in industrial sensor data.

Figure C. Process flow of raw audio signals being converted into spectrograms and embedded via an encoder network (source)

Figure C. Process flow of raw audio signals being converted into spectrograms and embedded via an encoder network (source)

Its key methods include:

  • Spectrogram Analysis: Converts sound waves into visual representations and then embedding them.

  • Contrastive Language-Audio Pre-training (CLAP): Similar to CLIP but applies to sound.

Key players include Microsoft: (CLAP) and Meta: (Audio2Vec).

Measuring Vector Relationships

Once the raw data is vectorized, the model calculates the proximity between vectors using specific mathematical formulas.

Vectors that are closer together share similar semantic, audio, or visual meanings.

The diagram below shows how vector math interprets semantic meaning, taking two text embeddings representing words like "cat" in a 2 dimensional space for an example:

Figure D. Geometric representation of vector relationships showing small, wide, and zero-degree angles between concepts (Created by Kuriko IWAI)

When the two vectors represent related concepts (left, Cat vs. Dog, Figure D), the vectors are pointing in a similar direction with a small orientation (angle) θ, indicating that "Cat" and "Dog" share many features (both are pets, mammals, have four legs, etc) in the two-dimensional vector space.

When the two vectors represent unrelated concepts (middle, Cat vs. Crocodile, Figure D), the vectors are pointing in very different directions with a wide θ.

Lastly, when the vectors share the identical concept but with different magnitude (right, Cat vs. Cat Cat Cat, Figure D), both vectors sit on the exact same line (θ = 0), but vector B is much longer.

This represents a situation where the topic is identical, but the magnitude is different (perhaps a document that mentions "Cat" many more times in a similar concept).

But how can we measure the differences between the vectors?

This section explores the three mathematical metrics:

  • Dot Product,

  • Cosine Similarity, and

  • Euclidean Distance.

Dot Product

The dot product is the fundamental operation that measures the relationship between two vectors, considering both their orientations and magnitudes.

The dot product of the two vectors A and B in an n-dimensional space is generalized:

AB=i=1nAiBi=A1B1+A2B2++AnBn(1.1)

where A_i and B_i represent the i-th entry of the vectors A and B, respectively.

Alternatively, when the orientation (angle) θ between the two vectors is known, Eq. 1.1 can be denoted:

AB=ABcos(θ)(1.2)

where ||A|| and ||B|| represent the magnitudes of the vectors A and B, respectively.

In the case of Figure D, when Vector A (Cat) = [ 2, 2 ], each scenario is measured:


Cat vs. Dog

Cat vs. Crocodile

Cat vs. Cat Cat Cat

Vector B Coordinates

B = [ 3, 1.5 ]
Similar direction, similar size.

B = [ 4, -2 ]
Different direction, different size.

B = [ 6, 6 ]
Exactly the same direction, but 3x longer.

Dot Product Score

9.0

4.0

24.0

Similarity Interpretation

Moderate/High

Low

Very High

Table 2.1. Similarity Metric Comparison (Dot Product)

Cosine Similarity

The cosine similarity focuses purely on the orientation between two vectors A and B, effectively ignoring their magnitudes.

The formula is essentially the dot product (Eq 1's) divided by the product of the vectors' magnitudes ||A|| ||B||:

similarity=cos(θ)=ABAB=i=1nAiBii=1nAi2i=1nBi2(2)

where A_i and B_i represent the i-th entry of the vector A and B, respectively.

The formula makes it perfect for text analysis because when word frequency (magnitude) might vary, but the context (direction) remains the same.

In the case of Figure D, using the same Vector A & B coordinates, the similarity of each scenario is measured:


Cat vs. Dog

Cat vs. Crocodile

Cat vs. Cat Cat Cat

Cosine Similarity

0.95

0.32

1.00

Similarity Interpretation

High

Low

Perfect Match

Table 2.2. Similarity Metric Comparison (Cosine Similarity).

Euclidean Distance (L2 Norm)

The Euclidean distance measures (L2) the absolute displacement in space between two points, p and q:

d(p,q)=i=1n(piqi)2(3)

where:

  • d: The Euclidean distance (the scalar result).

  • p, q: The two vectors (or points) in n-dimensional space.

  • i: The index of the current dimension calculated (from 1 to n).

  • p_i, q_i: The specific coordinates of vectors p and q at the i-th dimension.

The Euclidean distance is highly sensitive to the magnitude of the vectors because it measures how far it is from the origin where the point sits.

When taking two vectors pointing in the exact same direction and multiplying the values of one vector by 10, the Euclidean distance will increase significantly because the tip of the second vector is physically much further away in the coordinate system.

This means when comparing two documents such that:

  • p: A short document.

  • q: Much longer version of the doc p. But repeatedly using the same words.

The Euclidean distance will see them far apart even though the context is similar just because of the much higher word counts of q.

In the case of Figure D, the similarity of each scenario is measured, using the same Vector A & B coordinates:


Cat vs. Dog

Cat vs. Crocodile

Cat vs. Cat Cat Cat

Euclidean Distance

1.12

4.47

5.66

Similarity Interpretation

Closest Match

Unrelated

Unrelated (The Trap)

Table 2.3. Similarity Metric Comparison (Euclidean distance).

The third scenario (Cat vs Cat Cat Cat) demonstrates the trap of the Euclidean distance because it has the largest gap (even though the meaning is identical) simply because Vector B is much longer.

When to Use Which Metric

Given how sensitive these methods are to the scale, each mathematical approach serves a specific use case:

Metric

Sensitive to Magnitude?

Best Use Case

Dot Product

Yes

Neural network layers, signal processing.

Cosine Similarity

No

Document similarity, recommendation systems.

Euclidean Distance

Strongly Yes

Clustering (k-means), physical sensor data.

Table 2.4. Use Cases by Similarity Metrics.

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.



Or explore:

Deep Dive: Implementing Dense, Sparse, and Hybrid Tactics

As Figure A shows, creating text embeddings involves converting human language into a vector that a computer can reason with.

These tactics range from simple keyword counting to advanced models that understand intent and instructions.

In either case, the system must vectorize the data and the query using the same embedding model and calculate the similarity score:

Figure E. Workflow of computing similarity scores between a query embedding and a document corpus in a RAG pipeline (Created by Kuriko IWAI)

Figure E. Workflow of computing similarity scores between a query embedding and a document corpus in a RAG pipeline (Created by Kuriko IWAI)

I’ll demonstrate how the embedding methods measure the scores using sample data and query:

  • Dense embedding.

  • Sparse embedding.

  • Hybrid embedding.

Sample data:

1data = [
2    "A cat is sitting outside", 
3    "A dog is playing guitar", 
4    "The new movie is awesome"
5]
6
7query = "Tell me about felines"
8

Dense Embedding in Action

Dense embedding is the gold standard for modern AI.

The embedding model uses a small, fixed number of dimensions filled with non-zero numbers to capture broad semantic meaning whose position doesn't map to a specific word, but the combination of numbers represents the meaning.

In the case of Eq. a, each dimension can represent a weight on a scale, such as:

  • Dimension 1 (0.12): Living vs. Non-living (where 1.0 is a human and -1.0 is a rock).

  • Dimension 2 (-0.98): Size (where -1.0 is tiny and 1.0 is massive).

  • Dimension 3 (0.45): Domesticity (where 1.0 is a pet and -1.0 is a wild predator).

In this scenario, the first dimension 0.12 suggests the subject is "somewhat living/animate" but perhaps not as high-ranking as a human in the model's hierarchy.

Common tactics involve:

  • Bi-Encoder.

  • Instruction Tuning.

  • Late Interaction.

Bi-Encoder

Bi-encoder is the most common method for text embedding.

It encodes the query and the document separately. It's fast because you can pre-calculate the document vectors.

1from sentence_transformers import SentenceTransformer, util
2
3# load bi-encoder model
4model = SentenceTransformer('all-MiniLM-L6-v2')
5
6# encode
7doc_emb = model.encode(docs)
8query_emb = model.encode(query)
9
10# compute cosine similarity
11scores = util.cos_sim(query_emb, doc_emb)
12

doc_emb - The vectorized docs looks like the following:

[ [ 0.1230927 -0.00072824 0.04190801 ... 0.03736668 -0.03583647 0.06841106 ]
[ 0.02273473 -0.02657051 0.03814451 ... 0.03001389 0.09356179 0.02145582 ]
[ -0.10044324 -0.07739273 -0.00137412 ... -0.00104974 0.07181141 0.02205478 ] ]

Results:

  1. A cat is sitting outside: 0.2984

  2. A dog is playing guitar: 0.1693

  3. The new movie is awesome: 0.1015

Instruction Tuning

Instruction tuning is the process of fine-tuning a pre-trained model on a dataset of explicit instructions (e.g., "Summarize the document," or "Translate to French").

The instruction tells models (e.g., HuggingFace's open source models like E5 or BGE) how to shape the vector based on the task goal to ensure the model's output aligns with the user's specific intent rather than just predicting the next most likely word.

The following snippet uses BGE:

1from sentence_transformers import SentenceTransformer, util
2
3# load bge model
4model = SentenceTransformer('BAAI/bge-small-en-v1.5')
5
6# bge instruction
7instruction = "Represent this sentence for searching relevant passages: "
8
9# encode (with instruction)
10query_emb = model.encode(instruction + query, normalize_embeddings=True)
11doc_embs = model.encode(docs, normalize_embeddings=True)
12
13# compute cosine similarity
14scores = util.cos_sim(query_emb, doc_embs)[0
15]
16

Results:

  1. A cat is sitting outside: 0.4718

  2. A dog is playing guitar: 0.4010

  3. The new movie is awesome: 0.3493

Late Interaction (ColBERT)

Instead of one vector per document, Late Interaction keeps multiple vectors (one per token), allowing for much more granular matching.

1from ragatouille import RAGPretrainedModel
2
3# load model
4RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
5
6# colbert creates an index to store token-level embeddings
7RAG.index(
8    collection=docs,
9    index_name="sample_index",
10    max_document_length=180,
11    split_documents=False
12)
13
14# search index
15results = RAG.search(query=query, k=3)
16

Results:

  1. A cat is sitting outside: 14.03

  2. A dog is playing guitar: 9.33

  3. The new movie is awesome: 5.30

Developer Note: RAGatouille

RAGatouille is one of the common libraries for using ColBERT in Python-based RAG pipelines.

It can simplify the complex original Stanford implementation into a few lines of code for indexing and retrieval, integrating to major AI frameworks like LangChain and LlamaIndex.

It is currently transitioning from the original Stanford ColBERT backend to PyLate for better compatibility.

Performance Summary

In either case, the sentence 'A cat is sitting outside' achieved the highest score, as it was identified as the closest match to the query embedding.

However, their practical applications diverge:

Tactic

Speed

Storage Cost

Accuracy

Primary Use Case

Bi-Encoder

Blazing Fast

Low

Good

Initial retrieval from massive datasets (millions of docs) where speed is the priority.

Instruction

Fast

Low

Very Good

Task-specific search (e.g., "find code snippets" vs "summarize") where intent matters.

Late

Moderate

High

Excellent

High-precision reranking or complex queries where term-level interaction is needed.

Table 3. Performance Matrix: Bi-Encoder vs. Instruction Tuning vs. Late Interaction.

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.



Or explore:

Sparse Embedding in Action

Sparse embedding is a vector representation where most values are zero.

Unlike dense embedding, sparse embedding maps a specific token or keyword to a non-zero value, making them highly interpretable and excellent for exact keyword retrieval.

Common tactics involve:

  • Best Matching 25 (BM25).

  • Sparse Lexical and Expansion (SPLADE).

Best Matching 25 (BM25)

Best Matching 25 (BM25) is the industry standard for keyword search.

It ranks documents based on the appearance of query terms, adjusting for document length.

In the following code snippet, I added "A cat" and "cat" as query:

1from rank_bm25 import BM25Okapi
2
3# bm25 works on words (tokens) not vectors
4tokenized_corpus = [doc.lower().split(" ") for doc in docs
5]
6
7# initialize bm25
8bm25 = BM25Okapi(tokenized_corpus)
9
10queries = ["Tell me about felines", "A cat", "cat"
11]
12for query in queries:
13    tokenized_query = query.lower().split(" ")
14
15    # score each doc
16    doc_scores = bm25.get_scores(tokenized_query)
17
18    # get top 3
19    top_n = bm25.get_top_n(tokenized_query, docs, n=3)
20

Results

Query: Tell me about felines

  • A cat is sitting outside: 0.0000

  • A dog is playing guitar: 0.0000

  • The new movie is awesome: 0.0000

Query: A cat

  1. A cat is sitting outside: 0.5661

  2. A dog is playing guitar: 0.0552

  3. The new movie is awesome: 0.0000

Query: cat

  1. A cat is sitting outside: 0.5108

  2. The new movie is awesome: 0.0000

2(tie). A dog is playing guitar: 0.0000

For the query “A cat”, the score of Rank 2 sentence, “A dog is playing guitar” is way much lower than the score of Rank 1 sentence, “A cat is sitting outside*”.*

This is because BM25 uses a specific formula to ensure common words (like “the” or “is”) don’t drown out rare, important words (e.g., “felines”).

It can balance:

  • Term Frequency (TF): How many times does the word appear? The more, the better.

  • Inverse Document Frequency (IDF): Is this word rare in the whole collection? Rare words like "feline" get more points than "the".

  • Document length normalization: Penalizes very long documents so that a 500-page book doesn't win just because it happens to repeat the keyword by accident.

Sparse Lexical and Expansion (SPLADE)

Sparse Lexical and Expansion (SPLADE) is an AI-powered BM25 where a neural network adds weight to latent words to expand a source document with related keywords.

Particularly, SPLADE takes the following two step:

  1. Sparsity: Like BM25, transforms a sentence into a sparse vector - caring only about specific words to make it efficient for standard search engines like Elasticsearch.

  2. Expansion: Adds weight to latent words to fix the vocabulary mismatch problem.

For example, if a document says "A cat is sitting outside," SPLADE internally adds weights for words like "feline,""pet," or "animal," even if they don't appear in the text.

This can improve the precision of keyword search while keeping the intelligence of neural search.

I'll use the splade-v3 model from Naver Labs (creator of SPLADE) for demonstration:

1import torch
2from transformers import AutoModelForMaskedLM, AutoTokenizer
3
4# load model
5model_id = "naver/splade-v3"
6tokenizer = AutoTokenizer.from_pretrained(model_id) 
7model = AutoModelForMaskedLM.from_pretrained(model_id)
8
9# encode 
10query_vec = get_splade_vector(query_text)
11doc_vecs = [get_splade_vector(d) for d in docs
12]
13
14# calc similarity score (dot product)
15scores = []
16for i, d_vec in enumerate(doc_vecs):
17    score = torch.dot(query_vec, d_vec).item()
18    scores.append((docs[i
19], score))
20

Results

  1. A cat is sitting outside: 9.4954

  2. A dog is playing guitar: 0.1755

  3. The new movie is awesome: 0.0000

SPLADE successfully finds the similar sentence to the query: A cat is sitting outside.

Comparison of BM25 vs. SPLADE

So, here is the comparison of BM25 and SPLADE:

Feature

BM25

SPLADE

Input

Raw text

Neural-weighted tokens

Expansion

No (only exact words)

Yes (adds synonyms/related terms)

Infrastructure

Standard CPU / Database

Requires GPU for encoding

Vector Type

Sparse

Enriched sparse

Table 4. Comparison of BM25 and SPLADE.

Hybrid Embedding in Action

Relying on just one type of vector sometimes fails in production.

For example, dense vectors are great at finding "adorable kittens" when searching for "cute cats", but might fail to find a specific character like "Marie" from the Aristocats (even though she is an adorable kitten:

Figure. Marie from the Aristocats (Disney)

Hybrid embedding can avoid this challenge by combining dense and sparse embeddings.

Its common methods involve:

  • Reciprocal Rank Fusion (RRF).

  • Reranking

Reciprocal Rank Fusion (RRF)

Reciprocal Rank Fusion (RRF) only cares about the rank among the sentences, rather than the absolute scores:

RRFscore(d)=rR1k+r(d)(4)

Where:

  • r(d) is the rank of document d in list R, and

  • k is a constant (usually k = 60) to prevent top-ranked documents from overwhelming the rest.

I’ll use Bi-Encoder and SPLIDE for demonstration:

1# sort the docs by rank
2rank_list_biecoder = [
3    'A cat is sitting outside', 
4    'The new movie is awesome', 
5    'A dog is playing guitar'
6] 
7rank_list_splide = [
8    'A cat is sitting outside', 
9    'A man is playing guitar', 
10    'The new movie is awesome'
11]
12
13# compute rrf scores for each doc in the rank list
14k = 60
15for rank, doc in enumerate(rank_list):
16    if doc not in scores: scores[doc
17] = 0
18    scores[doc
19] += 1 / (k + (rank + 1))
20

Results

  1. A cat is sitting outside: 0.5000

  2. A dog is playing guitar: 0.0366

2(tie). The new movie is awesome 0.0366.

Bi-encoder excels at understanding that “feline” and “cat” are related even if the words don’t match.

On the other hand, SPLADE excels at identifying specific relevant words.

RRF acts as the referee; if a document appears high in both lists, its RRF score will skyrocket. If it only appears in one, it stays in the middle.

RRF yielded the highest score for the sentence "A cat is sitting outside*"* because the sentence is the top in the rank lists of both bi-encoder and SPLADE.

Reranking

Reranking is a method where the system applies a quick dense search to get the top results first, and then passes the top results through cross-encoders to give a final score.

Cross-encoders are trained to rank items rather than provide a percentage of correctness, to examine the query and the document together.

A negative score means the model thinks the document is unlikely to be a perfect match, while a positive score means the model is very confident it's a match.

For demonstration, I'll use the CrossEncoder module from the sentence_transformer library:

1from sentence_transformers import CrossEncoder
2
3# load the model
4model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
5
6# create a query-doc pair
7sentence_pairs = [
8  [query, doc
9  ] for doc in docs
10]
11
12# score
13scores = model.predict(sentence_pairs)
14

Results

  1. A cat is sitting outside: -10.0513

  2. A dog is playing guitar: -10.9967

  3. The new movie is awesome: -11.3669

Although all the sentences are recognized as "unlikely to be a perfect match", the sentence "A cat is sitting outside" is still the closest match among all.

Wrapping Up

Vector database and vectorization are key to handling unstructured data.

They can enable semantic search and provide long-term memory for LLMs through RAG.

The Storage Landscape: Choosing Your Vector Storage Tier

If you are looking for where to store these vectors, the market is split into four camps:

Category

Key Players

Best For...

Vector-Native

Pinecone
Weaviate
Milvus
Qdrant

High-performance, specialized AI applications with massive scale.

Cloud Providers

AWS OpenSearch
Google Vertex AI Search
Azure AI Search

You are locked into a specific cloud ecosystem.

Traditional/SQL

pgvector (PostgreSQL)
Supabase
Oracle

Keep an existing database, and add vector capabilities.

NoSQL/Document

MongoDB Atlas Vector Search
Cassandra
Redis

Real-time applications.
Keeps JSON-like structures.

Table 5. Market Analysis: Vector-Native vs. Traditional Database Providers.

Continue Reading

How to Design a Production-Ready RAG System (Architecture + Tradeoffs) (2026 Edition)

Master industry-standard RAG architectures and how to architect an optimal RAG pipeline, balancing cost, latency, and precision.

Machine LearningDeep LearningAgentic AI

Vector search alone is no longer enough for enterprise AI.

While a simple NaiveRAG works for basic FAQs, complex reasoning and multi-document synthesis require specialized pipelines.

This guide dissects the six primary RAG architectures—including GraphRAG and Agentic RAG—and provides a rigorous decision framework to help you choose the right stack for your data’s complexity, reliability requirements, and budget.

Retrieval-Augmented Generation (RAG) is a technique to retrieve documents from a knowledge base and use them to generate more relevant answers.

Below diagram illustrates its workflow:

Figure A. Standard RAG workflow diagram showing the interface between document storage, vector retrieval, and LLM generation (Created by Kuriko IWAI)

Figure A. Standard RAG workflow diagram showing the interface between document storage, vector retrieval, and LLM generation (Created by Kuriko IWAI)

The core concept of RAG is to combine information retrieval (Storage to Retrieval, Figure A) with generative AI.

RAG plays a key role to help language models stay grounded in true information by injecting relevant context to the prompt.

For example, RAG can scan pages of domain-specific documents in seconds to enable LLMs to answer a user query with accuracy and speed.

How RAG Works - The 3-Stage RAG Pipeline

A common RAG workflow is split into two distinct phases:

  • Phase 1. The offline phase: Ingests raw data to prepare searchable structured data.

  • Phase 2. The online phase: Retrieve the relevant data and answer the query.

Then, I'd add the final, Phase 3 as a feedback loop where the retrieved context is evaluated repeatedly.

Phase 1. The Offline Phase - Ingestion Pipeline

The first phase is to turn raw files into structured, searchable data in a database.

The process involve:

  1. Load: Pull data from sources (PDFs, Notion, SQL, Slack) using LlamaIndex or LangChain.

  2. Clean: Remove noise like headers, footers, or HTML tags.

  3. Chunk: Split long documents into smaller, meaningful pieces (e.g., 500 words each). If chunks are too big, the AI gets confused; too small, and it loses context.

  4. Embed: Pass those chunks through an embedding model (e.g., OpenAI's text-embedding-3-small) to turn text into lists of vectors.

  5. Index: Store those vectors in a Vector Database (e.g., Pinecone, Chroma) to perform semantic search using vectors (meaning).

Phase 2. The Online Phase - Retrieval & Generation Pipeline

When a user asks a question, this process happens in milliseconds.

  1. Query transformation: The system takes the user's question (e.g., "How do I reset my password?") and turns it into a vector using the same embedding model from Phase 1.

  2. Retrieval: The system looks into the Vector Database to find the top 3–5 chunks mathematically closest to the user's question.

  3. Reranking: A reranker model double-checks the results to ensure the most relevant piece is at the very top.

  4. Augmentation: The system stuffs the retrieved chunks into a prompt for the LLM.

    Prompt Example: "You are a helpful assistant. Use the following pieces of context to answer the user's question. Context: [Chunk 1], [Chunk 2]. Question: How do I reset my password?"

  5. Generation: The LLM reads the prompt + context and writes a natural language answer based only on that given data.

Phase 3. The Feedback Loop Pipeline

Lastly, the retrieval results are constantly evaluated to avoid hallucination.

  • Evaluation: Score if the answer was actually based on the context.

  • Observability: See exactly which chunk caused a wrong answer.

This process will allow us to fix the chunking strategy in Phase 1.

Tooling Landscape: From Vector DBs to Observability

RAG tools are the building blocks that connect your data to powerful language models to deliver accurate results.

RAG tools are categorized based on the pipeline they support.

Choosing the right tool depends on what Phase we are working on.

Here are common RAG tools by Phase and key actions:

Phase

Step

What Happens?

Example Tools

Phase 1

Ingestion

Documents → Chunks

LlamaIndex


Storage

Chunks → Vectors in DB

Pinecone, Chroma

Phase 2

Retrieval

Question → Relevant Chunks

Weaviate, Meilisearch


Generation

Chunks (Context) + Prompt → Answer

GPT-5, Claude 4

Phase 3

Evaluation

Check for accuracy

Ragas, DeepEval


Observatory

Check for root cause (wrong chunk).

LangSmith, ArizePhoenix

Table 1: The RAG Tech Stack Categorized by Pipeline Phase.

Developer Note: Is Vector DB Necessary?

Storing vectors in a dedicated database is not strictly necessary to build a RAG system. Typical use cases involve:

  1. Zero-DB RAG: Only a few documents are available.

→Turns the docs into chunks and stores them in RAM (memory).

  • Pros: Fast, free, zero setup.

  • Cons: The memory is wiped when the app restarts. Slow to handle large docs.

  1. Traditional DB: Rule-based keyword search.

→ Stores text chunks in a standard database like PostgreSQL. The system only looks for exact matching words.

  • Pros: Good for finding unique phrases (e.g., a person's name, ID).

  • Cons: Lack of semantic meanings (e.g., An user asks about "mammals". The document says "dogs." The keyword search misses the document.)

Comparative Analysis: 6 Industry-Standard RAG Architectures

Different types of RAG architecture exist because no single setup works well in every situation.

Some tasks require speed and simplicity, while others call for deeper analysis, multiple sources, or even different types of input, such as images or graphs.

This section introduces six major RAG architectures and their common applications:

  • Naive RAG

  • Advanced RAG

  • Modular RAG

  • Collective RAG (CRAG)

  • GraphRAG

  • AgenticRAG

Naive RAG

Naive RAG is the simplest form of RAG. It pulls documents based on user query and passes it straight to the model without making any adjustments.

Figure B. Naive RAG architecture diagram illustrating simple top-K vector similarity matching  (Created by Kuriko IWAI)

Figure B. Naive RAG architecture diagram illustrating simple top-K vector similarity matching (Created by Kuriko IWAI)

NaiveRAG leverages a simple matching algorithm.

It converts the query to a vector, pulls the top-K similar chunks from a vector DB, and feeds them to the LLM.

  • Search method: Vector Similarity (Semantic Search).

  • Complexity: Low.

Pros:

  • Fastest response times and lowest computational cost.

  • Extremely easy to set up with standard libraries like LangChain, LlamaIndex.

  • Effective for basic fact retrieval from clean documents.

Cons:

  • High risk of noise - retrieving irrelevant chunks that confuse the LLM.

  • Struggles with complex or multi-part questions.

  • No self-correction; if the retrieval fails, the answer will be a hallucination.

Best for:

  • Simple Q&A on small, clean datasets.

Common applications:

  • Personal document Q&A.

  • Internal company FAQs.

  • Simple chat with the document app.

Advanced RAG

Advanced RAG adds sophisticated logic like query routing or reranking before and after the retrieval step to get more accurate results:

Figure C. Advanced RAG architecture showing pre-retrieval query transformation and post-retrieval reranking steps (Created by Kuriko IWAI)

Figure C. Advanced RAG architecture showing pre-retrieval query transformation and post-retrieval reranking steps (Created by Kuriko IWAI)

AdvancedRAG works by layering various RAG techniques: In the Pre-Retrieval process, it rewrites the query to make it more straightforward. In the Post-Retrieval process, it ranks the results (reranking) and check if the retrieval results make sense, all to ensure that the generated response is the most relevant and accurate.

  • Search method: Hybrid Search (Vector + Keyword) + Reranking.

  • Complexity: Medium.

Pros:

  • Handles complex questions better

  • Smart enough to know which approach works best for different situations

  • Offers more control over how results are generated

Cons:

  • Higher latency (reranking takes extra time).

  • More expensive to run due to multiple model calls per query.

  • More moving parts to debug and maintain.

  • Requires fine-tuning to ensure all parts work together effectively.

Best for:

  • Systems that require high fidelity where making mistakes is not an option.

Common application:

  • Professional knowledge base.

  • Customer support bot.

Modular RAG

ModularRAG leverages a plug-and-play architecture where different modules (pink boxes, Figure D) handle different parts of the workflow:

Figure D. Modular RAG architecture highlighting plug-and-play components like Search and Memory modules (Created by Kuriko IWAI)

Figure D. Modular RAG architecture highlighting plug-and-play components like Search and Memory modules (Created by Kuriko IWAI)

ModularRAG works by breaking the system into separate components like Search Module or Memory Module, allowing us to customize each part without rebuilding the entire system.

For example, one can swap in a new retriever, a better reranker, or a different generator as a component.

  • Search method: Multi-source retrieval (API, Database, Web).

  • Complexity: High (Requires a sophisticated orchestration layer).

Pros:

  • Easy to optimize each component - great for customizing workflow.

  • Easy to upgrade or replace components without starting from scratch.

Cons:

  • Very high setup cost and architectural complexity. Needs thorough planning in advance.

  • Requires a strong engineering team to manage the orchestration layer.

  • Potential for integration headaches between different module versions.

Best for:

  • Complex enterprise systems which require deep customization.

Common applications:

  • Enterprise AI assistants (checking multiple sources like Jira, Slack, and Google Drive simultaneously).

Corrective RAG (CRAG)

Corrective RAG (CRAG) is designed to double-check its answers and correct them if something is wrong:

Figure E. Corrective RAG (CRAG) flow displaying the evaluation step and fallback web search logic (Created by Kuriko IWAI)

Figure E. Corrective RAG (CRAG) flow displaying the evaluation step and fallback web search logic (Created by Kuriko IWAI)

In CRAG, an evaluator model in the system scores retrieved documents.

If the score is too low, the system ignores the internal DB and triggers a web search to find the correct answer elsewhere (dashed arrow, Figure E).

  • Search method: Evaluated retrieval + fallback web search.

  • Complexity: High (Involves logic-based branching and external API triggers).

Pros:

  • Fixes poor search results before an user sees them.

  • Improves the reliability and accuracy of generated responses by adding an extra layer of quality controls.

Cons:

  • The fallback web search takes longer time and consumes more computational resources.

  • Can get stuck in loops if it is never satisfied with what it finds.

Best for:

  • High-stake tasks where wrong/outdated information is strictly prohibited.

Common applications:

  • Medical research.

  • Legal research.

GraphRAG

GraphRAG uses a knowledge graph to structure the relationships between pieces of information rather than just text similarity:

Figure F-1. GraphRAG architecture showing knowledge graph traversal and community summary generation (Created by Kuriko IWAI)

Figure F-1. GraphRAG architecture showing knowledge graph traversal and community summary generation (Created by Kuriko IWAI)

After creating the knowledge graph, GraphRAG traverses the graph to find patterns between pieces of data rather than just matching words, so that it can find how Entity A is related to Entity B, even if they are mentioned in different documents.

  • Search method: Knowledge Graph Traversal + Community Summary.

  • Complexity: Very High (Requires building and maintaining a structured graph database).

Pros:

  • Great for complex questions connecting multiple concepts. Prevent scattered answers.

  • Can provide unexpected but relevant responses by connecting dots.

Cons:

  • Requires significant work to build a knowledge graph.

  • Slower than basic RAG systems.

  • The quality of the knowledge graph sets the performance cap. Works only as good as the connections in the knowledge graph.

Best for:

  • Understanding the big picture.

  • Complex, multi-hop reasoning across multiple data sources.

Common applications:

  • Investigative journalism (e.g., Fraud detection).

  • Drug discovery.

Figure F-2. LLM-generated knowledge graph built from a private dataset using GPT-4 Turbo (source).

Figure F-2. LLM-generated knowledge graph built from a private dataset using GPT-4 Turbo (source).

AgenticRAG

AgenticRAG is a dynamic RAG where AI agent (blue box, Figure G) acts as a coordinator to plan, retrieve, and refine the response:

Figure G. AgenticRAG workflow featuring an AI coordinator agent using tools for multi-step reasoning (Created by Kuriko IWAI)

Figure G. AgenticRAG workflow featuring an AI coordinator agent using tools for multi-step reasoning (Created by Kuriko IWAI)

Instead of just retrieving the first relevant documents, AgenticRAG plans its approach, decides what to investigate, and then takes action using associated tools.

Agentic RAG works by breaking down a task into smaller steps.

It searches various data sources for valuable information to the given query, and then checks whether the information answers the query. If not, AgenticRAG keeps searching for relevant information.

Pros:

  • Good for multi-step reasoning.

  • Intelligent decision-making about information gathering.

  • Can improve performance on complex queries.

Cons:

  • Costs more to run due to multiple searches.

  • Takes longer to respond since it is doing actual research work.

Best for:

  • Tasks require methodical planning.

Common applications:

  • Legal research to conduct comprehensive case analysis.

  • Financial analysis to combine market data with regulatory information.

The RAG Decision Path - A Framework for Architects

Although there's one rule for all when it comes to architect a RAG system, here is a common decision path to take for a task in hand:

Figure H. RAG Decision Path flowchart helping architects choose between Simple, Graph, Corrective, and Advanced RAG based on task constraints (Created by Kuriko IWAI)

Figure H. RAG Decision Path flowchart helping architects choose between Simple, Graph, Corrective, and Advanced RAG based on task constraints (Created by Kuriko IWAI)

Here are the breakdown:

Step 1. The Complexity Test

The first step is to ask whether an answer is in a single document.

If yes, No-DB RAG (for a relatively small document) or Simple RAG is the best option as it is fast, low cost, zero effort to build.

Step 2. The Relationship Test

The next step is to ask whether the query requires understanding deep hierarchies or hidden connections among multiple data sources.

If yes, GraphRAG can map these resources and relationships, which standard text search misses.

For example, a query like "Which departments are affected by Policy A, and who are their managers?" requires connecting department and employee data sources.

GraphRAG maps departments and employees as linked entities.

Step 3. The Reliability & Reasoning Test

The third step is to ask if the system can afford to be wrong.

If it requires extremely high precision, CorrectiveRAG or Self-RAG can take a dedicated step to evaluate if the retrieved data actually answers the question before showing it to the user.

And when we need multi-step reasoning in such a task, AgenticRAG can handle planning and self-refinement by leveraging its AI agents.

Step 4. The Performance vs. Cost Trade-off

The last question is the trade-off between performance and cost.

When latency is the top priority, a simpler RAG like No-DB RAG, NaiveRAG, or SimpleRAG works the best.

If the quality matters more than the cost, but not to the extent of the extremely high-precision in Step 3, AdvancedRAG can better understand the intentions behind the query than simpler RAGs, although it takes more time to process the query.

Wrapping Up

RAG enables LLMs to provide context-aware answers from massive, private datasets without the need for constant retraining.

But it’s not versatile on some occasions—specifically when the logic of the task requires deep reasoning, creative synthesis, or a mastery of the underlying language patterns rather than just looking up a document.

Strategic Boundaries: When not to Use RAG

Here are typical scenarios better to skip RAG:

  • Broad reasoning to understand common sense, ethical nuances, or general human behavior.
    → Leverage pre-trained models' internalized weights, not RAG.

  • General knowledge queries asking about traditional things that haven't changed in decades.
    → Leverage pre-trained models' internalized weights. In this case, RAG only adds latency, not accuracy.

  • Creative writing that requires specific tone and styles.
    → Leverage pre-trained models or fine-tuning. RAG is overkill.

  • Deep math or logic problems that require multi-step logic and computation.
    → Leverage fine-tuning or Chain-of-Thought (CoT) prompting.

  • Extremely low-latency requirements to serve near-instantaneous responses.
    → Avoid RAG as it adds embedding, searching, and injecting overheads to the pipeline.

  • Small dataset whose entire data can be pasted into the prompt.
    → Simple copy&paste would work. No need to build a complex RAG pipeline.

Summary Table: RAG vs. Alternatives

If you need...

Use...

Up-to-the-minute facts

RAG

Specialized vocabulary or unique tone

Fine-tuning

Complex logic/math

Pre-trained model (reasoning model) + tools

Immediate speed

Pre-trained model + prompt engineering

Table 2: Performance Comparison - RAG vs. Fine-Tuning vs. Reasoning Models

Continue Reading

Get 5-Point AI Security Checklist:

AI Engineering Masterclass

Module 3

Digital Clone: Persona Fine-Tuning & Edge Distillation

Engineered a high-fidelity interactive persona by distilling linguistic patterns from frontier models into a localized 3B parameter footprint.

unslothtrltransformersggufvllmsagemakerboto3openai

You'll Build: Edge-Native Digital Clone (Smartphone/Web)

Digital Clone: Persona Fine-Tuning & Edge Distillation

Production Goals:

  • Compress GPT 5.4 mini intelligence for edge AI.

What You'll Master:

  • Distill latent reasoning and Chain-of-Thought (CoT) capabilities from GPT-5.4 into a 3B model.
  • Engineer multi-stage tuning pipeline - SFT for grounding, RKD for logic, and DPO for stylistic parity.
  • Standardize input/output schemas using chat templates.
  • Implement 4-bit quantization (GGUF) to balance VRAM efficiency and perplexity for edge hardware.
  • Deploy via AWS SageMaker LMI/vLLM engine for paged-attention concurrency and real-time streaming.

Agentic AI framework

MIT licenseMIT licenseMIT licensePyPIPython

versionhq is a Python framework for autonomous agent networks that handle complex task automation without human interaction.

version UI dark mode
pypi package
agent network and task graph

Key Features

versionhq is a Python framework designed for automating complex, multi-step tasks using autonomous agent networks.

Users can either configure their agents and network manually or allow the system to automatically manage the process based on provided task goals.

Agent Network

When multiple agents handle a task, agents will adapt to specific network formation based on the task and network complexity.

You can specify a desired formation or allow the leader to determine it autonomously (default).

Solo AgentSupervisingSquadRandom
Formationsolosupervisorsquadrandom
Usage
  • A single agent with tools, knowledge, and memory.
  • When self-learning mode is on - it will turn into Random formation.
  • Leader agent gives directions, while sharing its knowledge and memory.
  • Subordinates can be solo agents or networks.
  • Share tasks, knowledge, and memory among network members.
  • A single agent handles tasks, asking help from other agents without sharing its memory or knowledge.
Use caseAn email agent drafts promo message for the given audience.The leader agent strategizes an outbound campaign plan and assigns components such as media mix or message creation to subordinate agents.An email agent and social media agent share the product knowledge and deploy multi-channel outbound campaign.1. An email agent drafts promo message for the given audience, asking insights on tones from other email agents which oversee other clusters. 2. An agent calls the external agent to deploy the campaign.

Kuriko IWAI

Kernel Labs Pte. Ltd.

Kuriko IWAI

Shipping AI Systems?

I help teams design and deploy scalable ML / RAG / LLM pipelines and MLOps infrastructure.



Or explore:

Related Books

These books cover the wide range of ML theories and practices from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps