rag-implementation

Build Retrieval-Augmented Generation (RAG) systems for LLM applications with vector databases and semantic search. Use when implementing knowledge-grounded AI, building document Q&A systems, or integrating LLMs with external knowledge bases.

View Source
name:rag-implementationdescription:Build Retrieval-Augmented Generation (RAG) systems for LLM applications with vector databases and semantic search. Use when implementing knowledge-grounded AI, building document Q&A systems, or integrating LLMs with external knowledge bases.

RAG Implementation

Master Retrieval-Augmented Generation (RAG) to build LLM applications that provide accurate, grounded responses using external knowledge sources.

Use this skill when

  • Building Q&A systems over proprietary documents

  • Creating chatbots with current, factual information

  • Implementing semantic search with natural language queries

  • Reducing hallucinations with grounded responses

  • Enabling LLMs to access domain-specific knowledge

  • Building documentation assistants

  • Creating research tools with source citation
  • Do not use this skill when

  • You only need purely generative writing without retrieval

  • The dataset is too small to justify embeddings

  • You cannot store or process the source data safely
  • Instructions

  • Define the corpus, update cadence, and evaluation targets.

  • Choose embedding models and vector store based on scale.

  • Build ingestion, chunking, and retrieval with reranking.

  • Evaluate with grounded QA metrics and monitor drift.
  • Safety

  • Redact sensitive data and enforce access controls.

  • Avoid exposing source documents in responses when restricted.
  • Core Components

    1. Vector Databases


    Purpose: Store and retrieve document embeddings efficiently

    Options:

  • Pinecone: Managed, scalable, fast queries

  • Weaviate: Open-source, hybrid search

  • Milvus: High performance, on-premise

  • Chroma: Lightweight, easy to use

  • Qdrant: Fast, filtered search

  • FAISS: Meta's library, local deployment
  • 2. Embeddings


    Purpose: Convert text to numerical vectors for similarity search

    Models:

  • text-embedding-ada-002 (OpenAI): General purpose, 1536 dims

  • all-MiniLM-L6-v2 (Sentence Transformers): Fast, lightweight

  • e5-large-v2: High quality, multilingual

  • Instructor: Task-specific instructions

  • bge-large-en-v1.5: SOTA performance
  • 3. Retrieval Strategies


    Approaches:
  • Dense Retrieval: Semantic similarity via embeddings

  • Sparse Retrieval: Keyword matching (BM25, TF-IDF)

  • Hybrid Search: Combine dense + sparse

  • Multi-Query: Generate multiple query variations

  • HyDE: Generate hypothetical documents
  • 4. Reranking


    Purpose: Improve retrieval quality by reordering results

    Methods:

  • Cross-Encoders: BERT-based reranking

  • Cohere Rerank: API-based reranking

  • Maximal Marginal Relevance (MMR): Diversity + relevance

  • LLM-based: Use LLM to score relevance
  • Quick Start

    from langchain.document_loaders import DirectoryLoader
    from langchain.text_splitters import RecursiveCharacterTextSplitter
    from langchain.embeddings import OpenAIEmbeddings
    from langchain.vectorstores import Chroma
    from langchain.chains import RetrievalQA
    from langchain.llms import OpenAI

    1. Load documents


    loader = DirectoryLoader('./docs', glob="*/.txt")
    documents = loader.load()

    2. Split into chunks


    text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
    )
    chunks = text_splitter.split_documents(documents)

    3. Create embeddings and vector store


    embeddings = OpenAIEmbeddings()
    vectorstore = Chroma.from_documents(chunks, embeddings)

    4. Create retrieval chain


    qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True
    )

    5. Query


    result = qa_chain({"query": "What are the main features?"})
    print(result['result'])
    print(result['source_documents'])

    Advanced RAG Patterns

    Pattern 1: Hybrid Search


    from langchain.retrievers import BM25Retriever, EnsembleRetriever

    Sparse retriever (BM25)


    bm25_retriever = BM25Retriever.from_documents(chunks)
    bm25_retriever.k = 5

    Dense retriever (embeddings)


    embedding_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

    Combine with weights


    ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, embedding_retriever],
    weights=[0.3, 0.7]
    )

    Pattern 2: Multi-Query Retrieval


    from langchain.retrievers.multi_query import MultiQueryRetriever

    Generate multiple query perspectives


    retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=OpenAI()
    )

    Single query → multiple variations → combined results


    results = retriever.get_relevant_documents("What is the main topic?")

    Pattern 3: Contextual Compression


    from langchain.retrievers import ContextualCompressionRetriever
    from langchain.retrievers.document_compressors import LLMChainExtractor

    compressor = LLMChainExtractor.from_llm(llm)

    compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever()
    )

    Returns only relevant parts of documents


    compressed_docs = compression_retriever.get_relevant_documents("query")

    Pattern 4: Parent Document Retriever


    from langchain.retrievers import ParentDocumentRetriever
    from langchain.storage import InMemoryStore

    Store for parent documents


    store = InMemoryStore()

    Small chunks for retrieval, large chunks for context


    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

    retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter
    )

    Document Chunking Strategies

    Recursive Character Text Splitter


    from langchain.text_splitters import RecursiveCharacterTextSplitter

    splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", " ", ""] # Try these in order
    )

    Token-Based Splitting


    from langchain.text_splitters import TokenTextSplitter

    splitter = TokenTextSplitter(
    chunk_size=512,
    chunk_overlap=50
    )

    Semantic Chunking


    from langchain.text_splitters import SemanticChunker

    splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile"
    )

    Markdown Header Splitter


    from langchain.text_splitters import MarkdownHeaderTextSplitter

    headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
    ]

    splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

    Vector Store Configurations

    Pinecone


    import pinecone
    from langchain.vectorstores import Pinecone

    pinecone.init(api_key="your-api-key", environment="us-west1-gcp")

    index = pinecone.Index("your-index-name")

    vectorstore = Pinecone(index, embeddings.embed_query, "text")

    Weaviate


    import weaviate
    from langchain.vectorstores import Weaviate

    client = weaviate.Client("http://localhost:8080")

    vectorstore = Weaviate(client, "Document", "content", embeddings)

    Chroma (Local)


    from langchain.vectorstores import Chroma

    vectorstore = Chroma(
    collection_name="my_collection",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
    )

    Retrieval Optimization

    1. Metadata Filtering


    # Add metadata during indexing
    chunks_with_metadata = []
    for i, chunk in enumerate(chunks):
    chunk.metadata = {
    "source": chunk.metadata.get("source"),
    "page": i,
    "category": determine_category(chunk.page_content)
    }
    chunks_with_metadata.append(chunk)

    Filter during retrieval


    results = vectorstore.similarity_search(
    "query",
    filter={"category": "technical"},
    k=5
    )

    2. Maximal Marginal Relevance


    # Balance relevance with diversity
    results = vectorstore.max_marginal_relevance_search(
    "query",
    k=5,
    fetch_k=20, # Fetch 20, return top 5 diverse
    lambda_mult=0.5 # 0=max diversity, 1=max relevance
    )

    3. Reranking with Cross-Encoder


    from sentence_transformers import CrossEncoder

    reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

    Get initial results


    candidates = vectorstore.similarity_search("query", k=20)

    Rerank


    pairs = [[query, doc.page_content] for doc in candidates]
    scores = reranker.predict(pairs)

    Sort by score and take top k


    reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]

    Prompt Engineering for RAG

    Contextual Prompt


    prompt_template = """Use the following context to answer the question. If you cannot answer based on the context, say "I don't have enough information."

    Context:
    {context}

    Question: {question}

    Answer:"""

    With Citations


    prompt_template = """Answer the question based on the context below. Include citations using [1], [2], etc.

    Context:
    {context}

    Question: {question}

    Answer (with citations):"""

    With Confidence


    prompt_template = """Answer the question using the context. Provide a confidence score (0-100%) for your answer.

    Context:
    {context}

    Question: {question}

    Answer:
    Confidence:"""

    Evaluation Metrics

    def evaluate_rag_system(qa_chain, test_cases):
    metrics = {
    'accuracy': [],
    'retrieval_quality': [],
    'groundedness': []
    }

    for test in test_cases:
    result = qa_chain({"query": test['question']})

    # Check if answer matches expected
    accuracy = calculate_accuracy(result['result'], test['expected'])
    metrics['accuracy'].append(accuracy)

    # Check if relevant docs were retrieved
    retrieval_quality = evaluate_retrieved_docs(
    result['source_documents'],
    test['relevant_docs']
    )
    metrics['retrieval_quality'].append(retrieval_quality)

    # Check if answer is grounded in context
    groundedness = check_groundedness(
    result['result'],
    result['source_documents']
    )
    metrics['groundedness'].append(groundedness)

    return {k: sum(v)/len(v) for k, v in metrics.items()}

    Resources

  • references/vector-databases.md: Detailed comparison of vector DBs

  • references/embeddings.md: Embedding model selection guide

  • references/retrieval-strategies.md: Advanced retrieval techniques

  • references/reranking.md: Reranking methods and when to use them

  • references/context-window.md: Managing context limits

  • assets/vector-store-config.yaml: Configuration templates

  • assets/retriever-pipeline.py: Complete RAG pipeline

  • assets/embedding-models.md: Model comparison and benchmarks
  • Best Practices

  • Chunk Size: Balance between context and specificity (500-1000 tokens)

  • Overlap: Use 10-20% overlap to preserve context at boundaries

  • Metadata: Include source, page, timestamp for filtering and debugging

  • Hybrid Search: Combine semantic and keyword search for best results

  • Reranking: Improve top results with cross-encoder

  • Citations: Always return source documents for transparency

  • Evaluation: Continuously test retrieval quality and answer accuracy

  • Monitoring: Track retrieval metrics in production
  • Common Issues

  • Poor Retrieval: Check embedding quality, chunk size, query formulation

  • Irrelevant Results: Add metadata filtering, use hybrid search, rerank

  • Missing Information: Ensure documents are properly indexed

  • Slow Queries: Optimize vector store, use caching, reduce k

  • Hallucinations: Improve grounding prompt, add verification step

    1. rag-implementation - Agent Skills