Why RAG is Hard | Infinity Services

If you've followed a standard RAG tutorial, you know the drill: Parse a PDF, chunk it into 512-token segments, embed it with OpenAI's `text-embedding-3`, and shove it into Pinecone. When a user asks a question, you grab the top 3 matches and feed them to GPT-4.

This works beautifully for a 5-page employee handbook. It fails catastrophically for a 50,000-page technical archive.

The "Lost in the Middle" Phenomenon

Vector search is probabilistic, not deterministic. When you rely solely on cosine similarity, you are optimizing for semantic overlap, not informational density.

A major failure mode in production is the "Lost in the Middle" effect. LLMs tend to prioritize information at the very beginning and very end of their context window. If your retriever fetches 10 chunks and the critical answer is in chunk #5, the model might hallucinate an answer because it simply glossed over the middle.

// The "Naive" Approach (Don't do this)
Why Naive RAG
                            Fails
                            Most RAG tutorials teach you to chunk text into 512-token segments, embed them with OpenAI's
                            `text-embedding-3-small`, and perform a cosine similarity search. failures usually fall into
                            three buckets:
                        

                            1. Semantic Drift
                            Dense embeddings are great at capturing concepts but terrible at capturing
                            specifics. If you search for "Error Code 503", a semantic search might return
                            documents about "Server Outages" generally, but miss the specific manual page for "Error
                            503" because the vector distance focused on the concept of "error" rather than the exact
                            exact string match.
                        

                            2. The "Lost in the Middle" Phenomenon
                            LLMs are not perfect readers. Research from Stanford shows that models are excellent at
                            using information at the beginning and end of a prompt, but degrading significantly when
                            accessing information in the middle. If you retrieve 10 chunks and the answer is in chunk
                            #5, the model might ignore it.
                        

                            3. Structure Blindness
                            A PDF isn't just a string of text. It has tables, headers, and layouts. Naive chunking
                            destroys this structure. If you chunk a table row by row, you lose the column headers. The
                            context is gone.
                        

                            The Production Stack: Hybrid Search & Reranking
                            To build a production-grade system, we abandon the single-step retrieval model. Instead, we
                            use a Two-Stage Retrieval process.
                        
Stage 1: Hybrid Retrieval (The Wide Net)
                            
                                We query two indexes simultaneously:
                            
Vector Index: Finds conceptually related content ("How do I fix the
                                    server?").
Keyword Index (BM25): Finds exact matches ("Error 503").

                                We weigh these results (alpha=0.7) and fuse them into a single list of ~50 candidates.
                            
Stage 2: Cross-Encoder Reranking (The
                                Filter)
                                Bi-encoders (Vectors) are fast but imprecise. Cross-Encoders are slow but incredibly
                                accurate.
                            

                                We take those 50 candidates and pass them through a Cross-Encoder (like
                                `bge-reranker-v2-m3`). This model reads the Query and the Document together and
                                outputs a relevance score (0-1). We keep the top 5.
                            

                            Implementation Pattern
                            def retrieve_and_rerank(query: str, top_k: int = 5):
    # 1. Hybrid Search (Broad Recall)
    vector_results = vector_db.search(query, k=50)
    keyword_results = bm25.search(query, k=50)
    
    # Reciprocal Rank Fusion (RRF)
    candidates = fuse_results(vector_results, keyword_results)
    
    # 2. Reranking (High Precision)
    reranker = CrossEncoder('BAAI/bge-reranker-v2-m3')
    ranked_results = reranker.rank(query, candidates)
    
    # 3. Context Window Optimization
    return ranked_results[:top_k]

                        

                                Building a RAG demo takes an afternoon. Building a RAG system that doesn't hallucinate
                                in
                                production takes six months.
                            

                                The vector database ecosystem has sold us a lie: that "semantic search" is magic. You
                                just chunk
                                your PDFs, embed them with OpenAI, and voila—you have chat-with-your-data.
                            

                                The reality? 70% of RAG projects fail in production. They fail because
                                "Naive RAG" (the standard tutorial architecture) cannot handle the nuance of real-world
                                enterprise data. This article explores the failure modes of Naive RAG and details the
                                Advanced RAG Architecture required to fix them.
                            
The "Naive RAG"
                                Death Spiral
                                Production RAG systems suffer from three primary failure modes. If you have deployed a
                                chatbot, you have likely seen these already.
                            
1. The "Lost in the Middle"
                                        You retrieve 10 documents. The answer is in Document 5. The LLM sees the
                                        beginning (Doc 1) and the end (Doc 10) but completely ignores the middle.
                                    
2. The Keyword Mismatch
                                        User asks for "PO #1234". Embedding models treat "1234" as generic. They
                                        retrieve "PO #5678" because it's semantically similar (both are purchase
                                        orders), but factually wrong.
                                    
3. Context Fragmentation
                                        The answer is split across two chunks. Chunk A has the question, Chunk B has the
                                        answer. Neither chunk has enough semantic meaning on its own to be retrieved.
                                    

                                The Fix: Hybrid Search & Reranking
                                To solve the Keyword Mismatch, we must admit that embeddings are not perfect. Sometimes,
                                old-school **BM25 (Keyword Search)** beats vector search.
                            

                                Hybrid Search runs both queries in parallel:
                                
1. Dense Vector Search: Finds concepts ("How do I reset my
                                password?")
                                
2. Sparse Keyword Search: Finds exact matches ("Error Code 503")
                            

                                We then merge the results using Reciprocal Rank Fusion (RRF) and pass
                                them to a Cross-Encoder Reranker.
                            

                                # Advanced RAG Pipeline

                                def retrieve(query):

                                  vector_results = vector_db.search(query) # Semantic

                                  keyword_results = bm25.search(query) #
                                    Exact Match


                                  # Reranking Step (Crucial)

                                  candidates = deduplicate(vector_results + keyword_results)

                                  ranked_docs = CohereRerank.rerank(query, candidates)


                                  return ranked_docs[:5] # Only Top 5 go to
                                    LLM
                            

                                Evaluation: The Ragas Framework
                                How do you know if your RAG system is good? "It feels right" is not a metric.
                            

                                We use the Ragas (Retrieval Augmented Generation Assessment) framework
                                to mathematically score our pipelines. We track three core metrics:
                            
1. Faithfulness (The
                                        Safety Check)
                                        Did the LLM make things up? We check if every sentence in the answer can be
                                        attributed to a sentence in the retrieved context. Low Faithfulness = Hallucination.
                                    
2. Context Precision
                                        (The Noise Filter)
                                        Did we retrieve garbage? If we retrieved 10 docs and only #9 was relevant, our
                                        precision is low. This confuses the LLM.
                                    
3. Answer Relevancy (The
                                        User Check)
                                        Did we actually answer the question? A faithful answer that ignores the user's
                                        intent is useless.
                                    
Stop
                                    Guessing. Start Engineering.
                                    We build Production RAG systems with Hybrid Search, Reranking, and Ragas Evaluation
                                    pipelines.
                                
Audit Your RAG Pipeline

Your Vector Database is Lying to You.

The "Lost in the Middle" Phenomenon

Why Naive RAG Fails

1. Semantic Drift

2. The "Lost in the Middle" Phenomenon

3. Structure Blindness

The Production Stack: Hybrid Search & Reranking

Stage 1: Hybrid Retrieval (The Wide Net)

Stage 2: Cross-Encoder Reranking (The Filter)

Implementation Pattern

The "Naive RAG" Death Spiral

1. The "Lost in the Middle"

2. The Keyword Mismatch

3. Context Fragmentation

The Fix: Hybrid Search & Reranking

Evaluation: The Ragas Framework

1. Faithfulness (The Safety Check)

2. Context Precision (The Noise Filter)

3. Answer Relevancy (The User Check)

Stop Guessing. Start Engineering.

Your Vector Database is
Lying to You.