Proxy-Pointer RAG Achieves Vectorless Accuracy at Vector RAG Scale and Cost

In a significant advancement for Retrieval Augmented Generation (RAG) systems, a novel architecture named Proxy-Pointer RAG has demonstrated unprecedented accuracy and efficiency, particularly when processing complex, structured documents. This development addresses a long-standing challenge in the field: bridging the gap between the precision of "vectorless" RAG approaches and the scalability required for enterprise-level applications. The initial proof-of-concept, detailed in a prior publication, laid the theoretical groundwork, but a recent comprehensive benchmark test has now provided robust evidence of its production readiness, especially for demanding data types like financial filings.

The core innovation of Proxy-Pointer RAG lies in its ability to leverage the inherent structure within documents, a feature typically discarded by conventional RAG systems. Standard vector RAG models often break down documents into uniform "chunks," losing valuable hierarchical information provided by section headings, subheadings, and other organizational elements. This fragmentation can lead to fragmented understanding and suboptimal responses from Large Language Models (LLMs). Proxy-Pointer, conversely, preserves and utilizes this structure, enabling more precise retrieval and significantly improving the quality of generated answers.

The Need for Structured Document Processing in RAG

The vast majority of documents encountered in enterprise environments—technical manuals, research papers, legal contracts, policy reports, financial statements, and compliance documents—are inherently structured. This structure is not merely cosmetic; it encapsulates meaning and reflects how humans intuitively comprehend and navigate complex information. By organizing content hierarchically, these documents guide the reader through logical flows and interconnections. Traditional RAG systems, however, often treat these rich documents as flat, unstructured text, leading to a loss of critical context. This can result in LLMs "hallucinating" information or failing to retrieve the most relevant data, even when it exists within the document.

Proxy-Pointer RAG aims to rectify this by integrating the document’s structural information directly into the vector index. This approach allows for a more "surgical" retrieval of information, akin to how a human expert would pinpoint relevant sections based on headings and subheadings, without the inherent scalability and cost limitations of purely vector-based methods.

Stress-Testing with Financial Filings: The Ultimate Challenge

To rigorously validate Proxy-Pointer RAG’s capabilities, researchers opted for a particularly demanding testbed: financial filings. These documents, such as the annual 10-K reports filed with the U.S. Securities and Exchange Commission (SEC), are characterized by deep nesting, intricate cross-referencing across multiple financial statements, and a critical need for precise numerical reasoning. A single misplaced decimal or overlooked footnote can have significant implications for analysis. The hypothesis is that if Proxy-Pointer RAG can successfully navigate these complex documents, it can handle virtually any structured text with headings.

The benchmark involved analyzing four publicly available FY2022 10-K filings from prominent companies: AMD (121 pages), American Express (260 pages), Boeing (190 pages), and PepsiCo (500 pages). A total of 66 questions were posed across two distinct benchmark datasets, including adversarial queries specifically engineered to challenge naive retrieval systems. The results, detailed below, have been described as "decisive."

Open-Sourcing the Pipeline for Reproducibility and Advancement

A significant aspect of this development is the open-sourcing of the complete Proxy-Pointer RAG pipeline. This initiative allows users to replicate the benchmark results, apply the system to their own documents, and contribute to its further development. The repository includes all necessary code for extraction, indexing, and querying, aiming for a quick and accessible integration for developers and researchers.

A Quick Recap: The Core Mechanism of Proxy-Pointer RAG

At its heart, Proxy-Pointer RAG differentiates itself from standard vector RAG in several key ways. Standard RAG splits documents into chunks, embeds them, and retrieves the top-K most similar embeddings. The synthesizing LLM then receives fragmented text, often lacking crucial context, leading to potential inaccuracies.

Proxy-Pointer RAG addresses this by incorporating five zero-cost engineering techniques:

Hierarchical Indexing: Instead of flat chunks, documents are indexed with their structural hierarchy intact. This means each piece of text retains its relationship to its parent headings and sibling sections.
Section-Aware Chunking: Chunks are not arbitrary divisions but are contextually aware of the sections they belong to. This ensures that when a chunk is retrieved, it’s part of a coherent textual unit.
Structural Pointers: Each chunk is augmented with "proxy pointers" that explicitly map its position within the document’s hierarchy. This allows the retrieval system to understand the "where" of the information.
LLM-Synthesized Sections: The LLM synthesizer receives complete sections of text, not isolated fragments, significantly enhancing its ability to understand context and generate accurate responses.
Zero-Cost Structural Embedding: The structural metadata is integrated without requiring additional LLM calls or expensive embedding computations during indexing, thus maintaining low cost.

The outcome is a retrieval system where every chunk knows its precise location within the document’s structure, and the LLM synthesizer has access to complete, contextually rich sections.

Refinements Elevating Prototype to Production Readiness

Since its initial introduction, the Proxy-Pointer RAG pipeline has undergone significant refinements, transforming it from a promising prototype into a robust, production-grade retrieval engine. These enhancements primarily target the indexing and retrieval pipelines.

Indexing Pipeline Enhancements:

Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

Standalone Architecture: The original implementation relied on external libraries for generating skeleton trees (hierarchical document structures). The new version features a completely self-contained, pure-Python tree builder. This ~150-line module parses Markdown headings into a hierarchical JSON structure, eliminating external dependencies and LLM calls for this process, achieving millisecond-level performance.
LLM-Powered Noise Filter: Previous versions used a hardcoded list of common noise titles (e.g., "Contents," "Foreword"). This approach proved brittle, failing to identify semantically equivalent titles like "Note of Thanks" versus "Acknowledgments" or variations of "Table of Contents." The updated pipeline now leverages a lightweight LLM (gemini-flash-lite) to analyze the skeleton tree and identify noise nodes across six categories. This semantic filtering is far more robust than regex-based approaches and can catch variations that would otherwise be missed.

The updated indexing pipeline flow is visualized as follows: [Image: Embedding pipeline, showing document parsing, skeleton tree generation, noise filtering, and chunking leading to FAISS index].

Retrieval Pipeline Enhancements:

Two-Stage Retrieval: Semantic + LLM Re-Ranker: The initial benchmark relied on a straightforward top-K retrieval from a FAISS index. The refined pipeline introduces a sophisticated two-stage retrieval process:
- Initial Semantic Search: A standard semantic search is performed to retrieve a larger set of candidate chunks (e.g., top-10).
- LLM-Powered Re-Ranking: A lightweight LLM then re-ranks these candidate chunks based on their relevance to the query, considering the structural context. This ensures that the most pertinent sections, even if not the absolute top semantic matches, are prioritized.

The updated retrieval pipeline is illustrated here: [Image: Retrieval pipeline, showing initial semantic search, re-ranking by LLM, and final selection for the synthesizer].

These refinements have been crucial in transforming Proxy-Pointer from a proof-of-concept into a system capable of meeting the stringent demands of enterprise applications.

Rigorous Benchmarking: Two Tests, 66 Questions, Four Companies

The evaluation strategy involved downloading the FY2022 10-K annual filings for AMD, American Express (AMEX), Boeing, and PepsiCo. These documents were first converted to Markdown format using LlamaParse and then indexed through the refined Proxy-Pointer pipeline. The testing was conducted across two distinct benchmark methodologies.

Benchmark 1: FinanceBench (26 Questions)

FinanceBench is an established benchmark dataset designed to assess the performance of RAG systems on financial filings. It comprises qualitative and quantitative questions that cover numerical reasoning, information extraction, and logical inference. All 26 questions from the FinanceBench dataset that spanned the four selected companies were utilized. Due to licensing restrictions, the specific questions and ground truth answers from FinanceBench are not included in the open-source repository, but comprehensive scorecards detailing Proxy-Pointer’s responses are provided for reference. A benchmark.py script is included to facilitate reproducibility, allowing users to run the full evaluation on FinanceBench or custom datasets and generate detailed logs and scorecards.

Benchmark 2: Comprehensive Stress Test (40 Questions)

Recognizing that FinanceBench primarily tests factual recall, a more challenging benchmark was developed. This "Comprehensive Stress Test" consisted of 40 custom questions, with 10 questions per company, specifically designed to push the limits of numerical reasoning, multi-hop retrieval, adversarial robustness, and cross-statement reconciliation. These queries were crafted to break systems that rely on superficial chunk matching.

The full question-and-answer logs, along with scorecards comparing the system’s output to pre-computed ground truth, are available in the open-source repository. To illustrate the complexity, several example queries highlight the demanding nature of this benchmark:

Multi-hop Numerical Reasoning (AMEX): "Calculate the proportion of net interest income to total revenues net of interest expense for 2022 and compare it to 2021. Did dependence increase?" This query requires locating two distinct line items across two fiscal years, performing calculations for each year, and then comparing the results. It necessitates precise retrieval from the income statement and a multi-step reasoning process.
Adversarial Numerical Reasoning (AMD): "Estimate whether inventory buildup contributed significantly to cash flow decline in FY2022." This question is deliberately adversarial because it presupposes a cash flow decline (which was marginal) and asks the model to quantify the impact of a balance sheet item (inventory) on a cash flow statement metric. A naive retriever might fetch balance sheet data but fail to connect it to the cash flow context.
Complex Calculation (PepsiCo): "Calculate the reinvestment rate defined as Capex divided by (Operating Cash Flow minus Dividends)." This task involves extracting three specific figures from the cash flow statement and performing a non-standard calculation not explicitly stated anywhere in the 10-K.
Counterintuitive Financial Metric (Boeing): "What percentage of operating cash flow in FY2022 was consumed by changes in working capital?" The expected answer here is counterintuitive: 0% consumed, because working capital acted as a source of cash. This requires the system to correctly interpret the signs of financial figures and understand the qualitative meaning of a net positive contribution from working capital.
Attributional Analysis (AMEX): "Estimate how much of total revenue growth is attributable to discount revenue increase." This query demands the calculation of two deltas (change in discount revenue and change in total revenue) and then expressing one as a percentage of the other. This calculation is not directly presented in the filing.

Each question in this stress test has a pre-computed ground truth answer with specific numerical values, ensuring unambiguous evaluation.

Decisive Results: 100% Accuracy in Primary Configuration

The benchmark results, particularly in the primary configuration using k=5 (retrieving 5 document sections), were unequivocally positive.

k=5 Configuration (Primary)

In this setup, the retriever selects a set of 5 nodes (k_final = 5), and the corresponding sections are fed to the synthesizer LLM.

Benchmark	Score	Accuracy
FinanceBench (26 questions)	26 / 26	100%
Comprehensive (40 questions)	40 / 40	100%
Total	66 / 66	100%

The system achieved a perfect score across all 66 questions. This means every numerical value precisely matched the ground truth, and every qualitative assessment accurately reflected the data presented in the filings.

Illustrative Bot Responses:

To demonstrate the system’s capabilities, two specific bot responses from the benchmark run are highlighted, showcasing both the retrieval path and the synthesized answer:

PepsiCo: Reinvestment Rate: For the query about the reinvestment rate, the bot retrieved sections from the "Consolidated Statement of Cash Flows." It then accurately computed the reinvestment rate as approximately 112.24% (5,207M Capex / (10,811M OCF – 6,172M Dividends)). Notably, the bot proactively computed the same ratio for prior fiscal years (FY2021 and FY2020), revealing a trend of accelerating reinvestment that was not explicitly requested but provided valuable additional insight.
Boeing: Cash Flow Quality: In response to the question about working capital’s consumption of operating cash flow, the bot identified sections from "Boeing > Liquidity and Capital Resources > Cash Flow Summary" and the "Consolidated Statements of Cash Flows." It correctly stated that changes in working capital provided $4,139M in cash, while net operating cash flow was $3,512M. Consequently, it accurately concluded that working capital did not consume any percentage of OCF, as it was a net source. This demonstrates the system’s ability to handle counterintuitive financial scenarios and correctly interpret the sign conventions in financial statements.

k=3 Configuration: Understanding Failure Boundaries

To explore the system’s limits, the benchmarks were re-run with k_final=3, reducing the number of retrieved document sections to three. This constraint tests the robustness of the retrieval precision when context is intentionally limited.

Benchmark	Score	Accuracy
FinanceBench (26 questions)	25 / 26	96.2%
Comprehensive (40 questions)	37 / 40	92.5%
Total	62 / 66	93.9%

While the accuracy remained high, the failures in the k=3 run offered valuable insights:

Insufficient Context for Complex Queries: The observed failures were consistently attributed to insufficient context coverage rather than incorrect retrieval. The re-ranker selected the primary right sections, but the limited context window (k=3) prevented it from including secondary sections crucial for complex reconciliation queries.
Trade-off Between Speed and Coverage: This highlights an important architectural insight: k=5 provides adequate coverage for intricate cross-referencing, whereas k=3 introduces retrieval gaps for the most demanding reconciliations. For the majority of typical queries that target a single section or statement, k=3 would likely suffice and offer faster performance.

Qualitative Strengths Beyond Raw Scores

Beyond the quantitative metrics, the benchmark testing revealed several qualitative strengths that are critical for real-world RAG applications:

Source Grounding: Every response was meticulously grounded in specific sources, using structural breadcrumbs (e.g., "AMD > Financial Condition > Liquidity and Capital Resources"). This provides an auditable trail for analysts, allowing them to trace answers directly to their origin in the filing.
Adversarial Robustness: The system demonstrated impressive robustness against adversarial queries. For instance, when asked about non-existent cryptocurrency revenue at AMEX, it correctly stated "No evidence" rather than fabricating data. Similarly, when faced with a mathematically undefined Debt/Equity ratio for Boeing due to negative equity, it explained why the metric was not meaningful, avoiding the generation of nonsensical numbers. These are precisely the types of queries that expose weaknesses in retrieval systems that rely on surface-level matching and can lead to hallucinations.
Outperforming Ground Truth: In several instances, the bot’s responses were arguably more insightful or precise than the pre-computed ground truth. For example, the bot calculated exact figures for PepsiCo’s reinvestment rate and Boeing’s backlog change, providing more granular detail than the initial ground truth estimates. This is attributed to the synthesizer having access to complete, unedited sections of text, enabling more comprehensive analysis.

Open-Source Repository and Quickstart Guide

In a significant move to foster community adoption and further development, Proxy-Pointer RAG has been released under the MIT License. The complete codebase is accessible via a dedicated GitHub repository. The project is designed for a rapid "5-minute quickstart," with a clear directory structure and included components:

src/: Contains core modules for configuration, extraction (PDF to Markdown via LlamaParse), indexing (including the pure-Python tree builder, noise filter, chunking, and FAISS index creation), and the RAG agent (interactive bot and benchmarking script).
data/: Includes sample documents, such as the four 10-K filings and pre-extracted Markdown for AMD, facilitating immediate testing.
Benchmark/: Houses full scorecards and comparison logs from the benchmark runs.

The entire pipeline operates using a single Gemini API key, leveraging the cost-effective gemini-flash-lite model. No GPU infrastructure is required, and the indexing process is streamlined, avoiding complex, token-intensive tree building. The user experience is intended to be straightforward: clone, configure, index, and query.

Conclusion: A Paradigm Shift in Structured Document Retrieval

The initial hypothesis behind Proxy-Pointer RAG was that achieving structurally aware retrieval did not necessitate complex, LLM-navigated indexing trees; clever integration of structural metadata was sufficient. The initial 10-query comparison provided compelling evidence. This latest article presents definitive proof, moving beyond hypothesis to demonstrable capability.

With a perfect 100% accuracy across 66 challenging questions derived from four major Fortune 500 companies, Proxy-Pointer RAG has proven its mettle. It successfully tackled multi-hop numerical reasoning, cross-statement reconciliation, adversarial edge cases, and counterintuitive financial metrics. Even when intentionally constrained to k=3, it maintained an impressive 93.9% accuracy, with failures occurring only in scenarios that genuinely required more than three document sections for comprehensive analysis.

The implications for production RAG systems are profound. This achievement signifies:

Unparalleled Accuracy: Demonstrates a significant leap in retrieval precision for structured documents.
Scalability and Cost-Effectiveness: Achieves this accuracy without the prohibitive costs associated with traditional LLM-driven indexing strategies.
Production Readiness: The refined architecture and successful benchmarking indicate suitability for enterprise deployment.
Transparency and Auditability: Structural grounding provides a clear audit trail for all generated responses.
Robustness: Exhibits resilience against adversarial queries and nuanced interpretations of financial data.

The core takeaway is clear: if a retrieval system struggles with complex, structured documents, the issue often lies not with the embedding model itself, but with the index’s inability to understand the document’s inherent organization. By equipping retrieval systems with structural awareness, accuracy naturally follows.

The open-source release of Proxy-Pointer RAG invites the broader community to explore its capabilities, test it on their own datasets, and contribute to its evolution. This initiative promises to set a new standard for how RAG systems interact with and derive insights from the wealth of structured information that permeates the enterprise landscape.

For further engagement and discussion, individuals can connect with the author on LinkedIn at www.linkedin.com/in/partha-sarkar-lets-talk-AI. All documents utilized in this benchmark are publicly available FY2022 10-K filings from SEC.gov. The code and benchmark results are released under the MIT License. Images used in this article were generated using Google Gemini.

Meastro USA

Proxy-Pointer RAG Achieves Vectorless Accuracy at Vector RAG Scale and Cost

Leave a Reply Cancel reply

Lina Hope

Leave a Reply Cancel reply

Lina Hope

Related Posts