The Exactly as Designed. The Answer Was Still Wrong.

A critical vulnerability in Retrieval-Augmented Generation (RAG) systems, often overlooked in standard performance evaluations, has been brought to light through a reproducible demonstration. The core issue lies not in the retrieval of relevant documents, but in the subsequent assembly and processing of that information, a phase where conflicting data can lead to confidently incorrect answers without any discernible warning signal. This revelation challenges the prevailing trust placed in retrieval scores and highlights a significant blind spot in current RAG pipeline architectures.

The problem manifests when a RAG system, despite successfully identifying and retrieving multiple documents that contain contradictory information, proceeds to synthesize an answer that is factually incorrect. This is not a case of hallucination, where the model invents information, nor a failure of retrieval, where relevant documents are missed. Instead, the model, presented with a "contradictory brief," is ill-equipped to arbitrate between competing claims. It is a scenario where the system is asked to referee a dispute it was never designed to judge, leading to a silent and confident misrepresentation of facts.

This failure mode eludes typical retrieval metrics and standard hallucination detectors, existing instead in the crucial but often unevaluated step between context assembly and final generation. A newly released, self-contained demonstration project, runnable on standard CPU hardware with minimal resource requirements (approximately 220 MB of data and no API keys or cloud dependencies), has been developed to isolate and illustrate this critical flaw. The project provides complete source code, allowing for direct replication and verification of the findings.

Table of Contents

The Core Experiment: Confronting Contradictory Information

The experimental setup is deliberately streamlined to focus on the precise point of failure. It involves three distinct scenarios, each designed around a knowledge base containing pairs of documents with directly opposing claims on the same factual subject. Crucially, the retrieval mechanism is meticulously tuned to ensure that both conflicting documents are consistently retrieved for each query.

The central question posed by the experiment is not whether retrieval can identify relevant documents, but rather, what a language model does when presented with a contradictory set of facts and tasked with providing a definitive answer with high confidence. The observed behavior reveals a tendency for the model to unilaterally "pick a side," doing so silently, with unwavering confidence, and without any indication that a choice between conflicting sources was necessary.

Production-Derived Scenarios Expose RAG Vulnerabilities

The experiment draws its scenarios from common real-world situations where conflicting information is likely to accumulate within a knowledge base over time. These include financial restatements, policy revisions, and versioned technical documentation – all commonplace in enterprise environments.

Scenario A: The Earnings Restatement Conundrum

This scenario simulates a typical financial reporting cycle. A company’s initial Q4 earnings release reports an annual revenue of $4.2 million for fiscal year 2023. Three months later, following an external audit, this figure is officially restated to $6.8 million. Both the preliminary release and the audited revision are present in the knowledge base. When a user queries, "What was Acme Corp’s revenue for fiscal year 2023?", the RAG system retrieves both documents. With similarity scores of 0.863 for the preliminary report and 0.820 for the revised annual report, both are considered highly relevant. However, the language model, when tasked with answering, selects the preliminary figure of $4.2 million, reporting it with 80.3% confidence. This choice is driven by subtle retrieval score differences, not by the fact that the later, audited figure is the authoritative one. The system offers no mechanism to signal that a more authoritative source had superseded the initial claim.

Your RAG System Retrieves the Right Data — But Still Produces Wrong Answers. Here’s Why (and How to Fix It).

Scenario B: The Evolving Remote Work Policy

In this HR-related scenario, an initial policy enacted in June 2023 mandates employees to be in the office three days a week. A subsequent revision in November 2023 explicitly overturns this, allowing for fully remote work. Both documents are retrieved when an employee inquires about the current remote work policy, scoring 0.806 for the June policy and 0.776 for the November revision. The RAG system’s response defaults to the older, stricter June policy, providing an answer that is no longer applicable. The confidence score for this incorrect answer is 78.3%.

Scenario C: Outdated API Documentation

This scenario reflects a common challenge in technical documentation. Version 1.2 of an API reference document states a rate limit of 100 requests per minute. Version 2.0, published after a significant infrastructure upgrade, raises this limit to 500 requests per minute. Both documents are retrieved, with scores of 0.788 for v1.2 and 0.732 for v2.0. The RAG system confidently answers with "100 requests per minute," leading a developer to incorrectly configure their application and operate at only one-fifth of their actual allowed capacity. The confidence for this answer is 81.0%.

These examples are not isolated incidents; they represent recurring patterns in virtually any production knowledge base that undergoes updates over time. The current RAG architecture, without an explicit conflict resolution layer, fails to detect or appropriately handle these inherent contradictions.

The Mechanics of Model Decision-Making in Conflict

The underlying reason for these incorrect answers lies in the fundamental design and training of the language models used in RAG pipelines. The employed extractive question-answering model, deepset/minilm-uncased-squad2, operates by identifying the most probable span of text within a given context string. It lacks an intrinsic mechanism to recognize or flag situations where multiple, contradictory spans are present.

When presented with context containing both $4.2M and $6.8M, the model evaluates token-level scores across the entire string. The decision of which span to select is influenced by several factors, none of which relate to the recency or authority of the information:

Position Bias: Earlier segments of the context often receive slightly higher attention scores due to the underlying encoder architecture. In Scenario A, the preliminary earnings report, having a marginally higher retrieval score, appeared earlier in the concatenated context, thus influencing the model’s attention.
Language Strength: Direct, declarative statements tend to be favored over hedged or conditional phrasing. For instance, a sentence stating "revenue was $4.2M" might be prioritized over a sentence like "following the restatement, the revised figure is $6.8M."
Lexical Alignment: Spans that share more vocabulary with the original query tokens are also favored, irrespective of the factual accuracy or currency of the information.

Crucially, factors such as the document’s publication date, its audit status, or whether one claim explicitly supersedes another are entirely invisible to this type of extractive model. These crucial contextual signals are not part of its decision-making process.

This phenomenon is not limited to extractive models. Generative LLMs, while producing answers in fluent prose rather than verbatim extraction, exhibit similar behaviors. Research, such as that presented at ICLR 2025 by Joren et al., indicates that even frontier models like Gemini 1.5 Pro, GPT-4o, and Claude 3.5 frequently provide incorrect answers rather than abstaining when context is insufficient or conflicting. Furthermore, their expressed confidence levels do not reliably reflect the accuracy of their responses. The root cause is identified as an architectural deficiency in the RAG pipeline itself – the absence of a stage dedicated to detecting contradictions before the generation phase.

Introducing a Conflict Detection Layer

To address this critical gap, a novel component has been integrated into the RAG pipeline: a ConflictDetector. This layer operates between the retrieval stage and the generation stage, meticulously examining pairs of retrieved documents to identify and flag contradictions before the QA model processes the context. A key efficiency is that embeddings for all retrieved documents are computed in a single batched forward pass, ensuring that each document is encoded only once, regardless of how many pairwise comparisons it participates in.

The detector employs two primary heuristics for identifying potential conflicts:

Heuristic 1: Numerical Contradiction

This heuristic flags documents that are semantically similar but contain distinct numerical values. To mitigate the issue of ubiquitous small integers and years generating false positives, the implementation specifically filters out years (1900-2099) and single-digit integers (1-9) when they appear as standalone claim values. The code snippet below illustrates the extraction of "meaningful numbers":

@classmethod
def _extract_meaningful_numbers(cls, text: str) -> set[str]:
    results = set()
    for m in cls._NUM_RE.finditer(text):
        raw = m.group().strip()
        numeric_core = re.sub(r"[$&cent;£MBK%,]", "", raw, flags=re.IGNORECASE).strip()
        try:
            val = float(numeric_core)
        except ValueError:
            continue
        if 1900 <= val <= 2099 and "." not in numeric_core:
            continue   # skip years
        if val < 10 and re.fullmatch(r"d+", raw):
            continue   # skip bare small integers
        results.add(raw)
    return results

In Scenario A, this heuristic correctly identifies that fin-001 contains $4.2M and fin-002 contains $6.8M. With an empty intersection, a conflict is flagged.

Heuristic 2: Contradiction Signal Asymmetry

This heuristic identifies documents discussing the same topic where one document contains specific "contradiction tokens" that the other lacks. These tokens are categorized into two sets:

Negation and Uncertainty: 'no', 'not', 'never', 'none', 'n't', 'without', 'lack', 'fail', 'unable', 'impossible', 'doubt', 'unlikely', 'disagree', 'reject', 'deny'
Directional and Comparative: 'increase', 'decrease', 'rise', 'fall', 'up', 'down', 'more', 'less', 'higher', 'lower', 'change', 'update', 'new', 'old', 'past', 'future', 'previous', 'current', 'former', 'latest', 'earlier', 'later', 'supercede', 'replace', 'revise', 'amend', 'cancel', 'remove'

These tokens are combined into a CONTRADICTION_SIGNALS set. This approach allows for domain-specific tuning; for example, a legal corpus might require a broader set of negation terms, while a changelog corpus might benefit from more directional tokens.

Applied to Scenario B, hr-002 contains "no" (from "no longer required"), which is absent in hr-001, flagging an asymmetry. Similarly, in Scenario C, api-002 includes "increased," which is not present in api-001, also indicating an asymmetry.

Both heuristics are gated by a topic_sim >= 0.68 threshold, ensuring that only documents discussing genuinely related topics trigger conflict detection. This threshold, calibrated for the specific embedding model (all-MiniLM-L6-v2) and document set used in the demonstration, serves as a starting point and may require recalibration for different embedding models or domains.

Resolution Strategy: Cluster-Aware Recency

When conflicts are detected, the system resolves them by prioritizing the most recently timestamped document within each identified conflict cluster. The critical design element here is "cluster-aware." A top-k retrieval result might contain multiple independent conflict clusters. A naive approach of simply selecting the single most recent document overall could erroneously discard valid winning documents from other unrelated clusters.

The implemented resolution strategy constructs a conflict graph and uses iterative Depth-First Search (DFS) to identify connected components (clusters). Each cluster is then resolved independently. Documents not involved in any conflict pass through unchanged. For each identified conflict cluster, the document with the latest timestamp (determined using ISO-8601 formatted timestamps that sort lexicographically) is retained.

@staticmethod
def _resolve_by_recency(
    contexts: list[RetrievedContext],
    conflict: ConflictReport,
) -> list[RetrievedContext]:
    # Build adjacency list
    adj: dict[str, set[str]] = defaultdict(set)
    for a_id, b_id in conflict.conflict_pairs:
        adj[a_id].add(b_id)
        adj[b_id].add(a_id)

    # Connected components via iterative DFS
    visited: set[str] = set()
    clusters: list[set[str]] = []
    for start in adj:
        if start not in visited:
            cluster: set[str] = set()
            stack = [start]
            while stack:
                node = stack.pop()
                if node not in visited:
                    visited.add(node)
                    cluster.add(node)
                    stack.extend(adj[node] - visited)
            clusters.append(cluster)

    all_conflicting_ids = set().union(*clusters) if clusters else set()
    non_conflicting = [c for c in contexts if c.document.doc_id not in all_conflicting_ids]

    resolved_docs = []
    for cluster in clusters:
        cluster_ctxs = [c for c in contexts if c.document.doc_id in cluster]
        # ISO-8601 timestamps sort lexicographically -> max() gives most recent
        best = max(cluster_ctxs, key=lambda c: c.document.timestamp)
        resolved_docs.append(best)

    return non_conflicting + resolved_docs

This approach ensures that only one document per distinct conflict cluster is passed to the generation model, and that document is the most recent one.

Phase 2: Conflict-Aware RAG in Action

When the conflict detection and resolution layer is enabled, the same three scenarios yield correct answers.

In Scenario A, the system now correctly identifies $6.8M as the annual revenue, with a confidence score of 79.6%. The log explicitly states that the conflict was resolved by keeping the more recent document.

Scenario B’s remote work policy query now returns the correct information: "employees are no longer required to maintain a fixed in-office schedule," with a confidence of 78.0%. The resolution log indicates that the November 2023 policy was chosen over the older one.

For Scenario C, the API rate limit query correctly yields "500 requests per minute," with a confidence of 80.9%. The system again notes the resolution based on the newer API reference.

Remarkably, the confidence scores remain almost identical to those in Phase 1. This underscores the initial point: confidence alone is not a reliable indicator of correctness when conflicts exist. The significant change is not in the model’s confidence, but in the pipeline’s architecture, which now incorporates a mechanism for conflict resolution.

Limitations of the Heuristics

It is imperative to acknowledge the limitations of these heuristics to avoid overstating their capabilities.

Paraphrased Conflicts: The current heuristics are effective for numerical discrepancies and explicit contradiction tokens. However, they will not detect conflicts expressed through paraphrasing, such as "the service was retired" versus "the service is currently available." Addressing such nuanced conflicts would require more sophisticated Natural Language Inference (NLI) models, like cross-encoder/nli-deberta-v3-small. The ConflictDetector class is designed for extensibility, allowing for the integration of NLI models into the _pair_conflict_reason method.
Non-Temporal Conflicts: The recency-based resolution strategy is ideal for versioned documents and policy updates. It is less suitable for scenarios involving expert opinion disagreements (where a minority view might be correct), conflicts arising from different methodologies, or queries requiring multiple perspectives. In such cases, the ConflictReport data structure can be utilized to present both claims, flag the issue for human review, or prompt the user for clarification.
Scalability: The current pair-comparison approach has a time complexity of O(k^2), where k is the number of retrieved documents. While efficient for small k (e.g., k=3 to 20), it may become computationally intensive for pipelines retrieving a very large number of documents (k=100 or more). For such scenarios, pre-indexing known conflict pairs or employing cluster-based detection methods would be more appropriate.

The Research Landscape and Future Directions

The work presented here aligns with and builds upon ongoing research in the field of knowledge conflict resolution in RAG systems. Active research is tackling this problem with increasingly sophisticated methods.

The CONFLICTS benchmark, introduced by Cattan et al. (2025), provides a structured taxonomy for understanding how models handle knowledge conflicts across categories like freshness, conflicting opinions, complementary information, and misinformation. Their experiments reveal that LLMs frequently struggle with these conflicts, though explicit prompting to reason about conflicts significantly improves response quality.

Ye et al. (2026) proposed TCR (Transparent Conflict Resolution), a framework that disentangles semantic relevance from factual consistency using dual contrastive encoders. Their approach enhances conflict detection by 5-18 F1 points while adding minimal parameters.

Gao et al. (2025) introduced CLEAR (Conflict-Localized and Enhanced Attention for RAG), which probes LLM hidden states to detect internal manifestations of conflicting knowledge, leading to more accurate evidence integration.

A consistent finding across these research efforts mirrors the demonstration: retrieval quality and answer quality are distinct dimensions, and the gap between them is substantial. The key differentiator of this demonstration is its accessibility – a fully reproducible solution requiring minimal resources and no complex authentication.

Practical Recommendations for Implementation

Based on these findings, several actionable steps are recommended for improving RAG systems:

Integrate a Conflict Detection Layer: Implement a conflict detection mechanism upstream of the generation stage. Even simple heuristics can effectively address common conflict patterns in enterprise corpora, such as restatements, policy updates, and versioned documentation.
Differentiate Conflict Types: Recognize that temporal conflicts (resolved by recency), factual disputes (requiring human review), and opinion conflicts (requiring dual presentation) necessitate distinct resolution strategies. Applying a single strategy universally can introduce new failure modes.
Log Conflict Reports: Systematically log all detected conflicts. Analyzing this data over time will provide invaluable insights into the frequency, types, and query patterns associated with conflicts specific to your organization’s corpus.
Surface Uncertainty: When conflicts cannot be resolved definitively, the most responsible approach is to acknowledge and communicate this uncertainty. Instead of arbitrarily selecting one answer and masking the ambiguity, systems should provide responses that highlight the conflicting information and its resolution status.

The Takeaway: Bridging the Gap

While vector search and retrieval mechanisms are largely mature and well-understood, the problem of context assembly in RAG systems remains largely unaddressed. The significant gap between retrieving "correct documents" and producing a "correct answer" is a pervasive issue, leading to confident, yet incorrect, responses without any clear indication of failure.

The solution does not necessitate larger models, entirely new architectures, or extensive re-training. It requires the addition of a single pipeline stage that operates on existing embeddings, incurring minimal marginal latency. The demonstration experiment, runnable in approximately thirty seconds on a standard laptop, highlights the feasibility of such an addition. The critical question for organizations deploying RAG systems is not whether their retrieval is optimized, but whether they possess a mechanism to handle the inevitable conflicts that arise from evolving knowledge bases. The risk lies in the silent, confident propagation of misinformation, a risk that can be mitigated with a dedicated conflict detection and resolution layer.

References

[1] Ye, H., Chen, S., Zhong, Z., Xiao, C., Zhang, H., Wu, Y., & Shen, F. (2026). Seeing through the conflict: Transparent knowledge conflict handling in retrieval-augmented generation. arXiv:2601.06842. https://doi.org/10.48550/arXiv.2601.06842

[2] Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2023). Self-RAG: Learning to retrieve, generate, and critique through self-reflection. arXiv:2310.11511. https://doi.org/10.48550/arXiv.2310.11511

[3] Cattan, A., Jacovi, A., Ram, O., Herzig, J., Aharoni, R., Goldshtein, S., Ofek, E., Szpektor, I., & Caciularu, A. (2025). DRAGged into conflicts: Detecting and addressing conflicting sources in search-augmented LLMs. arXiv:2506.08500. https://doi.org/10.48550/arXiv.2506.08500

[4] Gao, L., Bi, B., Yuan, Z., Wang, L., Chen, Z., Wei, Z., Liu, S., Zhang, Q., & Su, J. (2025). Probing latent knowledge conflict for faithful retrieval-augmented generation. arXiv:2510.12460. https://doi.org/10.48550/arXiv.2510.12460

[5] Jin, Z., Cao, P., Chen, Y., Liu, K., Jiang, X., Xu, J., Li, Q., & Zhao, J. (2024). Tug-of-war between knowledge: Exploring and resolving knowledge conflicts in retrieval-augmented language models. arXiv:2402.14409. https://doi.org/10.48550/arXiv.2402.14409

[6] Joren, H., Zhang, J., Ferng, C.-S., Juan, D.-C., Taly, A., & Rashtchian, C. (2025). Sufficient context: A new lens on retrieval augmented generation systems. arXiv:2411.06037. https://doi.org/10.48550/arXiv.2411.06037

[7] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., … & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv:2005.11401. https://doi.org/10.48550/arXiv.2005.11401

[8] Mallen, A., Asai, A., Zhong, V., Das, R., Khashabi, D., & Hajishirzi, H. (2023). When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv:2212.10511. https://doi.org/10.48550/arXiv.2212.10511

[9] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv:1908.10084. https://doi.org/10.48550/arXiv.1908.10084

[10] Xu, R., Qi, Z., Guo, Z., Wang, C., Wang, H., Zhang, Y., & Xu, W. (2024). Knowledge conflicts for LLMs: A survey. arXiv:2403.08319. https://doi.org/10.48550/arXiv.2403.08319

[11] Xie, J., Zhang, K., Chen, J., Lou, R., & Su, Y. (2023). Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. arXiv:2305.13300. https://doi.org/10.48550/arXiv.2305.13300

Complete Source Code: https://github.com/Emmimal/rag-conflict-demo

Models Used:

sentence-transformers/all-MiniLM-L6-v2 (approx. 90 MB)
deepset/minilm-uncased-squad2 (approx. 130 MB)

Both models download automatically on first run and cache locally. No API key or Hugging Face authentication is required.

Meastro USA

The Exactly as Designed. The Answer Was Still Wrong.

The Core Experiment: Confronting Contradictory Information

Production-Derived Scenarios Expose RAG Vulnerabilities