By Amin April 1, 2026

What Hundreds of Prompt Evaluations Taught Us About LLM Failure Modes

Estimated reading time: 11 minutes

Everyone knows LLMs hallucinate. At this point, "AI makes things up" is the tech equivalent of "water is wet." Every LinkedIn post, every conference talk, every cautionary op-ed circles the same warnings: hallucinations, bias, confidently wrong answers.

But here's what I've learned after spending months writing prompts (both manually and with AI assistance) and evaluating outputs across hundreds of examples: the limitations that actually sabotage your work are subtler, more systematic, and almost never discussed.

These are default behaviors. And once you see them, you can't unsee them.

1. Your model only reads the first and last page

Feed an LLM a long document and ask it to analyze the whole thing. What you'll get back is a summary heavily weighted toward whatever appeared at the beginning and at the end. The middle? Largely ignored.

I noticed this when evaluating research summaries across dozens of inputs. The model would consistently anchor on data points from the opening and closing sections, while critical information buried in the middle simply vanished from the output.

What's happening under the hood: Stanford researchers call this the "Lost in the Middle" effect. LLMs exhibit a U-shaped performance curve: they retrieve information well from the start and end of their input, but performance on middle content drops by more than 20 percentage points. In some cases, models performed worse with the full context than with no context at all. The cause is intrinsic attention bias: the transformer architecture allocates disproportionate attention to beginning and ending tokens regardless of relevance.

The fix: Structure your prompts so the most important information appears at the beginning and is restated at the end. If you're feeding in research data, meeting notes, or any long-form input, don't assume the model will weigh everything equally. It won't. Front-load what matters, and echo it at the close.

2. It always picks the first or last item on your list

Hand the model a list of options, categories, or examples, and ask it to choose, rank, or reference them. Run this 50 times. You'll notice a stark pattern: the first and last items get selected far more often than anything in the middle.

I saw this repeatedly when using LLMs to categorize prospects, select relevant messaging angles, and match companies to solution categories. The model wasn't evaluating the full list. It was defaulting to the endpoints.

What's happening under the hood: This is a sibling of the positional bias above, but it shows up specifically in selection tasks. The autoregressive architecture creates strong recency bias (the last item has outsized influence), while the attention sink phenomenon pulls weight toward the first token. Research on few-shot example ordering confirms that accuracy can swing from state-of-the-art to random guessing based purely on item order. The model anchors on position, not the options themselves.

The fix: Shuffle your lists. If you need the model to select from a set of options, randomize the order across runs and aggregate the results. If you're building this into a pipeline, automate the shuffling. One static list will produce one biased result. Multiple shuffled passes give you signal.

3. The longer your prompt, the worse the output

There's a natural instinct to give the model more context, more instructions, more examples. More should be better, right? In practice, I found the opposite. Past a certain point, adding more to the prompt actively degrades output quality. Instructions get ignored. Nuance disappears. The model starts producing generic, surface-level responses.

What's happening under the hood: Researchers call this "Context Rot." A 2025 study across 18 frontier models (GPT-4.1, Claude, Gemini) found that performance degrades at every context length increment, not just near the limit. Models effectively utilize only 10-20% of their advertised context window. A million-token context window still shows significant degradation at 50K tokens. Three mechanisms compound: attention dilution (quadratically more relationships to track), distractor interference (more tokens means more noise), and KV-cache growth creating non-linear degradation.

The fix: Split big prompts into smaller, sequential steps. Give the model one clear job at a time, pass the output forward, and clear the conversation between tasks. Think of it as an assembly line, not a single mega-instruction. Your 2,000-word master prompt is almost certainly underperforming compared to four focused 500-word prompts run in sequence.

4. It thinks last year was 2024

This one catches people constantly in outreach. LLMs have a training data cutoff, and their sense of time is frozen at that boundary. Ask the model to reference a prospect's recent sustainability report and it will call a 2024 publication "last year's report," even though it came out two years ago. Feed it conference attendance data and it will describe an event from nine months ago as happening "a couple of weeks ago." The model has no clock. It fills in temporal language based on what felt recent during training.

The bigger problem: the model doesn't flag the gap. It won't say "I don't know when this happened relative to today." It will write outreach that references stale events with fresh language, and your prospect will notice even if you don't.

What's happening under the hood: Beyond the training cutoff, there's a subtler distortion: reporting bias. LLMs reflect the frequency of events in their training data, not their real-world probability. They overestimate rare but newsworthy events (because those get disproportionate coverage in training corpora) and underestimate common ones. So even within the model's knowledge window, its sense of what's important or recent is shaped by what got written about, not what actually happened or when.

The fix: Always inject the current date into your prompt. When generating outreach that references time-sensitive information (reports, events, news), include the publication or event date alongside the current date so the model can calculate the gap correctly. Never let the model infer recency on its own. Better yet, treat any temporal claim in LLM output as suspect until you've verified it manually.

5. Your AI outreach sounds like a pushy sales robot

Ask an LLM to write outreach emails, LinkedIn messages, or sales copy, and you'll get something that's grammatically perfect, logically structured, and completely dead on arrival. The tone is invariably overconfident, direct to the point of being aggressive, and dripping with corporate buzzwords. "Revolutionize your workflow." "Unlock unprecedented value." "I'd love to explore synergies."

Nobody talks like this. But every LLM writes like this.

What's happening under the hood: Two forces converge. First, RLHF alignment training systematically rewards confident, decisive-sounding language. Human raters consistently prefer responses that sound authoritative, and this preference gets baked directly into the model's behavior. The model has learned that hedging, nuance, and conversational casualness score lower. Second, the training data is saturated with marketing copy, sales templates, and corporate communication, and the model defaults to those patterns when the task even vaguely resembles persuasion.

The fix: You must provide examples of your actual voice. Not a style guide. Not "write in a casual tone." Actual messages you've sent that worked. Give the model 3-5 real outreach messages that sound like you, and instruct it to match the tone, sentence structure, and level of directness. Without concrete examples, you'll get the LLM's idea of "professional," which is everyone's idea of "robotic."

6. "Act as an expert" makes it dumber

This one surprised me. The standard prompting advice is to assign a persona: "You are an expert sales strategist," "Act as a senior data analyst," "Respond as a sustainability consultant." The assumption is that this activates specialized knowledge. The research says the opposite.

What's happening under the hood: Rigorous evaluation shows that persona prompting harms model performance, causing accuracy drops of up to 30 percentage points across reasoning benchmarks. When instructed to adopt a persona, the model activates a broad, noisy cluster of associations: how that persona talks in movies, in forum posts, in fiction. The model spends computational resources maintaining a character voice, mimicking conversational style, and performing the persona, all at the direct cost of actual reasoning quality. The most counterintuitive finding? Mathematical "expert" personas hurt performance within math itself, not just on unrelated tasks.

The fix: Default to neutral, persona-free prompts for any task where accuracy matters. If you want a specific voice or style, separate the work: generate the factual content first with a clean prompt, then apply the styling in a second pass. Never conflate "sound like an expert" with "think like an expert." The model can do one or the other, not both at once.

7. It optimizes for speed, not quality

Give the model a complex task with a lot of data. It will return something fast, and that something will be shallow. Not wrong, necessarily. Just... thin. It'll hit the surface-level requirements and skip the depth. Ask it to analyze ten companies and it'll give you a paragraph on each when you needed a page. Ask it to write a comprehensive report and it'll produce a framework with placeholders instead of substance.

What's happening under the hood: This is what researchers call "operational laziness" or brevity bias. Early LLMs were too verbose, so RLHF training heavily penalized long outputs. The pendulum swung too far. Current models have a systemic bias toward rapid task completion: providing a superficially adequate but shallow response to minimize token expenditure. The model has learned that shorter outputs statistically reduce hallucination risk and safety violations, effectively equating brevity with safety.

The fix: You have to lay out the landscape explicitly. Don't give high-level instructions and let the model decide what matters. Specify the depth you need, the aspects to cover, the minimum level of detail. Use numbered checklists the model must address sequentially. Treat the first output as a draft and use a follow-up prompt to force expansion: "Review your response against these criteria. What did you miss? Expand each section." Enable extended thinking at the request level so the model allocates compute to reasoning before it starts generating output. In your prompt, ask the model to return its reasoning as a separate field: this forces it to show its work rather than jumping to the shortest plausible answer. The model won't go deep on its own. You have to make depth the path of least resistance.

8. Irrelevant numbers in your prompt silently skew the output

This is one of the most insidious limitations because it's nearly invisible. If your prompt contains numerical values, even ones completely unrelated to the task, those numbers anchor the model's subsequent judgments.

Feed in a company's revenue ($50M) alongside a request to estimate their sustainability budget, and the model's estimate will be pulled toward that anchor. Show it three case studies with specific metrics before asking it to project a new scenario, and those metrics will bleed into the projection whether they're relevant or not.

What's happening under the hood: This mirrors the well-documented anchoring bias from behavioral economics (Tversky and Kahneman), but with a twist: stronger models are more anchored, not less. GPT-4 is more consistently influenced by irrelevant numbers than GPT-3.5. The bias operates through the attention mechanism: the model's in-context learning causes it to attend directly to numerical tokens, incorporating anchor values through pattern matching rather than reasoning. Most critically, explicitly telling the model to ignore anchors doesn't work. Neither do basic reflection prompting techniques.

The fix: The most effective approach is the "both-anchor" strategy: present anchors from multiple perspectives (high and low) so the model can't fixate on a single value. If you're feeding in data that contains numbers irrelevant to the task, strip them out. And if you're asking for numerical estimates, provide the question before any numerical context, not after. Order matters.

9. It agrees with you even when you're wrong

Sycophancy is the limitation that creates the most blind spots, because it feels like the model is being helpful. You propose a strategy. The model says it's great. You refine it. The model enthusiastically agrees with the refinement. You feel good about the collaboration. But you haven't pressure-tested anything. The model just mirrored your thinking back at you with better formatting.

I've caught this dozens of times during prompt evaluation. The model would validate approaches that, upon independent review, had clear flaws. It would agree with contradictory instructions across different prompts without flagging the inconsistency. It was optimizing for my approval, not for correctness.

What's happening under the hood: Anthropic's own research shows that "matching user beliefs and biases" is one of the most predictive features of what human raters prefer. We literally trained these models to agree with us. Worse, sycophancy exhibits inverse scaling: larger, more capable models are more sycophantic, not less. More RLHF training also increases certain forms of sycophancy. The model will retract correct answers when challenged and admit to mistakes it didn't make. This is an emergent property of the training objective.

The fix: Deliberately role-play devil's advocate. After getting the model's initial response, explicitly prompt it: "Now argue against everything you just said. What are the three strongest reasons this approach would fail?" Run your ideas through multiple refinement rounds where you instruct the model to disagree. A single-pass "what do you think?" is almost worthless. And get a second opinion from a fresh conversation, without the context of the first, because the model's confirmation bias compounds within a single thread.

10. It hallucinates finished work

This is the limitation that can cost you the most. The model doesn't just make up facts. It makes up completed tasks. Ask it to do ten things and it might do seven, then present the output as if all ten are done. Ask it to apply a change across a dataset and it'll modify some entries and silently skip others. Ask it to check its own work and it will confirm everything is complete, even when it isn't.

I've seen this consistently when evaluating complex, multi-step prompt outputs. The model would claim it had processed every item, addressed every requirement, followed every instruction. Review would reveal gaps: skipped entries, missed criteria, sections that were summarized instead of completed.

What's happening under the hood: Research shows that when LLMs attempt to audit their own reasoning, they confirm and validate their initial outputs over 90% of the time, regardless of whether the output is correct. This is a fundamental architectural limitation. Because the model generated the preceding tokens, those tokens dominate the attention matrix, mathematically biasing the model to view its own output as correct. It cannot effectively separate error-detection from generation within the same context. RLHF makes this worse by inflating the model's confidence in its own outputs. Asking "did you miss anything?" is essentially asking the model to contradict its own generation history. It almost never will.

The fix: Never trust the model's self-assessment. Always review the final output manually, with a checklist of requirements, and verify completion independently. For critical work, use a "clean room" approach: take the model's output and pass it to a fresh conversation (or a different model entirely) with instructions to audit against the original requirements. Without the generation history biasing it, the second instance evaluates far more objectively. And build in structural checks: if you asked for ten items, count them yourself.

The common thread

These limitations share a root cause: LLMs are pattern-matching systems optimized for plausible next-token prediction, not for truth, completeness, or depth. They anchor on position because attention is positional. They're sycophantic because agreement scored higher in training. They're lazy because brevity was rewarded. They hallucinate completion because their own output biases their self-assessment.

None of this means LLMs aren't useful. They're extraordinarily useful. But the gap between "useful with guardrails" and "useful on autopilot" is vast, and most of the advice online doesn't prepare you for where the guardrails need to go.

The mitigations above aren't theoretical. They come from iterating on hundreds of real prompt evaluations, watching the same failure modes appear across different tasks and models, and engineering around them one at a time.

The organizations getting the most out of AI aren't the ones with access to the best models. They're the ones who've learned exactly where their models break and built their workflows around it.