Google’s AI Overviews are becoming more accurate, but a new analysis suggests the improvement still leaves a problem too large to ignore. According to a review highlighted by The New York Times, the Gemini-powered feature now answers factual questions correctly about 90% of the time. On the surface, that sounds like progress. In practice, however, it means roughly 1 in 10 answers can still be wrong, a miss rate that becomes far more serious once scaled to the enormous volume of Google searches performed every day.
That is the real tension behind the latest findings. AI Overviews may no longer be failing in the spectacular ways that defined some of the criticism after launch, but they are still fallible in a setting where users often expect certainty. Search occupies a very different place from a chatbot. When an answer appears at the top of Google, many people are likely to treat it as authoritative, even when the system itself quietly warns that mistakes are possible.
The result is a more complicated phase for Google’s AI search experiment. The technology is clearly getting better, but the standard for success is much higher when even a small error rate can translate into millions of inaccurate responses.
Accuracy is improving, but not enough to remove risk
The analysis cited in the report was conducted with help from Oumi, a startup involved in AI model development, using a benchmark known as SimpleQA. This test includes thousands of factual questions with verifiable answers and is designed to measure how reliably generative AI systems respond to straightforward prompts.
When Oumi first ran the benchmark last year, while Gemini 2.5 was still Google’s leading model, AI Overviews scored about 85% accuracy. After the rollout of Gemini 3, the result reportedly improved to 91%. That is a noticeable gain and suggests Google has made meaningful technical progress since the feature’s rocky 2024 debut.
Even so, the improvement does not resolve the core issue. A system that is wrong 9% or 10% of the time may sound strong by AI standards, but it remains problematic when presented as a fast, polished summary layer on top of search. The larger the search volume, the more those errors compound into a steady stream of misinformation.
The examples show how AI can still fail with confidence
The report highlights several cases that show why the remaining error rate matters. In one example, AI Overviews was asked when Bob Marley’s former home became a museum. It cited three sources, but two did not address the date and the third, Wikipedia, contained conflicting years. The system selected the wrong answer anyway.
In another example, it was asked about the date on which Yo Yo Ma was inducted into the Classical Music Hall of Fame. Despite citing a source that listed the induction, the overview reportedly answered that no such institution existed. These are not just vague interpretive errors. They are factual mistakes delivered in the confident style that makes generative AI especially misleading when it goes wrong.
This remains one of the central problems with AI summaries in search. They do not merely present raw links for users to inspect. They condense and assert, which means a mistake can arrive already packaged as a conclusion.
Google disputes the benchmark itself
Google has pushed back against the findings, arguing that the benchmark used in the study does not accurately reflect what people are actually searching for. The company says it prefers a related evaluation method called SimpleQA Verified, which uses a smaller and more heavily checked set of questions. In Google’s view, the test used in the report contains flaws and possibly incorrect underlying data.
That response points to a wider problem in AI evaluation. Benchmarks are now a battlefield of their own, with companies favoring the tests that best support their claims and critics relying on outside measures that may expose different weaknesses. Because generative AI systems can produce different answers to the same question across repeated runs, and because some evaluations themselves rely on AI tools, certainty is hard to achieve.
So while the dispute over methodology is real, it does not eliminate the more obvious conclusion: measuring factuality in AI remains messy, and no single benchmark is likely to settle the issue completely.
AI Overviews are not one thing, and that matters
Another complication is that AI Overviews do not rely on a single model for every search. Google has said it uses the model best suited to a given query, which means the system may shift between more powerful but slower options and faster, lighter models optimized for speed and cost. That helps explain why performance can feel uneven.
In theory, the most advanced Gemini models could provide stronger answers more consistently. In practice, search has to operate at massive scale and near-instant speed. That means Google often leans on faster Gemini Flash variants rather than always using the most capable model available. The trade-off is clear: lower latency and lower cost may come at the expense of reliability.
For users, that technical architecture is mostly invisible. What they see is one answer box at the top of the page. But behind that box is a system balancing speed, expense and factual accuracy in ways that can shape the trustworthiness of the final result.
The bigger issue is how people use the answers
Perhaps the most important point is not whether AI Overviews are right 85%, 90% or 91% of the time. It is how users respond to them. Traditional Google search encouraged people to evaluate sources manually through blue links. AI Overviews change that habit by offering a ready-made summary that many users may accept without checking the underlying pages.
That makes even a relatively low error rate more dangerous than it might seem. The correct information may still exist in the cited sources, but the overview layer can discourage users from doing the work of verification. In effect, it makes convenience compete directly with skepticism.
Google itself acknowledges this in the small disclaimer that appears beneath the feature: AI can make mistakes, so responses should be double-checked. That warning is honest, but it also captures the contradiction at the center of the product. The feature is designed to save users time by summarizing the web for them, while also asking those same users to verify the summary themselves. As long as that tension remains unresolved, better accuracy alone will not fully settle the debate over whether AI Overviews improve search or simply make its mistakes more persuasive.
