One of the most criticized behaviors of AI-powered chatbots is so-called hallucinating when the AI convincingly answers a question while providing you with factually incorrect information. Simply said, artificial intelligence is making things up in an attempt to satisfy its user.
It is not such an issue in tools that use generative AI to create pictures or videos. In the end, renowned expert Andrej Karpathy, who only recently departed from OpenAI, went so far as to say that the ability to hallucinate is the biggest feature of large language models (LLMs), generative AI’s underlying technology.
However, hallucinations are a big no-no in text-focused, LLM-based chatbots where the user expects that the provided information is factually accurate.
Preventing AI from hallucinating is a technological challenge – and not an easy one. It seems, though, that Google DeepMind and Standford University found a workaround of some sort, as reported by Marktechpost.com.
The researchers came up with an LLM-based system – Search-Augmented Factuality Evaluator, or SAFE – that essentially fact-checks long-form responses generated by AI chatbots. Their findings are available as a preprint on arXiv along with all the experimental code and datasets.
The system analyzes, processes, and evaluates the answers in four steps to verify their accuracy and factuality. First, SAFE splits the answer into individual facts, revises them, and compares them against the results from Google Search. The system also checks the relevance of individual facts to the original question.
To evaluate the performance of SAFE, the researchers created LongFact, a dataset of roughly 16,000 facts. Then, they tested the system across 13 LLMs from four different families (Claude, Gemini, GPT, PaLM-2).
In 72% of cases, SAFE provided the same results as human annotators. In cases of disagreement, SAFE was correct 76% of the time.
On top of that, the researchers claim that using SAFE is 20 times cheaper than human annotators or fact-checkers, thus providing an economically viable solution that, ambitiously, can be applied on a scale.
11 Comments - Add comment