OpenAI bakes "CriticGPT" model to find mistakes in ChatGPT's answers

ChatGPT website on a mobile browser — *Image via Pixabay*

ChatGPT took the tech world by storm when it arrived during the final months of 2022. The launch was big enough that it shook things under Google's roof, and the search came up with its generative AI offering. While ChatGPT doesn't suggest you should add glue to your pizza, the do-it-all chatbot isn't perfect and can make mistakes.

One of the tasks ChatGPT can do is write code snippets after taking prompts from the user. OpenAI has trained a GPT-4-based AI model called CriticGPT to find errors in the code output provided by the chatbot. It can write critiques that highlight inaccuracies in ChatGPT's answers. The model is being used internally, and OpenAI has published a research paper to describe it in detail.

CriticGPT is meant to assist human AI trainers whose job is to train and improve GPT-4's responses using a technique called Reinforcement Learning from Human Feedback (RLHF). It involves AI trainers rating different ChatGPT responses against each other.

However, things are getting harder for the AI trainers as ChatGPT is becoming more accurate and its mistakes more subtle. "This is a fundamental limitation of RLHF, and it may make it increasingly difficult to align models as they gradually become more knowledgeable than any person that could provide feedback," OpenAI said.

CriticGPT comes into the picture to save the day, but it's still an AI model, and its responses may not always be correct. It's also susceptible to AI problems like hallucination; however, the model can help humans get better at pointing out errors than when they do the job on their own.

OpenAI said that "a second random trainer preferred critiques from the Human+CriticGPT team over those from an unassisted person more than 60% of the time." CriticGPT was also trained using RLHF and tasked to analyze and critique a large number of inputs that contained mistakes.

The model had to find mistakes deliberately inserted by humans and "naturally occurring" ChatGPT bugs previously caught by a trainer. There are a few limitations OpenAI is currently working to eliminate.

CriticGPT was trained using short ChatGPT answers, and new methods need to be developed that can help trainers understand long and complex tasks. Hallucinations could have consequences as trainers who see them might make labeling mistakes.

Currently, CriticGPT has an eagle-eyed view when it tries to spot errors in ChatGPT's responses. OpenAI notes that real-world mistakes can spread across many parts of an answer, something it needs to tackle in the future.