According to the CEO of AI firm Cohere, Aiden Gomez, synthetic data is already being used to train AI models. With companies like Reddit and Twitter charging exorbitant amounts for companies to scrape their data, AI firms such as Microsoft, OpenAI, and Cohere are turning to synthetic data.
Gomez revealed that the use of synthetic data is already huge but that it’s not broadcast widely. One example he gave was if they wanted to train a model in advanced mathematics, they can set up two AI models playing the part of a teacher and student where they discuss a topic like trigonometry and then the human observing corrects the conversation if anything was said incorrectly.
While synthetic data has been used to train models and has been the focus of several research papers, the main way that models are trained is by scraping data from the internet including from digital books, news articles, blogs, social media, Flickr, and more. Humans then give feedback and fill gaps in the information through reinforcement learning by human feedback (RLHF).
Some of the issues with this method include copyright infringement and privacy violations that could get companies into trouble.
The Financial Times pointed to an interesting research paper by Microsoft Research called ‘Textbooks Are All You Need’ which explained that by training a model for coding on textbook quality data it was able to perform quite well on coding tasks. Similar things can be done with language where a model is trained on simple words and sentences and can then produce fluent and grammatically correct stories.
Of course, while creating synthetic data to train models could lead to breakthroughs, companies also have to be careful not to use poor synthetic data which could lead to degradation over time.
Coupled with chain-of-thought techniques being developed by the likes of OpenAI and Anthropic to reduce AI hallucinations, synthetic data could possibly help AI help us solve more challenges.
Source: Financial Times