Microsoft announces Turing Bletchley v3 vision-language model for Bing image searches

Microsoft has officially announced the third version of its Turing Bletchley multilingual vision-language foundation model. It's now being rolled out to a number of Microsoft's products, including Bing for improvements in image searches.

Microsoft launched the first version of the Turing Bletchley model back in November 2021. In a post today on the official Bing blog, Microsoft said it started testing for the third version of the model in the fall of 2022 before adding it to Bing and other products.

The model uses input from both text and images to find the things that a person is looking for on Microsoft's Bing search engine. It's the goal to have the model come as close as possible so that a text that describes, for example, "a dog eating ice cream" comes as close as possible to images of a dog eating ice cream in a search result.

Part of the way Turing Bletchley v3 makes these connections is extensive pertaining to the model. Microsoft states:

Given an image and a caption describing the image, some words in the caption are masked. A neural network is then trained to predict the hidden words conditioned on both the image and the text. The task can also be flipped to mask out pixels instead of words. This type of masked training together with a large transformer-based model leads to a strong pre-trained model which can be finetuned on a diverse set of downstream tasks.

In addition to being used for image searches in Bing. the new Turing Bletchley v3 model is being used for content moderation on its Xbox game service. It helps that team identify, for example, images and videos that are uploaded by Xbox players to their profiles that would be considered inappropriate and in violation of the company's community standard on the Xbox platform.