Meta announced today that it"s open-sourcing a new AI model called ImageBind. It"s a multimodal AI designed to work with six different types of data, including text, audio, video, 3D, thermal, and motion. ImageBind can receive input in one of the supported data modes which it can relate to others.
For instance, it can find the sound of waves when given a picture of a beach. When it"s fed with a photo of a tiger and the sound of the waterfall, the system can give a video that combines both, Meta CEO Mark Zuckerberg explained on his Instagram broadcast channel. "This is a step towards AIs that understand the world around them more like we do, which will make them a lot more useful and will open up totally new ways to create things," he said.
Meta explains in a blog post that ImageBind takes an approach similar to how humans can gather information from multiple senses, and process all of it simultaneously and holistically. In the future, it plans to expand the supported data modes to other senses such as touch, speech, smell, and brain fMRI signals, which will enable richer human-centric AI models.
For reference, existing AI models like Open AI"s DALL E 2, MidJourney, and Stable Diffusion are trained to link text and images. These systems take inputs in the form of natural language text prompts and generate an image accordingly.
ImageBind can have various applications, for instance, it can be used to improve search functionality for pictures, videos, audio files, or text messages using a combination of text, audio, and image. Meta"s AI tool Make-A-Scene which currently uses text prompts to generate images can leverage ImageBind to generate images using audio. Meta has published a research paper [PDF] describing its open-source AI model but it"s yet to release a tool or consumer product based on it.