Computer vision is a field that has been researched quite a lot in the past few decades, primarily because of its immediate and obvious applications of building autonomous vehicles and others tools that can "see" the world as humans do. However, one area that has not seen this level of research until recently is the use of sound instead of sight to model an environment. Now, researchers at the Massachusetts Institute of Technology (MIT) have penned a research paper regarding the construction of a machine learning (ML) model trained in this domain.
A blog post over on the MIT News website describes that researchers at MIT and the MIT-IBM Watson AI Lab have collaborated to build an ML model that uses spatial acoustics to see and model the environment. Simply stated, this model enables the mapping of an environment by figuring out how a listener would hear a sound originating from a point and being propagated to different positions.
There are numerous benefits to this technique since it allows the determination of the underlying 3D geometry of environmental objects using just sound. It can then render accurate visuals to reconstruct the environment. Potential applications include virtual and augmented reality, along with augmenting AI agents so that they can utilize both sound and sight to better visualize their environments. For example, an underwater exploration robot could use acoustics to better determine the location of certain objects as compared to computer vision.
The researchers have emphasized that building this ML model based on sound was considerably more complex than one based on computer vision. This is because computer vision models leverage a property called photometric consistency, which means that an object looks roughly the same when viewed from different angles. This does not apply to sound as depending upon your location and other obstacles, what you hear from a source may highly variable.
In order to tackle this problem, the researchers used two other features called reciprocality and local geometry. The former basically means that even if you swap the location of the speaker and the listener, the sound will be exactly the same. Meanwhile, local geometry mapping involved combining reciprocality in a neural acoustic field (NAF) to capture objects and other architectural components. You can view the fascinating results below:
To get the ML model working in test environments, it needs to be fed some visual information and spectograms containing samples of what the audio would sound like based on specified locations for the originator and the listener. Following these inputs, the model can accurately determine how the sound will change as the listener moves around the environment.
The research paper's lead author Andrew Luo noted that:
If you imagine standing near a doorway, what most strongly affects what you hear is the presence of that doorway, not necessarily geometric features far away from you on the other side of the room. We found this information enables better generalization than a simple fully connected network.
Moving forward, the researchers want to further enhance the model so it can visualize bigger and more complex environments such as a building or even an entire city. In the meantime, you can read their research paper here.