A new AI translation system for headphones clones multiple voices simultaneously | MIT Technology Review


AI Summary Hide AI Generated Summary

Spatial Speech Translation: A Novel AI System

MIT Technology Review details a groundbreaking AI translation system for headphones, called Spatial Speech Translation. This system utilizes two AI models to achieve real-time, multi-voice translation while maintaining the unique characteristics of each speaker's voice.

System Functionality

The first model identifies and locates speakers in the surrounding space using a neural network. The second model translates speech from French, German, and Spanish into English, utilizing publicly available datasets. Crucially, this model preserves the pitch, amplitude, and emotional tone of each speaker's voice, resulting in a translated output that sounds natural and speaker-specific rather than robotic.

Challenges and Future Improvements

While impressive, the system faces challenges. Samuele Cornell, a researcher at Carnegie Mellon University, notes the need for more training data, especially real-world recordings from headsets. The team aims to significantly reduce latency to under one second to enhance the conversational flow. Claudio Fantinuoli from Johannes Gutenberg University highlights the trade-off between latency and translation accuracy, as longer processing times allow for greater contextual understanding.

Language Processing Differences

The system's translation speed varies across languages; French is quickest, followed by Spanish and then German, due to the latter's sentence structure. This difference emphasizes the complexity of real-time language translation.

Sign in to unlock more AI features Sign in with Google

Spatial Speech Translation consists of two AI models, the first of which divides the space surrounding the person wearing the headphones into small regions and uses a neural network to search for potential speakers and pinpoint their direction. 

The second model then translates the speakers’ words from French, German, or Spanish into English text using publicly available data sets. The same model extracts the unique characteristics and emotional tone of each speaker’s voice, such as the pitch and the amplitude, and applies those properties to the text, essentially creating a “cloned” voice. This means that when the translated version of a speaker’s words is relayed to the headphone wearer a few seconds later, it sounds as if it’s coming from the speaker’s direction and the voice sounds a lot like the speaker’s own, not a robotic-sounding computer.

Given that separating out human voices is hard enough for AI systems, being able to incorporate that ability into a real-time translation system, map the distance between the wearer and the speaker, and achieve decent latency on a real device is impressive, says Samuele Cornell, a postdoc researcher at Carnegie Mellon University’s Language Technologies Institute, who did not work on the project.

“Real-time speech-to-speech translation is incredibly hard,” he says. “Their results are very good in the limited testing settings. But for a real product, one would need much more training data—possibly with noise and real-world recordings from the headset, rather than purely relying on synthetic data.”

Gollakota’s team is now focusing on reducing the amount of time it takes for the AI translation to kick in after a speaker says something, which will accommodate more natural-sounding conversations between people speaking different languages. “We want to really get down that latency significantly to less than a second, so that you can still have the conversational vibe,” Gollakota says.

This remains a major challenge, because the speed at which an AI system can translate one language into another depends on the languages’ structure. Of the three languages Spatial Speech Translation was trained on, the system was quickest to translate French into English, followed by Spanish and then German—reflecting how German, unlike the other languages, places a sentence’s verbs and much of its meaning at the end and not at the beginning, says Claudio Fantinuoli, a researcher at the Johannes Gutenberg University of Mainz in Germany, who did not work on the project. 

Reducing the latency could make the translations less accurate, he warns: “The longer you wait [before translating], the more context you have, and the better the translation will be. It’s a balancing act.”

Was this article displayed correctly? Not happy with what you see?


Share this article with your
friends and colleagues.
Earn points from views and
referrals who sign up.
Learn more

Facebook

Save articles to reading lists
and access them on any device


Share this article with your
friends and colleagues.
Earn points from views and
referrals who sign up.
Learn more

Facebook

Save articles to reading lists
and access them on any device