Vall-E AI can recreate any voice with just a 3-second sample

Microsoft has developed a new AI tool that it says is able to imitate any voice with just a three-second audio sample.

The tool is called Vall-E, and thanks to being trained on 60,000 hours of English speech data, it is capable of recreating a speaker’s voice and tone whilst maintaining emotion, according to researchers at Cornell University.

“Vall-E emerges in-context learning capabilities and can be used to synthesise high-quality personalised speech with only a three-second enrolled recording of an unseen speaker as an acoustic prompt,” said researchers.

You’re out of free articles for this month

Username or Email

Password Forgot password?

Keep me signed in on this device.

First Name

Last Name

Mobile

Organisation Type

By becoming a member, I agree to receive information and promotional messages from Cyber Daily. I can opt out of these communications at any time. For more information, please visit our Privacy Statement.

Need help signing up? Visit the Help Centre.

“Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot [text to speech] system in terms of speech naturalness and speaker similarity.

“In addition, we find Vall-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.”

Microsoft has released a range of samples showing just how real the AI’s recreation of speech is on GitHub, which can be found here.

Samples do vary in quality; however, some are eerily accurate, making it very hard to differentiate between the speaker prompt and the speech created from Vall-E.

Whilst at this stage, the AI is not available to the public and is being used only for research purposes, technology like this has obvious security risks.

For example, combined with deepfake footage, someone could easily create a phony video of someone saying something, potentially ruining their career, or even worse, compromise security in the event that it is used on a government official or a nation’s leader.

VIEW ALL

Cyber criminals could also use it to recreate a potential victim’s voice, allowing them to steal identities or sign verbal agreements over the phone.

Vall-E, despite being unavailable to the public, comes with a warning that addresses this issue.

“Since Vall-E could synthesise speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker,” said researchers on the Vall-E website.

“We conducted the experiments under the assumption that the user agree [sic] to be the target speaker in speech synthesis.

“When the model is generalised to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesised speech detection model.”

The concern of AI making cyber attacks easier for bad actors is an increasing concern.

Researchers recently discovered that OpenAI’s ChatGPT AI could be used to write malicious code and phishing emails.

Similarly, experts at Microsoft and the Universities of California and Virginia have found ways to train AI coding assistants into suggesting malicious code to developers, which could lead to vulnerabilities and major supply chain attacks.

Daniel Croft

Born in the heart of Western Sydney, Daniel Croft is a passionate journalist with an understanding for and experience writing in the technology space. Having studied at Macquarie University, he joined Momentum Media in 2022, writing across a number of publications including Australian Aviation, Cyber Security Connect and Defence Connect. Outside of writing, Daniel has a keen interest in music, and spends his time playing in bands around Sydney.

You need to be a member to post comments. Become a member for free today!

newsletter

Be the first to hear the latest developments in the cyber industry.

Vall-E AI can recreate any voice with just a 3-second sample

Daniel Croft

OUR PLATFORMS AND BRANDS

EVENTS AND SUMMITS

PODCASTS

LEARNING AND EDUCATION

MOMENTUM MARKETS NETWORK

LINKS

STAY CONNECTED