cyber daily logo

Breaking news and updates daily. Subscribe to our Newsletter

Breaking news and updates daily. Subscribe to our Newsletter X facebook linkedin Instagram Instagram

Vall-E AI can recreate any voice with just a 3-second sample

Microsoft has developed a new AI tool that it says is able to imitate any voice with just a three-second audio sample.

user icon Daniel Croft
Fri, 13 Jan 2023
Vall-E AI can recreate any voice with just a 3-second sample
expand image

The tool is called Vall-E, and thanks to being trained on 60,000 hours of English speech data, it is capable of recreating a speaker’s voice and tone whilst maintaining emotion, according to researchers at Cornell University.

“Vall-E emerges in-context learning capabilities and can be used to synthesise high-quality personalised speech with only a three-second enrolled recording of an unseen speaker as an acoustic prompt,” said researchers.

“Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot [text to speech] system in terms of speech naturalness and speaker similarity.


“In addition, we find Vall-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.”

Microsoft has released a range of samples showing just how real the AI’s recreation of speech is on GitHub, which can be found here.

Samples do vary in quality; however, some are eerily accurate, making it very hard to differentiate between the speaker prompt and the speech created from Vall-E.

Whilst at this stage, the AI is not available to the public and is being used only for research purposes, technology like this has obvious security risks.

For example, combined with deepfake footage, someone could easily create a phony video of someone saying something, potentially ruining their career, or even worse, compromise security in the event that it is used on a government official or a nation’s leader.

Cyber criminals could also use it to recreate a potential victim’s voice, allowing them to steal identities or sign verbal agreements over the phone.

Vall-E, despite being unavailable to the public, comes with a warning that addresses this issue.

“Since Vall-E could synthesise speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker,” said researchers on the Vall-E website.

“We conducted the experiments under the assumption that the user agree [sic] to be the target speaker in speech synthesis.

“When the model is generalised to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesised speech detection model.”

The concern of AI making cyber attacks easier for bad actors is an increasing concern.

Researchers recently discovered that OpenAI’s ChatGPT AI could be used to write malicious code and phishing emails.

Similarly, experts at Microsoft and the Universities of California and Virginia have found ways to train AI coding assistants into suggesting malicious code to developers, which could lead to vulnerabilities and major supply chain attacks.

Daniel Croft

Daniel Croft

Born in the heart of Western Sydney, Daniel Croft is a passionate journalist with an understanding for and experience writing in the technology space. Having studied at Macquarie University, he joined Momentum Media in 2022, writing across a number of publications including Australian Aviation, Cyber Security Connect and Defence Connect. Outside of writing, Daniel has a keen interest in music, and spends his time playing in bands around Sydney.

cyber daily subscribe
Be the first to hear the latest developments in the cyber industry.