Microsoft’s new AI tool VASA-1, an AI image-to-video model that can generate videos from just one photo and a speech audio clip, has every potential of turning into a deepfake nightmare machine as feared by Daniel John, senior news editor at Creative Bloq in an article late last week. Precisely this realisation seems to have prompted Microsoft to state that “Our research focuses on generating visual affective skills for virtual AI avatars, aiming for positive applications. It is not intended to create content that is used to mislead or deceive. However, like other related content generation techniques, it could still potentially be misused for impersonating humans.”
Critics might term that as anticipatory bail or standard disclaimer, but peering into the distant horizon where AI’s evolution is headed to, the IT leviathan has no option but to keep innovating, lest fall behind. VASA-1 is described by Microsoft as a framework for generating lifelike talking faces of virtual characters with appealing visual affective skills (VAS), given a single static image and a speech audio clip.
The premiere model, VASA-1, “is capable of not only producing lip movements that are exquisitely synchronised with the audio, but also capturing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness.
“The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos. Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively. Our method not only delivers high video quality with realistic facial and head dynamics but also supports the online generation of 512x512 videos at up to 40 FPS with negligible starting latency. It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviours.”
Perusal of a sample of VASA-1 creations, based on virtual portrait images, prove it is becoming increasingly difficult to tell apart between fake and real pictures and videos. Hence Microsoft is saying that “We are opposed to any behaviour to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detection. Currently, the videos generated by this method still contain identifiable artifacts, and the numerical analysis shows that there’s still a gap to achieve the authenticity of real videos.
“While acknowledging the possibility of misuse, it’s imperative to recognise the substantial positive potential of our technique. The benefits – such as enhancing educational equity, improving accessibility for individuals with communication challenges, offering companionship or therapeutic support to those in need, among many others – underscore the importance of our research and other related explorations. We are dedicated to developing AI responsibly, with the goal of advancing human well-being. Given such context, we have no plans to release an online demo, API (Application Programming Interface), product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations.”
Coincidentally, TBD co-founder and COO Emily Chiu has said it’s going to get more and more difficult to tell the difference between AI and man-made content: “Now we have face and voice and video that’s mixed reality. ... It’s going to be increasingly hard for us to detect what’s real; what’s not; what’s fraudulent.” Speaking on Web Summit Rio’s Center Stage, also late last week, the investment banker turned entrepreneur said that AI’s imperfections are actually the fault of its human creators. But what happens when most of those imperfections are ironed out?