Scientists from KAIST, the South Korean AI Institute, have described a new model called StyleTalker, which takes a single image of a person as an input, and produces a video of them talking "with accurately audio-synced lip shapes, realistic head poses, and eye blinks".
StyleTalker combines AI techniques for "audio-driven generation" (generating realistic lip-movements from audio) with "motion-controllable" generation, that can do things like take the head-movements and gestures from one video, and use them in a new video with a new face.
The work builds on a recent boom in neural lip-synced video generation, a research-field that aims to "transforming the lip region of the person in the target video,generating new videos with the lip shapes that match the input audio." (This could be used, for example, when movies are dubbed from one language into another.)
StyleTalker "can generate more natural and robust talking head videos compared to other models" previously described. It is a step towards more realistic fake videos, but is that a good thing? Let us know in the comments how you think this technology could be used for good and for bad.
Citation: Dongchan Min, Minyoung Song, Eunji Ko, Sung Ju Hwang. StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation. eprint (2024). https://arxiv.org/abs/2208.10922 (open access)