The “AI face swap” is getting hot again these days. As always, Ozmca takes a brief look at some of the most important AI face-changing technologies of recent years from a technical perspective.
Cycle GAN can be considered an important early attempt to convert all faces. In the trend of adversarial generative networks (GANs), it was found that given samples from the source category and samples from the target category, GANs can easily learn the conversion relationship between the two categories, which is naturally applicable to “image to image conversion” problems, such as winter to summer and a horse to a zebra in the same landscape photo; the core idea of Cycle GAN is that if we can convert from the source to the target, we can also convert from the source to the target. The core idea of Cycle GAN is that if it can convert from source to target and back, the model can be considered to have learned the conversion relationship between the two categories well, and the quality of the converted image can be better ensured. However, Cycle GAN is not very effective in face replacement, it is after all a general method for all categories of images.
Face2Face is a “standard, rule-based” attempt to use dlib and OpenCV to first detect faces in source images, find key markers on faces, and then convert key markers to target face images using a pix2pix conversion model for faces. Perhaps because this approach does not leave enough room for deep learning, it is not as effective as it could be.
Since then, researchers at NVIDIA and UC Berkeley have improved pix2pixHD based on pix2pix to improve face image generation and still maintain the multi-category generality of the original pix2pix model.
The hottest and most popular deep learning face swapping model is definitely DeepFakes, an Autoencoder-Decoder that emerged in late 2017, which trains the model to recognize and restore the faces of two people separately by using hundreds of photos (the more the better) of the source person and the target person. Finally, the conversion is done by using the photos of the source person with the decoder of the target person. It also has good support for video-to-video conversion.
The disadvantage of DeepFakes is that it cannot work on small samples, meaning that it is impossible to replace any two faces with one or two photos; the training process of the model also consumes a lot of resources.
When DeepFakes was first made public, it was limited to communication among technology enthusiasts, and no formal papers were published. But some of Gal Gadot’s face-swapping motion pictures immediately sparked attention. The “Yang Mi face swap with Zhu Yin” video that was hyped earlier this year was also likely achieved using this method, as the encoders in DeepFakes do have the ability to convert any input face (such as Zhu Yin’s face) into a high-quality, high-realistic target face (Yang Mi’s face) after sufficient training.
It is still being updated and upgraded; a desktop application called FakeApp has been launched to make it easier for more novice users who can’t play with TensorFlow to try it. For an in-depth analysis of the article, see Deepfake.
Converting facial actions from a single photo
DeepFakes style “change the face in the target image to another face” may also be difficult to reduce the sample size requirements and resource requirements in the future, so there is another idea, that is, given a face image, and then according to the given action to make the person in the picture “move”. A paper published in May this year by Samsung’s AI Research Center in Moscow, in conjunction with the Skolkovo Institute of Science and Technology, brought good results. Not only pictures of real people, but they can even make people in paintings speak naturally.