{"id":17875,"date":"2017-07-12T08:04:35","date_gmt":"2017-07-12T08:04:35","guid":{"rendered":"http:\/\/www.kurzweilai.net\/?p=303130"},"modified":"2017-07-15T03:41:33","modified_gmt":"2017-07-15T03:41:33","slug":"how-to-turn-audio-clips-into-realistic-lip-synced-video","status":"publish","type":"post","link":"https:\/\/hoo.central12.com\/fugic\/2017\/07\/12\/how-to-turn-audio-clips-into-realistic-lip-synced-video\/","title":{"rendered":"How to turn audio clips into realistic lip-synced video"},"content":{"rendered":"<p><iframe frameborder=\"0\" height=\"315\" src=\"https:\/\/www.youtube.com\/embed\/UCwbJxW-ZRg?rel=0\" width=\"560\"><\/iframe><br \/>\n<em>UW (University of Washington) | UW researchers create realistic video from audio files alone<\/em><\/p>\n<p>University of Washington researchers at the <a href=\"http:\/\/grail.cs.washington.edu\/\" >UW Graphics and Image Laboratory<\/a> have developed new algorithms that <a href=\"http:\/\/grail.cs.washington.edu\/projects\/AudioToObama\/\" >turn audio clips into a realistic, lip-synced video<\/a>, starting with an existing video of\u00a0 that person speaking on a different topic.<\/p>\n<p>As detailed in a\u00a0<a href=\"http:\/\/grail.cs.washington.edu\/projects\/AudioToObama\/siggraph17_obama.pdf\" >paper<\/a>\u00a0to be presented Aug. 2 at\u00a0<a href=\"http:\/\/s2017.siggraph.org\/\">\u00a0<\/a><a href=\"http:\/\/s2017.siggraph.org\/\" >SIGGRAPH<\/a><a href=\"http:\/\/s2017.siggraph.org\/\">\u00a02017<\/a>, the team successfully generated a <a href=\"https:\/\/www.youtube.com\/watch?v=MVBe6_o4cMI\" >highly realistic video<\/a>\u00a0of former president Barack Obama talking about terrorism, fatherhood, job creation and other topics, using audio clips of those speeches and existing weekly video addresses in which he originally spoke on a different topic decades ago.<\/p>\n<p>Realistic audio-to-video conversion has practical applications like improving video conferencing for meetings (streaming audio over the internet takes up far less bandwidth than video, reducing video glitches), or holding a conversation with a historical figure in virtual reality, said <a href=\"http:\/\/homes.cs.washington.edu\/~kemelmi\/\" >Ira Kemelmacher-Shlizerman<\/a>, an assistant professor at the UW\u2019s Paul G. Allen School of Computer Science &amp; Engineering.<\/p>\n<p><iframe frameborder=\"0\" height=\"315\" src=\"https:\/\/www.youtube.com\/embed\/MVBe6_o4cMI?rel=0\" width=\"560\"><\/iframe><br \/>\n<em>Supasorn Suwajanakorn | Teaser &#8212; Synthesizing Obama: Learning Lip Sync from Audio<\/em><\/p>\n<p>This beats previous audio-to-video conversion processes, which have involved filming multiple people in a studio saying the same sentences over and over to try to capture how a particular sound correlates to different mouth shapes, which is expensive, tedious and time-consuming. The new machine learning tool may also help overcome the \u201c<a href=\"https:\/\/en.wikipedia.org\/wiki\/Uncanny_valley\" >uncanny valley<\/a>\u201d problem, which has dogged efforts to create realistic video from audio.<\/p>\n<p><strong>How to do it<\/strong><\/p>\n<div id=\"attachment_303131\" class=\"wp-caption aligncenter\" style=\"width: 301px;  border: 1px solid #dddddd; background-color: #f3f3f3; padding-top: 4px; margin: 10px; text-align:center; display: block; margin-right: auto; margin-left: auto;\"><a href=\"http:\/\/www.kurzweilai.net\/images\/Obama-lip-Sync-Graphic.jpg\"><img class=\" wp-image-303131 noshadow\" title=\"Obama lip Sync Graphic\" src=\"http:\/\/www.kurzweilai.net\/images\/Obama-lip-Sync-Graphic.jpg\" alt=\"\" width=\"291\" height=\"361\" \/><\/a><p style=' padding: 0 4px 5px; margin: 0;'  class=\"wp-caption-text\">A neural network first converts the sounds from an audio file into basic mouth shapes. Then the system grafts and blends those mouth shapes onto an existing target video and adjusts the timing to create a realistic, lip-synced video of the person delivering the new speech. (credit: University of Washington)<\/p><\/div>\n<p>1. Find or record a video of the person (or use video chat tools like Skype to create a new video) for the neural network to learn from. There are millions of hours of video that already exist from interviews, video chats, movies, television programs and other sources, the researchers note. (Obama was chosen because there were hours of presidential videos in the public domain.)<\/p>\n<p>2. Train the neural network to watch videos of the person and translate different audio sounds into basic mouth shapes.<\/p>\n<p>3. The system then uses the audio of an individual\u2019s speech to generate realistic mouth shapes, which are then grafted onto and blended with the head of that person. Use a small time shift to enable the neural network to anticipate what the person is going to say next.<\/p>\n<p>4. Currently, the neural network is designed to learn on one individual at a time, meaning that Obama\u2019s voice &#8212; speaking words he actually uttered &#8212; is the only information used to \u201cdrive\u201d the synthesized video. Future steps, however, include helping the algorithms generalize across situations to recognize a person\u2019s voice and speech patterns with less data, with only an hour of video to learn from, for instance, instead of 14 hours.<\/p>\n<p><strong>Fakes of fakes<br \/>\n<\/strong><\/p>\n<p>So the obvious question is: Can you use someone else&#8217;s voice on a video (assuming enough videos)? The researchers said they decided against going down the path, but they didn&#8217;t say it was impossible.<\/p>\n<p>Even more pernicious: the original video person&#8217;s words (not just the voice) could be faked using Princeton\/Adobe&#8217;s \u201c<a href=\"http:\/\/www.kurzweilai.net\/princetonadobe-technology-will-let-you-edit-voices-like-text\" >VoCo<\/a>\u201d software (when available) &#8212; simply by editing a text transcript of their voice recording &#8212; or the fake voice itself could be modified.<\/p>\n<p>Or Disney Research\u2019s <a href=\"http:\/\/www.kurzweilai.net\/princetonadobe-technology-will-let-you-edit-voices-like-text\" >FaceDirector<\/a> could be used to edit recorded substitute facial expressions (along with the fake voice) into the video.<\/p>\n<p>However, by reversing the process &#8212; feeding video into the neural network instead of just audio &#8212; one could also potentially develop algorithms that could detect whether a video is real or manufactured, the researchers note.<\/p>\n<p>The research was funded by Samsung, Google, Facebook, Intel, and the UW Animation Research Labs. You can contact the research team at\u00a0<a href=\"mailto:audiolipsync@cs.washington.edu\" >audiolipsync@cs.washington.edu<\/a>.<\/p>\n<hr \/>\n<h4>Abstract of\u00a0<em>Synthesizing Obama: Learning Lip Sync from Audio<\/em><\/h4>\n<p>Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track. Our approach produces photorealistic results.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>UW (University of Washington) | UW researchers create realistic video from audio files alone University of Washington researchers at the UW Graphics and Image Laboratory have developed new algorithms that turn audio clips into a realistic, lip-synced video, starting with an existing video of&nbsp; that person speaking on a different topic. As detailed in a&nbsp;paper&nbsp;to [&#8230;]<\/p>\n","protected":false},"author":13,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[46,48,51,43,69],"tags":[],"class_list":["post-17875","post","type-post","status-publish","format-standard","hentry","category-airobotics","category-electronics","category-entertainmentnew-media","category-news","category-socialethicallegal"],"_links":{"self":[{"href":"https:\/\/hoo.central12.com\/fugic\/wp-json\/wp\/v2\/posts\/17875"}],"collection":[{"href":"https:\/\/hoo.central12.com\/fugic\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hoo.central12.com\/fugic\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hoo.central12.com\/fugic\/wp-json\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"https:\/\/hoo.central12.com\/fugic\/wp-json\/wp\/v2\/comments?post=17875"}],"version-history":[{"count":2,"href":"https:\/\/hoo.central12.com\/fugic\/wp-json\/wp\/v2\/posts\/17875\/revisions"}],"predecessor-version":[{"id":17953,"href":"https:\/\/hoo.central12.com\/fugic\/wp-json\/wp\/v2\/posts\/17875\/revisions\/17953"}],"wp:attachment":[{"href":"https:\/\/hoo.central12.com\/fugic\/wp-json\/wp\/v2\/media?parent=17875"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hoo.central12.com\/fugic\/wp-json\/wp\/v2\/categories?post=17875"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hoo.central12.com\/fugic\/wp-json\/wp\/v2\/tags?post=17875"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}