AI may have outwitted itself with fake video

The US military is funding an effort to determine whether AI-generated video and audio will soon be indistinguishable from the real thing—even for another AI.

The Defense Advanced Research Projects Agency (DARPA) is holding a contest this summer to generate the most convincing AI-created videos and the most effective tools to spot the counterfeits.

Some of the most realistic fake footage is created by generative adversarial networks, or GANs. GANs pit AI systems against each other to refine their creations and make a product real enough to fool the other AI. In other words, the final videos are literally made to dupe detection tools.

Why it matters? The software to create these videos is becoming increasingly advanced and accessible, which could cause real harm. Sooner this year, actor and filmmaker Jordan Peele warned of the dangers of of deepfakes by manipulating a video of Barack Obama’s speech.

A computer sees a masterpiece

Uncompromisingly, I showed Edvard Munch’s “The Scream” to Google Cloud Vision API and, for my surprise, its computer vision algorithm “saw” an interesting aspect I guess most human eyes wouldn’t notice. The Cloud Vision API landscape feature, which detects popular natural and man-made structures within an image, printed out a bounding box along with the tag National Congress of Brazil in a specific area of the painting. Apparently, our Congress is a fright to the machine’s eyes.

Note: The Cloud Vision API doesn’t detect the National Congress of Brazil in all images of “The Scream” available on the web. The image I used was from this page.

Curating photography with neural networks

“Computed Curation” is a 95-foot-long, accordion photobook created by a computer. Taking the human editor out of the loop, it uses machine learning and computer vision tools to curate a series of photos from Philipp Schmitt personal archive.

The book features 207 photos taken between 2013 to 2017. Considering both image content and composition the algorithms uncover unexpected connections among photographies and interpretations that a human editor might have missed.

A spread of the accordion book feels like this: on one page, a photograph is centralized with a caption above it: “a harbor filled with lots of traffic” [confidence: 56,75%]. Location and date appear next to the photo, as a credit: Los Angeles, USA. November, 2016. On the bottom of the photo, some tags are listed: “marina, city, vehicle, dock, walkway, sport venue, port, harbor, infrastructure, downtown”. On the next page, the same layout with different content: a picture is captioned “a crowd of people watching a large umbrella” [confidence: 67,66%]. Location and date: Berlin, Germany. August, 2014. Tags: “crowd, people, spring, festival, tradition”.

Metadata from the camera device (date and location) is collected using Adobe Lightroom. Visual features (tags and colors) are extracted from photos using Google’s Cloud Vision API. Automated captions for photos, with their corresponding score confidence, are generated using Microsoft’s Cognitive Services API. Finally, image composition is analyzed using histogram of oriented gradients (HOGs). These components were then considered by a t-SNE learning algorithm, which sorted the images in a two-dimensional space according to similarities. A genetic TSP algorithm computes the shortest path through the arrangement, thereby defining the page order. You can check out the process, recorded in his video below:

 

 

About machine capabilities versus human sensitivities

For Recognition, an artificial intelligence program that associates Tate’s masterpieces and news photographs provided by Reuters, there are visual or thematic similarities between the photo of a woman with a phrase on her face that reads #foratemer (out Temer) during a protest against a constitutional amendment known as PEC 55 and the portrait of an aristocrat man of the seventeenth century in costumes that denote his sovereignty and authority. In times when intelligent and thinking machines, like chatbots, are a topic widely discussed I wonder if the algorithms that created the dialogue between these two images would be aware of the conflicting but no less interesting relationship between resistance and power established between them.