Multiplicity project at the 123data exhibition in Paris

Multiplicity is a collective photographic portrait of Paris. Idealized and designed by Moritz Stefaner, in the occasion of the 123 data exhibition, this interactive installation provides an immersive dive into the image space spanned by hundreds of thousands of photos taken across the Paris city area and shared on social media.

Content selection and curation aspects

The original image dataset consisted of 6.2m geo-located social media photos posted in Paris in 2017. However, for a not really clarified reason (maybe a technical aspect?), a custom selection of 25.000 photos was chosen according to a list of criteria. Moritz highlights it was his intention not to measure, but portray the city. He says: “Rather than statistics, the project presents a stimulating arrangement of qualitative contents, open for exploration and to interpretation — consciously curated and pre-arranged, but not pre-interpreted.” This curated method wasn’t just used for data selection but also for bridging the t-SNE visualization and the grid visualization. Watch the transition effect in the video below. As a researcher interested in user interface and visualization techniques to support knowledge discovery in digital image collections, I wonder if a curated-applied method could be considered in a Digital Humanities approach.

Data Processing

Using machine learning techniques, the images are organized by similarity and image contents, allowing to visually explore niches and microgenres of image styles and contents. More precisely, it uses t-SNE dimensionality reduction to visualize the features from the last layer of a pre-trained neural network to cluster images of Paris. The author says: “I used feature vectors normally intended for classification to calculate pairwise similarities between the images. The map arrangement was calculated using t-SNE — an algorithm that finds an optimal 2D layout so that similar images are close together.”

While the t-SNE algorithm takes care of the clustering and neighborhood structure, manual annotations help with identification of curated map areas. These areas can be zoomed on demand enabling close viewing of similar photos.

 

Gugelmann Galaxy

Gugelmann Galaxy is an interactive demo by Mathias Bernhard exploring itens from the Gugelmann Collection, a group of 2336 works by the Schweizer Kleinmeister – Swiss 18th century masters. Gugelmann Galaxy is built on Three.js, a lightweight javascript library, allowing to create animated 3D visualizations in the browser using WebGL.

The images are grouped according to specific parameters that are automatically calculated by image analysis and text analysis from metadata. A high-dimensional space is then projected onto a 3D space, while preserving topological neighborhoods between images in the original space. More explanation about the dimensionality reduction can be read here.

The user interface allows four types of image arrangement: by color distribution, by technique, by description and by composition.  As the mouse hovers over the items, an info box with some metadata is displayed on the left. The user can also perform rotation, zooming, and panning.

The author wrote on his site:

The project renounces to come up with a rigid ontology and forcing the items to fit in premade categories. It rather sees clusters emerge from attributes contained in the images and texts themselves. Groupings can be derived but are not dictated.

 

A computer sees a masterpiece

Uncompromisingly, I showed Edvard Munch’s “The Scream” to Google Cloud Vision API and, for my surprise, its computer vision algorithm “saw” an interesting aspect I guess most human eyes wouldn’t notice. The Cloud Vision API landscape feature, which detects popular natural and man-made structures within an image, printed out a bounding box along with the tag National Congress of Brazil in a specific area of the painting. Apparently, our Congress is a fright to the machine’s eyes.

Note: The Cloud Vision API doesn’t detect the National Congress of Brazil in all images of “The Scream” available on the web. The image I used was from this page.

Curating photography with neural networks

“Computed Curation” is a 95-foot-long, accordion photobook created by a computer. Taking the human editor out of the loop, it uses machine learning and computer vision tools to curate a series of photos from Philipp Schmitt personal archive.

The book features 207 photos taken between 2013 to 2017. Considering both image content and composition the algorithms uncover unexpected connections among photographies and interpretations that a human editor might have missed.

A spread of the accordion book feels like this: on one page, a photograph is centralized with a caption above it: “a harbor filled with lots of traffic” [confidence: 56,75%]. Location and date appear next to the photo, as a credit: Los Angeles, USA. November, 2016. On the bottom of the photo, some tags are listed: “marina, city, vehicle, dock, walkway, sport venue, port, harbor, infrastructure, downtown”. On the next page, the same layout with different content: a picture is captioned “a crowd of people watching a large umbrella” [confidence: 67,66%]. Location and date: Berlin, Germany. August, 2014. Tags: “crowd, people, spring, festival, tradition”.

Metadata from the camera device (date and location) is collected using Adobe Lightroom. Visual features (tags and colors) are extracted from photos using Google’s Cloud Vision API. Automated captions for photos, with their corresponding score confidence, are generated using Microsoft’s Cognitive Services API. Finally, image composition is analyzed using histogram of oriented gradients (HOGs). These components were then considered by a t-SNE learning algorithm, which sorted the images in a two-dimensional space according to similarities. A genetic TSP algorithm computes the shortest path through the arrangement, thereby defining the page order. You can check out the process, recorded in his video below:

 

 

About machine capabilities versus human sensitivities

For Recognition, an artificial intelligence program that associates Tate’s masterpieces and news photographs provided by Reuters, there are visual or thematic similarities between the photo of a woman with a phrase on her face that reads #foratemer (out Temer) during a protest against a constitutional amendment known as PEC 55 and the portrait of an aristocrat man of the seventeenth century in costumes that denote his sovereignty and authority. In times when intelligent and thinking machines, like chatbots, are a topic widely discussed I wonder if the algorithms that created the dialogue between these two images would be aware of the conflicting but no less interesting relationship between resistance and power established between them.

Machine Learning Foundations – Week 1: course overview

I decided to take the online course “Machine Learning Foundations – A Case Study Approach” offered by Coursera and taught by Carlos Guestrin and Emily Fox (professors from University of Washington).

This introductory and intuitive course treats the Machine Learning method as a black box. The idea is to learn ML concepts through a case study approach, so the course doesn’t deepen on how to describe a ML model and optimize it.

It’s a 6-week course and I’ll share here the highlights related to my research.

Week 1 – course overview

Slides
Videos

Machine learning is changing the world: In fact, if you look some of the most industry successful companies today – Companies that are called disruptive – they’re often differentiated by intelligent applications, by intelligence that uses machine learning at its core. So, for example, early days Amazon really disrupted the retail market by bringing in product recommendations into their website. We saw Google disrupting the advertising market by really targeting advertising with machine learning to figure out what people would click on. You saw Netflix, the movie distribution company, really change how movies are seen. Now we don’t go to a shop and rent movies anymore. We go to the web and we stream data. Netflix really changed that. And at the core, there was a recommender system that helped me find the movies that I liked, the movies that are good for me out of the many, many, many thousands of movies they were serving. You see companies like Pandora, where they’re providing a music recommendation system where I find music that I like. And I find streams that are good for the morning when I’m sleepy or at night when I’m ready to go to bed and I want to listen to different music. And they really find good music for us. And you see that in many places, in many industries, you see Facebook connecting me with people who I might want to be friends with. And you even see companies like Uber disrupting the taxi industry by really optimizing how to connect drivers with people in real time. So, in all these areas, machine learning is one of the core technologies, the technology that makes that company’s product really special.

The Machine Learning pipeline: the data to intelligence pipeline. We start from data and bring in a machine learning method that provides us with a new kind of analysis of the data. And that analysis gives us intelligence. Intelligence like what product am I likely to buy right now?

Case study 1: Predicting house prices

Machine Learning can be used to predict house values. So, the intelligence we’re deriving is a value associated with some house that’s not on the market. So, we don’t know what its value is and we want to learn that from data. And what’s our data? In this case, we look at other houses and look at their house sales prices to inform the house value of this house we’re interested in. And in addition to the sales prices, we look at other features of the houses. Like how the number of bedrooms, bathrooms, the number of square feet, and so on. What the machine learning method does it to relate the house attributes to the sales price. Because if we can learn this model – this relationship from house level features to the observed sales price – then we can use that for predicting on this new house. We take its house attribute and predict its house sales price. And this method is called regression.

Case study 2: Sentiment analysis

Machine Learning can be used to a sentiment analysis task where the training data are reviews of restaurants. In this case, a review can say the sushi was awesome, the drink was awesome, but the service was awful. A possible ML goal in this scenario can be to take this single review and classify whether or not it has a positive sentiment. If it is a good review, thumbs up; if it has negative sentiment, thumbs down. To do so, the ML pipeline analyses a lot of other reviews (training data) considering the text and the rating of the review in order to understand what’s the relationship here, for classification of this sentiment. For example, the ML model might analyze the text of this review in terms of how many time the word “awesome” versus how many times the word “awful” was used. And doing so for all reviews, the model will learn – based on the balance of usage of these words – a decision boundary between whether it’s a positive or negative review. And the way the model learn from these other reviews is based on the ratings associated with that text. This method is called a classification method.

Case study 3: Document retrieval

The third case study it’s about a document retrieval task. From a huge collection of articles and books (dataset) the system could recommend, the challenge is to use machine learning to indicate those readings more interesting to a specific person. In this case, the ML model tries to find structure in the dataset based on groups of related articles (e.g. sports, world news, entertainment, science, etc.). By finding this structure and annotating the corpus (the collection of documents) then the machine can use the labels to build a document retrieval engine. And if a reader is currently reading some article about world news and wants to retrieve another one, then, aware of its label, he or she knows which type of category to keep searching over. This type of approach is called clustering.

Case study 4: Product recommendation

The fourth case study addresses an approach called collaborative filtering that’s had a lot of impact in many domains in the last decade. Specifically, the task is to build a product recommendation applications, where the ML model gets to know the costumer’s past purchases and tries to use those to recommend some set of other products the customer might be interested in purchasing. The relation the model tries to understand to make the recommendation is on the products the consumer bought before and what he or she is likely to buy in the future. And to learn this relation the model looks at the purchase histories of a lot of past customers and possibly features of those customers (e.g. age, genre, family role, location …).

Case study 5:  Visual product recommender

The last case study is about a visual product recommender. The concept idea is pretty much like the latter example. The task here is also a recommendation application, but the ML model learns from visual features of an image and the outcome is also an image. Here, the data is an input image (e.g. black shoe, black boot, high heel, running shoe or some other shoe) chosen by a user on a browser. And the goal of the application is to retrieve a set of images of shoes visually similar to the input image. The model does so by learning visual relations between different shoes. Usually, these models are trained on a specific kind of architecture called Convolutional Neural Network (CNN). In CNN architecture, every layer of the neural network provides more and more descriptive features. The first layer is supposed to just detect features like different edges. By the second layer, the model begins to detect corners and more complex features. And as we go deeper and deeper in these layers, we can observe more intricate visual features arising.

DH2017 – Computer Vision in DH workshop (Papers – Third Block)

Third block: Deep Learning
Chair: Thomas Smits (Radboud University)

6) Aligning Images and Text in a Digital Library (Jack Hessel & David Mimno)

Abstract
Slides
Website David Mimno
Website Jack Hessel

Problem: correspondence between text and images.
  • In this work, the researchers train machine learning algorithms to match images from book scans with the text in the pages surrounding those images.
  • Using 400K images collected from 65K volumes published between the 14th and 20th centuries released to the public domain by the British Library, they build information retrieval systems capable of performing cross-modal retrieval, i.e., searching images using text, and vice-versa.
  • Previous multi-modal work:
    • Datasets: Microsoft Common Objects in Context (COCO) and Flickr (images with user-provided tags);
    • Tasks: Cross-modal information retrieval (ImageCLEF) and Caption search / generation
  • Project Goals:
    • Use text to provide context for the images we see in digital libraries, and as a noisy “label” for computer vision tasks
    • Use images to provide grounding for text.
  • Why is this hard? Most relationship between text and images is weakly aligned, that is, very vague. A caption is an example of strong alignments between text and images. An article is an example of weak alignment.

7) Visual Trends in Dutch Newspaper Advertisements (Melvin Wevers & Juliette Lonij)

Abstract
Slides

Live Demo of SIAMESE: Similar advertisement search.
  • The context of advertisements for historical research:
    • “insight into the ideals and aspirations of past realities …”
    • “show the state of technology, the social functions of products, and provide information on the society in which a product was sold” (Marchand, 1985).
  • Research question: How can we combine non-textual information with textual information to study trends in advertisements?
  • Data: ~1,6M Advertisements from two Dutch national newspapers Algemeen Handelsblad and NRC Handelsblad between 1948-1995
  • Metadata: title, date, newspaper, size, position (x, y), ocr, page number, total number of pages.
  • Approach: Visual Similarity:
    • Group images together based on visual cues;
    • Demo: SIAMESE: SImilar AdvertiseMEnt SEarch;
    • Approximate nearest neighbors in a penultimate layer of ImageNet inception model.
  • Final remarks:
    • Object detection and visual similarity approach offer trends on different layers, similar to close and distant reading;
    • Visual Similarity is not always Conceptual Similarity;
    • Combination of text/semantic and visual similarity as a way to find related advertisements.

8) Deep Learning Tools for Foreground-Aware Analysis of Film Colors (Barbara Flueckiger, Noyan Evirgen, Enrique G. Paredes, Rafael Ballester-Ripoll, Renato Pajarola)

The research project FilmColors, funded by an Advanced Grant of the European Research Council, aims at a systematic investigation into the relationship between film color technologies and aesthetics.

Initially, the research team analyzed a large group of 400 films from 1895 to 1995 with a protocol that consists of about 600 items per segment to identify stylistic and aesthetic patterns of color in film.

This human-based approach is now being extended by an advanced software that is able to detect the figure-ground configuration and to plot the results into corresponding color schemes based on a perceptually uniform color space (see Flueckiger 2011 and Flueckiger 2017, in press).

ERC Advanced Grant FilmColors

DH2017 – Computer Vision in DH workshop (Papers – Second Block)

Second block: Tools
Chair: Melvin Wevers (Utrecht University)

4) A Web-Based Interface for Art Historical Research (Sabine Lang & Bjorn Ommer)

Abstract
Slides
Computer Vision Group (University of Heidelberg)

  • Area: art history <-> computer vision
  • First experiment: Can computers propel the understanding and reconstruction of drawing processes?
  • Goal: Study production process. Understand the types and degrees of transformation between an original piece of art and its reproductions.
  • Experiment 2: Can computers help with the analysis of large image corpora, e.g. find gestures?
  • Goal: Find visual similarities and do formal analysis.
  • Central questions: which gestures can we identify? Do there exist varying types of one gesture?
  • Results: Visuelle Bildsuche (interface for art historical research)
Visuelle Bildsuche – Interface start screen. Data collection Sachsenspiegel (c1220)
  • Interesting and potential feature: in the image, you can markup areas and find others images with visual similarities:
Search results with visual similarities based on selected bounding boxes
Bautista, Miguel A., Artsiom Sanakoyeu, Ekaterina Sutter, and Björn Ommer. “CliqueCNN: Deep Unsupervised Exemplar Learning.” arXiv:1608.08792 [Cs], August 31, 2016. http://arxiv.org/abs/1608.08792.

5) The Media Ecology Project’s Semantic Annotation Tool and Knight Prototype Grant (Mark Williams, John Bell, Dimitrios Latsis, Lorenzo Torresani)

Abstract
Slides
Media Ecology Project (Dartmouth)

The Semantic Annotation Tool (SAT)

Is a drop-in module that facilitates the creation and sharing of time-based media annotations on the Web

Knight News Challenge Prototype Grant

Knight Foundation has awarded a Prototype Grant for Media Innovation to The Media Ecology Project (MEP) and Prof. Lorenzo Torresani’s Visual Learning Group at Dartmouth, in conjunction with The Internet Archive and the VEMI Lab at The University of Maine.

“Unlocking Film Libraries for Discovery and Search” will apply existing software for algorithmic object, action, and speech recognition to a varied collection of 100 educational films held by the Internet Archive and Dartmouth Library. We will evaluate the resulting data to plan future multimodal metadata generation tools that improve video discovery and accessibility in libraries.

 

DH2017 – Computer Vision in DH workshop (Keynote)

Robots Reading Vogue Project

A keynote by Lindsay King & Peter Leonard (Yale University) on “Processing Pixels: Towards Visual Culture Computation”.

SLIDES HERE

Abstract: This talk will focus on an array of algorithmic image analysis techniques, from simple to cutting-edge, on materials ranging from 19th century photography to 20th century fashion magazines. We’ll consider colormetrics, hue extraction, facial detection, and neural network-based visual similarity. We’ll also consider the opportunities and challenges of obtaining and working with large-scale image collections.

Project Robots Reading Vogue project at Digital Humanities Lab Yale University Library

1) The project:

  • 121 yrs of Vogue (2,700 covers, 400,000 pages, 6 TB of data). First experiments: N-Grams, topic modeling.
  • Humans are better at seeing “distant vision” (images) patterns with their own eyes than  “distant reading” (text)
  • A simple layout interface of covers by month and year reveals patterns about Vogue’s seasonal patterns
  • The interface is not technically difficult do implement
  • Does not use computer vision for analysis

2) Image analysis in RRV (sorting covers by color to enable browsing)

    • Media visualization (Manovich) to show saturation and hue by month. Result: differences by the season of the year. Tool used:  ImagePlot
    • “The average color problem”. Solutions:
    • Slice histograms: Visualization Peter showed.

The slice histograms give us a zoomed-out view unlike any other visualizations we’ve tried. We think of them as “visual fingerprints” that capture a macroscopic view of how the covers of Vogue changed through time.
  • “Face detection is kinda of a hot topic people talk about but I only think it is of use when it is combined with other techniques’ see e.g. face detection within 

    3. Experiment Face Detection + geography 

  •  Photogrammer
Face Detection + Geography
  • Code on Github
  • Idea: Place image as thumbnail in a map
  • Face Detection + composition
Face Detection + composition

4. Visual Similarity 

  • What if we could search for pictures that are visually similar to a given image
  • Neural networks approach
  • Demo of Visual Similarity experiment:
In the main interface, you select an image and it shows its closest neighbors.
  • In the main interface, you select an image and it shows its closest neighbors.

Other related works on Visual Similarities:

  • John Resig’s Ukiyo-e  (Japenese woodblock prints project). Article: Resig, John. “Aggregating and Analyzing Digitized Japanese Woodblock Prints.” Japanese Association of Digital Humanities conference, 2013.
  • John Resig’s  TinEye MatchEngine (Finds duplicate, modified and even derivative images in your image collection).
  • Carl Stahmer – Arch Vision (Early English Broadside / Ballad Impression Archive)
  • Article: Stahmer, Carl. (2014). “Arch-V: A platform for image-based search and retrieval of digital archives.” Digital Humanities 2014: Conference Abstracts
  • ARCHIVE-VISION Github code here
  • Peter refers to paper Benoit presented in Krakow.

5. Final thoughts and next steps

  • Towards Visual Cultures Computation
  • NNs are “indescribable”… but we can dig in to look at pixels that contribute to classifications: http://cs231n.github.io/understanding-cnn/
  • The Digital Humanities Lab at Yale University Library is currently working with as image dataset from YALE library through Deep Learning approach to detect visual similarities.
  • This project is called Neural Neighbors and there is a live demo of neural network visual similarity on 19thC photos
Neural Neighbors seeks to show visual similarities in 80,000 19th Century photographs
  • The idea is to combine signal from pixels with signal from text
  • Question: how to organize this logistically?
  • Consider intrinsic metadata of available collections
  • Approaches to handling copyright licensing restrictions (perpetual license and transformative use)
  • Increase the number of open image collections available: museums, governments collections, social media
  • Computer science departments working on computer vision with training datasets.