Machine Learning Foundations – Week 1: course overview

I decided to take the online course “Machine Learning Foundations – A Case Study Approach” offered by Coursera and taught by Carlos Guestrin and Emily Fox (professors from University of Washington).

This introductory and intuitive course treats the Machine Learning method as a black box. The idea is to learn ML concepts through a case study approach, so the course doesn’t deepen on how to describe a ML model and optimize it.

It’s a 6-week course and I’ll share here the highlights related to my research.

Week 1 – course overview


Machine learning is changing the world: In fact, if you look some of the most industry successful companies today – Companies that are called disruptive – they’re often differentiated by intelligent applications, by intelligence that uses machine learning at its core. So, for example, early days Amazon really disrupted the retail market by bringing in product recommendations into their website. We saw Google disrupting the advertising market by really targeting advertising with machine learning to figure out what people would click on. You saw Netflix, the movie distribution company, really change how movies are seen. Now we don’t go to a shop and rent movies anymore. We go to the web and we stream data. Netflix really changed that. And at the core, there was a recommender system that helped me find the movies that I liked, the movies that are good for me out of the many, many, many thousands of movies they were serving. You see companies like Pandora, where they’re providing a music recommendation system where I find music that I like. And I find streams that are good for the morning when I’m sleepy or at night when I’m ready to go to bed and I want to listen to different music. And they really find good music for us. And you see that in many places, in many industries, you see Facebook connecting me with people who I might want to be friends with. And you even see companies like Uber disrupting the taxi industry by really optimizing how to connect drivers with people in real time. So, in all these areas, machine learning is one of the core technologies, the technology that makes that company’s product really special.

The Machine Learning pipeline: the data to intelligence pipeline. We start from data and bring in a machine learning method that provides us with a new kind of analysis of the data. And that analysis gives us intelligence. Intelligence like what product am I likely to buy right now?

Case study 1: Predicting house prices

Machine Learning can be used to predict house values. So, the intelligence we’re deriving is a value associated with some house that’s not on the market. So, we don’t know what its value is and we want to learn that from data. And what’s our data? In this case, we look at other houses and look at their house sales prices to inform the house value of this house we’re interested in. And in addition to the sales prices, we look at other features of the houses. Like how the number of bedrooms, bathrooms, the number of square feet, and so on. What the machine learning method does it to relate the house attributes to the sales price. Because if we can learn this model – this relationship from house level features to the observed sales price – then we can use that for predicting on this new house. We take its house attribute and predict its house sales price. And this method is called regression.

Case study 2: Sentiment analysis

Machine Learning can be used to a sentiment analysis task where the training data are reviews of restaurants. In this case, a review can say the sushi was awesome, the drink was awesome, but the service was awful. A possible ML goal in this scenario can be to take this single review and classify whether or not it has a positive sentiment. If it is a good review, thumbs up; if it has negative sentiment, thumbs down. To do so, the ML pipeline analyses a lot of other reviews (training data) considering the text and the rating of the review in order to understand what’s the relationship here, for classification of this sentiment. For example, the ML model might analyze the text of this review in terms of how many time the word “awesome” versus how many times the word “awful” was used. And doing so for all reviews, the model will learn – based on the balance of usage of these words – a decision boundary between whether it’s a positive or negative review. And the way the model learn from these other reviews is based on the ratings associated with that text. This method is called a classification method.

Case study 3: Document retrieval

The third case study it’s about a document retrieval task. From a huge collection of articles and books (dataset) the system could recommend, the challenge is to use machine learning to indicate those readings more interesting to a specific person. In this case, the ML model tries to find structure in the dataset based on groups of related articles (e.g. sports, world news, entertainment, science, etc.). By finding this structure and annotating the corpus (the collection of documents) then the machine can use the labels to build a document retrieval engine. And if a reader is currently reading some article about world news and wants to retrieve another one, then, aware of its label, he or she knows which type of category to keep searching over. This type of approach is called clustering.

Case study 4: Product recommendation

The fourth case study addresses an approach called collaborative filtering that’s had a lot of impact in many domains in the last decade. Specifically, the task is to build a product recommendation applications, where the ML model gets to know the costumer’s past purchases and tries to use those to recommend some set of other products the customer might be interested in purchasing. The relation the model tries to understand to make the recommendation is on the products the consumer bought before and what he or she is likely to buy in the future. And to learn this relation the model looks at the purchase histories of a lot of past customers and possibly features of those customers (e.g. age, genre, family role, location …).

Case study 5:  Visual product recommender

The last case study is about a visual product recommender. The concept idea is pretty much like the latter example. The task here is also a recommendation application, but the ML model learns from visual features of an image and the outcome is also an image. Here, the data is an input image (e.g. black shoe, black boot, high heel, running shoe or some other shoe) chosen by a user on a browser. And the goal of the application is to retrieve a set of images of shoes visually similar to the input image. The model does so by learning visual relations between different shoes. Usually, these models are trained on a specific kind of architecture called Convolutional Neural Network (CNN). In CNN architecture, every layer of the neural network provides more and more descriptive features. The first layer is supposed to just detect features like different edges. By the second layer, the model begins to detect corners and more complex features. And as we go deeper and deeper in these layers, we can observe more intricate visual features arising.

Your face in 3D

Reconstructing a 3-D model of a face is a fundamental Computer Vision problem that usually requires multiple images. But a recent publication presents an artificial intelligence approach to tackle this problem. And it does an impressive job!

In this work, the authors train a Convolutional Neural Network (CNN) on an appropriate dataset consisting of 2D images and 3D facial models or scans. See more information at their project website.

Try their online demo!
Reference: Jackson, Aaron S., Adrian Bulat, Vasileios Argyriou, and Georgios Tzimiropoulos. “Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression.” arXiv:1703.07834 [Cs], March 22, 2017.

DH2017 – Computer Vision in DH workshop (lightining talks part 1)

To facilitate the exchange of current ongoing work, projects or plans, the workshop allowed participants to give very short lightning talks and project pitches of max 5 minutes.

Part 1
Chair: Martijn Kleppe (National Library of the Netherlands)

1. How can Caffe be used to segment historical images into different categories?
Thomas Smits (Radboud University)

Number of images by identified categories.
  • Challenge: how to attack the “unknown” category and make data more discoverable?

2. The Analysis Of Colors By Means Of Contrasts In Movies 
Niels Walkowski (BBAW / KU Leuven)

  • Slides 
  • Cinemetrics, Colour Analysis & Digital Humanities:
    • Brodbeck (2011) “Cinemetrics”: the project is about measuring and visualizing movie data, in order to reveal the characteristics of films and to create a visual “fingerprint” for them. Information such as the editing structure, color, speech or motion are extracted, analyzed and transformed into graphic representations so that movies can be seen as a whole and easily interpreted or compared side by side.

      Film Data Visualization
    • Burghardt (2016) “Movieanalyzer
Movieanalyzer (2016)

3. New project announcement INSIGHT: Intelligent Neural Networks as Integrated Heritage Tools
Mike Kestemont (Universiteit Antwerpen)

  • Slides
  • Data from two museums Museums: Royal Museums of Fine Arts of Belgium and Royal Museums of Art and History;
  • Research opportunity: how can multimodal representation learning (NPL + Vision) help to organize and explore this data;
  • Transfer knowledge approach:
    • Large players in the field have massive datasets;
    • How easily can we transfer knowledge from large to small collections? E.g. automatic dating or object description;
  • Partner up: the Departments of Literature and Linguistics (Faculty of Arts and Philosophy) of the University of Antwerp and the Montefiore Institute (Faculty of Applied Sciences) of the University of Liège are seeking to fill two full-time (100%) vacancies for Doctoral Grants in the area of machine/deep learning, language technology, and/or computer vision for enriching heritage collections. More information.

4. Introduction of CODH computer vision and machine learning datasets such as old Japanese books and characters
Asanobu KITAMOTO (CODH -National Institute of Informatics)

  • Slides;
  • Center for Open Data in the Humanities (CODH);
  • It’s a research center in Tokyo, Japan, officially launched on April 1, 2017;
  • Scope: (1) humanities research using information technology and (2) other fields of research using humanities data.
  • Released datasets:
    • Dataset of Pre-Modern Japanese Text (PMJT): Pre-Modern Japanese Text, owned by National Institute of Japanese Literature, is released image and text data as open data. In addition, some text has description, transcription, and tagging data.

      Pre-Modern Japanese Text Dataset: currently 701 books
    • PMJT Character Shapes;
    • IIIF Curation Viewer

      Curation Viewer
  • CODH is looking for a project researcher who is interested in applying computer vision to humanities data. Contact:

5. Introduction to the new South African Centre for Digital Language Resources (SADiLaR )
Juan Steyn

  • Slides;
  • SADiLaR is a new research infrastructure set up by the Department of Science and Technology (DST) forming part of the new South African Research Infrastructure Roadmap (SARIR).
  • Officially launched on October, 2016;
  • SADiLaR runs two programs:
    • Digitisation program: which entails the systematic creation of relevant digital text, speech and multi-modal resources related to all official languages of South Africa, as well as the development of appropriate natural language processing software tools for research and development purposes;
    • A Digital Humanities program; which facilitates research capacity building by promoting and supporting the use of digital data and innovative methodological approaches within the Humanities and Social Sciences. (See

DH2017 – Computer Vision in DH workshop (Papers – First Block)

Seven papers have been selected by a review commission and authors had 15 minutes to present during the Workshop. Papers were divided into three thematic blocks:

First block: Research results using computer vision
Chair: Mark Williams (Darthmouth College)

1) Extracting Meaningful Data from Decomposing Bodies (Alison Langmead, Paul Rodriguez, Sandeep Puthanveetil Satheesan, and Alan Craig)

Full Paper

Each card used a pre-established set of eleven anthropometrical measurements (such as height, length of left foot, and width of the skull) as an index for other identifying information about each individual (such as the crime committed, their nationality, and a pair of photographs).

This presentation is about Decomposing Bodies, a large-scale, lab-based, digital humanities project housed in the Visual Media Workshop at the University of Pittsburgh that is examining the system of criminal identification introduced in France in the late 19th century by Alphonse Bertillon.

  • Data: System of criminal identification from American prisoners from Ohio.
  • ToolOpenFace. Free and open source face recognition with deep neural networks.
  • Goal: An end-to-end system for extracting handwritten text and numbers from scanned Bertillon cards in a semi-automated fashion and also the ability to browse through the original data and generated metadata using a web interface.
  • Character recognition: MNIST database
  • Mechanical Turk: we need to talk about it”: consider Mechanical Turk if public domain data and task is easy.
  • Findings: Humans deal very well with understanding discrepancies. We should not ask the computer to find these discrepancies to us, but we should build visualizations that allow us to visually compare images and identify de similarities and discrepancies.

2) Distant Viewing TV (Taylor Arnold and Lauren Tilton, University of Richmond)


Distant Viewing TV applies computational methods to the study of television series, utilizing and developing cutting-edge techniques in computer vision to analyze moving image culture on a large scale.

Screenshots of analysis of Bewitched
  • Code on Github
  • Both presenters are authors o Humanities Data in R
  • The project was built on work with libraries with low-level features (dlib, cvv and OpenCV) + many papers that attempt to identify mid-level features. Still:
    • code often nonexistent;
    • a prototype is not a library;
    • not generalizable;
    • no interoperability
  • Abstract-features such as genre and emotion, are new territories
Feature taxonomy
  • Pilot study: Bewitched (serie)
  • Goal: measure character presence and position in the scene
  • Algorithm for shot detection 
  • Algorithm for face detection
  •  Video example
  • Next steps:
    • Audio features
    • Build a formal testing set

3) Match, compare, classify, annotate: computer vision tools for the modern humanist (Giles Bergel)

The Printing Machine (Giles Bergel research blog)

This presentation related the University of Oxford’s Visual Geometry Group’s experience in making images computationally addressable for humanities research.

The Visual Geometry Group has built a number of systems for humanists, variously implementing (i) visual search, in which an image is made retrievable; (ii) comparison, which assists the discovery of similarity and difference; (iii) classification, which applies a descriptive vocabulary to images; and (iv) annotation, in which images are further described for both computational and offline analysis

a) Main Project Seebibyte

  • Idea: Visual Search for the Era of Big Data is a large research project based in the Department of Engineering Science, University of Oxford. It is funded by the EPSRC (Engineering and Physical Sciences Research Council), and will run from 2015 – 2020.
  • Objectives: to carry out fundamental research to develop next generation computer vision methods that are able to analyse, describe and search image and video content with human-like capabilities. To transfer these methods to industry and to other academic disciplines (such as Archaeology, Art, Geology, Medicine, Plant sciences and Zoology)
  • Demo: BBC News Search (Visual Search of BBC News)

Tool: VGG Image Classification (VIC) Engine

This is a technical demo of the large-scale on-the-fly web search technologies which are under development in the Oxford University Visual Geometry Group, using data provided by BBC R&D comprising over five years of prime-time news broadcasts from six channels. The demo consists of three different components, which can be used to query the dataset on-the-fly for three different query types: object search, image search and people search.

The demo consists of three different components, which can be used to query the dataset on-the-fly for three different query types.
An item of interest can be specified at run time by a text query, and a discriminative classifier for that item is then learnt on-the-fly using images downloaded from Google Image search.

ApproachImage classification through Machine Learning.
Tool: VGG Image Classification Engine (VIC)

The objective of this research is to find objects in paintings by learning classifiers from photographs on the internet. There is a live demo that allows a user to search for an object of their choosing (such as “baby”, “bird”, or “dog, for example) in a dataset of over 200,000 paintings, in a matter of seconds.

It allows computers to recognize objects in images, what is distinctive about our work is that we also recover the 2D outline of the object. Currently, the project has trained this model to recognize 20 classes. The demo allows the user to test our algorithm on their images.

b) Other projects

Approach: Image searching
Tool: VGG Image Search Engine (VISE)

Approach: Image annotation
Tool: VGG Image Annotator (VIA)


DH2017 – Computer Vision in DH workshop (Keynote)

Robots Reading Vogue Project

A keynote by Lindsay King & Peter Leonard (Yale University) on “Processing Pixels: Towards Visual Culture Computation”.


Abstract: This talk will focus on an array of algorithmic image analysis techniques, from simple to cutting-edge, on materials ranging from 19th century photography to 20th century fashion magazines. We’ll consider colormetrics, hue extraction, facial detection, and neural network-based visual similarity. We’ll also consider the opportunities and challenges of obtaining and working with large-scale image collections.

Project Robots Reading Vogue project at Digital Humanities Lab Yale University Library

1) The project:

  • 121 yrs of Vogue (2,700 covers, 400,000 pages, 6 TB of data). First experiments: N-Grams, topic modeling.
  • Humans are better at seeing “distant vision” (images) patterns with their own eyes than  “distant reading” (text)
  • A simple layout interface of covers by month and year reveals patterns about Vogue’s seasonal patterns
  • The interface is not technically difficult do implement
  • Does not use computer vision for analysis

2) Image analysis in RRV (sorting covers by color to enable browsing)

    • Media visualization (Manovich) to show saturation and hue by month. Result: differences by the season of the year. Tool used:  ImagePlot
    • “The average color problem”. Solutions:
    • Slice histograms: Visualization Peter showed.

The slice histograms give us a zoomed-out view unlike any other visualizations we’ve tried. We think of them as “visual fingerprints” that capture a macroscopic view of how the covers of Vogue changed through time.
  • “Face detection is kinda of a hot topic people talk about but I only think it is of use when it is combined with other techniques’ see e.g. face detection within 

    3. Experiment Face Detection + geography 

  •  Photogrammer
Face Detection + Geography
  • Code on Github
  • Idea: Place image as thumbnail in a map
  • Face Detection + composition
Face Detection + composition

4. Visual Similarity 

  • What if we could search for pictures that are visually similar to a given image
  • Neural networks approach
  • Demo of Visual Similarity experiment:
In the main interface, you select an image and it shows its closest neighbors.
  • In the main interface, you select an image and it shows its closest neighbors.

Other related works on Visual Similarities:

  • John Resig’s Ukiyo-e  (Japenese woodblock prints project). Article: Resig, John. “Aggregating and Analyzing Digitized Japanese Woodblock Prints.” Japanese Association of Digital Humanities conference, 2013.
  • John Resig’s  TinEye MatchEngine (Finds duplicate, modified and even derivative images in your image collection).
  • Carl Stahmer – Arch Vision (Early English Broadside / Ballad Impression Archive)
  • Article: Stahmer, Carl. (2014). “Arch-V: A platform for image-based search and retrieval of digital archives.” Digital Humanities 2014: Conference Abstracts
  • ARCHIVE-VISION Github code here
  • Peter refers to paper Benoit presented in Krakow.

5. Final thoughts and next steps

  • Towards Visual Cultures Computation
  • NNs are “indescribable”… but we can dig in to look at pixels that contribute to classifications:
  • The Digital Humanities Lab at Yale University Library is currently working with as image dataset from YALE library through Deep Learning approach to detect visual similarities.
  • This project is called Neural Neighbors and there is a live demo of neural network visual similarity on 19thC photos
Neural Neighbors seeks to show visual similarities in 80,000 19th Century photographs
  • The idea is to combine signal from pixels with signal from text
  • Question: how to organize this logistically?
  • Consider intrinsic metadata of available collections
  • Approaches to handling copyright licensing restrictions (perpetual license and transformative use)
  • Increase the number of open image collections available: museums, governments collections, social media
  • Computer science departments working on computer vision with training datasets.