AI is learning how to invent new fashions

In a paper published on the ArXiv, researchers from the University of California and Adobe have outlined a way for AI to not only learn a person’s style but create computer-generated images of items that match that style. This kind of computer vision task is being called “predictive fashion” and could let retailers create personalized pieces of clothing.

The model can be used for both personalized recommendation and design. Personalized recommendation is achieved by using a ‘visually aware’ recommender based on Siamese CNNs; generation is achieved by using a Generative Adversarial Net to synthesize new clothing items in the user’s personal style. (Kang et al., 2017).
Reference: Kang, Wang-Cheng, Chen Fang, Zhaowen Wang, and Julian McAuley. “Visually-Aware Fashion Recommendation and Design with Generative Image Models.” arXiv:1711.02231 [Cs], November 6, 2017. http://arxiv.org/abs/1711.02231.

Webinar: ImageNet – Where have we been? Where are we going?

ACM Learning Webinar
ImageNet: Where have we been? Where are we going?
Speaker: Fei-Fei Li
Chief Scientist of AI/ML at Google Cloud; Associate Professor at Stanford, Director of Stanford A.I. Lab

Slides

Webinar abstract: It took nature and evolution more than 500 million years to develop a powerful visual system in humans. The journey for AI and computer vision is about half of a century. In this talk, Dr. Li will briefly discuss the key ideas and the cutting-edge advances in the quest for visual intelligence in computers, focusing on work done to develop ImageNet over the years.

_____

Some highlights of this webinar:

1) The impact of ImageNet on AI/ ML research:
  • First. What’s ImageNet? It’s an image database, a “… largescale ontology of images built upon the backbone of the WordNet structure”;
  • The article “ImageNet: A Large-Scale Hierarchical Image Database” (1) has ~4,386 citations by the time on Google Scholar;
  • The dataset gave origin to The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (2), a benchmark in image classification and object detection in images running annually from 2010 up to now;
  • Many ImageNet Challenge Contestants became Startups (e.g. Clarifai; VizSense);
  • ImageNet became a key driven-force for deep learning implementation and helped to spread the culture of building structured datasets for specific domains:
Annotated datasets for specific domains.
  • Kaggle: a platform for predictive modeling and analytics competitions in which companies and researchers post data and statisticians and data miners compete to produce the best models for predicting and describing the data

Datasets – not algorithms – might be the key limiting factor to develpment of human-level artificial inteligence.” (Alexander Wissner-Gross, 2016)

2) The background of ImageNet
  • The beginning: Publication about ImageNet in CVPR (2009);
  • There are a lot of previous datasets that should be acknowledged:
Previous image datasets.
  • The reason why ImageNet became so popular is that this dataset has the rights characteristics to implement Computer Vision (CV) tasks from a Machine Learning (ML) approach.;
  • By 2005, the marriage of ML and CV became a trend in the scientific community;
  • There was a shift in the way ML was applied for visual recognition tasks: from a modeling-oriented approach to having lots of data.
  • This shift was partly enabled by the rapid internet data growth, that meant the opportunity to collect a large-scale visual data.
    3) From Wordnet to ImageNet
  • ImageNet was built upon the backbone of the WordNet, a tremendous dataset that enabled work in Natural Language Processing (NLP) and related tasks.
  • What’s WordNet? It’s a large lexical database of English. The original paper (3) by George Miller et al is cited over 5k. The database organizers over 150k words into 117k categories. It establishes ontological and lexical relationships in NLP and related tasks.
  • The idea to move from language to image:
From WordNet to ImageNet.
  • Three steps shift:
    • Step 1: ontological structures based on wordnet;
    • Step 2: populate categories with thousands of images from the internet;
    • Step3: clean bad results manually. By cleaning the errors you ensure your dataset is accurate.
From WordNet to ImageNet: three steps.
  • Three attempts to populate, train and test the dataset. The first two failed. The third was successful due to a new technology that became available by that time:  Amazon Mechanical Turk, a kind of crowdsourced engineer. Imagenet had the help of 49k workers from 167 countries (2007-2010).
  • After three years, ImageNet goes live in 2009 (50M images organized by 10K concept categories)
4) What they did right?
  • Based on ML needs, ImageNet targeted scale:
ImageNet: large-scale visual data
  • Besides, the database cared about:
    • image quality (high resolution to better replicate human visual acuity);
    • accurate annotations (to create a benchmarking dataset and advance the state of machine perception);
    • free of Charge (to ensure immediate application and a sense of community -> democratization)
  • Emphasis on Community: ILSVRC challenge is launched in 2009;
  • ILSVRC was inspired in PASCAL by VOC (Pattern Analysis, Statistical Modelling, and Computational Learning). From 2005-2012.
  • Participation and performance: the number of entries increased; classification errors (top-5) went down; the average precision for object detection went up:
Participation and performance at ILSVRC (2010-2017)
5) In what ImageNet invested and still investing efforts?
  • Lack of details: just one category annotated per image. Object detection enabled to recognize more than one class per image (through bounding boxs);
  • Hierarchical annotation:
Confusion matrix and sub-matrices of classifying the 7404 leaf categories in ImageNet7K, ordered by a depth-first traversal of the WordNet hierarchy (J. Deng, A. Berg & L. Fei-Fei, ECCV, 2010) (4)
  • Fine-grained recognition: recognize similar objects (class of cars, for example):
Fine-Grained Recognition (Gebru, Krause, Deng, Fei-Fei, CHI 2017)
6) Expected outcomes
  • ImageNet became a benchmark
  • It meant a breakthrough in object recognition
  • Machine learning advanced and changed dramatically
7) Unexpected outcomes
  • Neural Nets became popular in academical research again
  • Together, with the increase of accurate and available datasets and high-performance GPUs they promoted a Deep Learning revolution:
  • Maximize specificity in ontological structures:
Maximizing specificity (Deng, Krause, Berg & Fei-Fei, CVPR 2012)
  • Still, relatively few works uses ontological structures;
  • Human comparing versus machine comparing:
How humans and machines compare (Andrej Karpathy, 2014)
7) What lies ahead
  • moving from object recognition to human-level understanding (from perception to cognition):
It means more than recognizing objects AI will allow scene understanding, that is, the relations between people, actions and artifacts in an image.
  • That’s the concept behind Microsoft COCO (Common Objects in Context) (5), a “dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding”;
  • More recently there is the Visual Genome (6), a dataset, a knowledge base, an ongoing effort to connect structural image concepts to language:
    • Specs • 108,249 images (COCO images) • 4.2M image descriptions • 1.8M Visual QA (7W) • 1.4M objects, 75.7K obj. classes • 1.5M relationships, 40.5K rel. classes • 1.7M attributes, 40.5K attr. classes • Vision and language correspondences • Everything mapped to WordNet Synset
    • Exploratory interface:
The interface allows to search fore image and select different image attributes.
  • Visual Genome dataset was further used to advance the state-of-art in CV:
    • Paragraph generation;
    • Relationship prediction;
    • Image retrieval with scene graph;
    • visual questioning and answering
  • The future of vision intelligence relies upon the integration of perception, understanding, and action;
  • From now on, ImageNet ILSVRC challenge will be organized by Kaggle, a data science community that organizes competitions and makes datasets available.
References 

(1) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database. IEEE Computer Vision and Pattern Recognition (CVPR), 2009. 

(2) Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. “ImageNet Large Scale Visual Recognition Challenge.” International Journal of Computer Vision 115, no. 3 (December 2015): 211–52. doi:10.1007/s11263-015-0816-y.

(3) Miller, George A. “WordNet: A Lexical Database for English.” Communications of the ACM 38, no. 11 (1995): 39–41.

(4) Deng, Jia, Alexander C. Berg, Kai Li, and Li Fei-Fei. “What Does Classifying More than 10,000 Image Categories Tell Us?” In European Conference on Computer Vision, 71–84. Springer, 2010. https://link.springer.com/chapter/10.1007/978-3-642-15555-0_6.

(5) Lin, Tsung-Yi, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. “Microsoft COCO: Common Objects in Context.” arXiv:1405.0312 [Cs], May 1, 2014. http://arxiv.org/abs/1405.0312.

(6) Krishna, Ranjay, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, et al. “Visual Genome.” Accessed September 27, 2017. https://pdfs.semanticscholar.org/fdc2/d05c9ee932fa19df3edb9922b4f0406538a4.pdf.

Your face in 3D

Reconstructing a 3-D model of a face is a fundamental Computer Vision problem that usually requires multiple images. But a recent publication presents an artificial intelligence approach to tackle this problem. And it does an impressive job!

In this work, the authors train a Convolutional Neural Network (CNN) on an appropriate dataset consisting of 2D images and 3D facial models or scans. See more information at their project website.

Try their online demo!
Reference: Jackson, Aaron S., Adrian Bulat, Vasileios Argyriou, and Georgios Tzimiropoulos. “Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression.” arXiv:1703.07834 [Cs], March 22, 2017. http://arxiv.org/abs/1703.07834.

The problem of gender bias in the depiction of activities such as cooking and sports in images

The challenge of teaching machines to understand the world without reproducing prejudices. Researchers from Virginia University have identified that intelligent systems have started to link the cooking action in images much more to women than men.

Gender bias test with artificial intelligence to the act “cook”: women are more associated, even when there is a man in the image.

Just like search engines – which Google has as its prime example – do not work under absolute neutrality, free of any bias or prejudice, machines equipped with artificial intelligence trained to identify and categorize what they see in photos also do not work in a neutral way.

Article on Wired.

Article on Nexo (Portuguese)

Reference: Zhao, Jieyu, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. “Men Also Like Shopping: Reducing Gender Bias Amplification Using Corpus-Level Constraints.” arXiv:1707.09457 [Cs, Stat], July 28, 2017. http://arxiv.org/abs/1707.09457.

A Computer Vision and ML approach to understand urban changes

By comparing 1.6 million pairs of photos taken seven years apart, researchers from MIT’s Collective Learning Group now used a new computer vision system to quantify the physical improvement or deterioration of neighborhoods in five American cities, in an attempt to identify factors that predict urban change.

A large positive Streetchange value is typically indicative of major new construction (top row). A large negative Streetchange value is typically indicative of abandoned or demolished housing (bottom row).

The project is called Streetchange. An article introducing the article can be found here.

Reference:Naik, Nikhil, Scott Duke Kominers, Ramesh Raskar, Edward L. Glaeser, and César A. Hidalgo. “Computer Vision Uncovers Predictors of Physical Urban Change.” Proceedings of the National Academy of Sciences 114, no. 29 (July 18, 2017): 7571–76. doi:10.1073/pnas.1619003114.

DH2017 – Computer Vision in DH workshop (Papers – First Block)

Seven papers have been selected by a review commission and authors had 15 minutes to present during the Workshop. Papers were divided into three thematic blocks:

First block: Research results using computer vision
Chair: Mark Williams (Darthmouth College)

1) Extracting Meaningful Data from Decomposing Bodies (Alison Langmead, Paul Rodriguez, Sandeep Puthanveetil Satheesan, and Alan Craig)

Abstract
Slides
Full Paper

Each card used a pre-established set of eleven anthropometrical measurements (such as height, length of left foot, and width of the skull) as an index for other identifying information about each individual (such as the crime committed, their nationality, and a pair of photographs).

This presentation is about Decomposing Bodies, a large-scale, lab-based, digital humanities project housed in the Visual Media Workshop at the University of Pittsburgh that is examining the system of criminal identification introduced in France in the late 19th century by Alphonse Bertillon.

  • Data: System of criminal identification from American prisoners from Ohio.
  • ToolOpenFace. Free and open source face recognition with deep neural networks.
  • Goal: An end-to-end system for extracting handwritten text and numbers from scanned Bertillon cards in a semi-automated fashion and also the ability to browse through the original data and generated metadata using a web interface.
  • Character recognition: MNIST database
  • Mechanical Turk: we need to talk about it”: consider Mechanical Turk if public domain data and task is easy.
  • Findings: Humans deal very well with understanding discrepancies. We should not ask the computer to find these discrepancies to us, but we should build visualizations that allow us to visually compare images and identify de similarities and discrepancies.

2) Distant Viewing TV (Taylor Arnold and Lauren Tilton, University of Richmond)

Abstract
Slides

Distant Viewing TV applies computational methods to the study of television series, utilizing and developing cutting-edge techniques in computer vision to analyze moving image culture on a large scale.

Screenshots of analysis of Bewitched
  • Code on Github
  • Both presenters are authors o Humanities Data in R
  • The project was built on work with libraries with low-level features (dlib, cvv and OpenCV) + many papers that attempt to identify mid-level features. Still:
    • code often nonexistent;
    • a prototype is not a library;
    • not generalizable;
    • no interoperability
  • Abstract-features such as genre and emotion, are new territories
Feature taxonomy
  • Pilot study: Bewitched (serie)
  • Goal: measure character presence and position in the scene
  • Algorithm for shot detection 
  • Algorithm for face detection
  •  Video example
  • Next steps:
    • Audio features
    • Build a formal testing set

3) Match, compare, classify, annotate: computer vision tools for the modern humanist (Giles Bergel)

Abstract
Slides
The Printing Machine (Giles Bergel research blog)

This presentation related the University of Oxford’s Visual Geometry Group’s experience in making images computationally addressable for humanities research.

The Visual Geometry Group has built a number of systems for humanists, variously implementing (i) visual search, in which an image is made retrievable; (ii) comparison, which assists the discovery of similarity and difference; (iii) classification, which applies a descriptive vocabulary to images; and (iv) annotation, in which images are further described for both computational and offline analysis

a) Main Project Seebibyte

  • Idea: Visual Search for the Era of Big Data is a large research project based in the Department of Engineering Science, University of Oxford. It is funded by the EPSRC (Engineering and Physical Sciences Research Council), and will run from 2015 – 2020.
  • Objectives: to carry out fundamental research to develop next generation computer vision methods that are able to analyse, describe and search image and video content with human-like capabilities. To transfer these methods to industry and to other academic disciplines (such as Archaeology, Art, Geology, Medicine, Plant sciences and Zoology)
  • Demo: BBC News Search (Visual Search of BBC News)

Tool: VGG Image Classification (VIC) Engine

This is a technical demo of the large-scale on-the-fly web search technologies which are under development in the Oxford University Visual Geometry Group, using data provided by BBC R&D comprising over five years of prime-time news broadcasts from six channels. The demo consists of three different components, which can be used to query the dataset on-the-fly for three different query types: object search, image search and people search.

The demo consists of three different components, which can be used to query the dataset on-the-fly for three different query types.
An item of interest can be specified at run time by a text query, and a discriminative classifier for that item is then learnt on-the-fly using images downloaded from Google Image search.

ApproachImage classification through Machine Learning.
Tool: VGG Image Classification Engine (VIC)

The objective of this research is to find objects in paintings by learning classifiers from photographs on the internet. There is a live demo that allows a user to search for an object of their choosing (such as “baby”, “bird”, or “dog, for example) in a dataset of over 200,000 paintings, in a matter of seconds.

It allows computers to recognize objects in images, what is distinctive about our work is that we also recover the 2D outline of the object. Currently, the project has trained this model to recognize 20 classes. The demo allows the user to test our algorithm on their images.

b) Other projects

Approach: Image searching
Tool: VGG Image Search Engine (VISE)

Approach: Image annotation
Tool: VGG Image Annotator (VIA)

 

DH2017 – Computer Vision in DH workshop (Keynote)

Robots Reading Vogue Project

A keynote by Lindsay King & Peter Leonard (Yale University) on “Processing Pixels: Towards Visual Culture Computation”.

SLIDES HERE

Abstract: This talk will focus on an array of algorithmic image analysis techniques, from simple to cutting-edge, on materials ranging from 19th century photography to 20th century fashion magazines. We’ll consider colormetrics, hue extraction, facial detection, and neural network-based visual similarity. We’ll also consider the opportunities and challenges of obtaining and working with large-scale image collections.

Project Robots Reading Vogue project at Digital Humanities Lab Yale University Library

1) The project:

  • 121 yrs of Vogue (2,700 covers, 400,000 pages, 6 TB of data). First experiments: N-Grams, topic modeling.
  • Humans are better at seeing “distant vision” (images) patterns with their own eyes than  “distant reading” (text)
  • A simple layout interface of covers by month and year reveals patterns about Vogue’s seasonal patterns
  • The interface is not technically difficult do implement
  • Does not use computer vision for analysis

2) Image analysis in RRV (sorting covers by color to enable browsing)

    • Media visualization (Manovich) to show saturation and hue by month. Result: differences by the season of the year. Tool used:  ImagePlot
    • “The average color problem”. Solutions:
    • Slice histograms: Visualization Peter showed.

The slice histograms give us a zoomed-out view unlike any other visualizations we’ve tried. We think of them as “visual fingerprints” that capture a macroscopic view of how the covers of Vogue changed through time.
  • “Face detection is kinda of a hot topic people talk about but I only think it is of use when it is combined with other techniques’ see e.g. face detection within 

    3. Experiment Face Detection + geography 

  •  Photogrammer
Face Detection + Geography
  • Code on Github
  • Idea: Place image as thumbnail in a map
  • Face Detection + composition
Face Detection + composition

4. Visual Similarity 

  • What if we could search for pictures that are visually similar to a given image
  • Neural networks approach
  • Demo of Visual Similarity experiment:
In the main interface, you select an image and it shows its closest neighbors.
  • In the main interface, you select an image and it shows its closest neighbors.

Other related works on Visual Similarities:

  • John Resig’s Ukiyo-e  (Japenese woodblock prints project). Article: Resig, John. “Aggregating and Analyzing Digitized Japanese Woodblock Prints.” Japanese Association of Digital Humanities conference, 2013.
  • John Resig’s  TinEye MatchEngine (Finds duplicate, modified and even derivative images in your image collection).
  • Carl Stahmer – Arch Vision (Early English Broadside / Ballad Impression Archive)
  • Article: Stahmer, Carl. (2014). “Arch-V: A platform for image-based search and retrieval of digital archives.” Digital Humanities 2014: Conference Abstracts
  • ARCHIVE-VISION Github code here
  • Peter refers to paper Benoit presented in Krakow.

5. Final thoughts and next steps

  • Towards Visual Cultures Computation
  • NNs are “indescribable”… but we can dig in to look at pixels that contribute to classifications: http://cs231n.github.io/understanding-cnn/
  • The Digital Humanities Lab at Yale University Library is currently working with as image dataset from YALE library through Deep Learning approach to detect visual similarities.
  • This project is called Neural Neighbors and there is a live demo of neural network visual similarity on 19thC photos
Neural Neighbors seeks to show visual similarities in 80,000 19th Century photographs
  • The idea is to combine signal from pixels with signal from text
  • Question: how to organize this logistically?
  • Consider intrinsic metadata of available collections
  • Approaches to handling copyright licensing restrictions (perpetual license and transformative use)
  • Increase the number of open image collections available: museums, governments collections, social media
  • Computer science departments working on computer vision with training datasets.

 

Designing ML-driven products

The People + AI Research Initiative (PAIR), launched on 10th July 2017 by Google Brain Team, brings together researchers across Google to study and redesign the ways people interact with AI systems.

The article “Human-Centered Machine Learning” by Jess Holbrook¹, addresses how ML is causing UX designers to rethink, restructure, displace, and consider new possibilities for every product or service they build.

Both texts made me think about the image search and comparison engine I’m proposing through an user-centered point of view. I can take the following user needs identified by Martin Wattenberg and Fernanda Viégas and try to apply them to the product I’m planning to implement and evaluate:

  • Engineers and researchers: AI is built by people. How might we make it easier for engineers to build and understand machine learning systems? What educational materials and practical tools do they need?
  • Domain experts: How can AI aid and augment professionals in their work? How might we support doctors, technicians, designers, farmers, and musicians as they increasingly use AI?
  • Everyday users: How might we ensure machine learning is inclusive, so everyone can benefit from breakthroughs in AI? Can design thinking open up entirely new AI applications? Can we democratize the technology behind AI?

In my opinion, my research expects to attend the needs of “domain experts” (eg. designers and other professionals interested on visual discovery) and everyday users. But how to design this image search and comparison engine through a ML-driven approach or what Jess Holbrook calls “Human-Centered Machine Learning”? In his text, there are 7 steps to stay focused on the user when designing with ML. However, I want to highlight a distinction between what I see to be a full ML-driven product (in the way of what Google creates) and what I understand to be a product that shows a ML approach in its conception but not in its entirety (that is, the engine proposed in my research).

A full ML-driven product results in an interface that dynamically responds to the user input. That is, the pre-trained model performs tasks during user interaction and the interface presents the desired output for the user input. Or even more: the model can be retrained from the user’s data during interaction and the interface will dynamically show the results.

On the other hand, in my research, the ML approach will be only used during the image classification phase, which does not include the final user. After we collect all images from Twitter (or Instagram) these data will be categorized by Google Vision API, which is driven by ML algorithms. The results of Google’s classification will be then selected and used to organize the images on a multimedia interface. Finally, the user will be able to search for image trough text queries or by selecting filters based on ML image classification. However, during user interaction, there are no ML tasks being performed.

 

1 UX Manager and UX Researcher in the Research and Machine Intelligence group at Google

StreetStyle: Exploring world-wide clothing styles from millions of photos

Each day billions of photographs are uploaded to photo-sharing services and social media platforms. These images are packed with information about how people live around the world. In this paper we exploit this rich trove of data to understand fashion and style trends worldwide. We present a framework for visual discovery at scale, analyzing clothing and fashion across millions of images of people around the world and spanning several years. We introduce a large-scale dataset of photos of people annotated with clothing attributes, and use this dataset to train attribute classifiers via deep learning. We also present a method for discovering visually consistent style clusters that capture useful visual correlations in this massive dataset. Using these tools, we analyze millions of photos to derive visual insight, producing a first-of-its-kind analysis of global and per-city fashion choices and spatio-temporal trends.

Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:1706.01869 [cs.CV]
(or arXiv:1706.01869v1 [cs.CV] for this version)
 Source: https://arxiv.org/abs/1706.01869v1