Data mining with historical documents

The last seminar held by the Vision and Graphics Laboratory was about data mining with historical documents. Marcelo Ribeiro, a master student at the Applied Mathematics School of the Getúlio Vargas Foundation (EMAp/FGV), presented the results obtained with the application of topic modeling and natural language processing on the analysis of historical documents. This work was previously presented at the first International Digital Humanities Conference held in Brazil (HDRIO2018) and had Renato Rocha Souza (professor and researcher at EMAp/FGV) and Alexandre Moreli (professor and researcher at USP) as co-authors.

The database used is part of the CPDOC-FGV collection and essentially comprises historical documents from the 1970s belonging to Antonio Azeredo da Silveira, former Minister of Foreign Affairs of Brazil.

The documents:

• +10 thousand documents
• +66 thousand pages
• +14 million tokens / words (dictionaries or not)
• 5 languages, mainly Portuguese

• Physical documents
• Images (.tif and .jpg)
• Texts (.txt)

The presentation addressed the steps of the project, from document digitalization to Integration of results into the History-Lab platform.

The images below refer to the explanation of the OCR (Optical Character Recognition) phase and the topic modeling phase:

Presentation slides (in pt) can be accessed here. This initiative integrates the History Lab project, organized by Columbia University, which uses data science methods to investigate history.

Webinar: ImageNet – Where have we been? Where are we going?

ACM Learning Webinar
ImageNet: Where have we been? Where are we going?
Speaker: Fei-Fei Li
Chief Scientist of AI/ML at Google Cloud; Associate Professor at Stanford, Director of Stanford A.I. Lab


Webinar abstract: It took nature and evolution more than 500 million years to develop a powerful visual system in humans. The journey for AI and computer vision is about half of a century. In this talk, Dr. Li will briefly discuss the key ideas and the cutting-edge advances in the quest for visual intelligence in computers, focusing on work done to develop ImageNet over the years.


Some highlights of this webinar:

1) The impact of ImageNet on AI/ ML research:
  • First. What’s ImageNet? It’s an image database, a “… largescale ontology of images built upon the backbone of the WordNet structure”;
  • The article “ImageNet: A Large-Scale Hierarchical Image Database” (1) has ~4,386 citations by the time on Google Scholar;
  • The dataset gave origin to The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (2), a benchmark in image classification and object detection in images running annually from 2010 up to now;
  • Many ImageNet Challenge Contestants became Startups (e.g. Clarifai; VizSense);
  • ImageNet became a key driven-force for deep learning implementation and helped to spread the culture of building structured datasets for specific domains:
Annotated datasets for specific domains.
  • Kaggle: a platform for predictive modeling and analytics competitions in which companies and researchers post data and statisticians and data miners compete to produce the best models for predicting and describing the data

Datasets – not algorithms – might be the key limiting factor to develpment of human-level artificial inteligence.” (Alexander Wissner-Gross, 2016)

2) The background of ImageNet
  • The beginning: Publication about ImageNet in CVPR (2009);
  • There are a lot of previous datasets that should be acknowledged:
Previous image datasets.
  • The reason why ImageNet became so popular is that this dataset has the rights characteristics to implement Computer Vision (CV) tasks from a Machine Learning (ML) approach.;
  • By 2005, the marriage of ML and CV became a trend in the scientific community;
  • There was a shift in the way ML was applied for visual recognition tasks: from a modeling-oriented approach to having lots of data.
  • This shift was partly enabled by the rapid internet data growth, that meant the opportunity to collect a large-scale visual data.
    3) From Wordnet to ImageNet
  • ImageNet was built upon the backbone of the WordNet, a tremendous dataset that enabled work in Natural Language Processing (NLP) and related tasks.
  • What’s WordNet? It’s a large lexical database of English. The original paper (3) by George Miller et al is cited over 5k. The database organizers over 150k words into 117k categories. It establishes ontological and lexical relationships in NLP and related tasks.
  • The idea to move from language to image:
From WordNet to ImageNet.
  • Three steps shift:
    • Step 1: ontological structures based on wordnet;
    • Step 2: populate categories with thousands of images from the internet;
    • Step3: clean bad results manually. By cleaning the errors you ensure your dataset is accurate.
From WordNet to ImageNet: three steps.
  • Three attempts to populate, train and test the dataset. The first two failed. The third was successful due to a new technology that became available by that time:  Amazon Mechanical Turk, a kind of crowdsourced engineer. Imagenet had the help of 49k workers from 167 countries (2007-2010).
  • After three years, ImageNet goes live in 2009 (50M images organized by 10K concept categories)
4) What they did right?
  • Based on ML needs, ImageNet targeted scale:
ImageNet: large-scale visual data
  • Besides, the database cared about:
    • image quality (high resolution to better replicate human visual acuity);
    • accurate annotations (to create a benchmarking dataset and advance the state of machine perception);
    • free of Charge (to ensure immediate application and a sense of community -> democratization)
  • Emphasis on Community: ILSVRC challenge is launched in 2009;
  • ILSVRC was inspired in PASCAL by VOC (Pattern Analysis, Statistical Modelling, and Computational Learning). From 2005-2012.
  • Participation and performance: the number of entries increased; classification errors (top-5) went down; the average precision for object detection went up:
Participation and performance at ILSVRC (2010-2017)
5) In what ImageNet invested and still investing efforts?
  • Lack of details: just one category annotated per image. Object detection enabled to recognize more than one class per image (through bounding boxs);
  • Hierarchical annotation:
Confusion matrix and sub-matrices of classifying the 7404 leaf categories in ImageNet7K, ordered by a depth-first traversal of the WordNet hierarchy (J. Deng, A. Berg & L. Fei-Fei, ECCV, 2010) (4)
  • Fine-grained recognition: recognize similar objects (class of cars, for example):
Fine-Grained Recognition (Gebru, Krause, Deng, Fei-Fei, CHI 2017)
6) Expected outcomes
  • ImageNet became a benchmark
  • It meant a breakthrough in object recognition
  • Machine learning advanced and changed dramatically
7) Unexpected outcomes
  • Neural Nets became popular in academical research again
  • Together, with the increase of accurate and available datasets and high-performance GPUs they promoted a Deep Learning revolution:
  • Maximize specificity in ontological structures:
Maximizing specificity (Deng, Krause, Berg & Fei-Fei, CVPR 2012)
  • Still, relatively few works uses ontological structures;
  • Human comparing versus machine comparing:
How humans and machines compare (Andrej Karpathy, 2014)
7) What lies ahead
  • moving from object recognition to human-level understanding (from perception to cognition):
It means more than recognizing objects AI will allow scene understanding, that is, the relations between people, actions and artifacts in an image.
  • That’s the concept behind Microsoft COCO (Common Objects in Context) (5), a “dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding”;
  • More recently there is the Visual Genome (6), a dataset, a knowledge base, an ongoing effort to connect structural image concepts to language:
    • Specs • 108,249 images (COCO images) • 4.2M image descriptions • 1.8M Visual QA (7W) • 1.4M objects, 75.7K obj. classes • 1.5M relationships, 40.5K rel. classes • 1.7M attributes, 40.5K attr. classes • Vision and language correspondences • Everything mapped to WordNet Synset
    • Exploratory interface:
The interface allows to search fore image and select different image attributes.
  • Visual Genome dataset was further used to advance the state-of-art in CV:
    • Paragraph generation;
    • Relationship prediction;
    • Image retrieval with scene graph;
    • visual questioning and answering
  • The future of vision intelligence relies upon the integration of perception, understanding, and action;
  • From now on, ImageNet ILSVRC challenge will be organized by Kaggle, a data science community that organizes competitions and makes datasets available.

(1) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database. IEEE Computer Vision and Pattern Recognition (CVPR), 2009. 

(2) Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. “ImageNet Large Scale Visual Recognition Challenge.” International Journal of Computer Vision 115, no. 3 (December 2015): 211–52. doi:10.1007/s11263-015-0816-y.

(3) Miller, George A. “WordNet: A Lexical Database for English.” Communications of the ACM 38, no. 11 (1995): 39–41.

(4) Deng, Jia, Alexander C. Berg, Kai Li, and Li Fei-Fei. “What Does Classifying More than 10,000 Image Categories Tell Us?” In European Conference on Computer Vision, 71–84. Springer, 2010.

(5) Lin, Tsung-Yi, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. “Microsoft COCO: Common Objects in Context.” arXiv:1405.0312 [Cs], May 1, 2014.

(6) Krishna, Ranjay, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, et al. “Visual Genome.” Accessed September 27, 2017.

DH2017 – Computer Vision in DH workshop (lightining talks part 1)

To facilitate the exchange of current ongoing work, projects or plans, the workshop allowed participants to give very short lightning talks and project pitches of max 5 minutes.

Part 1
Chair: Martijn Kleppe (National Library of the Netherlands)

1. How can Caffe be used to segment historical images into different categories?
Thomas Smits (Radboud University)

Number of images by identified categories.
  • Challenge: how to attack the “unknown” category and make data more discoverable?

2. The Analysis Of Colors By Means Of Contrasts In Movies 
Niels Walkowski (BBAW / KU Leuven)

  • Slides 
  • Cinemetrics, Colour Analysis & Digital Humanities:
    • Brodbeck (2011) “Cinemetrics”: the project is about measuring and visualizing movie data, in order to reveal the characteristics of films and to create a visual “fingerprint” for them. Information such as the editing structure, color, speech or motion are extracted, analyzed and transformed into graphic representations so that movies can be seen as a whole and easily interpreted or compared side by side.

      Film Data Visualization
    • Burghardt (2016) “Movieanalyzer
Movieanalyzer (2016)

3. New project announcement INSIGHT: Intelligent Neural Networks as Integrated Heritage Tools
Mike Kestemont (Universiteit Antwerpen)

  • Slides
  • Data from two museums Museums: Royal Museums of Fine Arts of Belgium and Royal Museums of Art and History;
  • Research opportunity: how can multimodal representation learning (NPL + Vision) help to organize and explore this data;
  • Transfer knowledge approach:
    • Large players in the field have massive datasets;
    • How easily can we transfer knowledge from large to small collections? E.g. automatic dating or object description;
  • Partner up: the Departments of Literature and Linguistics (Faculty of Arts and Philosophy) of the University of Antwerp and the Montefiore Institute (Faculty of Applied Sciences) of the University of Liège are seeking to fill two full-time (100%) vacancies for Doctoral Grants in the area of machine/deep learning, language technology, and/or computer vision for enriching heritage collections. More information.

4. Introduction of CODH computer vision and machine learning datasets such as old Japanese books and characters
Asanobu KITAMOTO (CODH -National Institute of Informatics)

  • Slides;
  • Center for Open Data in the Humanities (CODH);
  • It’s a research center in Tokyo, Japan, officially launched on April 1, 2017;
  • Scope: (1) humanities research using information technology and (2) other fields of research using humanities data.
  • Released datasets:
    • Dataset of Pre-Modern Japanese Text (PMJT): Pre-Modern Japanese Text, owned by National Institute of Japanese Literature, is released image and text data as open data. In addition, some text has description, transcription, and tagging data.

      Pre-Modern Japanese Text Dataset: currently 701 books
    • PMJT Character Shapes;
    • IIIF Curation Viewer

      Curation Viewer
  • CODH is looking for a project researcher who is interested in applying computer vision to humanities data. Contact:

5. Introduction to the new South African Centre for Digital Language Resources (SADiLaR )
Juan Steyn

  • Slides;
  • SADiLaR is a new research infrastructure set up by the Department of Science and Technology (DST) forming part of the new South African Research Infrastructure Roadmap (SARIR).
  • Officially launched on October, 2016;
  • SADiLaR runs two programs:
    • Digitisation program: which entails the systematic creation of relevant digital text, speech and multi-modal resources related to all official languages of South Africa, as well as the development of appropriate natural language processing software tools for research and development purposes;
    • A Digital Humanities program; which facilitates research capacity building by promoting and supporting the use of digital data and innovative methodological approaches within the Humanities and Social Sciences. (See