Matthias Arnold, Lena Hessel
Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO)
This paper introduces the project “Early Chinese Periodicals Online (ECPO)”. ECPO joins several important digital collections of the early Chinese press and puts them into a single overarching framework. In a first phase, several databases on early women’s periodicals and entertainment publishing were created: “Chinese Women’s Magazines in the Late Qing and Early Republican Period” (WoMag), “Chinese Entertainment Newspapers” (Xiaobao), and databases hosted at the Academia Sinica in Taiwan. These systems approach the material in two ways: in the intensive approach we record all articles, images, advertisements, and related agents and assign them to a complete set of scanned pages, while in the extensive approach we record the main characteristic features of publications.
ECPO has begun to join these various materials in a second, ongoing phase of the project. Today, ECPO provides open access to 267 publications comprising over 280.000 pages of print. A key aspect is to make entire issues available, front-to-back, including illustrations, advertisements, and even blank pages. For 138 publications we also provide descriptions of individual items in Chinese with Pinyin transcription. These records also contain genre and column information, basic content analysis, as well as names and roles of agents associated with an item.
Our new cross-database agent service allows us to manage the approximately 47.000 names recorded in WoMag and ECPO: we a) merge identical names across databases, b) identify agents and assign names to them, and c) link agent records to authority data (GND, VIAF, Wikidata, Baidu, DBpedia). Besides creating a curated list of agents occurring in the publications, we also aim to add missing persons to authority files like the GND.
One crucial aspect of ECPO is full text capability. Unfortunately, OCR software cannot be used out-of-the-box, or a number of reasons: document analysis fails to recognize complex newspaper layout, character recognition fails when it faces emphasis marks next to characters, and recognized passages have to be grouped in the right semantic order.