Text Map Text
Bill Pascoe, Jack Newley, Dan Price, 2018
Text Map Text is an initiative of the Centre For 21st Century Humanities along a stream of development for the interdisciplinary theme of digital mapping. At present we have a functional prototype demonstrating feasibility and have established benchmarks to improve performance and add features. The case studies are the Gold Fields project and the Castieau Diaries project.
This will be useful for a wide range of Humanities needs, such as plotting journey narratives and exploring world views and changes in them through texts. Such a system can provide not only a map of places visited but shows the prominence of various places within a world view, or changes in it, represented by the text.
- Gold Fields News courtesy of Peter Crabb and Alexis Antonia
- Castieau Diaries courtesy of Prof Mark Finnane
The following feasibility prototypes were created manually and are incomplete, to set benchmarks for comparison.
- In Search Of De Vergulde Draeck
- A Voyage To The Great South Land By Order Of The Dutch East India Company
- Life And Travels Of I-Tsing
These case studies demonstrate the core functionality of automatically:
- processing a corpus of texts
- find places in the texts
- plot all places found on a map
- link the places on the map to occurrences in the texts
- link the occurrences in the texts to points on the map
Planned improvements include:
- upload plain text corpora
- upload marked up corpora
- edit places in map (CRUD)
- edit places in text (CRUD)
- mark dates of texts and include time tools in the user interface_exists
- tag places with dates
- select from available gazeteers
- upload gazeteer to user
- report statistics (Eg: how many places found; frequency of places per wordcount; success rate as the difference between places detected and places manually edited)
- mass process 20Gb+ and 1Tb+ corpora on HPC
Metrics were obtained from the Gold Fields case study.
Please note that while a roughly 50% success rate may seem low, it demonstrates feasibility, and in the case of manually editing texts would mean saving 50% of someone's time in a laborious and repetitive tasks. It also means this is an area where more research in clearly identified areas can lead to substantial improvement.
Success (error rates) of the automated processing are measured in terms of frequencies:
- failure to identify a place
- identification of a non-place as a place (false positive)
- identification of a place in the wrong place
- aggregates of above for total errors
- 21,020 total words in the whole sample
- 297 actual places mentioned in the text (places identified + places not identified - false positives)
- 156 places identified
- 151 places not identified (of these 86 were unique places in the text, and 65 were additional mentions of the same place name in one text)
- 10 false positives: places identified that were not intended to denote places (eg 'Sydney' in 'Sydney Morning Herald' or 'California' in 'California hats' the style of hat)
- 25 places identified in the wrong location (eg: Sofala located in Mozambique)
- 186 errors in total
- Approx. 14 places per 1000 words.
- Of these about 6 places per 1000 words were correctly identified.
- About 40% of places were correctly identified.
- 6% of places identified were false positives.
- 16% of places identified were in the wrong location.
- 50% of places in the text were not identified at all (correctly or incorrectly).
- The 'place-density' of the texts varied greatly from 0.0007 to 0.017, or roughly, less than 1 place per 1000 words to 17 places per thousand words.
This presents opportunities for future developments and enhancements in each part of the overall architecture of such as system, with these benchmarks enabling demonstration of and quantification of improvement. The components of a system such as this, that can be explored for improvement include:
- named entity recognition (NER) of places
- heuristics for deciding which place among many with the same name.
- gazeteer to find and locate places (in particular historical place names required for Humanities)