Crowdsourcing the Character of a Place: Character-Level Convolutional Networks for Multilingual Geographic Text Classification

Publication Type:

Journal Article

Source:

Transactions in GIS (2018)

URL:

http://www.grantmckenzie.com/academics/CharacterOfPlace_2017.pdf

Keywords:

convolutional neural networks, crowdsourcing, geographic information retrieval, geoparsing, text classification, user-generated content

Abstract:

This article presents a new character-level convolutional neural network model that can classify multilingual text written using any character set that can be encoded with UTF-8, a standard and widely used 8-bit character encoding. For geographic classification of text, we demonstrate that this approach is competitive with state-of-the-art word-based text classification methods. The model was tested on four crowdsourced data sets made up of Wikipedia articles, online travel blogs, Geonames toponyms, and Twitter posts. Unlike word-based methods, which require data cleaning and pre-processing, the proposed model works for any language without modification and with classification accuracy comparable to existing methods. Using a synthetic data set with introduced character-level errors, we show it is more robust to noise than word-level classification algorithms. The results indicate that UTF-8 character-level convolutional neural networks are a promising technique for georeferencing noisy text, such as found in colloquial social media posts and texts scanned with optical character recognition. However, word-based methods currently require less computation time to train, so are currently preferable for classifying well-formatted and cleaned texts in single languages.