Multinational Address Parsing: A Zero-Shot Evaluation
Abstract
Address parsing consists of identifying the segments that make up an address, such as a street name or a postal code. Because of its importance for tasks like record linkage, address parsing has been approached with many techniques, the latest relying on neural networks. While these models yield notable results, previous work on neural networks has only focused to parse addresses from a single source country. We propose in this paper an approach in which we employ subword embeddings and a Recurrent Neural Network architecture to build a single model capable of learning to parse addresses from multiple countries at the same time while taking into account the difference in languages and address formatting systems. The proposed method achieves an average accuracy (token-wise) of 99 % on the test set of the countries used as the source dataset with no pre-processing nor post-processing being required. We explore the possibility of transferring the address parsing knowledge acquired by training on some countries' addresses to others with no further training in a zero-shot transfer learning setting. We also experiment using an attention mechanism and a domain adversarial training algorithm in the same zero-shot transfer setting to improve performance. Both methods yield state-of-the-art performance for the majority of the tested countries while giving good results on the remaining countries. We also explore the effect of incomplete addresses on our best model, and we evaluate the impact of using incomplete addresses during training. In addition, we propose an open-source Python implementation of some of our trained models (https://github.com/GRAAL-Research/deepparse).The submitting author warrants that the submission is original and that she/he is the author of the submission together with the named co-authors; to the extend the submission incorporates text passages, figures, data or other material from the work of others, the submitting author has obtained any necessary permission.
Articles in this journal are published under the Creative Commons Attribution Licence (CC-BY). This is to get more legal certainty about what readers can do with published articles, and thus a wider dissemination and archiving, which in turn makes publishing with this journal more valuable for you, the authors.
In order for iJIST to publish and disseminate research articles, we need publishing rights. This is determined by a publishing agreement between the author and iJIST.
By submitting an article the author grants to this journal the non-exclusive right to publish it. The author retains the copyright and the publishing rights for his article without any restrictions.
Privacy Statement
The names and email addresses entered in this journal site will be used exclusively for the stated purposes of this journal and will not be made available for any other purpose or to any other party.