Monasterio Leonardo
Department of Regional, Urban and Environmental Studies, Institute for Applied Economic Research, Brasília, DF, Brazil.
Graduate School of Economics, Universidade Católica de Brasília, Brasília, DF, Brazil.
PLoS One. 2017 May 8;12(5):e0176890. doi: 10.1371/journal.pone.0176890. eCollection 2017.
This paper presents a method for classifying the ancestry of Brazilian surnames based on historical sources. The information obtained forms the basis for applying fuzzy matching and machine learning classification algorithms to more than 46 million workers in 5 categories: Iberian, Italian, Japanese, German and East European. The vast majority (96.7%) of the single surnames were identified using a fuzzy matching and the rest using a method proposed by Cavnar and Trenkle (1994). A comparison of the results of the procedures with data on foreigners in the 1920 Census and with the geographic distribution of non-Iberian surnames underscores the accuracy of the procedure. The study shows that surname ancestry is associated with significant differences in wages and schooling.
本文提出了一种基于历史资料对巴西姓氏的祖籍进行分类的方法。所获得的信息构成了对5个类别(伊比利亚、意大利、日本、德国和东欧)的4600多万名工人应用模糊匹配和机器学习分类算法的基础。绝大多数(96.7%)的单姓是通过模糊匹配识别出来的,其余的则使用了Cavnar和Trenkle(1994年)提出的方法。将这些程序的结果与1920年人口普查中的外国人数据以及非伊比利亚姓氏的地理分布进行比较,突出了该程序的准确性。研究表明,姓氏祖籍与工资和受教育程度的显著差异有关。