Am J Epidemiol. 2014 Aug 1;180(3):325-9. doi: 10.1093/aje/kwu129. Epub 2014 Jun 18.
Correct identification of ethnicity is central to many epidemiologic analyses. Unfortunately, ethnicity data are often missing. Successful classification typically relies on large databases (n > 500,000 names) of known name-ethnicity associations. We propose an alternative naïve Bayesian strategy that uses substrings of full names. Name and ethnicity data for Malays, Indians, and Chinese were provided by a health and demographic surveillance site operating in Malaysia from 2011-2013. The data comprised a training data set (n = 10,104) and a test data set (n = 9,992). Names were spliced into contiguous 3-letter substrings, and these were used as the basis for the Bayesian analysis. Performance was evaluated on both data sets using Cohen's κ and measures of sensitivity and specificity. There was little difference between the classification performance in the training and test data (κ = 0.93 and 0.94, respectively). For the test data, the sensitivity values for the Malay, Indian, and Chinese names were 0.997, 0.855, and 0.932, respectively, and the specificity values were 0.907, 0.998, and 0.997, respectively. A naïve Bayesian strategy for the classification of ethnicity is promising. It performs at least as well as more sophisticated approaches. The possible application to smaller data sets is particularly appealing. Further research examining other substring lengths and other ethnic groups is warranted.
正确识别族群对于许多流行病学分析至关重要。遗憾的是,族群数据往往缺失。成功的分类通常依赖于具有已知姓名-族群关联的大型数据库(n > 500,000 个名称)。我们提出了一种替代的朴素贝叶斯策略,该策略使用全名的子字符串。马来人、印度人和华人的姓名和族群数据由马来西亚一个从 2011 年至 2013 年运行的健康和人口监测站点提供。数据包括一个训练数据集(n = 10,104)和一个测试数据集(n = 9,992)。将姓名拼接成连续的 3 个字母子字符串,并以此作为贝叶斯分析的基础。在两个数据集上使用 Cohen's κ 和敏感性与特异性度量评估了性能。训练数据和测试数据之间的分类性能差异不大(κ分别为 0.93 和 0.94)。对于测试数据,马来人、印度人和华人姓名的敏感性值分别为 0.997、0.855 和 0.932,特异性值分别为 0.907、0.998 和 0.997。用于族群分类的朴素贝叶斯策略很有前途。它的表现至少与更复杂的方法一样好。适用于较小数据集的可能性特别吸引人。进一步研究其他子字符串长度和其他族群的研究是有必要的。