从名字判断种族：一种简单的贝叶斯方法

A "Roziah" by any other name: a simple Bayesian method for determining ethnicity from names.

出版信息

Am J Epidemiol. 2014 Aug 1;180(3):325-9. doi: 10.1093/aje/kwu129. Epub 2014 Jun 18.

Abstract

Correct identification of ethnicity is central to many epidemiologic analyses. Unfortunately, ethnicity data are often missing. Successful classification typically relies on large databases (n > 500,000 names) of known name-ethnicity associations. We propose an alternative naïve Bayesian strategy that uses substrings of full names. Name and ethnicity data for Malays, Indians, and Chinese were provided by a health and demographic surveillance site operating in Malaysia from 2011-2013. The data comprised a training data set (n = 10,104) and a test data set (n = 9,992). Names were spliced into contiguous 3-letter substrings, and these were used as the basis for the Bayesian analysis. Performance was evaluated on both data sets using Cohen's κ and measures of sensitivity and specificity. There was little difference between the classification performance in the training and test data (κ = 0.93 and 0.94, respectively). For the test data, the sensitivity values for the Malay, Indian, and Chinese names were 0.997, 0.855, and 0.932, respectively, and the specificity values were 0.907, 0.998, and 0.997, respectively. A naïve Bayesian strategy for the classification of ethnicity is promising. It performs at least as well as more sophisticated approaches. The possible application to smaller data sets is particularly appealing. Further research examining other substring lengths and other ethnic groups is warranted.

摘要

正确识别族群对于许多流行病学分析至关重要。遗憾的是，族群数据往往缺失。成功的分类通常依赖于具有已知姓名-族群关联的大型数据库（n > 500,000 个名称）。我们提出了一种替代的朴素贝叶斯策略，该策略使用全名的子字符串。马来人、印度人和华人的姓名和族群数据由马来西亚一个从 2011 年至 2013 年运行的健康和人口监测站点提供。数据包括一个训练数据集（n = 10,104）和一个测试数据集（n = 9,992）。将姓名拼接成连续的 3 个字母子字符串，并以此作为贝叶斯分析的基础。在两个数据集上使用 Cohen's κ 和敏感性与特异性度量评估了性能。训练数据和测试数据之间的分类性能差异不大（κ分别为 0.93 和 0.94）。对于测试数据，马来人、印度人和华人姓名的敏感性值分别为 0.997、0.855 和 0.932，特异性值分别为 0.907、0.998 和 0.997。用于族群分类的朴素贝叶斯策略很有前途。它的表现至少与更复杂的方法一样好。适用于较小数据集的可能性特别吸引人。进一步研究其他子字符串长度和其他族群的研究是有必要的。

相似文献

A "Roziah" by any other name: a simple Bayesian method for determining ethnicity from names.从名字判断种族：一种简单的贝叶斯方法

Am J Epidemiol. 2014 Aug 1;180(3):325-9. doi: 10.1093/aje/kwu129. Epub 2014 Jun 18.

Ethnic and other factors affecting birthweight in Singapore.影响新加坡出生体重的种族及其他因素。

Int J Gynaecol Obstet. 1989 Aug;29(4):289-95. doi: 10.1016/0020-7292(89)90351-2.

STR data for the AmpFlSTR Profiler loci from the three main ethnic population groups (Malay, Chinese and Indian) in Malaysia.马来西亚三个主要族群（马来族、华族和印度族）的AmpFlSTR Profiler基因座的STR数据。

Forensic Sci Int. 2001 Jun 1;119(1):109-12. doi: 10.1016/s0379-0738(00)00386-8.

Ethnic differences in bone mineral density among midlife women in a multi-ethnic Southeast Asian cohort.中年女性在多民族东南亚队列中的骨密度的种族差异。

Arch Osteoporos. 2019 Jul 19;14(1):80. doi: 10.1007/s11657-019-0631-0.

Ethnicity modifies the association between diabetes mellitus and ischaemic heart disease in Chinese, Malays and Asian Indians living in Singapore.种族对居住在新加坡的华人、马来人和印度裔亚洲人糖尿病与缺血性心脏病之间的关联产生影响。

Diabetologia. 2006 Dec;49(12):2866-73. doi: 10.1007/s00125-006-0469-z. Epub 2006 Oct 5.

Ethnicity-specific prevalences of refractive errors vary in Asian children in neighbouring Malaysia and Singapore.在邻国马来西亚和新加坡的亚洲儿童中，屈光不正的特定种族患病率有所不同。

Br J Ophthalmol. 2006 Oct;90(10):1230-5. doi: 10.1136/bjo.2006.093450. Epub 2006 Jun 29.

Ethnic differences of intraocular pressure and central corneal thickness: the Singapore Epidemiology of Eye Diseases study.种族间眼压和中央角膜厚度的差异：新加坡眼病流行病学研究。

Ophthalmology. 2014 Oct;121(10):2013-22. doi: 10.1016/j.ophtha.2014.04.041. Epub 2014 Jun 18.

Validation and utility of a computerized South Asian names and group recognition algorithm in ascertaining South Asian ethnicity in the national renal registry.验证和利用一种计算机化的南亚姓名和群体识别算法，以确定国家肾脏登记处中南亚裔族群的身份。

QJM. 2009 Dec;102(12):865-72. doi: 10.1093/qjmed/hcp142. Epub 2009 Oct 14.

A machine learning approach to predict ethnicity using personal name and census location in Canada.一种使用个人姓名和加拿大人口普查地点进行族群预测的机器学习方法。

PLoS One. 2020 Nov 18;15(11):e0241239. doi: 10.1371/journal.pone.0241239. eCollection 2020.

Can body fat distribution, adiponectin levels and inflammation explain differences in insulin resistance between ethnic Chinese, Malays and Asian Indians?体脂肪分布、脂联素水平和炎症能否解释中国、马来和印度裔人群之间胰岛素抵抗的差异？

Int J Obes (Lond). 2012 Aug;36(8):1086-93. doi: 10.1038/ijo.2011.185. Epub 2011 Sep 27.

引用本文的文献

HDSS Profile: The South East Asia Community Observatory Health and Demographic Surveillance System (SEACO HDSS).健康与人口监测系统简介：东南亚社区观察站健康与人口监测系统（东南亚社区观察站健康与人口监测系统）

Int J Epidemiol. 2017 Oct 1;46(5):1370-1371g. doi: 10.1093/ije/dyx113.

Surnames and ancestry in Brazil.巴西的姓氏与血统。

PLoS One. 2017 May 8;12(5):e0176890. doi: 10.1371/journal.pone.0176890. eCollection 2017.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

从名字判断种族：一种简单的贝叶斯方法

A "Roziah" by any other name: a simple Bayesian method for determining ethnicity from names.

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献