Santamaría Lucía, Mihaljević Helena
Amazon Development Center, Berlin, Germany.
University of Applied Sciences, Berlin, Germany.
PeerJ Comput Sci. 2018 Jul 16;4:e156. doi: 10.7717/peerj-cs.156. eCollection 2018.
The increased interest in analyzing and explaining gender inequalities in tech, media, and academia highlights the need for accurate inference methods to predict a person's gender from their name. Several such services exist that provide access to large databases of names, often enriched with information from social media profiles, culture-specific rules, and insights from sociolinguistics. We compare and benchmark five name-to-gender inference services by applying them to the classification of a test data set consisting of 7,076 manually labeled names. The compiled names are analyzed and characterized according to their geographical and cultural origin. We define a series of performance metrics to quantify various types of classification errors, and define a parameter tuning procedure to search for optimal values of the services' free parameters. Finally, we perform benchmarks of all services under study regarding several scenarios where a particular metric is to be optimized.
对科技、媒体和学术界性别不平等现象进行分析和解释的兴趣日益浓厚,这凸显了使用准确推理方法从名字预测一个人性别的必要性。有几种这样的服务,它们可以访问大型名字数据库,这些数据库通常还丰富了来自社交媒体资料、特定文化规则和社会语言学见解的信息。我们通过将五种名字到性别的推理服务应用于由7076个手动标注名字组成的测试数据集的分类,对它们进行比较和基准测试。对汇编的名字根据其地理和文化起源进行分析和特征描述。我们定义了一系列性能指标来量化各种类型的分类错误,并定义了一个参数调整程序来搜索服务自由参数的最优值。最后,我们针对几个要优化特定指标的场景,对所有研究中的服务进行基准测试。