J Med Libr Assoc. 2021 Oct 1;109(4):609-612. doi: 10.5195/jmla.2021.1252.
We recently showed that genderize.io is not a sufficiently powerful gender detection tool due to a large number of nonclassifications. In the present study, we aimed to assess whether the accuracy of inference by genderize.io can be improved by manipulating the first names in the database.
We used a database containing the first names, surnames, and gender of 6,131 physicians practicing in a multicultural country (Switzerland). We uploaded the original CSV file (file #1), the file obtained after removing all diacritic marks, such as accents and cedilla (file #2), and the file obtained after removing all diacritic marks and retaining only the first term of the compound first names (file #3). For each file, we computed three performance metrics: proportion of misclassifications (errorCodedWithoutNA), proportion of nonclassifications (naCoded), and proportion of misclassifications and nonclassifications (errorCoded).
naCoded, which was high for file #1 (16.4%), was reduced after data manipulation (file #2: 11.7%, file #3: 0.4%). As the increase in the number of misclassifications was small, the overall performance of genderize.io (i.e., errorCoded) improved, especially for file #3 (file #1: 17.7%, file #2: 13.0%, and file #3: 2.3%).
A relatively simple manipulation of the data improved the accuracy of gender inference by genderize.io. We recommend using genderize.io only with files that were modified in this way.
我们最近发现,由于大量未分类的情况,genderize.io 并不是一个足够强大的性别检测工具。在本研究中,我们旨在评估通过 genderize.io 进行推断的准确性是否可以通过操纵数据库中的名字来提高。
我们使用了一个包含在一个多元文化国家(瑞士)行医的 6131 名医生的名字、姓氏和性别的数据库。我们上传了原始的 CSV 文件(文件 #1)、删除了所有变音符号(如重音符号和小舌音)后的文件(文件 #2)以及删除了所有变音符号并保留了复合名字的第一个词后的文件(文件 #3)。对于每个文件,我们计算了三个性能指标:错误分类的比例(无缺失值的错误分类码)、未分类的比例(缺失值的分类码)以及错误分类和未分类的比例(错误分类码)。
文件 #1 的未分类比例较高(16.4%),经过数据处理后(文件 #2:11.7%,文件 #3:0.4%)有所降低。由于错误分类的数量增加很小,genderize.io 的整体性能(即错误分类码)得到了提高,特别是对于文件 #3(文件 #1:17.7%,文件 #2:13.0%,文件 #3:2.3%)。
对数据进行相对简单的操作可以提高 genderize.io 性别推断的准确性。我们建议仅在使用以这种方式修改过的文件时使用 genderize.io。