Kilpatrick Alexander, Ćwiek Aleksandra
International Communication, Nagoya University of Commerce and Business, Nagoya, Aichi, Japan.
Leibniz-Zentrum Allgemeine Sprachwissenschaft, Berlin, Germany.
PeerJ Comput Sci. 2024 Jan 17;10:e1811. doi: 10.7717/peerj-cs.1811. eCollection 2024.
This study investigates the extent to which gender can be inferred from the phonemes that make up given names and words in American English. Two extreme gradient boosted algorithms were constructed to classify words according to gender, one using a list of the most common given names (N∼1,000) in North America and the other using the Glasgow Norms (N∼5,500), a corpus consisting of nouns, verbs, adjectives, and adverbs which have each been assigned a psycholinguistic score of how they are associated with male or female behaviour. Both models report significant findings, but the model constructed using given names achieves a greater accuracy despite being trained on a smaller dataset suggesting that gender is expressed more robustly in given names than in other word classes. Feature importance was examined to determine which features were contributing to the decision-making process. Feature importance scores revealed a general pattern across both models, but also show that not all word classes express gender the same way. Finally, the models were reconstructed and tested on the opposite dataset to determine whether they were useful in classifying opposite samples. The results showed that the models were not as accurate when classifying opposite samples, suggesting that they are more suited to classifying words of the same class.
本研究调查了在美国英语中,从构成名字和单词的音素推断性别的程度。构建了两种极端梯度提升算法,根据性别对单词进行分类,一种使用北美最常见的名字列表(约1000个),另一种使用格拉斯哥规范(约5500个),这是一个由名词、动词、形容词和副词组成的语料库,每个词都被赋予了一个关于它们与男性或女性行为关联程度的心理语言学分数。两个模型都报告了显著的结果,但使用名字构建的模型尽管在较小的数据集上进行训练,却取得了更高的准确率,这表明性别在名字中比在其他词类中表达得更为强烈。研究了特征重要性,以确定哪些特征对决策过程有贡献。特征重要性分数揭示了两个模型的一般模式,但也表明并非所有词类表达性别的方式都相同。最后,在相反的数据集上重建并测试模型,以确定它们对相反样本进行分类是否有用。结果表明,模型在对相反样本进行分类时不够准确,这表明它们更适合对同一类别的单词进行分类。