Domínguez-Díaz Adrián, Goyanes Manuel, de-Marcos Luis, Prado-Sánchez Víctor Pablo
Ciencias de la Computación, Universidad de Alcalá, Alcalá de Henares, Spain.
Comunicación, Universidad Carlos III de Madrid, Getafe, Spain.
PeerJ Comput Sci. 2024 Oct 17;10:e2378. doi: 10.7717/peerj-cs.2378. eCollection 2024.
The gender classification from names is crucial for uncovering a myriad of gender-related research questions. Traditionally, this has been automatically computed by gender detection tools (GDTs), which now face new industry players in the form of conversational bots like ChatGPT. This paper statistically tests the stability and performance of ChatGPT 3.5 Turbo and ChatGPT 4o for gender detection. It also compares two of the most used GDTs (Namsor and Gender-API) with ChatGPT using a dataset of 5,779 records compiled from previous studies for the most challenging variant, which is the gender inference from full name without providing any additional information. Results statistically show that ChatGPT is very stable presenting low standard deviation and tight confidence intervals for the same input, while it presents small differences in performance when prompt changes. ChatGPT slightly outperforms the other tools with an overall accuracy over 96%, although the difference is around 3% with both GDTs. When the probability returned by GDTs is factored in, differences get narrower and comparable in terms of inter-coder reliability and error coded. ChatGPT stands out in the reduced number of non-classifications (0% in most tests), which in combination with the other metrics analyzed, results in a solid alternative for gender inference. This paper contributes to current literature on gender detection classification from names by testing the stability and performance of the most used state-of-the-art AI tool, suggesting that the generative language model of ChatGPT provides a robust alternative to traditional gender application programming interfaces (APIs), yet GDTs (especially Namsor) should be considered for research-oriented purposes.
从名字进行性别分类对于揭示众多与性别相关的研究问题至关重要。传统上,这是由性别检测工具(GDTs)自动计算的,而现在像ChatGPT这样的对话机器人成为了新的行业参与者。本文对ChatGPT 3.5 Turbo和ChatGPT 4o进行性别检测的稳定性和性能进行了统计测试。它还使用从先前研究中汇编的5779条记录的数据集,将两个最常用的GDTs(Namsor和Gender-API)与ChatGPT进行比较,用于最具挑战性的变体,即从全名进行性别推断而不提供任何额外信息。结果统计表明,ChatGPT非常稳定,对于相同输入呈现出低标准差和紧密的置信区间,而当提示改变时其性能差异较小。ChatGPT略优于其他工具,总体准确率超过96%,尽管与两个GDTs的差异约为3%。当考虑GDTs返回的概率时,在编码者间可靠性和错误编码方面差异会缩小且具有可比性。ChatGPT在非分类数量减少方面表现突出(大多数测试中为0%),结合分析的其他指标,使其成为性别推断的可靠选择。本文通过测试最常用的先进人工智能工具的稳定性和性能,为当前关于从名字进行性别检测分类的文献做出了贡献,表明ChatGPT的生成语言模型为传统性别应用程序编程接口(APIs)提供了强大的替代方案,但出于研究目的应考虑GDTs(特别是Namsor)。