Coles Nicholas A, Perz Bartosz, Behnke Maciej, Eichstaedt Johannes C, Kim Soo Hyung, Vu Tu N, Raman Chirag, Tejada Julian, Huynh Van-Thong, Zhang Guangyi, Cui Tanming, Podder Sharanyak, Chavda Rushi, Pandey Shubham, Upadhyay Arpit, Padilla-Buritica Jorge I, Barrera Causil Carlos J, Ji Linying, Dollack Felix, Kiyokawa Kiyoshi, Liu Huakun, Perusquia-Hernandez Monica, Uchiyama Hideaki, Wei Xin, Cao Houwei, Yang Ziqing, Iancarelli Alessia, McVeigh Kieran, Wang Yiyu, Berwian Isabel M, Chiu Jamie C, Mirea Dan-Mircea, Nook Erik C, Vartiainen Henna I, Whiting Claire, Cho Young Won, Chow Sy-Miin, Fisher Zachary F, Li Yanling, Xiong Xiaoyue, Shen Yuqi, Tagliazucchi Enzo, Bugnon Leandro A, Ospina Raydonal, Bruno Nicolas M, D'Amelio Tomas A, Zamberlan Federico, Mercado Diaz Luis R, Pinzon-Arenas Javier O, Posada-Quintero Hugo F, Bilalpur Maneesh, Hinduja Saurabh, Marmolejo-Ramos Fernando, Canavan Shaun, Jivnani Liza, Saganowski Stanisław
University of Florida, Gainesville, FL, USA.
Wrocław University of Science and Technology, Wroclaw, Województwo Dolnośląskie, Poland.
R Soc Open Sci. 2025 Jun 25;12(6):241778. doi: 10.1098/rsos.241778. eCollection 2025 Jun.
Researchers are increasingly using machine learning to study physiological markers of emotion. We evaluated the promises and limitations of this approach via a big team science competition. Twelve teams competed to predict self-reported affective experiences using a multi-modal set of peripheral nervous system measures. Models were trained and tested in multiple ways: with data divided by participants, targeted emotion, inductions, and time. In 100% of tests, teams outperformed baseline models that made random predictions. In 46% of tests, teams also outperformed baseline models that relied on the simple average of ratings from training datasets. More notably, results uncovered a methodological challenge: multiplicative constraints on generalizability. Inferences about the accuracy and theoretical implications of machine learning efforts depended not only on their architecture, but also how they were trained, tested, and evaluated. For example, some teams performed better when tested on observations from the same (vs. different) subjects seen during training. Such results could be interpreted as evidence against claims of universality. However, such conclusions would be premature because other teams exhibited the opposite pattern. Taken together, results illustrate how big team science can be leveraged to understand the promises and limitations of machine learning methods in affective science and beyond.
研究人员越来越多地使用机器学习来研究情绪的生理指标。我们通过一场大型团队科学竞赛评估了这种方法的前景和局限性。十二支团队竞争,利用一套多模式的外周神经系统测量方法来预测自我报告的情感体验。模型通过多种方式进行训练和测试:按参与者、目标情绪、诱导方式和时间划分数据。在100%的测试中,各团队的表现均优于进行随机预测的基线模型。在46%的测试中,各团队的表现也优于依赖训练数据集评分简单平均值的基线模型。更值得注意的是,研究结果揭示了一个方法学挑战:普遍性的乘法约束。关于机器学习成果的准确性和理论意义的推断不仅取决于其架构,还取决于它们的训练、测试和评估方式。例如,一些团队在对训练期间见过的相同(而非不同)受试者的观察结果进行测试时表现更好。这样的结果可以被解释为反对普遍性主张的证据。然而,这样的结论还为时过早,因为其他团队呈现出相反的模式。综合来看,研究结果说明了如何利用大型团队科学来理解机器学习方法在情感科学及其他领域的前景和局限性。