Department of Economics, University of Copenhagen, 1353 Copenhagen, Denmark.
Center for Social Data Science, University of Copenhagen, 1353 Copenhagen, Denmark.
Proc Natl Acad Sci U S A. 2021 Apr 6;118(14). doi: 10.1073/pnas.2020258118.
Increasingly, human behavior can be monitored through the collection of data from digital devices revealing information on behaviors and locations. In the context of higher education, a growing number of schools and universities collect data on their students with the purpose of assessing or predicting behaviors and academic performance, and the COVID-19-induced move to online education dramatically increases what can be accumulated in this way, raising concerns about students' privacy. We focus on academic performance and ask whether predictive performance for a given dataset can be achieved with less privacy-invasive, but more task-specific, data. We draw on a unique dataset on a large student population containing both highly detailed measures of behavior and personality and high-quality third-party reported individual-level administrative data. We find that models estimated using the big behavioral data are indeed able to accurately predict academic performance out of sample. However, models using only low-dimensional and arguably less privacy-invasive administrative data perform considerably better and, importantly, do not improve when we add the high-resolution, privacy-invasive behavioral data. We argue that combining big behavioral data with "ground truth" administrative registry data can ideally allow the identification of privacy-preserving task-specific features that can be employed instead of current indiscriminate troves of behavioral data, with better privacy and better prediction resulting.
越来越多的人类行为可以通过从数字设备中收集数据来监测,这些数据揭示了行为和位置的信息。在高等教育背景下,越来越多的学校和大学收集学生数据,目的是评估或预测学生的行为和学习成绩,而 COVID-19 引发的在线教育转变极大地增加了可以以这种方式积累的数据,这引发了人们对学生隐私的担忧。我们专注于学习成绩,并探讨是否可以使用侵犯隐私程度较低但更具体任务的数据来实现对给定数据集的预测性能。我们借鉴了一个关于大量学生群体的独特数据集,其中包含行为和个性的高度详细度量以及高质量的第三方报告的个人层面的行政数据。我们发现,使用大数据集估计的模型确实能够准确地预测样本外的学习成绩。然而,仅使用低维且可以说侵犯隐私程度较低的行政数据的模型表现要好得多,而且重要的是,当我们添加高分辨率、侵犯隐私的行为数据时,模型并不会得到改善。我们认为,将大数据行为数据与“真实数据”行政登记数据相结合,可以理想地识别出可替代当前无差别行为数据的隐私保护特定任务的特征,从而实现更好的隐私保护和更好的预测效果。