Goldblum Micah, Tsipras Dimitris, Xie Chulin, Chen Xinyun, Schwarzschild Avi, Song Dawn, Madry Aleksander, Li Bo, Goldstein Tom
IEEE Trans Pattern Anal Mach Intell. 2023 Feb;45(2):1563-1580. doi: 10.1109/TPAMI.2022.3162397. Epub 2023 Jan 6.
As machine learning systems grow in scale, so do their training data requirements, forcing practitioners to automate and outsource the curation of training data in order to achieve state-of-the-art performance. The absence of trustworthy human supervision over the data collection process exposes organizations to security vulnerabilities; training data can be manipulated to control and degrade the downstream behaviors of learned models. The goal of this work is to systematically categorize and discuss a wide range of dataset vulnerabilities and exploits, approaches for defending against these threats, and an array of open problems in this space.
随着机器学习系统规模的扩大,其训练数据需求也随之增加,这迫使从业者将训练数据的管理自动化并外包出去,以实现最先进的性能。在数据收集过程中缺乏可靠的人工监督,使组织面临安全漏洞;训练数据可能会被操纵,以控制和降低学习模型的下游行为。这项工作的目标是系统地分类和讨论各种数据集漏洞及利用方式、抵御这些威胁的方法,以及该领域一系列未解决的问题。