McGill University, 845 Sherbrooke St W, Montreal, Quebec H3A 0G4, Canada.
INRIA.
Gigascience. 2021 Sep 28;10(9). doi: 10.1093/gigascience/giab055.
Machine learning brings the hope of finding new biomarkers extracted from cohorts with rich biomedical measurements. A good biomarker is one that gives reliable detection of the corresponding condition. However, biomarkers are often extracted from a cohort that differs from the target population. Such a mismatch, known as a dataset shift, can undermine the application of the biomarker to new individuals. Dataset shifts are frequent in biomedical research, e.g., because of recruitment biases. When a dataset shift occurs, standard machine-learning techniques do not suffice to extract and validate biomarkers. This article provides an overview of when and how dataset shifts break machine-learning-extracted biomarkers, as well as detection and correction strategies.
机器学习带来了从具有丰富生物医学测量数据的队列中提取新生物标志物的希望。一个好的生物标志物是能够可靠检测相应条件的标志物。然而,生物标志物通常是从与目标人群不同的队列中提取的。这种不匹配,称为数据集偏移,可能会破坏生物标志物在新个体中的应用。在生物医学研究中,数据集偏移很常见,例如,由于招募偏差。当发生数据集偏移时,标准的机器学习技术不足以提取和验证生物标志物。本文概述了数据集偏移何时以及如何破坏机器学习提取的生物标志物,以及检测和纠正策略。