Ho Sung Yang, Phua Kimberly, Wong Limsoon, Bin Goh Wilson Wen
School of Biological Sciences, Nanyang Technological University, Singapore 637551, Singapore.
Department of Computer Science, National University of Singapore, Singapore 117417, Singapore.
Patterns (N Y). 2020 Nov 13;1(8):100129. doi: 10.1016/j.patter.2020.100129.
We discuss the validation of machine learning models, which is standard practice in determining model efficacy and generalizability. We argue that internal validation approaches, such as cross-validation and bootstrap, cannot guarantee the quality of a machine learning model due to potentially biased training data and the complexity of the validation procedure itself. For better evaluating the generalization ability of a learned model, we suggest leveraging on external data sources from elsewhere as validation datasets, namely external validation. Due to the lack of research attractions on external validation, especially a well-structured and comprehensive study, we discuss the necessity for external validation and propose two extensions of the external validation approach that may help reveal the true domain-relevant model from a candidate set. Moreover, we also suggest a procedure to check whether a set of validation datasets is valid and introduce statistical reference points for detecting external data problems.
我们讨论机器学习模型的验证,这是确定模型有效性和通用性的标准做法。我们认为,诸如交叉验证和自助法等内部验证方法,由于潜在的有偏差的训练数据以及验证过程本身的复杂性,无法保证机器学习模型的质量。为了更好地评估学习模型的泛化能力,我们建议利用来自其他地方的外部数据源作为验证数据集,即外部验证。由于对外部验证缺乏研究吸引力,尤其是缺乏结构良好且全面的研究,我们讨论了外部验证的必要性,并提出了外部验证方法的两种扩展,这可能有助于从候选集中揭示真正与领域相关的模型。此外,我们还建议了一个程序来检查一组验证数据集是否有效,并引入统计参考点以检测外部数据问题。