Kyoto University Graduate School of Medicine, Laboratory of Molecular Biosciences, 606-8501, E-109 Konoemachi, Sakyo, Kyoto, Japan.
Mol Inform. 2018 Jan;37(1-2). doi: 10.1002/minf.201700127. Epub 2018 Jan 23.
Molecular modeling frequently constructs classification models for the prediction of two-class entities, such as compound bio(in)activity, chemical property (non)existence, protein (non)interaction, and so forth. The models are evaluated using well known metrics such as accuracy or true positive rates. However, these frequently used metrics applied to retrospective and/or artificially generated prediction datasets can potentially overestimate true performance in actual prospective experiments. Here, we systematically consider metric value surface generation as a consequence of data balance, and propose the computation of an inverse cumulative distribution function taken over a metric surface. The proposed distribution analysis can aid in the selection of metrics when formulating study design. In addition to theoretical analyses, a practical example in chemogenomic virtual screening highlights the care required in metric selection and interpretation.
分子建模经常构建用于预测两类实体的分类模型,例如化合物生物(无)活性、化学性质(无)存在、蛋白质(无)相互作用等。这些模型使用诸如准确性或真阳性率等著名指标进行评估。然而,这些常用于回顾性和/或人为生成的预测数据集的指标可能会高估实际前瞻性实验中的真实性能。在这里,我们系统地考虑了由于数据平衡而导致的指标值曲面生成,并提出了计算指标曲面上的逆累积分布函数。所提出的分布分析有助于在制定研究设计时选择指标。除了理论分析之外,化学基因组虚拟筛选中的一个实际示例突出了在选择和解释指标时需要注意的事项。