IEEE/ACM Trans Comput Biol Bioinform. 2020 Jul-Aug;17(4):1352-1363. doi: 10.1109/TCBB.2019.2913855. Epub 2019 Apr 30.
In cheminformatics, compound-target binding profiles has been a main source of data for research. For data repositories that only provide positive profiles, a popular assumption is that unreported profiles are all negative. In this paper, we caution the audience not to take this assumption for granted, and present empirical evidence of its ineffectiveness from a machine learning perspective. Our examination is based on a setting where binding profiles are used as features to train predictive models; we show (1) prediction performance degrades when the assumption fails and (2) explicit recovery of unreported profiles improves prediction performance. In particular, we propose a framework that jointly recovers profiles and learns predictive model, and show it achieves further performance improvement. The presented study not only suggests applying matrix recovery methods to recover unreported profiles, but also initiates a new missing feature problem which we called Learning with Positive and Unknown Features.
在化学信息学中,化合物-靶标结合谱一直是研究的主要数据来源。对于仅提供阳性谱的数据存储库,一个流行的假设是未报告的谱都是阴性的。在本文中,我们提醒读者不要想当然地认为这一假设成立,并从机器学习的角度提供了实证证据证明其无效性。我们的检查基于这样一种情况,即结合谱被用作特征来训练预测模型;我们展示了(1)当假设失败时,预测性能会下降,以及(2)显式恢复未报告的谱可以提高预测性能。具体来说,我们提出了一个联合恢复谱和学习预测模型的框架,并展示了它可以实现进一步的性能提升。本研究不仅建议应用矩阵恢复方法来恢复未报告的谱,还引发了一个新的缺失特征问题,我们称之为带有阳性和未知特征的学习。