An Seungchan, Lee Yeonjin, Gong Junpyo, Hwang Seokyoung, Park In Guk, Cho Jayhyun, Lee Min Ju, Kim Minkyu, Kang Yun Pyo, Noh Minsoo
College of Pharmacy, Natural Products Research Institute, Seoul National University, Seoul, 08826, Republic of Korea.
J Cheminform. 2025 Apr 10;17(1):49. doi: 10.1186/s13321-025-00999-1.
The development of robust artificial intelligence (AI)-driven predictive models relies on high-quality, diverse chemical datasets. However, the scarcity of negative data and a publication bias toward positive results often hinder accurate biological activity prediction. To address this challenge, we introduce InertDB, a comprehensive database comprising 3,205 curated inactive compounds (CICs) identified through rigorous review of over 4.6 million compound records in PubChem. CIC selection prioritized bioassay diversity, determined using natural language processing (NLP)-based clustering metrics, while ensuring minimal biological activity across all evaluated bioassays. Notably, 97.2% of CICs adhere to the Rule of Five, a proportion significantly higher than that of overall PubChem dataset. To further expand the chemical space, InertDB also features 64,368 generated inactive compounds (GICs) produced using a deep generative AI model trained on the CIC dataset. Compared to conventional approaches such as random sampling or property-matched decoys, InertDB significantly improves predictive AI performance, particularly for phenotypic activity prediction by providing reliable inactive compound sets.Scientific contributionsInertDB addresses a critical gap in AI-driven drug discovery by providing a comprehensive repository of biologically inactive compounds, effectively resolving the scarcity of negative data that limits prediction accuracy and model reliability. By leveraging language model-based bioassay diversity metrics and generative AI, InertDB integrates rigorously curated inactive compounds with an expanded chemical space. InertDB serves as a valuable alternative to random sampling and decoy generation, offering improved training datasets and enhancing the accuracy of phenotypic pharmacological activity prediction.
强大的人工智能(AI)驱动的预测模型的发展依赖于高质量、多样化的化学数据集。然而,阴性数据的稀缺以及对阳性结果的发表偏倚常常阻碍准确的生物活性预测。为应对这一挑战,我们引入了InertDB,这是一个综合数据库,包含通过对PubChem中超过460万条化合物记录进行严格审查而确定的3205种经过整理的无活性化合物(CIC)。CIC的选择优先考虑生物测定多样性,使用基于自然语言处理(NLP)的聚类指标来确定,同时确保在所有评估的生物测定中生物活性最小。值得注意的是,97.2%的CIC符合五规则,这一比例显著高于整个PubChem数据集。为了进一步扩展化学空间,InertDB还具有64368种通过在CIC数据集上训练的深度生成式AI模型生成的无活性化合物(GIC)。与随机抽样或性质匹配诱饵等传统方法相比,InertDB显著提高了预测AI的性能,特别是通过提供可靠的无活性化合物集来进行表型活性预测。
科学贡献
InertDB通过提供生物无活性化合物的综合存储库,解决了AI驱动的药物发现中的一个关键差距,有效解决了限制预测准确性和模型可靠性的阴性数据稀缺问题。通过利用基于语言模型的生物测定多样性指标和生成式AI,InertDB将经过严格整理的无活性化合物与扩展的化学空间相结合。InertDB是随机抽样和诱饵生成的有价值替代方案,提供了改进的训练数据集并提高了表型药理活性预测的准确性。