Etezadi Fatemeh, Ito Shunichi, Yasui Kosuke, Kado Abdalkader Rodi, Minami Itsunari, Uesugi Motonari, Ganesh Pandian Namasivayam, Nakano Haruko, Nakano Atsushi, Packwood Daniel M
Institute for Integrated Cell-Material Sciences (iCeMS), Kyoto University, Kyoto 606-8501, Japan.
Faculty of Pharmaceutical Sciences, Kyoto University, Kyoto 606-8501, Japan.
J Chem Inf Model. 2024 Dec 9;64(23):8824-8837. doi: 10.1021/acs.jcim.4c01353. Epub 2024 Nov 25.
The discovery of small organic compounds for inducing stem cell differentiation is a time- and resource-intensive process. While data science could, in principle, streamline the discovery of these compounds, novel approaches are required due to the difficulty of acquiring training data from large numbers of example compounds. In this paper, we present the design of a new compound for inducing cardiomyocyte differentiation using simple regression models trained with a data set containing only 80 examples. We introduce decorated shape descriptors, an information-rich molecular feature representation that integrates both molecular shape and hydrophilicity information. These models demonstrate improved performance compared to ones using standard molecular descriptors based on shape alone. Model overtraining is diagnosed using a new type of sensitivity analysis. Our new compound is designed using a conservative molecular design strategy, and its effectiveness is confirmed through expression profiles of cardiomyocyte-related marker genes using real-time polymerase chain reaction experiments on human iPS cell lines. This work demonstrates a viable data-driven strategy for designing new compounds for stem cell differentiation protocols and will be useful in situations where training data is limited.
发现用于诱导干细胞分化的有机小分子化合物是一个耗费时间和资源的过程。虽然数据科学原则上可以简化这些化合物的发现过程,但由于难以从大量示例化合物中获取训练数据,因此需要新的方法。在本文中,我们展示了一种使用仅包含80个示例的数据集训练的简单回归模型来设计用于诱导心肌细胞分化的新化合物的方法。我们引入了修饰形状描述符,这是一种信息丰富的分子特征表示,它整合了分子形状和亲水性信息。与仅使用基于形状的标准分子描述符的模型相比,这些模型表现出了更好的性能。使用一种新型的敏感性分析来诊断模型过训练。我们的新化合物是使用保守的分子设计策略设计的,并通过对人诱导多能干细胞系进行实时聚合酶链反应实验,利用心肌细胞相关标记基因的表达谱证实了其有效性。这项工作展示了一种可行的数据驱动策略,用于设计用于干细胞分化方案的新化合物,并且在训练数据有限的情况下将很有用。