Kosonocky Clayton W, Wilke Claus O, Marcotte Edward M, Ellington Andrew D
Department of Molecular Biosciences, University of Texas at Austin Austin TX 78705 USA.
Department of Integrative Biology, University of Texas at Austin Austin TX 78705 USA.
Digit Discov. 2024 May 7;3(6):1150-1159. doi: 10.1039/d4dd00011k. eCollection 2024 Jun 12.
The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631 K molecule-function pairs, was created using an LLM- and embedding-based method to obtain 1.5 K unique functional labels for approximately 100 K randomly selected molecules from their corresponding 188 K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate through several examples that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an alternative approach to traditional structure-based methods in the pursuit of designing novel functional molecules.
小分子发现的基本目标是生成具有目标功能的化学物质。虽然这通常通过基于结构的方法来进行,但我们着手研究利用大量化学文献库的方法的实用性。我们假设,一个足够大的文本衍生化学功能数据集将反映化学功能的实际情况。鉴于化学功能源于分子结构及其相互作用伙伴,这样的情况将隐含地捕捉复杂的物理和生物相互作用。为了评估这一假设,我们构建了一个基于专利衍生功能标签的化学功能(CheF)数据集。这个数据集包含63.1万个分子-功能对,是使用基于大语言模型和嵌入的方法创建的,从相应的18.8万项独特专利中为大约10万个随机选择的分子获得了1500个独特的功能标签。我们进行了一系列分析,证明CheF数据集包含与化学结构关系一致的功能情况的语义连贯文本表示,从而近似于实际的化学功能情况。然后,我们通过几个例子证明,这种基于文本的功能情况可用于利用一个仅根据结构就能预测功能概况的模型来识别具有目标功能的药物。我们相信,功能标签引导的分子发现可能成为传统基于结构的方法的一种替代方法,用于设计新型功能分子。