Suppr超能文献

基于双层集成模型策略的基于大语言模型的分子和序列表示来识别肿瘤归巢肽的计算工具:LLM4THP。

LLM4THP: a computing tool to identify tumor homing peptides by molecular and sequence representation of large language model based on two-layer ensemble model strategy.

机构信息

School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou, 213164, China.

The Affiliated Changzhou No. 2 People's Hospital of Nanjing Medical University, Changzhou, 213164, China.

出版信息

Amino Acids. 2024 Oct 15;56(1):62. doi: 10.1007/s00726-024-03422-5.

Abstract

Tumor homing peptides (THPs) have a distinctive capacity to specifically attach to tumor cells, providing a promising approach for targeted cancer treatment and detection. Although THPs have the potential for significant impact, their detection by conventional methods is both time-consuming and expensive. To tackle this issue, we provide LLM4THP, an innovative computational approach that utilizes large language models (LLMs) to quickly and effectively detect THPs. LLM4THP utilizes two protein LLMs, ESM2 and Prot_T5_XL_UniRef50, to encode peptide sequences. This allows for the capture of complex patterns and relationships within the peptide data. In addition, we utilize inherent sequence characteristics such as Amino Acid Composition (AAC), Pseudo Amino Acid Composition (PAAC), Amphiphilic Pseudo Amino Acid Composition (APAAC), and Composition, Transition, and Distribution (CTD) to improve the representation of peptides. The RDKitDescriptors feature representation approach transforms peptide sequences into molecular objects and computes chemical characteristics, resulting in enhanced THP identification. The LLM4THP ensemble strategy incorporates various features into a two-layer learning architecture. The first layer consists of LightGBM, XGBoost, Random Forest, and Extremely Randomized Trees, which generate a set of meta results. The second layer utilizes Logistic Regression to further refine the identification of sequences as either THP or non-THP. LLM4THP exhibits exceptional performance compared to the most advanced methods, showcasing enhancements in accuracy, Matthew's correlation coefficient, F1 score, area under the curve, and average precision. The source code and dataset can be accessed at the following URL: https://github.com/abcair/LLM4THP.

摘要

肿瘤归巢肽(Tumor homing peptides,THPs)具有特异结合肿瘤细胞的独特能力,为靶向癌症治疗和检测提供了一种很有前途的方法。尽管 THPs 具有很大的影响潜力,但用传统方法检测它们既耗时又昂贵。为了解决这个问题,我们提供了 LLM4THP,这是一种利用大型语言模型(LLMs)快速有效地检测 THPs 的创新计算方法。LLM4THP 利用两个蛋白质 LLM,ESM2 和 Prot_T5_XL_UniRef50,对肽序列进行编码。这使得能够捕获肽数据中的复杂模式和关系。此外,我们利用内在的序列特征,如氨基酸组成(Amino Acid Composition,AAC)、伪氨基酸组成(Pseudo Amino Acid Composition,PAAC)、两亲性伪氨基酸组成(Amphiphilic Pseudo Amino Acid Composition,APAAC)和组成、转换和分布(Composition, Transition, and Distribution,CTD)来改善肽的表示。RDKitDescriptors 特征表示方法将肽序列转换为分子对象,并计算化学特征,从而增强 THP 的识别。LLM4THP 集成策略将各种特征纳入两层学习架构中。第一层由 LightGBM、XGBoost、随机森林和极端随机树组成,它们生成一组元结果。第二层使用逻辑回归进一步细化将序列识别为 THP 或非 THP 的过程。与最先进的方法相比,LLM4THP 表现出卓越的性能,在准确性、马修斯相关系数、F1 得分、曲线下面积和平均精度方面都有提高。源代码和数据集可以在以下 URL 访问:https://github.com/abcair/LLM4THP。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6fdf/11480143/6dc1e419a5cb/726_2024_3422_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验