基于双层集成模型策略的基于大语言模型的分子和序列表示来识别肿瘤归巢肽的计算工具：LLM4THP。

LLM4THP: a computing tool to identify tumor homing peptides by molecular and sequence representation of large language model based on two-layer ensemble model strategy.

机构信息

School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou, 213164, China.

The Affiliated Changzhou No. 2 People's Hospital of Nanjing Medical University, Changzhou, 213164, China.

出版信息

Amino Acids. 2024 Oct 15;56(1):62. doi: 10.1007/s00726-024-03422-5.

DOI:10.1007/s00726-024-03422-5

PMID:39404804

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11480143/

Abstract

Tumor homing peptides (THPs) have a distinctive capacity to specifically attach to tumor cells, providing a promising approach for targeted cancer treatment and detection. Although THPs have the potential for significant impact, their detection by conventional methods is both time-consuming and expensive. To tackle this issue, we provide LLM4THP, an innovative computational approach that utilizes large language models (LLMs) to quickly and effectively detect THPs. LLM4THP utilizes two protein LLMs, ESM2 and Prot_T5_XL_UniRef50, to encode peptide sequences. This allows for the capture of complex patterns and relationships within the peptide data. In addition, we utilize inherent sequence characteristics such as Amino Acid Composition (AAC), Pseudo Amino Acid Composition (PAAC), Amphiphilic Pseudo Amino Acid Composition (APAAC), and Composition, Transition, and Distribution (CTD) to improve the representation of peptides. The RDKitDescriptors feature representation approach transforms peptide sequences into molecular objects and computes chemical characteristics, resulting in enhanced THP identification. The LLM4THP ensemble strategy incorporates various features into a two-layer learning architecture. The first layer consists of LightGBM, XGBoost, Random Forest, and Extremely Randomized Trees, which generate a set of meta results. The second layer utilizes Logistic Regression to further refine the identification of sequences as either THP or non-THP. LLM4THP exhibits exceptional performance compared to the most advanced methods, showcasing enhancements in accuracy, Matthew's correlation coefficient, F1 score, area under the curve, and average precision. The source code and dataset can be accessed at the following URL: https://github.com/abcair/LLM4THP.

摘要

肿瘤归巢肽（Tumor homing peptides，THPs）具有特异结合肿瘤细胞的独特能力，为靶向癌症治疗和检测提供了一种很有前途的方法。尽管 THPs 具有很大的影响潜力，但用传统方法检测它们既耗时又昂贵。为了解决这个问题，我们提供了 LLM4THP，这是一种利用大型语言模型（LLMs）快速有效地检测 THPs 的创新计算方法。LLM4THP 利用两个蛋白质 LLM，ESM2 和 Prot_T5_XL_UniRef50，对肽序列进行编码。这使得能够捕获肽数据中的复杂模式和关系。此外，我们利用内在的序列特征，如氨基酸组成（Amino Acid Composition，AAC）、伪氨基酸组成（Pseudo Amino Acid Composition，PAAC）、两亲性伪氨基酸组成（Amphiphilic Pseudo Amino Acid Composition，APAAC）和组成、转换和分布（Composition, Transition, and Distribution，CTD）来改善肽的表示。RDKitDescriptors 特征表示方法将肽序列转换为分子对象，并计算化学特征，从而增强 THP 的识别。LLM4THP 集成策略将各种特征纳入两层学习架构中。第一层由 LightGBM、XGBoost、随机森林和极端随机树组成，它们生成一组元结果。第二层使用逻辑回归进一步细化将序列识别为 THP 或非 THP 的过程。与最先进的方法相比，LLM4THP 表现出卓越的性能，在准确性、马修斯相关系数、F1 得分、曲线下面积和平均精度方面都有提高。源代码和数据集可以在以下 URL 访问：https://github.com/abcair/LLM4THP。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6fdf/11480143/6dc1e419a5cb/726_2024_3422_Fig1_HTML.jpg

相似文献

LLM4THP: a computing tool to identify tumor homing peptides by molecular and sequence representation of large language model based on two-layer ensemble model strategy.基于双层集成模型策略的基于大语言模型的分子和序列表示来识别肿瘤归巢肽的计算工具：LLM4THP。

Amino Acids. 2024 Oct 15;56(1):62. doi: 10.1007/s00726-024-03422-5.

StackTHP: A stacking ensemble model for accurate prediction of tumor-homing peptides in cancer therapy.StackTHP：一种用于癌症治疗中肿瘤归巢肽精确预测的堆叠集成模型。

Comput Biol Med. 2025 May;189:109958. doi: 10.1016/j.compbiomed.2025.109958. Epub 2025 Mar 5.

A Computational Predictor for Accurate Identification of Tumor Homing Peptides by Integrating Sequential and Deep BiLSTM Features.一种通过整合序列和深度 BiLSTM 特征来准确识别肿瘤归巢肽的计算预测器。

Interdiscip Sci. 2024 Jun;16(2):503-518. doi: 10.1007/s12539-024-00628-9. Epub 2024 May 11.

NEPTUNE: A novel computational approach for accurate and large-scale identification of tumor homing peptides.NEPTUNE：一种用于准确、大规模鉴定肿瘤归巢肽的新型计算方法。

Comput Biol Med. 2022 Sep;148:105700. doi: 10.1016/j.compbiomed.2022.105700. Epub 2022 Jun 7.

StackTHPred: Identifying Tumor-Homing Peptides through GBDT-Based Feature Selection with Stacking Ensemble Architecture.StackTHPred：基于 GBDT 特征选择的堆叠集成架构识别肿瘤归巢肽。

Int J Mol Sci. 2023 Jun 19;24(12):10348. doi: 10.3390/ijms241210348.

UMPred-FRL: A New Approach for Accurate Prediction of Umami Peptides Using Feature Representation Learning.UMPred-FRL：一种使用特征表示学习准确预测鲜味肽的新方法。

Int J Mol Sci. 2021 Dec 4;22(23):13124. doi: 10.3390/ijms222313124.

Stack-AAgP: Computational prediction and interpretation of anti-angiogenic peptides using a meta-learning framework.Stack-AAgP：使用元学习框架进行抗血管生成肽的计算预测和解释。

Comput Biol Med. 2024 May;174:108438. doi: 10.1016/j.compbiomed.2024.108438. Epub 2024 Apr 9.

Identification of tumor homing peptides by utilizing hybrid feature representation.利用混合特征表示法鉴定肿瘤归巢肽。

J Biomol Struct Dyn. 2023 May;41(8):3405-3412. doi: 10.1080/07391102.2022.2049368. Epub 2022 Mar 9.

AI4ACEIP: A Computing Tool to Identify Food Peptides with High Inhibitory Activity for ACE by Merged Molecular Representation and Rich Intrinsic Sequence Information Based on an Ensemble Learning Strategy.AI4ACEIP：一种基于集成学习策略，通过合并分子表示和丰富的内在序列信息来识别对ACE具有高抑制活性的食物肽的计算工具。

J Agric Food Chem. 2024 Nov 13;72(45):25340-25356. doi: 10.1021/acs.jafc.4c05650. Epub 2024 Nov 4.

StackAHTPs: An explainable antihypertensive peptides identifier based on heterogeneous features and stacked learning approach.StackAHTPs：一种基于异构特征和堆叠学习方法的可解释性抗高血压肽识别器。

IET Syst Biol. 2025 Jan-Dec;19(1):e70002. doi: 10.1049/syb2.70002.

本文引用的文献

Interdiscip Sci. 2024 Jun;16(2):503-518. doi: 10.1007/s12539-024-00628-9. Epub 2024 May 11.

LMCrot: an enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model.LMCrot：一种基于转换器的蛋白质语言模型的可解释窗口级嵌入的增强型蛋白质巴豆酰化位点预测器。

Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae290.

Large language models in medicine.医学中的大型语言模型。

Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.

Int J Mol Sci. 2023 Jun 19;24(12):10348. doi: 10.3390/ijms241210348.

Barriers to immune cell infiltration in tumors.肿瘤中免疫细胞浸润的障碍。

J Immunother Cancer. 2023 Apr;11(4). doi: 10.1136/jitc-2022-006401.

Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。

Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.

: a convolutional neural network based architecture for text classification.一种基于卷积神经网络的文本分类架构。

Appl Intell (Dordr). 2023;53(11):14249-14268. doi: 10.1007/s10489-022-04221-9. Epub 2022 Oct 22.

Tumor homing-penetrating and nanoenzyme-augmented 2D phototheranostics against hypoxic solid tumors.肿瘤归巢-穿透及纳米酶增强二维光热治疗乏氧实体瘤。

Acta Biomater. 2022 Sep 15;150:391-401. doi: 10.1016/j.actbio.2022.07.044. Epub 2022 Jul 30.

NEPTUNE: A novel computational approach for accurate and large-scale identification of tumor homing peptides.NEPTUNE：一种用于准确、大规模鉴定肿瘤归巢肽的新型计算方法。

Comput Biol Med. 2022 Sep;148:105700. doi: 10.1016/j.compbiomed.2022.105700. Epub 2022 Jun 7.

Anti-cancer peptide-based therapeutic strategies in solid tumors.基于抗癌肽的实体瘤治疗策略。

Cell Mol Biol Lett. 2022 Apr 9;27(1):33. doi: 10.1186/s11658-022-00332-w.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于双层集成模型策略的基于大语言模型的分子和序列表示来识别肿瘤归巢肽的计算工具：LLM4THP。

LLM4THP: a computing tool to identify tumor homing peptides by molecular and sequence representation of large language model based on two-layer ensemble model strategy.

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献