• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

数据集同源性和严格评估策略对蛋白质二级结构预测的影响。

The influence of dataset homology and a rigorous evaluation strategy on protein secondary structure prediction.

机构信息

Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan.

Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan.

出版信息

PLoS One. 2021 Jul 14;16(7):e0254555. doi: 10.1371/journal.pone.0254555. eCollection 2021.

DOI:10.1371/journal.pone.0254555
PMID:34260641
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8279362/
Abstract

The secondary structure prediction (SSP) of proteins has long been an essential structural biology technique with various applications. Despite its vital role in many research and industrial fields, in recent years, as the accuracy of state-of-the-art secondary structure predictors approaches the theoretical upper limit, SSP has been considered no longer challenging or too challenging to make advances. With the belief that the substantial improvement of SSP will move forward many fields depending on it, we conducted this study, which focused on three issues that have not been noticed or thoroughly examined yet but may have affected the reliability of the evaluation of previous SSP algorithms. These issues are all about the sequence homology between or within the developmental and evaluation datasets. We thus designed many different homology layouts of datasets to train and evaluate SSP prediction models. Multiple repeats were performed in each experiment by random sampling. The conclusions obtained with small experimental datasets were verified with large-scale datasets using state-of-the-art SSP algorithms. Very different from the long-established assumption, we discover that the sequence homology between query datasets for training, testing, and independent tests exerts little influence on SSP accuracy. Besides, the sequence homology redundancy between or within most datasets would make the accuracy of an SSP algorithm overestimated, while the redundancy within the reference dataset for extracting predictive features would make the accuracy underestimated. Since the overestimating effects are more significant than the underestimating effect, the accuracy of some SSP methods might have been overestimated. Based on the discoveries, we propose a rigorous procedure for developing SSP algorithms and making reliable evaluations, hoping to bring substantial improvements to future SSP methods and benefit all research and application fields relying on accurate prediction of protein secondary structures.

摘要

蛋白质的二级结构预测(SSP)一直是一项重要的结构生物学技术,具有多种应用。尽管它在许多研究和工业领域都发挥了重要作用,但近年来,随着最先进的二级结构预测器的准确性接近理论上限,SSP 被认为不再具有挑战性,或者太难取得进展。我们相信,SSP 的实质性改进将推动许多依赖于它的领域取得进展,因此进行了这项研究。本研究集中于三个尚未被注意到或彻底检查过但可能影响以前 SSP 算法评估可靠性的问题。这些问题都与发展和评估数据集之间或内部的序列同源性有关。因此,我们设计了许多不同的数据集同源性布局来训练和评估 SSP 预测模型。在每个实验中,通过随机抽样进行多次重复。用小实验数据集获得的结论,使用最先进的 SSP 算法在大规模数据集上进行了验证。与长期以来的假设非常不同的是,我们发现,用于训练、测试和独立测试的查询数据集之间的序列同源性对 SSP 准确性几乎没有影响。此外,大多数数据集之间或内部的序列同源性冗余会高估 SSP 算法的准确性,而从提取预测特征的参考数据集中的冗余会低估 SSP 算法的准确性。由于高估效应比低估效应更为显著,因此一些 SSP 方法的准确性可能被高估了。基于这些发现,我们提出了一种严格的开发 SSP 算法和进行可靠评估的程序,希望为未来的 SSP 方法带来实质性的改进,并使所有依赖于准确预测蛋白质二级结构的研究和应用领域受益。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/328bc7f028b5/pone.0254555.g014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/c3e9908d591b/pone.0254555.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/2b037a7a2812/pone.0254555.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/e8a52d989643/pone.0254555.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/794fc1f0cab3/pone.0254555.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/3618af475e1d/pone.0254555.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/af417ae66f88/pone.0254555.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/ecd09bd62946/pone.0254555.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/2591cc09e4f9/pone.0254555.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/f0e43fd3ad02/pone.0254555.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/a89a00875b6c/pone.0254555.g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/76e72d0de63c/pone.0254555.g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/708f494d2b16/pone.0254555.g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/fa1d5681f99c/pone.0254555.g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/328bc7f028b5/pone.0254555.g014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/c3e9908d591b/pone.0254555.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/2b037a7a2812/pone.0254555.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/e8a52d989643/pone.0254555.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/794fc1f0cab3/pone.0254555.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/3618af475e1d/pone.0254555.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/af417ae66f88/pone.0254555.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/ecd09bd62946/pone.0254555.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/2591cc09e4f9/pone.0254555.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/f0e43fd3ad02/pone.0254555.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/a89a00875b6c/pone.0254555.g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/76e72d0de63c/pone.0254555.g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/708f494d2b16/pone.0254555.g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/fa1d5681f99c/pone.0254555.g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5b18/8279362/328bc7f028b5/pone.0254555.g014.jpg

相似文献

1
The influence of dataset homology and a rigorous evaluation strategy on protein secondary structure prediction.数据集同源性和严格评估策略对蛋白质二级结构预测的影响。
PLoS One. 2021 Jul 14;16(7):e0254555. doi: 10.1371/journal.pone.0254555. eCollection 2021.
2
Discovering the Ultimate Limits of Protein Secondary Structure Prediction.揭示蛋白质二级结构预测的极限。
Biomolecules. 2021 Nov 3;11(11):1627. doi: 10.3390/biom11111627.
3
A secondary structure-based position-specific scoring matrix applied to the improvement in protein secondary structure prediction.基于二级结构的位置特异性评分矩阵在提高蛋白质二级结构预测中的应用。
PLoS One. 2021 Jul 28;16(7):e0255076. doi: 10.1371/journal.pone.0255076. eCollection 2021.
4
A simple strategy to enhance the speed of protein secondary structure prediction without sacrificing accuracy.一种不牺牲准确性、提高蛋白质二级结构预测速度的简单策略。
PLoS One. 2020 Jun 30;15(6):e0235153. doi: 10.1371/journal.pone.0235153. eCollection 2020.
5
Protein Secondary Structure Prediction Based on Data Partition and Semi-Random Subspace Method.基于数据分区和半随机子空间方法的蛋白质二级结构预测。
Sci Rep. 2018 Jun 29;8(1):9856. doi: 10.1038/s41598-018-28084-8.
6
Structural class prediction of protein using novel feature extraction method from chaos game representation of predicted secondary structure.利用从预测二级结构的混沌博弈表示中提取的新特征方法对蛋白质进行结构类预测。
J Theor Biol. 2016 Jul 7;400:1-10. doi: 10.1016/j.jtbi.2016.04.011. Epub 2016 Apr 12.
7
A high-accuracy protein structural class prediction algorithm using predicted secondary structural information.利用预测的二级结构信息进行高精度蛋白质结构类预测算法。
J Theor Biol. 2010 Dec 7;267(3):272-5. doi: 10.1016/j.jtbi.2010.09.007. Epub 2010 Sep 8.
8
Achieving 80% ten-fold cross-validated accuracy for secondary structure prediction by large-scale training.通过大规模训练实现二级结构预测的80%十折交叉验证准确率。
Proteins. 2007 Mar 1;66(4):838-45. doi: 10.1002/prot.21298.
9
Variable predictive model based classification algorithm for effective separation of protein structural classes.基于可变预测模型的分类算法用于有效分离蛋白质结构类别。
Comput Biol Chem. 2008 Aug;32(4):302-6. doi: 10.1016/j.compbiolchem.2008.03.009. Epub 2008 Apr 1.
10
Beyond the Twilight Zone: automated prediction of structural properties of proteins by recursive neural networks and remote homology information.超越模糊地带:利用递归神经网络和远程同源信息自动预测蛋白质的结构特性
Proteins. 2009 Oct;77(1):181-90. doi: 10.1002/prot.22429.

引用本文的文献

1
Ubigo-X: Protein ubiquitination site prediction using ensemble learning with image-based feature representation and weighted voting.Ubigo-X:基于集成学习、利用基于图像的特征表示和加权投票进行蛋白质泛素化位点预测
Comput Struct Biotechnol J. 2025 Jul 14;27:3137-3146. doi: 10.1016/j.csbj.2025.07.025. eCollection 2025.
2
The constrained-disorder principle defines the functions of systems in nature.约束-无序原理定义了自然界中系统的功能。
Front Netw Physiol. 2024 Dec 18;4:1361915. doi: 10.3389/fnetp.2024.1361915. eCollection 2024.
3
ANPS: machine learning based server for identification of anti-nutritional proteins in plants.

本文引用的文献

1
A simple strategy to enhance the speed of protein secondary structure prediction without sacrificing accuracy.一种不牺牲准确性、提高蛋白质二级结构预测速度的简单策略。
PLoS One. 2020 Jun 30;15(6):e0235153. doi: 10.1371/journal.pone.0235153. eCollection 2020.
2
Deeper Profiles and Cascaded Recurrent and Convolutional Neural Networks for state-of-the-art Protein Secondary Structure Prediction.用于蛋白质二级结构预测的深度剖面和级联递归与卷积神经网络。
Sci Rep. 2019 Aug 26;9(1):12374. doi: 10.1038/s41598-019-48786-x.
3
NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning.
ANPS:基于机器学习的植物抗营养蛋白鉴定服务器。
Funct Integr Genomics. 2024 Oct 25;24(6):201. doi: 10.1007/s10142-024-01474-0.
4
Artificial intelligence and neoantigens: paving the path for precision cancer immunotherapy.人工智能与新抗原:为精准癌症免疫治疗铺平道路。
Front Immunol. 2024 May 29;15:1394003. doi: 10.3389/fimmu.2024.1394003. eCollection 2024.
5
Discovering the Ultimate Limits of Protein Secondary Structure Prediction.揭示蛋白质二级结构预测的极限。
Biomolecules. 2021 Nov 3;11(11):1627. doi: 10.3390/biom11111627.
6
CirPred, the first structure modeling and linker design system for circularly permuted proteins.CirPred,首个环状排列蛋白质的结构建模和连接子设计系统。
BMC Bioinformatics. 2021 Oct 12;22(Suppl 10):494. doi: 10.1186/s12859-021-04403-1.
7
A secondary structure-based position-specific scoring matrix applied to the improvement in protein secondary structure prediction.基于二级结构的位置特异性评分矩阵在提高蛋白质二级结构预测中的应用。
PLoS One. 2021 Jul 28;16(7):e0255076. doi: 10.1371/journal.pone.0255076. eCollection 2021.
NetSurfP-2.0:通过集成深度学习改进蛋白质结构特征预测。
Proteins. 2019 Jun;87(6):520-527. doi: 10.1002/prot.25674. Epub 2019 Mar 9.
4
Single-sequence-based prediction of protein secondary structures and solvent accessibility by deep whole-sequence learning.基于单序列的深度学习全序列预测蛋白质二级结构和溶剂可及性。
J Comput Chem. 2018 Oct 5;39(26):2210-2216. doi: 10.1002/jcc.25534. Epub 2018 Oct 14.
5
Clustering huge protein sequence sets in linear time.线性时间内的大规模蛋白质序列集聚类。
Nat Commun. 2018 Jun 29;9(1):2542. doi: 10.1038/s41467-018-04964-5.
6
MUFOLD-SS: New deep inception-inside-inception networks for protein secondary structure prediction.MUFOLD-SS:用于蛋白质二级结构预测的新深度 inception-inside-inception 网络。
Proteins. 2018 May;86(5):592-598. doi: 10.1002/prot.25487. Epub 2018 Mar 12.
7
Critical assessment of methods of protein structure prediction (CASP)-Round XII.蛋白质结构预测方法的关键评估(CASP)——第十二轮。
Proteins. 2018 Mar;86 Suppl 1(Suppl 1):7-15. doi: 10.1002/prot.25415. Epub 2017 Dec 15.
8
Sixty-five years of the long march in protein secondary structure prediction: the final stretch?蛋白质二级结构预测的长征:终章?
Brief Bioinform. 2018 May 1;19(3):482-494. doi: 10.1093/bib/bbw129.
9
Sequence-Based Prediction of Protein-Carbohydrate Binding Sites Using Support Vector Machines.使用支持向量机基于序列预测蛋白质-碳水化合物结合位点
J Chem Inf Model. 2016 Oct 24;56(10):2115-2122. doi: 10.1021/acs.jcim.6b00320. Epub 2016 Sep 22.
10
Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields.基于深度卷积神经场的蛋白质二级结构预测
Sci Rep. 2016 Jan 11;6:18962. doi: 10.1038/srep18962.