• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

SpanSeq:基于相似度的序列数据分割方法,用于改进深度学习项目的开发与评估。

SpanSeq: similarity-based sequence data splitting method for improved development and assessment of deep learning projects.

作者信息

Ferrer Florensa Alfred, Almagro Armenteros Jose Juan, Nielsen Henrik, Aarestrup Frank Møller, Clausen Philip Thomas Lanken Conradsen

机构信息

Research Group for Genomic Epidemiology, DTU National Food Institute, Technical University of Denmark, Anker Engelunds Vej 1, 2800 Kongens Lyngby, Denmark.

Informatics and Predictive Sciences Research, Bristol Myers Squibb Company, Calle Isaac Newton 4, 41092 Sevilla, Spain.

出版信息

NAR Genom Bioinform. 2024 Aug 16;6(3):lqae106. doi: 10.1093/nargab/lqae106. eCollection 2024 Sep.

DOI:10.1093/nargab/lqae106
PMID:39157582
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11327874/
Abstract

The use of deep learning models in computational biology has increased massively in recent years, and it is expected to continue with the current advances in the fields such as Natural Language Processing. These models, although able to draw complex relations between input and target, are also inclined to learn noisy deviations from the pool of data used during their development. In order to assess their performance on unseen data (their capacity to ), it is common to split the available data randomly into development (train/validation) and test sets. This procedure, although standard, has been shown to produce dubious assessments of due to the existing similarity between samples in the databases used. In this work, we present SpanSeq, a database partition method for machine learning that can scale to most biological sequences (genes, proteins and genomes) in order to avoid data leakage between sets. We also explore the effect of not restraining similarity between sets by reproducing the development of two state-of-the-art models on bioinformatics, not only confirming the consequences of randomly splitting databases on the model assessment, but expanding those repercussions to the model development. SpanSeq is available at https://github.com/genomicepidemiology/SpanSeq.

摘要

近年来,深度学习模型在计算生物学中的应用大幅增加,并且随着自然语言处理等领域当前的进展,预计这种增长态势还将持续。这些模型虽然能够在输入和目标之间建立复杂的关系,但也倾向于从其开发过程中使用的数据池中学习到有噪声的偏差。为了评估它们在未见数据上的性能(它们的 能力),通常会将可用数据随机拆分为开发集(训练集/验证集)和测试集。这个过程虽然是标准的,但由于所用数据库中样本之间存在相似性,已被证明会对 产生可疑的评估。在这项工作中,我们提出了SpanSeq,一种用于机器学习的数据库划分方法,它可以扩展到大多数生物序列(基因、蛋白质和基因组),以避免集合之间的数据泄漏。我们还通过重现生物信息学中两个最先进模型的开发过程,探讨了不限制集合之间相似性的影响,不仅证实了随机拆分数据库对模型评估的后果,还将这些影响扩展到了模型开发。SpanSeq可在https://github.com/genomicepidemiology/SpanSeq上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba8a/11327874/b50b0ddff6f4/lqae106fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba8a/11327874/b101cd16c492/lqae106fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba8a/11327874/c454c108420d/lqae106fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba8a/11327874/36aa7060eb01/lqae106fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba8a/11327874/b50b0ddff6f4/lqae106fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba8a/11327874/b101cd16c492/lqae106fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba8a/11327874/c454c108420d/lqae106fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba8a/11327874/36aa7060eb01/lqae106fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba8a/11327874/b50b0ddff6f4/lqae106fig4.jpg

相似文献

1
SpanSeq: similarity-based sequence data splitting method for improved development and assessment of deep learning projects.SpanSeq:基于相似度的序列数据分割方法,用于改进深度学习项目的开发与评估。
NAR Genom Bioinform. 2024 Aug 16;6(3):lqae106. doi: 10.1093/nargab/lqae106. eCollection 2024 Sep.
2
3
Ensemble machine learning model trained on a new synthesized dataset generalizes well for stress prediction using wearable devices.在新合成数据集上训练的集成机器学习模型,对于使用可穿戴设备进行压力预测具有良好的泛化能力。
J Biomed Inform. 2023 Dec;148:104556. doi: 10.1016/j.jbi.2023.104556. Epub 2023 Dec 2.
4
Evaluating generalizability of artificial intelligence models for molecular datasets.评估人工智能模型对分子数据集的可推广性。
bioRxiv. 2024 Feb 28:2024.02.25.581982. doi: 10.1101/2024.02.25.581982.
5
S-CUDA: Self-cleansing unsupervised domain adaptation for medical image segmentation.S-CUDA:用于医学图像分割的自清洁无监督域适应
Med Image Anal. 2021 Dec;74:102214. doi: 10.1016/j.media.2021.102214. Epub 2021 Aug 12.
6
Generalizing Deep Learning for Medical Image Segmentation to Unseen Domains via Deep Stacked Transformation.通过深度堆叠变换将深度学习用于医学图像分割推广到未见领域。
IEEE Trans Med Imaging. 2020 Jul;39(7):2531-2540. doi: 10.1109/TMI.2020.2973595. Epub 2020 Feb 12.
7
Maximizing lipocalin prediction through balanced and diversified training set and decision fusion.通过平衡且多样化的训练集和决策融合实现脂蛋白预测最大化。
Comput Biol Chem. 2015 Dec;59 Pt A:101-10. doi: 10.1016/j.compbiolchem.2015.09.011. Epub 2015 Sep 28.
8
SOFB is a comprehensive ensemble deep learning approach for elucidating and characterizing protein-nucleic-acid-binding residues.SOFB 是一种全面的集成深度学习方法,用于阐明和描述蛋白质-核酸结合残基。
Commun Biol. 2024 Jun 3;7(1):679. doi: 10.1038/s42003-024-06332-0.
9
On TCR binding predictors failing to generalize to unseen peptides.TCR 结合预测因子无法泛化到未见的肽。
Front Immunol. 2022 Oct 21;13:1014256. doi: 10.3389/fimmu.2022.1014256. eCollection 2022.
10
Neural Translation and Automated Recognition of ICD-10 Medical Entities From Natural Language: Model Development and Performance Assessment.从自然语言中对ICD - 10医学实体进行神经翻译和自动识别:模型开发与性能评估
JMIR Med Inform. 2022 Apr 11;10(4):e26353. doi: 10.2196/26353.

引用本文的文献

1
Inferring protein from transcript abundances using convolutional neural networks.使用卷积神经网络从转录本丰度推断蛋白质。
BioData Min. 2025 Feb 27;18(1):18. doi: 10.1186/s13040-025-00434-z.
2
When less is more: sketching with minimizers in genomics.少即是多:基因组学中的最小化器草图。
Genome Biol. 2024 Oct 14;25(1):270. doi: 10.1186/s13059-024-03414-4.

本文引用的文献

1
NetAllergen, a random forest model integrating MHC-II presentation propensity for improved allergenicity prediction.NetAllergen,一种整合了MHC-II呈递倾向以改进变应原性预测的随机森林模型。
Bioinform Adv. 2023 Oct 16;3(1):vbad151. doi: 10.1093/bioadv/vbad151. eCollection 2023.
2
GraphPart: homology partitioning for biological sequence analysis.GraphPart:用于生物序列分析的同源性划分
NAR Genom Bioinform. 2023 Oct 16;5(4):lqad088. doi: 10.1093/nargab/lqad088. eCollection 2023 Dec.
3
Scaling neighbor joining to one million taxa with dynamic and heuristic neighbor joining.
使用动态启发式邻接法将邻接法扩展到一百万分类单元。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac774.
4
Inflation of test accuracy due to data leakage in deep learning-based classification of OCT images.深度学习分类 OCT 图像中因数据泄露导致的测试精度膨胀。
Sci Data. 2022 Sep 22;9(1):580. doi: 10.1038/s41597-022-01618-6.
5
Deep learning models for RNA secondary structure prediction (probably) do not generalize across families.深度学习模型预测 RNA 二级结构(可能)不能跨家族泛化。
Bioinformatics. 2022 Aug 10;38(16):3892-3899. doi: 10.1093/bioinformatics/btac415.
6
DeepLoc 2.0: multi-label subcellular localization prediction using protein language models.DeepLoc 2.0:使用蛋白质语言模型进行多标签亚细胞定位预测。
Nucleic Acids Res. 2022 Jul 5;50(W1):W228-W234. doi: 10.1093/nar/gkac278.
7
Constructing benchmark test sets for biological sequence analysis using independent set algorithms.使用独立集算法构建生物序列分析的基准测试集。
PLoS Comput Biol. 2022 Mar 7;18(3):e1009492. doi: 10.1371/journal.pcbi.1009492. eCollection 2022 Mar.
8
UFold: fast and accurate RNA secondary structure prediction with deep learning.UFold:使用深度学习进行快速准确的 RNA 二级结构预测。
Nucleic Acids Res. 2022 Feb 22;50(3):e14. doi: 10.1093/nar/gkab1074.
9
Sustainable data analysis with Snakemake.使用 Snakemake 进行可持续数据分析。
F1000Res. 2021 Jan 18;10:33. doi: 10.12688/f1000research.29032.2. eCollection 2021.
10
RNA secondary structure prediction using deep learning with thermodynamic integration.使用热力学积分的深度学习进行 RNA 二级结构预测。
Nat Commun. 2021 Feb 11;12(1):941. doi: 10.1038/s41467-021-21194-4.