Suppr超能文献

SpanSeq:基于相似度的序列数据分割方法,用于改进深度学习项目的开发与评估。

SpanSeq: similarity-based sequence data splitting method for improved development and assessment of deep learning projects.

作者信息

Ferrer Florensa Alfred, Almagro Armenteros Jose Juan, Nielsen Henrik, Aarestrup Frank Møller, Clausen Philip Thomas Lanken Conradsen

机构信息

Research Group for Genomic Epidemiology, DTU National Food Institute, Technical University of Denmark, Anker Engelunds Vej 1, 2800 Kongens Lyngby, Denmark.

Informatics and Predictive Sciences Research, Bristol Myers Squibb Company, Calle Isaac Newton 4, 41092 Sevilla, Spain.

出版信息

NAR Genom Bioinform. 2024 Aug 16;6(3):lqae106. doi: 10.1093/nargab/lqae106. eCollection 2024 Sep.

Abstract

The use of deep learning models in computational biology has increased massively in recent years, and it is expected to continue with the current advances in the fields such as Natural Language Processing. These models, although able to draw complex relations between input and target, are also inclined to learn noisy deviations from the pool of data used during their development. In order to assess their performance on unseen data (their capacity to ), it is common to split the available data randomly into development (train/validation) and test sets. This procedure, although standard, has been shown to produce dubious assessments of due to the existing similarity between samples in the databases used. In this work, we present SpanSeq, a database partition method for machine learning that can scale to most biological sequences (genes, proteins and genomes) in order to avoid data leakage between sets. We also explore the effect of not restraining similarity between sets by reproducing the development of two state-of-the-art models on bioinformatics, not only confirming the consequences of randomly splitting databases on the model assessment, but expanding those repercussions to the model development. SpanSeq is available at https://github.com/genomicepidemiology/SpanSeq.

摘要

近年来,深度学习模型在计算生物学中的应用大幅增加,并且随着自然语言处理等领域当前的进展,预计这种增长态势还将持续。这些模型虽然能够在输入和目标之间建立复杂的关系,但也倾向于从其开发过程中使用的数据池中学习到有噪声的偏差。为了评估它们在未见数据上的性能(它们的 能力),通常会将可用数据随机拆分为开发集(训练集/验证集)和测试集。这个过程虽然是标准的,但由于所用数据库中样本之间存在相似性,已被证明会对 产生可疑的评估。在这项工作中,我们提出了SpanSeq,一种用于机器学习的数据库划分方法,它可以扩展到大多数生物序列(基因、蛋白质和基因组),以避免集合之间的数据泄漏。我们还通过重现生物信息学中两个最先进模型的开发过程,探讨了不限制集合之间相似性的影响,不仅证实了随机拆分数据库对模型评估的后果,还将这些影响扩展到了模型开发。SpanSeq可在https://github.com/genomicepidemiology/SpanSeq上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba8a/11327874/b101cd16c492/lqae106fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验