Suppr超能文献

探索生成对抗网络在生物序列分析中的潜力。

Exploring the Potential of GANs in Biological Sequence Analysis.

作者信息

Murad Taslim, Ali Sarwan, Patterson Murray

机构信息

Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.

出版信息

Biology (Basel). 2023 Jun 14;12(6):854. doi: 10.3390/biology12060854.

Abstract

Biological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, such as viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become global pandemics. New tools for biological sequence analysis are provided by machine learning (ML) technologies to effectively analyze the functions and structures of the sequences. However, these ML-based methods undergo challenges with data imbalance, generally associated with biological sequence datasets, which hinders their performance. Although various strategies are present to address this issue, such as the SMOTE algorithm, which creates synthetic data, however, they focus on local information rather than the overall class distribution. In this work, we explore a novel approach to handle the data imbalance issue based on generative adversarial networks (GANs), which use the overall data distribution. GANs are utilized to generate synthetic data that closely resembles real data, thus, these generated data can be employed to enhance the ML models' performance by eradicating the class imbalance problem for biological sequence analysis. We perform four distinct classification tasks by using four different sequence datasets (Influenza A Virus, PALMdb, VDjDB, Host) and our results illustrate that GANs can improve the overall classification performance.

摘要

生物序列分析是深入理解序列潜在功能、结构和行为的重要一步。它有助于识别相关生物体(如病毒等)的特征,并建立预防机制以根除其传播和影响,因为病毒已知会引发可能演变为全球大流行的流行病。机器学习(ML)技术提供了用于生物序列分析的新工具,以有效分析序列的功能和结构。然而,这些基于ML的方法面临数据不平衡的挑战,这通常与生物序列数据集相关,从而阻碍了它们的性能。尽管存在各种解决此问题的策略,例如创建合成数据的SMOTE算法,但是,它们关注的是局部信息而非整体类分布。在这项工作中,我们探索了一种基于生成对抗网络(GAN)来处理数据不平衡问题的新方法,该方法使用整体数据分布。GAN用于生成与真实数据非常相似的合成数据,因此,这些生成的数据可用于通过消除生物序列分析中的类不平衡问题来提高ML模型的性能。我们使用四个不同的序列数据集(甲型流感病毒、PALMdb、VDjDB、宿主)执行四个不同的分类任务,我们的结果表明GAN可以提高整体分类性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6957/10295061/679be71a9bb2/biology-12-00854-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验