National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangzhou 518120, China.
College of Biomedical Engineering, Taiyuan University of Technology, Jinzhong 030600, China.
Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae138.
Following the milestone success of the Human Genome Project, the 'Encyclopedia of DNA Elements (ENCODE)' initiative was launched in 2003 to unearth information about the numerous functional elements within the genome. This endeavor coincided with the emergence of numerous novel technologies, accompanied by the provision of vast amounts of whole-genome sequences, high-throughput data such as ChIP-Seq and RNA-Seq. Extracting biologically meaningful information from this massive dataset has become a critical aspect of many recent studies, particularly in annotating and predicting the functions of unknown genes. The core idea behind genome annotation is to identify genes and various functional elements within the genome sequence and infer their biological functions. Traditional wet-lab experimental methods still rely on extensive efforts for functional verification. However, early bioinformatics algorithms and software primarily employed shallow learning techniques; thus, the ability to characterize data and features learning was limited. With the widespread adoption of RNA-Seq technology, scientists from the biological community began to harness the potential of machine learning and deep learning approaches for gene structure prediction and functional annotation. In this context, we reviewed both conventional methods and contemporary deep learning frameworks, and highlighted novel perspectives on the challenges arising during annotation underscoring the dynamic nature of this evolving scientific landscape.
在人类基因组计划取得里程碑式的成功之后,“DNA 元件百科全书(ENCODE)”计划于 2003 年启动,旨在挖掘基因组中众多功能元件的信息。这一努力恰逢许多新技术的出现,同时提供了大量的全基因组序列、高通量数据,如 ChIP-Seq 和 RNA-Seq。从这个庞大的数据集提取有生物学意义的信息已成为许多近期研究的关键方面,特别是在注释和预测未知基因的功能方面。基因组注释的核心思想是识别基因组序列中的基因和各种功能元件,并推断它们的生物学功能。传统的湿实验方法仍然需要大量的工作来进行功能验证。然而,早期的生物信息学算法和软件主要采用浅层学习技术,因此,数据和特征学习的能力有限。随着 RNA-Seq 技术的广泛采用,来自生物学界的科学家开始利用机器学习和深度学习方法进行基因结构预测和功能注释。在这种背景下,我们回顾了传统方法和当代深度学习框架,并强调了注释过程中出现的新挑战,突出了这个不断发展的科学领域的动态性质。