Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China.
Section of Developmental Genomics, National Institute of Diabetes and Kidney and Digestive Diseases, National Institutes of Health, Bethesda, MD 20892, USA.
Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad177.
The volume of ribonucleic acid (RNA)-seq data has increased exponentially, providing numerous new insights into various biological processes. However, due to significant practical challenges, such as data heterogeneity, it is still difficult to ensure the quality of these data when integrated. Although some quality control methods have been developed, sample consistency is rarely considered and these methods are susceptible to artificial factors. Here, we developed MassiveQC, an unsupervised machine learning-based approach, to automatically download and filter large-scale high-throughput data. In addition to the read quality used in other tools, MassiveQC also uses the alignment and expression quality as model features. Meanwhile, it is user-friendly since the cutoff is generated from self-reporting and is applicable to multimodal data. To explore its value, we applied MassiveQC to Drosophila RNA-seq data and generated a comprehensive transcriptome atlas across 28 tissues from embryogenesis to adulthood. We systematically characterized fly gene expression dynamics and found that genes with high expression dynamics were likely to be evolutionarily young and expressed at late developmental stages, exhibiting high nonsynonymous substitution rates and low phenotypic severity, and they were involved in simple regulatory programs. We also discovered that human and Drosophila had strong positive correlations in gene expression in orthologous organs, revealing the great potential of the Drosophila system for studying human development and disease.
RNA-seq 数据的数量呈指数级增长,为各种生物学过程提供了许多新的见解。然而,由于存在数据异质性等重大实际挑战,在整合这些数据时仍然难以保证其质量。尽管已经开发出一些质量控制方法,但很少考虑样本一致性,并且这些方法容易受到人为因素的影响。在这里,我们开发了 MassiveQC,这是一种基于无监督机器学习的方法,可自动下载和过滤大规模高通量数据。除了其他工具中使用的读取质量外,MassiveQC 还将对齐和表达质量用作模型特征。同时,它用户友好,因为截止值是由自我报告生成的,适用于多模态数据。为了探索其价值,我们将 MassiveQC 应用于果蝇 RNA-seq 数据,并生成了从胚胎发生到成年的 28 种组织的综合转录组图谱。我们系统地描述了果蝇基因表达的动态变化,发现表达动态高的基因可能是进化上较年轻的基因,并且在发育后期表达,表现出高非同义替换率和低表型严重程度,它们参与了简单的调控程序。我们还发现,在同源器官中,人类和果蝇的基因表达具有很强的正相关性,这揭示了果蝇系统在研究人类发育和疾病方面的巨大潜力。