Singapore Institute for Clinical Sciences, Agency for Science, Technology and Research (A*STAR), 30 Medical Dr, 117609, Singapore, Republic of Singapore.
Bioinformatics Institute, Agency for Science, Technology and Research (A*STAR), 30 Biopolis St, Matrix, 138671, Singapore, Republic of Singapore.
Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae039.
Cohort studies increasingly collect biosamples for molecular profiling and are observing molecular heterogeneity. High-throughput RNA sequencing is providing large datasets capable of reflecting disease mechanisms. Clustering approaches have produced a number of tools to help dissect complex heterogeneous datasets, but selecting the appropriate method and parameters to perform exploratory clustering analysis of transcriptomic data requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent. To address this, we have developed Omada, a suite of tools aiming to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning-based functions.
The efficiency of each tool was tested with 7 datasets characterized by different expression signal strengths to capture a wide spectrum of RNA expression datasets. Our toolkit's decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Within datasets with less clear biological distinctions, our tools either formed stable subgroups with different expression profiles and robust clinical associations or revealed signs of problematic data such as biased measurements.
In conclusion, Omada successfully automates the robust unsupervised clustering of transcriptomic data, making advanced analysis accessible and reliable even for those without extensive machine learning expertise. Implementation of Omada is available at http://bioconductor.org/packages/omada/.
越来越多的队列研究收集生物样本进行分子分析,并观察分子异质性。高通量 RNA 测序提供了能够反映疾病机制的大型数据集。聚类方法已经产生了许多工具来帮助剖析复杂的异质数据集,但选择适当的方法和参数来对转录组数据进行探索性聚类分析需要深入了解机器学习和广泛的计算实验。目前还没有不需要事先了解领域知识就能帮助做出此类决策的工具。为了解决这个问题,我们开发了 Omada,这是一套工具,旨在通过自动化机器学习功能来自动化这些过程,并使稳健的无监督聚类分析更易于使用转录组数据。
我们使用 7 个具有不同表达信号强度的数据集来测试每个工具的效率,以捕获广泛的 RNA 表达数据集。我们工具包的决策反映了数据集中小组可识别的稳定分区的实际数量。在生物学差异不太明显的数据集内,我们的工具要么形成具有不同表达谱和稳健临床关联的稳定子组,要么显示出有问题的数据迹象,如有偏差的测量。
总之,Omada 成功地自动化了转录组数据的稳健无监督聚类,即使对于没有广泛机器学习专业知识的人来说,也可以实现高级分析的便捷和可靠。Omada 的实现可在 http://bioconductor.org/packages/omada/ 获得。