Suppr超能文献

机器学习方法从单细胞转录组学数据中识别包含空间信息的基因。

Machine Learning Approaches Identify Genes Containing Spatial Information From Single-Cell Transcriptomics Data.

作者信息

Loher Phillipe, Karathanasis Nestoras

机构信息

Computational Medicine Center, Thomas Jefferson University, Philadelphia, PA, United States.

出版信息

Front Genet. 2021 Feb 1;11:612840. doi: 10.3389/fgene.2020.612840. eCollection 2020.

Abstract

The development of single-cell sequencing technologies has allowed researchers to gain important new knowledge about the expression profile of genes in thousands of individual cells of a model organism or tissue. A common disadvantage of this technology is the loss of the three-dimensional (3-D) structure of the cells. Consequently, the Dialogue on Reverse Engineering Assessment and Methods (DREAM) organized the Single-Cell Transcriptomics Challenge, in which we participated, with the aim to address the following two problems: (a) to identify the top 60, 40, and 20 genes of the embryo that contain the most spatial information and (b) to reconstruct the 3-D arrangement of the embryo using information from those genes. We developed two independent techniques, leveraging machine learning models from least absolute shrinkage and selection operator (Lasso) and deep neural networks (NNs), which are applied to high-dimensional single-cell sequencing data in order to accurately identify genes that contain spatial information. Our first technique, Lasso.TopX, utilizes the Lasso and ranking statistics and allows a user to define a specific number of features they are interested in. The NN approach utilizes weak supervision for linear regression to accommodate for uncertain or probabilistic training labels. We show, individually for both techniques, that we are able to identify important, stable, and a user-defined number of genes containing the most spatial information. The results from both techniques achieve high performance when reconstructing spatial information in and also generalize to zebrafish (). Furthermore, we identified novel genes that carry important positional information and were not previously suspected. We also show how the indirect use of the full datasets' information can lead to data leakage and generate bias in overestimating the model's performance. Lastly, we discuss the applicability of our approaches to other feature selection problems outside the realm of single-cell sequencing and the importance of being able to handle probabilistic training labels. Our source code and detailed documentation are available at https://github.com/TJU-CMC-Org/SingleCell-DREAM/.

摘要

单细胞测序技术的发展使研究人员能够获取有关模式生物或组织中数千个单个细胞基因表达谱的重要新知识。该技术的一个常见缺点是细胞三维(3-D)结构的丧失。因此,逆向工程评估与方法对话组织(DREAM)举办了单细胞转录组学挑战赛,我们参与其中,旨在解决以下两个问题:(a)识别胚胎中包含最多空间信息的前60、40和20个基因,以及(b)利用这些基因的信息重建胚胎的三维排列。我们开发了两种独立的技术,利用来自最小绝对收缩和选择算子(Lasso)和深度神经网络(NNs)的机器学习模型,将其应用于高维单细胞测序数据,以准确识别包含空间信息的基因。我们的第一种技术Lasso.TopX利用Lasso和排序统计,允许用户定义他们感兴趣的特定数量的特征。NN方法利用弱监督进行线性回归,以适应不确定或概率性的训练标签。我们分别针对这两种技术表明,我们能够识别出重要、稳定且数量由用户定义的包含最多空间信息的基因。这两种技术的结果在重建空间信息时都具有高性能,并且还能推广到斑马鱼。此外,我们识别出了携带重要位置信息且此前未被怀疑的新基因。我们还展示了间接使用完整数据集的信息如何导致数据泄露,并在高估模型性能时产生偏差。最后,我们讨论了我们的方法在单细胞测序领域之外的其他特征选择问题中的适用性,以及能够处理概率性训练标签的重要性。我们的源代码和详细文档可在https://github.com/TJU-CMC-Org/SingleCell-DREAM/获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e25/7902049/d107b9b1e102/fgene-11-612840-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验