机器学习方法从单细胞转录组学数据中识别包含空间信息的基因。

Machine Learning Approaches Identify Genes Containing Spatial Information From Single-Cell Transcriptomics Data.

作者信息

Loher Phillipe, Karathanasis Nestoras

机构信息

Computational Medicine Center, Thomas Jefferson University, Philadelphia, PA, United States.

出版信息

Front Genet. 2021 Feb 1;11:612840. doi: 10.3389/fgene.2020.612840. eCollection 2020.

DOI:10.3389/fgene.2020.612840

PMID:33633771

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7902049/

Abstract

The development of single-cell sequencing technologies has allowed researchers to gain important new knowledge about the expression profile of genes in thousands of individual cells of a model organism or tissue. A common disadvantage of this technology is the loss of the three-dimensional (3-D) structure of the cells. Consequently, the Dialogue on Reverse Engineering Assessment and Methods (DREAM) organized the Single-Cell Transcriptomics Challenge, in which we participated, with the aim to address the following two problems: (a) to identify the top 60, 40, and 20 genes of the embryo that contain the most spatial information and (b) to reconstruct the 3-D arrangement of the embryo using information from those genes. We developed two independent techniques, leveraging machine learning models from least absolute shrinkage and selection operator (Lasso) and deep neural networks (NNs), which are applied to high-dimensional single-cell sequencing data in order to accurately identify genes that contain spatial information. Our first technique, Lasso.TopX, utilizes the Lasso and ranking statistics and allows a user to define a specific number of features they are interested in. The NN approach utilizes weak supervision for linear regression to accommodate for uncertain or probabilistic training labels. We show, individually for both techniques, that we are able to identify important, stable, and a user-defined number of genes containing the most spatial information. The results from both techniques achieve high performance when reconstructing spatial information in and also generalize to zebrafish (). Furthermore, we identified novel genes that carry important positional information and were not previously suspected. We also show how the indirect use of the full datasets' information can lead to data leakage and generate bias in overestimating the model's performance. Lastly, we discuss the applicability of our approaches to other feature selection problems outside the realm of single-cell sequencing and the importance of being able to handle probabilistic training labels. Our source code and detailed documentation are available at https://github.com/TJU-CMC-Org/SingleCell-DREAM/.

摘要

单细胞测序技术的发展使研究人员能够获取有关模式生物或组织中数千个单个细胞基因表达谱的重要新知识。该技术的一个常见缺点是细胞三维（3-D）结构的丧失。因此，逆向工程评估与方法对话组织（DREAM）举办了单细胞转录组学挑战赛，我们参与其中，旨在解决以下两个问题：（a）识别胚胎中包含最多空间信息的前60、40和20个基因，以及（b）利用这些基因的信息重建胚胎的三维排列。我们开发了两种独立的技术，利用来自最小绝对收缩和选择算子（Lasso）和深度神经网络（NNs）的机器学习模型，将其应用于高维单细胞测序数据，以准确识别包含空间信息的基因。我们的第一种技术Lasso.TopX利用Lasso和排序统计，允许用户定义他们感兴趣的特定数量的特征。NN方法利用弱监督进行线性回归，以适应不确定或概率性的训练标签。我们分别针对这两种技术表明，我们能够识别出重要、稳定且数量由用户定义的包含最多空间信息的基因。这两种技术的结果在重建空间信息时都具有高性能，并且还能推广到斑马鱼。此外，我们识别出了携带重要位置信息且此前未被怀疑的新基因。我们还展示了间接使用完整数据集的信息如何导致数据泄露，并在高估模型性能时产生偏差。最后，我们讨论了我们的方法在单细胞测序领域之外的其他特征选择问题中的适用性，以及能够处理概率性训练标签的重要性。我们的源代码和详细文档可在https://github.com/TJU-CMC-Org/SingleCell-DREAM/获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e25/7902049/d107b9b1e102/fgene-11-612840-g001.jpg

相似文献

Machine Learning Approaches Identify Genes Containing Spatial Information From Single-Cell Transcriptomics Data.

Front Genet. 2021 Feb 1;11:612840. doi: 10.3389/fgene.2020.612840. eCollection 2020.

scBOL: a universal cell type identification framework for single-cell and spatial transcriptomics data.

Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae188.

MLSpatial: A machine-learning method to reconstruct the spatial distribution of cells from scRNA-seq by extracting spatial features.

Comput Biol Med. 2023 Jun;159:106873. doi: 10.1016/j.compbiomed.2023.106873. Epub 2023 Apr 18.

Feature Selection for Topological Proximity Prediction of Single-Cell Transcriptomic Profiles in Embryo Using Genetic Algorithm.

Genes (Basel). 2020 Dec 28;12(1):28. doi: 10.3390/genes12010028.

Gene selection for optimal prediction of cell position in tissues from single-cell transcriptomics data.

Life Sci Alliance. 2020 Sep 24;3(11). doi: 10.26508/lsa.202000867. Print 2020 Nov.

Spatial mapping of single cells in the embryo from transcriptomic data based on topological consistency.

F1000Res. 2020 Aug 20;9:1014. doi: 10.12688/f1000research.24163.2. eCollection 2020.

Erratum: Eyestalk Ablation to Increase Ovarian Maturation in Mud Crabs.

J Vis Exp. 2023 May 26(195). doi: 10.3791/6561.

Deep convolutional neural network and IoT technology for healthcare.

Digit Health. 2024 Jan 17;10:20552076231220123. doi: 10.1177/20552076231220123. eCollection 2024 Jan-Dec.

A Novel Algorithm for Feature Selection Using Penalized Regression with Applications to Single-Cell RNA Sequencing Data.

Biology (Basel). 2022 Oct 12;11(10):1495. doi: 10.3390/biology11101495.

A machine learning-based method for automatically identifying novel cells in annotating single-cell RNA-seq data.

Bioinformatics. 2022 Oct 31;38(21):4885-4892. doi: 10.1093/bioinformatics/btac617.

引用本文的文献

Mitigating bias in prostate cancer diagnosis using synthetic data for improved AI driven Gleason grading.

NPJ Precis Oncol. 2025 May 23;9(1):151. doi: 10.1038/s41698-025-00934-5.

GAADE: identification spatially variable genes based on adaptive graph attention network.

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae669.

Using medicare claims to estimate risk-adjusted performance of Pennsylvania trauma centers.

PLOS Digit Health. 2023 Jun 2;2(6):e0000263. doi: 10.1371/journal.pdig.0000263. eCollection 2023 Jun.

Computational elucidation of spatial gene expression variation from spatially resolved transcriptomics data.

Mol Ther Nucleic Acids. 2021 Dec 11;27:404-411. doi: 10.1016/j.omtn.2021.12.009. eCollection 2022 Mar 8.

Gene selection for optimal prediction of cell position in tissues from single-cell transcriptomics data.

Life Sci Alliance. 2020 Sep 24;3(11). doi: 10.26508/lsa.202000867. Print 2020 Nov.

本文引用的文献

Gene selection for optimal prediction of cell position in tissues from single-cell transcriptomics data.

Life Sci Alliance. 2020 Sep 24;3(11). doi: 10.26508/lsa.202000867. Print 2020 Nov.

The embryo at single-cell transcriptome resolution.

Science. 2017 Oct 13;358(6360):194-199. doi: 10.1126/science.aan3235. Epub 2017 Aug 31.

A Comparison of Self-Selected Walking Speeds and Walking Speed Variability When Data Are Collected During Repeated Discrete Trials and During Continuous Walking.

J Appl Biomech. 2017 Oct 1;33(5):384-387. doi: 10.1123/jab.2016-0355. Epub 2017 Sep 19.

Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View.

J Med Internet Res. 2016 Dec 16;18(12):e323. doi: 10.2196/jmir.5870.

Deep learning.

Nature. 2015 May 28;521(7553):436-44. doi: 10.1038/nature14539.

Spatial reconstruction of single-cell gene expression data.

Nat Biotechnol. 2015 May;33(5):495-502. doi: 10.1038/nbt.3192. Epub 2015 Apr 13.

Don't use a cannon to kill the ... miRNA mosquito.

Bioinformatics. 2014 Apr 1;30(7):1047-8. doi: 10.1093/bioinformatics/btu100. Epub 2014 Mar 10.

ZFIN, the Zebrafish Model Organism Database: increased support for mutants and transgenics.

Nucleic Acids Res. 2013 Jan;41(Database issue):D854-60. doi: 10.1093/nar/gks938. Epub 2012 Oct 15.

Regularization Paths for Generalized Linear Models via Coordinate Descent.

J Stat Softw. 2010;33(1):1-22.

Pitfalls of supervised feature selection.

Bioinformatics. 2010 Feb 1;26(3):440-3. doi: 10.1093/bioinformatics/btp621. Epub 2009 Oct 29.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

机器学习方法从单细胞转录组学数据中识别包含空间信息的基因。

Machine Learning Approaches Identify Genes Containing Spatial Information From Single-Cell Transcriptomics Data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献