Qiu Yixuan, Wang Jiebiao, Lei Jing, Roeder Kathryn
Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA.
Bioinformatics. 2021 Oct 11;37(19):3228-3234. doi: 10.1093/bioinformatics/btab257.
Marker genes, defined as genes that are expressed primarily in a single-cell type, can be identified from the single-cell transcriptome; however, such data are not always available for the many uses of marker genes, such as deconvolution of bulk tissue. Marker genes for a cell type, however, are highly correlated in bulk data, because their expression levels depend primarily on the proportion of that cell type in the samples. Therefore, when many tissue samples are analyzed, it is possible to identify these marker genes from the correlation pattern.
To capitalize on this pattern, we develop a new algorithm to detect marker genes by combining published information about likely marker genes with bulk transcriptome data in the form of a semi-supervised algorithm. The algorithm then exploits the correlation structure of the bulk data to refine the published marker genes by adding or removing genes from the list.
We implement this method as an R package markerpen, hosted on CRAN (https://CRAN.R-project.org/package=markerpen).
Supplementary data are available at Bioinformatics online.
标记基因被定义为主要在单一细胞类型中表达的基因,可从单细胞转录组中识别出来;然而,对于标记基因的许多用途(如批量组织的反卷积),此类数据并非总是可用。然而,细胞类型的标记基因在批量数据中高度相关,因为它们的表达水平主要取决于样本中该细胞类型的比例。因此,当分析许多组织样本时,有可能从相关模式中识别出这些标记基因。
为了利用这种模式,我们开发了一种新算法,通过将关于可能的标记基因的已发表信息与批量转录组数据以半监督算法的形式相结合来检测标记基因。然后,该算法利用批量数据的相关结构,通过从列表中添加或删除基因来优化已发表的标记基因。
我们将此方法实现为一个R包markerpen,托管在CRAN上(https://CRAN.R-project.org/package=markerpen)。
补充数据可在《生物信息学》在线获取。