Suppr超能文献

文献衍生知识图谱增强单细胞 RNA-seq 数据集的解读。

A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets.

机构信息

nference, One Main Street, Cambridge, MA 02142, USA.

nference Labs, Bengaluru, Karnataka 560017, India.

出版信息

Genes (Basel). 2021 Jun 10;12(6):898. doi: 10.3390/genes12060898.

Abstract

Technology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have advanced rapidly in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural language processing (NLP) methods to enhance analysis workflows has not been adequately explored. Here we deployed an NLP framework to objectively quantify associations between a comprehensive set of over 20,000 human protein-coding genes and over 500 cell type terms across over 26 million biomedical documents. The resultant gene-cell type associations (GCAs) are significantly stronger between a curated set of matched cell type-marker pairs than the complementary set of mismatched pairs (Mann Whitney = 6.15 × 10, r = 0.24; cohen's D = 2.6). Building on this, we developed an augmented annotation algorithm (single cell Annotation via Literature Encoding, or scALE) that leverages GCAs to categorize cell clusters identified in scRNA-seq datasets, and we tested its ability to predict the cellular identity of 133 clusters from nine datasets of human breast, colon, heart, joint, ovary, prostate, skin, and small intestine tissues. With the optimized settings, the true cellular identity matched the top prediction in 59% of tested clusters and was present among the top five predictions for 91% of clusters. scALE slightly outperformed an existing method for reference data driven automated cluster annotation, and we demonstrate that integration of scALE can meaningfully improve the annotations derived from such methods. Further, contextualization of differential expression analyses with these GCAs highlights poorly characterized markers of well-studied cell types, such as CLIC6 and DNASE1L3 in retinal pigment epithelial cells and endothelial cells, respectively. Taken together, this study illustrates for the first time how the systematic application of a literature-derived knowledge graph can expedite and enhance the annotation and interpretation of scRNA-seq data.

摘要

在过去的几年中,生成单细胞 RNA 测序 (scRNA-seq) 数据集的技术和注释它们的工具已经迅速发展。这些工具通常依赖于现有的转录组数据集或细胞类型定义基因的精心整理的数据库,而可扩展的自然语言处理 (NLP) 方法在增强分析工作流程方面的应用尚未得到充分探索。在这里,我们部署了一个 NLP 框架,客观地量化了超过 20,000 个人类蛋白质编码基因和超过 500 个细胞类型术语之间的关联,这些术语跨越超过 2600 万篇生物医学文献。由此产生的基因-细胞类型关联 (GCAs) 在一组经过精心整理的匹配细胞类型-标记对之间比互补的不匹配对之间要强得多 (Mann Whitney = 6.15×10, r = 0.24; cohen's D = 2.6)。在此基础上,我们开发了一种增强注释算法(通过文献编码进行单细胞注释,或 scALE),该算法利用 GCAs 对 scRNA-seq 数据集中识别的细胞簇进行分类,我们测试了它预测来自人类乳腺、结肠、心脏、关节、卵巢、前列腺、皮肤和小肠组织的九个数据集的 133 个细胞簇的细胞身份的能力。在优化设置下,在测试的 59%的簇中,真实的细胞身份与最佳预测相匹配,在 91%的簇中,最佳预测中包含前五个预测。scALE 略微优于现有用于参考数据驱动自动聚类注释的方法,并且我们证明了集成 scALE 可以显著改善此类方法得出的注释。此外,用这些 GCAs 对差异表达分析进行语境化处理,可以突出研究良好的细胞类型的特征标记物,例如视网膜色素上皮细胞中的 CLIC6 和 DNASE1L3 以及内皮细胞中的 CLIC6 和 DNASE1L3。总的来说,这项研究首次说明了系统应用文献衍生的知识图谱如何加速和增强 scRNA-seq 数据的注释和解释。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/38a6/8229796/ea0aa9c1d260/genes-12-00898-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验