Center for Complex Biological Systems, University of California Irvine, Irvine, CA, 92697, USA.
Bioinformatics and Computational Biology Lab, Department of Computer Engineering, Sharif University of Technology, Tehran, 11365, Iran.
Commun Biol. 2022 Jun 7;5(1):556. doi: 10.1038/s42003-022-03528-0.
Non-coding RNAs (ncRNAs) form a large portion of the mammalian genome. However, their biological functions are poorly characterized in cancers. In this study, using a newly developed tool, SomaGene, we analyze de novo somatic point mutations from the International Cancer Genome Consortium (ICGC) whole-genome sequencing data of 1,855 breast cancer samples. We identify 1030 candidates of ncRNAs that are significantly and explicitly mutated in breast cancer samples. By integrating data from the ENCODE regulatory features and FANTOM5 expression atlas, we show that the candidate ncRNAs significantly enrich active chromatin histone marks (1.9 times), CTCF binding sites (2.45 times), DNase accessibility (1.76 times), HMM predicted enhancers (2.26 times) and eQTL polymorphisms (1.77 times). Importantly, we show that the 1030 ncRNAs contain a much higher level (3.64 times) of breast cancer-associated genome-wide association (GWAS) single nucleotide polymorphisms (SNPs) than genome-wide expectation. Such enrichment has not been seen with GWAS SNPs from other cancers. Using breast cell line related Hi-C data, we then show that 82% of our candidate ncRNAs (1.9 times) significantly interact with the promoter of protein-coding genes, including previously known cancer-associated genes, suggesting the critical role of candidate ncRNA genes in the activation of essential regulators of development and differentiation in breast cancer. We provide an extensive web-based resource ( https://www.ihealthe.unsw.edu.au/research ) to communicate our results with the research community. Our list of breast cancer-specific ncRNA genes has the potential to provide a better understanding of the underlying genetic causes of breast cancer. Lastly, the tool developed in this study can be used to analyze somatic mutations in all cancers.
非编码 RNA(ncRNAs)构成了哺乳动物基因组的很大一部分。然而,它们在癌症中的生物学功能尚未得到充分表征。在这项研究中,我们使用一种新开发的工具 SomaGene,分析了来自国际癌症基因组联盟(ICGC)的 1855 个乳腺癌样本全基因组测序数据中的从头体细胞点突变。我们鉴定了 1030 个在乳腺癌样本中明显且明确突变的 ncRNA 候选者。通过整合 ENCODE 调控特征和 FANTOM5 表达图谱的数据,我们表明候选 ncRNA 显著富集活跃染色质组蛋白标记(1.9 倍)、CTCF 结合位点(2.45 倍)、DNase 可及性(1.76 倍)、HMM 预测增强子(2.26 倍)和 eQTL 多态性(1.77 倍)。重要的是,我们表明 1030 个 ncRNA 包含乳腺癌相关全基因组关联(GWAS)单核苷酸多态性(SNP)的水平(3.64 倍)远高于全基因组预期。这种富集在其他癌症的 GWAS SNP 中没有出现。使用乳腺癌细胞系相关的 Hi-C 数据,我们然后表明我们的候选 ncRNA 中有 82%(1.9 倍)与蛋白质编码基因的启动子显著相互作用,包括以前已知的癌症相关基因,这表明候选 ncRNA 基因在乳腺癌中发育和分化的关键调节因子的激活中起着关键作用。我们提供了一个广泛的基于网络的资源(https://www.ihealthe.unsw.edu.au/research),与研究社区交流我们的研究结果。我们鉴定的乳腺癌特异性 ncRNA 基因列表有可能提供对乳腺癌潜在遗传原因的更好理解。最后,本研究中开发的工具可用于分析所有癌症的体细胞突变。