School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, 710048, People's Republic of China.
Shaanxi Engineering Research Center of Medical and Health Big Data, School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, 710048, People's Republic of China.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):82. doi: 10.1186/s12859-020-3349-5.
Genomic micro-satellites are the genomic regions that consist of short and repetitive DNA motifs. Estimating the length distribution and state of a micro-satellite region is an important computational step in cancer sequencing data pipelines, which is suggested to facilitate the downstream analysis and clinical decision supporting. Although several state-of-the-art approaches have been proposed to identify micro-satellite instability (MSI) events, they are limited in dealing with regions longer than one read length. Moreover, based on our best knowledge, all of these approaches imply a hypothesis that the tumor purity of the sequenced samples is sufficiently high, which is inconsistent with the reality, leading the inferred length distribution to dilute the data signal and introducing the false positive errors.
In this article, we proposed a computational approach, named ELMSI, which detected MSI events based on the next generation sequencing technology. ELMSI can estimate the specific length distributions and states of micro-satellite regions from a mixed tumor sample paired with a control one. It first estimated the purity of the tumor sample based on the read counts of the filtered SNVs loci. Then, the algorithm identified the length distributions and the states of short micro-satellites by adding the Maximum Likelihood Estimation (MLE) step to the existing algorithm. After that, ELMSI continued to infer the length distributions of long micro-satellites by incorporating a simplified Expectation Maximization (EM) algorithm with central limit theorem, and then used statistical tests to output the states of these micro-satellites. Based on our experimental results, ELMSI was able to handle micro-satellites with lengths ranging from shorter than one read length to 10kbps.
To verify the reliability of our algorithm, we first compared the ability of classifying the shorter micro-satellites from the mixed samples with the existing algorithm MSIsensor. Meanwhile, we varied the number of micro-satellite regions, the read length and the sequencing coverage to separately test the performance of ELMSI on estimating the longer ones from the mixed samples. ELMSI performed well on mixed samples, and thus ELMSI was of great value for improving the recognition effect of micro-satellite regions and supporting clinical decision supporting. The source codes have been uploaded and maintained at https://github.com/YixuanWang1120/ELMSI for academic use only.
基因组微卫星是由短而重复的 DNA 基序组成的基因组区域。估计微卫星区域的长度分布和状态是癌症测序数据管道中的一个重要计算步骤,这有助于促进下游分析和临床决策支持。尽管已经提出了几种用于识别微卫星不稳定性 (MSI) 事件的最先进方法,但它们在处理长度超过一个读取长度的区域时存在局限性。此外,据我们所知,所有这些方法都假设测序样本的肿瘤纯度足够高,这与现实不符,导致推断的长度分布稀释了数据信号并引入了假阳性错误。
在本文中,我们提出了一种计算方法,名为 ELMSI,它基于下一代测序技术检测 MSI 事件。ELMSI 可以从混合肿瘤样本与对照样本配对中估计微卫星区域的特定长度分布和状态。它首先根据过滤后的 SNV 位点的读取计数估计肿瘤样本的纯度。然后,该算法通过向现有算法添加最大似然估计 (MLE) 步骤来识别短微卫星的长度分布和状态。之后,ELMSI 通过将简化的期望最大化 (EM) 算法与中心极限定理结合使用,继续推断长微卫星的长度分布,然后使用统计检验输出这些微卫星的状态。根据我们的实验结果,ELMSI 能够处理长度从短于一个读取长度到 10kbps 的微卫星。
为了验证我们算法的可靠性,我们首先将其区分混合样本中较短微卫星的能力与现有的 MSIsensor 算法进行了比较。同时,我们改变了微卫星区域的数量、读取长度和测序覆盖度,分别测试了 ELMSI 对混合样本中较长微卫星的估计性能。ELMSI 在混合样本中表现良好,因此对于提高微卫星区域的识别效果和支持临床决策支持具有重要价值。源代码已上传并维护在 https://github.com/YixuanWang1120/ELMSI 上,仅供学术使用。