• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过逻辑回归和稀疏建模估计Illumina碱基识别的Phred质量分数。

Estimating Phred scores of Illumina base calls by logistic regression and sparse modeling.

作者信息

Zhang Sheng, Wang Bo, Wan Lin, Li Lei M

机构信息

National Center of Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China.

University of Chinese Academy of Sciences, Beijing, 100049, China.

出版信息

BMC Bioinformatics. 2017 Jul 11;18(1):335. doi: 10.1186/s12859-017-1743-4.

DOI:10.1186/s12859-017-1743-4
PMID:28697757
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5504792/
Abstract

BACKGROUND

Phred quality scores are essential for downstream DNA analysis such as SNP detection and DNA assembly. Thus a valid model to define them is indispensable for any base-calling software. Recently, we developed the base-caller 3Dec for Illumina sequencing platforms, which reduces base-calling errors by 44-69% compared to the existing ones. However, the model to predict its quality scores has not been fully investigated yet.

RESULTS

In this study, we used logistic regression models to evaluate quality scores from predictive features, which include different aspects of the sequencing signals as well as local DNA contents. Sparse models were further obtained by three methods: the backward deletion with either AIC or BIC and the L regularization learning method. The L -regularized one was then compared with the Illumina scoring method.

CONCLUSIONS

The L -regularized logistic regression improves the empirical discrimination power by as large as 14 and 25% respectively for two kinds of preprocessed sequencing signals, compared to the Illumina scoring method. Namely, the L method identifies more base calls of high fidelity. Computationally, the L method can handle large dataset and is efficient enough for daily sequencing. Meanwhile, the logistic model resulted from BIC is more interpretable. The modeling suggested that the most prominent quenching pattern in the current chemistry of Illumina occurred at the dinucleotide "GT". Besides, nucleotides were more likely to be miscalled as the previous bases if the preceding ones were not "G". It suggested that the phasing effect of bases after "G" was somewhat different from those after other nucleotide types.

摘要

背景

Phred质量分数对于诸如单核苷酸多态性(SNP)检测和DNA组装等下游DNA分析至关重要。因此,对于任何碱基识别软件而言,一个定义这些质量分数的有效模型都是不可或缺的。最近,我们为Illumina测序平台开发了碱基识别软件3Dec,与现有软件相比,它可将碱基识别错误降低44%至69%。然而,预测其质量分数的模型尚未得到充分研究。

结果

在本研究中,我们使用逻辑回归模型根据预测特征评估质量分数,这些预测特征包括测序信号的不同方面以及局部DNA含量。通过三种方法进一步获得了稀疏模型:使用AIC或BIC的向后删除法以及L正则化学习方法。然后将L正则化模型与Illumina评分方法进行比较。

结论

与Illumina评分方法相比,L正则化逻辑回归分别将两种预处理测序信号的经验判别能力提高了14%和25%。也就是说,L方法识别出更多高保真度的碱基识别结果。在计算方面,L方法可以处理大型数据集,并且对于日常测序来说效率足够高。同时,由BIC得到的逻辑模型更具可解释性。该建模表明,Illumina当前化学过程中最显著的淬灭模式出现在二核苷酸“GT”处。此外,如果前一个碱基不是“G”,则核苷酸被误判为前一个碱基的可能性更大。这表明“G”之后碱基的相位效应与其他核苷酸类型之后的碱基相位效应有所不同。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2ac/5504792/7900a254a327/12859_2017_1743_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2ac/5504792/dfd07d087654/12859_2017_1743_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2ac/5504792/fe014184d7f3/12859_2017_1743_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2ac/5504792/5bbe32165a4d/12859_2017_1743_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2ac/5504792/cc2ea0c716a0/12859_2017_1743_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2ac/5504792/f4ec95cdf268/12859_2017_1743_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2ac/5504792/fd806e5d27b4/12859_2017_1743_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2ac/5504792/7900a254a327/12859_2017_1743_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2ac/5504792/dfd07d087654/12859_2017_1743_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2ac/5504792/fe014184d7f3/12859_2017_1743_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2ac/5504792/5bbe32165a4d/12859_2017_1743_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2ac/5504792/cc2ea0c716a0/12859_2017_1743_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2ac/5504792/f4ec95cdf268/12859_2017_1743_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2ac/5504792/fd806e5d27b4/12859_2017_1743_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2ac/5504792/7900a254a327/12859_2017_1743_Fig7_HTML.jpg

相似文献

1
Estimating Phred scores of Illumina base calls by logistic regression and sparse modeling.通过逻辑回归和稀疏建模估计Illumina碱基识别的Phred质量分数。
BMC Bioinformatics. 2017 Jul 11;18(1):335. doi: 10.1186/s12859-017-1743-4.
2
PhredEM: a phred-score-informed genotype-calling approach for next-generation sequencing studies.PhredEM:一种用于下一代测序研究的基于Phred分数的基因型分型方法。
Genet Epidemiol. 2017 Jul;41(5):375-387. doi: 10.1002/gepi.22048. Epub 2017 May 31.
3
An adaptive decorrelation method removes Illumina DNA base-calling errors caused by crosstalk between adjacent clusters.一种自适应去相关方法可消除 Illumina 因相邻簇间串扰导致的 DNA 碱基判读错误。
Sci Rep. 2017 Feb 20;7:41348. doi: 10.1038/srep41348.
4
ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering.ViVaMBC:使用基于模型的聚类方法从Illumina深度测序数据估计复杂群体中的病毒序列变异
BMC Bioinformatics. 2015 Feb 22;16:59. doi: 10.1186/s12859-015-0458-7.
5
SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles.SimuSCoP:基于位置和上下文相关的分布可靠地模拟 Illumina 测序数据。
BMC Bioinformatics. 2020 Jul 23;21(1):331. doi: 10.1186/s12859-020-03665-5.
6
Impact of post-alignment processing in variant discovery from whole exome data.全外显子数据变异发现中比对后处理的影响
BMC Bioinformatics. 2016 Oct 3;17(1):403. doi: 10.1186/s12859-016-1279-z.
7
pIRS: Profile-based Illumina pair-end reads simulator.pIRS:基于谱的 Illumina 双端读取模拟器。
Bioinformatics. 2012 Jun 1;28(11):1533-5. doi: 10.1093/bioinformatics/bts187. Epub 2012 Apr 15.
8
ADEPT, a dynamic next generation sequencing data error-detection program with trimming.ADEPT,一个带有序列修剪功能的动态新一代测序数据错误检测程序。
BMC Bioinformatics. 2016 Feb 29;17:109. doi: 10.1186/s12859-016-0967-z.
9
VirVarSeq: a low-frequency virus variant detection pipeline for Illumina sequencing using adaptive base-calling accuracy filtering.VirVarSeq:一种用于Illumina测序的低频病毒变异检测流程,采用自适应碱基识别准确性过滤。
Bioinformatics. 2015 Jan 1;31(1):94-101. doi: 10.1093/bioinformatics/btu587. Epub 2014 Aug 31.
10
In-depth analysis of interrelation between quality scores and real errors in Illumina reads.对Illumina测序读段中质量分数与实际错误之间的相互关系进行深入分析。
Annu Int Conf IEEE Eng Med Biol Soc. 2013;2013:635-8. doi: 10.1109/EMBC.2013.6609580.

引用本文的文献

1
The role of interleukin-10 receptor alpha (IL10Rα) in Mycobacterium avium subsp. paratuberculosis infection of a mammary epithelial cell line.白细胞介素-10受体α(IL10Rα)在副结核分枝杆菌感染乳腺上皮细胞系中的作用。
BMC Genom Data. 2024 Jun 12;25(1):58. doi: 10.1186/s12863-024-01234-w.
2
mA-mediated modulation coupled with transcriptional regulation shapes long noncoding RNA repertoire of the cGAS-STING signaling.mA介导的调控与转录调控相结合,塑造了cGAS-STING信号通路的长链非编码RNA库。
Comput Struct Biotechnol J. 2022 Apr 9;20:1785-1797. doi: 10.1016/j.csbj.2022.04.002. eCollection 2022.
3
Evaluating whole-genome sequencing quality metrics for enteric pathogen outbreaks.

本文引用的文献

1
An adaptive decorrelation method removes Illumina DNA base-calling errors caused by crosstalk between adjacent clusters.一种自适应去相关方法可消除 Illumina 因相邻簇间串扰导致的 DNA 碱基判读错误。
Sci Rep. 2017 Feb 20;7:41348. doi: 10.1038/srep41348.
2
BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution.盲call:通过盲反卷积实现高通量测序数据的超快碱基调用。
Bioinformatics. 2014 May 1;30(9):1214-9. doi: 10.1093/bioinformatics/btu010. Epub 2014 Jan 9.
3
An extensive evaluation of read trimming effects on Illumina NGS data analysis.
评估肠道病原体暴发的全基因组测序质量指标
PeerJ. 2021 Nov 25;9:e12446. doi: 10.7717/peerj.12446. eCollection 2021.
4
A CTAB protocol for obtaining high-quality total RNA from cinnamon ( Blume).一种从肉桂(Blume)中获取高质量总RNA的十六烷基三甲基溴化铵(CTAB)方法。
3 Biotech. 2021 Apr;11(4):201. doi: 10.1007/s13205-021-02756-1. Epub 2021 Mar 29.
5
Dysregulation of hepatic microRNA expression in C57BL/6 mice affected by excretory-secretory products of Fasciola gigantica.肝微 RNA 表达失调在受巨型片形吸虫外分泌产物影响的 C57BL/6 小鼠中。
PLoS Negl Trop Dis. 2020 Dec 17;14(12):e0008951. doi: 10.1371/journal.pntd.0008951. eCollection 2020 Dec.
6
Bioinformatics and Computational Tools for Next-Generation Sequencing Analysis in Clinical Genetics.临床遗传学中用于下一代测序分析的生物信息学和计算工具
J Clin Med. 2020 Jan 3;9(1):132. doi: 10.3390/jcm9010132.
对读段修剪对Illumina二代测序数据分析的影响进行的广泛评估。
PLoS One. 2013 Dec 23;8(12):e85024. doi: 10.1371/journal.pone.0085024. eCollection 2013.
4
Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing.质量过滤极大地提高了 Illumina 扩增子测序的多样性估计。
Nat Methods. 2013 Jan;10(1):57-9. doi: 10.1038/nmeth.2276. Epub 2012 Dec 2.
5
Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems.Illumina HiSeq 和基因组分析仪系统生成的基因组高通量测序数据评估。
Genome Biol. 2011 Nov 8;12(11):R112. doi: 10.1186/gb-2011-12-11-r112.
6
Regularization Paths for Generalized Linear Models via Coordinate Descent.基于坐标下降法的广义线性模型正则化路径
J Stat Softw. 2010;33(1):1-22.
7
The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.基因组分析工具包:一种用于分析下一代 DNA 测序数据的 MapReduce 框架。
Genome Res. 2010 Sep;20(9):1297-303. doi: 10.1101/gr.107524.110. Epub 2010 Jul 19.
8
The Sequence Alignment/Map format and SAMtools.序列比对/映射格式和 SAMtools。
Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8.
9
Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.来自高通量DNA测序的超短读长数据集存在大量偏差。
Nucleic Acids Res. 2008 Sep;36(16):e105. doi: 10.1093/nar/gkn425. Epub 2008 Jul 26.
10
Next-generation DNA sequencing methods.下一代DNA测序方法。
Annu Rev Genomics Hum Genet. 2008;9:387-402. doi: 10.1146/annurev.genom.9.081307.164359.