更好的用于单体型组装的 ILP 模型。

Better ILP models for haplotype assembly.

机构信息

Department of Applied Mathematics, Faculty of Mathematical Sciences, University of Guilan, Rasht, 41938-33697, Iran.

Division of Information System Design, Tokyo Denki University, Saitama, 350-0394, Japan.

出版信息

BMC Bioinformatics. 2018 Feb 19;19(Suppl 1):52. doi: 10.1186/s12859-018-2012-x.

DOI:10.1186/s12859-018-2012-x

PMID:29504891

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6389102/

Abstract

BACKGROUND

The haplotype assembly problem for diploid is to find a pair of haplotypes from a given set of aligned Single Nucleotide Polymorphism (SNP) fragments (reads). It has many applications in association studies, drug design, and genetic research. Since this problem is computationally hard, both heuristic and exact algorithms have been designed for it. Although exact algorithms are much slower, they are still of great interest because they usually output significantly better solutions than heuristic algorithms in terms of popular measures such as the Minimum Error Correction (MEC) score, the number of switch errors, and the QAN50 score. Exact algorithms are also valuable because they can be used to witness how good a heuristic algorithm is. The best known exact algorithm is based on integer linear programming (ILP) and it is known that ILP can also be used to improve the output quality of every heuristic algorithm with a little decline in speed. Therefore, faster ILP models for the problem are highly demanded.

RESULTS

As in previous studies, we consider not only the general case of the problem but also its all-heterozygous case where we assume that if a column of the input read matrix contains at least one 0 and one 1, then it corresponds to a heterozygous SNP site. For both cases, we design new ILP models for the haplotype assembly problem which aim at minimizing the MEC score. The new models are theoretically better because they contain significantly fewer constraints. More importantly, our experimental results show that for both simulated and real datasets, the new model for the all-heterozygous (respectively, general) case can usually be solved via CPLEX (an ILP solver) at least 5 times (respectively, twice) faster than the previous bests. Indeed, the running time can sometimes be 41 times better.

CONCLUSIONS

This paper proposes a new ILP model for the haplotype assembly problem and its all-heterozygous case, respectively. Experiments with both real and simulated datasets show that the new models can be solved within much shorter time by CPLEX than the previous bests. We believe that the models can be used to improve heuristic algorithms as well.

摘要

背景

对于二倍体，单体型组装问题是指从给定的对齐单核苷酸多态性（SNP）片段（reads）中找到一对单体型。它在关联研究、药物设计和遗传研究中有许多应用。由于这个问题在计算上很难，因此已经设计了启发式和精确算法来解决它。虽然精确算法要慢得多，但它们仍然很有意义，因为它们通常在流行的度量标准（如最小纠错分数（MEC）、开关错误数和 QAN50 分数）方面输出明显更好的解决方案，而启发式算法则不能。精确算法也很有价值，因为它们可以用来证明启发式算法有多好。最著名的精确算法是基于整数线性规划（ILP）的，并且已知 ILP 也可以用于提高每个启发式算法的输出质量，而速度略有下降。因此，对于该问题，更快的 ILP 模型是非常需要的。

结果

与以前的研究一样，我们不仅考虑了问题的一般情况，还考虑了其全杂合情况，即我们假设如果输入读矩阵的某一列至少包含一个 0 和一个 1，则它对应于杂合 SNP 位点。对于这两种情况，我们为单体型组装问题设计了新的 ILP 模型，旨在最小化 MEC 分数。新模型在理论上更好，因为它们包含的约束明显更少。更重要的是，我们的实验结果表明，对于模拟和真实数据集，全杂合（分别为一般）情况下的新模型通常可以通过 CPLEX（ILP 求解器）求解，速度至少快 5 倍（分别为 2 倍）。实际上，运行时间有时可以快 41 倍。

结论

本文分别为单体型组装问题及其全杂合情况提出了一个新的 ILP 模型。使用真实和模拟数据集的实验表明，新模型可以通过 CPLEX 在更短的时间内解决，比以前的最佳模型快得多。我们相信这些模型也可以用于改进启发式算法。

相似文献

Better ILP models for haplotype assembly.更好的用于单体型组装的 ILP 模型。

BMC Bioinformatics. 2018 Feb 19;19(Suppl 1):52. doi: 10.1186/s12859-018-2012-x.

Better ILP-Based Approaches to Haplotype Assembly.基于整数线性规划的单倍型组装的更好方法。

J Comput Biol. 2016 Jul;23(7):537-52. doi: 10.1089/cmb.2015.0035.

Exact algorithms for haplotype assembly from whole-genome sequence data.全基因组序列数据中单体型组装的精确算法。

Bioinformatics. 2013 Aug 15;29(16):1938-45. doi: 10.1093/bioinformatics/btt349. Epub 2013 Jun 18.

Computing the minimum recombinant haplotype configuration from incomplete genotype data on a pedigree by integer linear programming.通过整数线性规划从家系的不完整基因型数据计算最小重组单倍型构型。

J Comput Biol. 2005 Jul-Aug;12(6):719-39. doi: 10.1089/cmb.2005.12.719.

Haplotype reconstruction from SNP fragments by minimum error correction.通过最小错误校正从单核苷酸多态性（SNP）片段进行单倍型重建。

Bioinformatics. 2005 May 15;21(10):2456-62. doi: 10.1093/bioinformatics/bti352. Epub 2005 Feb 24.

A highly accurate heuristic algorithm for the haplotype assembly problem.一种用于单体型组装问题的高度精确启发式算法。

BMC Genomics. 2013;14 Suppl 2(Suppl 2):S2. doi: 10.1186/1471-2164-14-S2-S2. Epub 2013 Feb 15.

GenHap: a novel computational method based on genetic algorithms for haplotype assembly.GenHap：一种基于遗传算法的新型单倍型组装计算方法。

BMC Bioinformatics. 2019 Apr 18;20(Suppl 4):172. doi: 10.1186/s12859-019-2691-y.

On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes.关于二倍体和多倍体基因组中单倍型组装的最小错误校正问题

J Comput Biol. 2016 Sep;23(9):718-36. doi: 10.1089/cmb.2015.0220. Epub 2016 Jun 9.

Haplotype assembly from aligned weighted SNP fragments.基于比对加权单核苷酸多态性片段的单倍型组装

Comput Biol Chem. 2005 Aug;29(4):281-7. doi: 10.1016/j.compbiolchem.2005.05.001.

Inferring haplotypes from genotypes on a pedigree with mutations, genotyping errors and missing alleles.在存在突变、基因分型错误和等位基因缺失的家系中从基因型推断单倍型。

J Bioinform Comput Biol. 2011 Apr;9(2):339-65. doi: 10.1142/s0219720011005549.

引用本文的文献

Haplotype assembly of autotetraploid potato using integer linear programing.利用整数线性规划进行同源四倍体马铃薯的单体型组装。

Bioinformatics. 2019 Sep 15;35(18):3279-3286. doi: 10.1093/bioinformatics/btz060.

本文引用的文献

Better ILP-Based Approaches to Haplotype Assembly.基于整数线性规划的单倍型组装的更好方法。

J Comput Biol. 2016 Jul;23(7):537-52. doi: 10.1089/cmb.2015.0035.

Joint haplotype assembly and genotype calling via sequential Monte Carlo algorithm.通过序贯蒙特卡罗算法进行联合单倍型组装和基因型分型

BMC Bioinformatics. 2015 Jul 16;16:223. doi: 10.1186/s12859-015-0651-8.

De novo assembly of a haplotype-resolved human genome.从头组装具有单倍型分辨率的人类基因组。

Nat Biotechnol. 2015 Jun;33(6):617-22. doi: 10.1038/nbt.3200. Epub 2015 May 25.

SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming.SDhaP：通过半定规划进行二倍体和多倍体的单倍型组装。

BMC Genomics. 2015 Apr 3;16(1):260. doi: 10.1186/s12864-015-1408-5.

Maximum likelihood model based on minor allele frequencies and weighted Max-SAT formulation for haplotype assembly.基于次要等位基因频率的最大似然模型和用于单倍型组装的加权最大可满足性公式

J Theor Biol. 2014 Jun 7;350:49-56. doi: 10.1016/j.jtbi.2014.01.036. Epub 2014 Jan 31.

A haplotype map of genomic variations and genome-wide association studies of agronomic traits in foxtail millet (Setaria italica).一个基因组变异的单体型图谱和谷子（Setaria italica）农艺性状的全基因组关联研究。

Nat Genet. 2013 Aug;45(8):957-61. doi: 10.1038/ng.2673. Epub 2013 Jun 23.

Exact algorithms for haplotype assembly from whole-genome sequence data.全基因组序列数据中单体型组装的精确算法。

Bioinformatics. 2013 Aug 15;29(16):1938-45. doi: 10.1093/bioinformatics/btt349. Epub 2013 Jun 18.

Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques.基于 fosmid 的 HapMap 三亲子个体全基因组单体型分析：单一个体单体型分析技术的评估。

Nucleic Acids Res. 2012 Mar;40(5):2041-53. doi: 10.1093/nar/gkr1042. Epub 2011 Nov 18.

A comprehensively molecular haplotype-resolved genome of a European individual.一个欧洲个体的全面分子单体型解析基因组。

Genome Res. 2011 Oct;21(10):1672-85. doi: 10.1101/gr.125047.111. Epub 2011 Aug 3.

Maternal plasma DNA sequencing reveals the genome-wide genetic and mutational profile of the fetus.母体外周血游离 DNA 测序揭示胎儿全基因组遗传和突变特征。

Sci Transl Med. 2010 Dec 8;2(61):61ra91. doi: 10.1126/scitranslmed.3001720.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验