IEEE/ACM Trans Comput Biol Bioinform. 2018 Jul-Aug;15(4):1231-1238. doi: 10.1109/TCBB.2015.2509997. Epub 2015 Dec 17.
Rapid progress in the fields of phylogenomics and population genomics has driven increases in both the size of multi-genomic datasets and the number and complexity of genome-wide analyses. We present the Multisample Variant Format, specifically designed to store multiple sequence alignments for phylogenomics and population genomic analysis. The signature feature of MVF is a distinctive encoding of aligned sites with specific biological information content (e.g., invariant, low-coverage). This biological pattern-based encoding of sequence data allows for rapid filtering and quality control of data and speeds up computation for many analyses. Similar to other modern formats, MVF has a simple data structure and flexible header structure to accommodate project metadata, allowing to also serve as an effective data publication and sharing format. We also propose several variants of the MVF format to accommodate protein and codon alignments, quality scores, and a mix of de novo and reference-aligned data. Using the MVFtools package, MVF files can be converted from other common sequence formats. MVFtools completes tasks ranging from simple transformation and filtering operations to complex genome-wide visualizations in only a few minutes, even on large datasets. In addition to presentation of MVF and MVFtools, we also discuss the application both in MVF and other existing data formats of the broader concept of using biological principles and patterns to inform sequence data encoding.
在系统发生基因组学和群体基因组学领域的快速发展推动下,多基因组数据集的规模以及全基因组分析的数量和复杂性都有所增加。我们提出了多样本变异格式(Multisample Variant Format,MVF),专门用于存储系统发生基因组学和群体基因组分析的多序列比对。MVF 的主要特点是对具有特定生物学信息量的对齐位点进行独特的编码(例如,不变的、低覆盖的)。这种基于生物学模式的序列数据编码允许快速过滤和数据质量控制,并加快许多分析的计算速度。与其他现代格式类似,MVF 具有简单的数据结构和灵活的头结构,以适应项目元数据,也可作为有效的数据发布和共享格式。我们还提出了 MVF 格式的几种变体,以适应蛋白质和密码子比对、质量分数以及从头和参考对齐数据的混合。使用 MVFtools 包,可以将 MVF 文件从其他常见的序列格式转换。MVFtools 可以在短短几分钟内完成从简单的转换和过滤操作到复杂的全基因组可视化等任务,即使在大型数据集上也是如此。除了介绍 MVF 和 MVFtools 外,我们还讨论了在 MVF 和其他现有数据格式中应用更广泛的使用生物学原理和模式来通知序列数据编码的概念。