利用基于相关性的位置权重矩阵在水稻基因组中搜索 SINE 重复序列。

Search for SINE repeats in the rice genome using correlation-based position weight matrices.

机构信息

Research Center of Biotechnology of the Russian Academy of Sciences, 60 let Oktjabrja pr-t, 7, bld. 1, Moscow, Russia.

出版信息

BMC Bioinformatics. 2021 Feb 2;22(1):42. doi: 10.1186/s12859-021-03977-0.

DOI:10.1186/s12859-021-03977-0

PMID:33530928

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7852121/

Abstract

BACKGROUND

Transposable elements (TEs) constitute a significant part of eukaryotic genomes. Short interspersed nuclear elements (SINEs) are non-autonomous TEs, which are widely represented in mammalian genomes and also found in plants. After insertion in a new position in the genome, TEs quickly accumulate mutations, which complicate their identification and annotation by modern bioinformatics methods. In this study, we searched for highly divergent SINE copies in the genome of rice (Oryza sativa subsp. japonica) using the Highly Divergent Repeat Search Method (HDRSM).

RESULTS

The HDRSM considers correlations of neighboring symbols to construct position weight matrix (PWM) for a SINE family, which is then used to perform a search for new copies. In order to evaluate the accuracy of the method and compare it with the RepeatMasker program, we generated a set of SINE copies containing nucleotide substitutions and indels and inserted them into an artificial chromosome for analysis. The HDRSM showed better results both in terms of the number of identified inserted repeats and the accuracy of determining their boundaries. A search for the copies of 39 SINE families in the rice genome produced 14,030 hits; among them, 5704 were not detected by RepeatMasker.

CONCLUSIONS

The HDRSM could find divergent SINE copies, correctly determine their boundaries, and offer a high level of statistical significance. We also found that RepeatMasker is able to find relatively short copies of the SINE families with a higher level of similarity, while HDRSM is able to find more diverged copies. To obtain a comprehensive profile of SINE distribution in the genome, combined application of the HDRSM and RepeatMasker is recommended.

摘要

背景

转座元件（TEs）构成了真核生物基因组的重要组成部分。短散布核元件（SINEs）是非自主 TEs，广泛存在于哺乳动物基因组中，也存在于植物中。TE 在基因组的新位置插入后，会迅速积累突变，这使得现代生物信息学方法难以对其进行识别和注释。在这项研究中，我们使用高度分化重复搜索方法（HDRSM）在水稻（Oryza sativa subsp. japonica）基因组中搜索高度分化的 SINE 拷贝。

结果

HDRSM 考虑了相邻符号之间的相关性，构建了 SINE 家族的位置权重矩阵（PWM），然后使用该矩阵来搜索新的拷贝。为了评估该方法的准确性并将其与 RepeatMasker 程序进行比较，我们生成了一组包含核苷酸取代和插入缺失的 SINE 拷贝，并将其插入人工染色体进行分析。HDRSM 在识别插入重复的数量和确定其边界的准确性方面都表现出更好的结果。在水稻基因组中搜索 39 个 SINE 家族的拷贝产生了 14030 个命中；其中，RepeatMasker 未检测到 5704 个。

结论

HDRSM 能够找到分化的 SINE 拷贝，正确确定其边界，并提供高水平的统计显著性。我们还发现，RepeatMasker 能够找到具有更高相似性的相对较短的 SINE 家族拷贝，而 HDRSM 能够找到更多分化的拷贝。为了获得 SINE 在基因组中分布的全面概况，建议联合使用 HDRSM 和 RepeatMasker。