Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-3-26 Aomi, Koto-ku, Tokyo 135-0064, Japan.
Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Chiba, Japan.
Nucleic Acids Res. 2021 Apr 6;49(6):3139-3155. doi: 10.1093/nar/gkab139.
Minimal absent words (MAWs) are minimal-length oligomers absent from a genome or proteome. Although some artificially synthesized MAWs have deleterious effects, there is still a lack of a strategy for the classification of non-occurring sequences as potentially malicious or benign. In this work, by using Markovian models with multiple-testing correction, we reveal significant absent oligomers, which are statistically expected to exist. This suggests that their absence is due to negative selection. We survey genomes and proteomes covering the diversity of life and find thousands of significant absent sequences. Common significant MAWs are often mono- or dinucleotide tracts, or palindromic. Significant viral MAWs are often restriction sites and may indicate unknown restriction motifs. Surprisingly, significant mammal genome MAWs are often present, but rare, in other mammals, suggesting that they are suppressed but not completely forbidden. Significant human MAWs are frequently present in prokaryotes, suggesting immune function, but rarely present in human viruses, indicating viral mimicry of the host. More than one-fourth of human proteins are one substitution away from containing a significant MAW, with the majority of replacements being predicted harmful. We provide a web-based, interactive database of significant MAWs across genomes and proteomes.
最小缺失寡核苷酸(MAWs)是基因组或蛋白质组中缺失的最短长度寡核苷酸。尽管一些人工合成的 MAWs 具有有害影响,但仍然缺乏一种将未出现序列分类为潜在恶意或良性的策略。在这项工作中,我们使用具有多重检验校正的马尔可夫模型,揭示了大量统计学上预期存在的显著缺失寡核苷酸。这表明它们的缺失是由于负选择。我们调查了涵盖生命多样性的基因组和蛋白质组,发现了数千个显著的缺失序列。常见的显著 MAWs 通常是单核苷酸或二核苷酸片段,或回文序列。显著的病毒 MAWs 通常是限制位点,可能表明存在未知的限制基序。令人惊讶的是,哺乳动物基因组中的显著 MAWs 通常存在,但在其他哺乳动物中很少见,这表明它们受到抑制但并非完全禁止。显著的人类 MAWs 经常存在于原核生物中,表明具有免疫功能,但在人类病毒中很少出现,这表明病毒模拟了宿主。超过四分之一的人类蛋白质只差一个取代就包含一个显著的 MAW,其中大多数取代被预测为有害的。我们提供了一个基于网络的、交互式的基因组和蛋白质组中显著 MAWs 数据库。