Suppr超能文献

我看到的是一场匹配吗?在使用反向序列的基准测试中,近似回文会导致过高的错误匹配率。

WAS IT A MATch I SAW? Approximate palindromes lead to overstated false match rates in benchmarks using reversed sequences.

作者信息

Glidden-Handgis George, Wheeler Travis J

机构信息

R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ 85721, United States.

出版信息

Bioinform Adv. 2024 Apr 8;4(1):vbae052. doi: 10.1093/bioadv/vbae052. eCollection 2024.

Abstract

BACKGROUND

Software for labeling biological sequences typically produces a theory-based statistic for each match (the E-value) that indicates the likelihood of seeing that match's score by chance. E-values accurately predict false match rate for comparisons of random (shuffled) sequences, and thus provide a reasoned mechanism for setting score thresholds that enable high sensitivity with low expected false match rate. This threshold-setting strategy is challenged by real biological sequences, which contain regions of local repetition and low sequence complexity that cause excess matches between non-homologous sequences. Knowing this, tool developers often develop benchmarks that use realistic-seeming decoy sequences to explore empirical tradeoffs between sensitivity and false match rate. A recent trend has been to employ reversed biological sequences as realistic decoys, because these preserve the distribution of letters and the existence of local repeats, while disrupting the original sequence's functional properties. However, we and others have observed that sequences appear to produce high scoring alignments to their reversals with surprising frequency, leading to overstatement of false match risk that may negatively affect downstream analysis.

RESULTS

We demonstrate that an alignment between a sequence S and its (possibly mutated) reversal tends to produce higher scores than alignment between truly unrelated sequences, even when S is a shuffled string with no notable repetitive or low-complexity regions. This phenomenon is due to the unintuitive fact that (even randomly shuffled) sequences contain palindromes that are on average longer than the longest common substrings (LCS) shared between permuted variants of the same sequence. Though the expected palindrome length is only slightly larger than the expected LCS, the distribution of alignment scores involving reversed sequences is strongly right-shifted, leading to greatly increased frequency of high-scoring alignments to reversed sequences.

IMPACT

Overestimates of false match risk can motivate unnecessarily high score thresholds, leading to potentially reduced true match sensitivity. Also, when tool sensitivity is only reported up to the score of the first matched decoy sequence, a large decoy set consisting of reversed sequences can obscure sensitivity differences between tools. As a result of these observations, we advise that reversed biological sequences be used as decoys only when care is taken to remove positive matches in the original (un-reversed) sequences, or when overstatement of false labeling is not a concern. Though the primary focus of the analysis is on sequence annotation, we also demonstrate that the prevalence of internal palindromes may lead to an overstatement of the rate of false labels in protein identification with mass spectrometry.

摘要

背景

用于标记生物序列的软件通常会为每个匹配项生成一个基于理论的统计量(E值),该统计量表明偶然看到该匹配项得分的可能性。E值能准确预测随机(打乱)序列比较中的错误匹配率,从而为设置得分阈值提供合理机制,实现高灵敏度和低预期错误匹配率。这种阈值设置策略受到真实生物序列的挑战,真实生物序列包含局部重复区域和低序列复杂性区域,会导致非同源序列之间出现过多匹配。了解到这一点,工具开发者通常会开发使用看似真实的诱饵序列的基准,以探索灵敏度和错误匹配率之间的经验权衡。最近的一个趋势是使用反向生物序列作为真实诱饵,因为这些序列保留了字母分布和局部重复的存在,同时破坏了原始序列的功能特性。然而,我们和其他人观察到,序列与其反向序列似乎以惊人的频率产生高分比对,导致对错误匹配风险的高估,这可能会对下游分析产生负面影响。

结果

我们证明,即使S是一个没有明显重复或低复杂性区域的打乱字符串,序列S与其(可能发生突变的)反向序列之间的比对往往比真正不相关序列之间产生更高的分数。这种现象是由于一个不直观的事实,即(即使是随机打乱的)序列包含回文,其平均长度比同一序列的排列变体之间共享的最长公共子串(LCS)长。虽然预期的回文长度仅略大于预期的LCS,但涉及反向序列的比对得分分布强烈右移,导致与反向序列的高分比对频率大大增加。

影响

对错误匹配风险的高估可能会促使设置不必要的高分阈值,从而可能降低真正匹配的灵敏度。此外,当仅报告工具灵敏度直至第一个匹配诱饵序列的得分时,由反向序列组成的大型诱饵集可能会掩盖工具之间的灵敏度差异。基于这些观察结果,我们建议仅在小心去除原始(未反向)序列中的正向匹配项时,或在不担心错误标记被高估时,才使用反向生物序列作为诱饵。虽然分析的主要重点是序列注释,但我们还证明,内部回文的普遍性可能导致在蛋白质质谱鉴定中对错误标记率的高估。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/66c5/11099658/493f02e91d99/vbae052f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验