Learn G H, Korber B T, Foley B, Hahn B H, Wolinsky S M, Mullins J I
Department of Microbiology, University of Washington, Seattle 98195-7740, USA.
J Virol. 1996 Aug;70(8):5720-30. doi: 10.1128/JVI.70.8.5720-5730.1996.
Human immunodeficiency virus type 1 (HIV-1) sequences are accumulating in the literature at a rapid pace. For this ever-expanding resource to be maximally useful, it is critical that researchers strive to maintain a high level of quality assurance, both in experimental design and conduct and in analyses. Here we present detailed analyses of problematic sets of HIV-1 sequences in the database that include sequence anomalies suggestive of mislabeling or sample contamination problems. These data are examined in the context of currently available HIV-1 sequence information to provide an example of how to identify potentially flawed data. Indicators of potential problems with sequences are (i) sequences that are nearly identical that are supposed to be derived from unlinked individuals and that are markedly distinct from other sequences from the putative source or (ii) sequences that are nearly identical to those of laboratory strains. We provide an outline of methods that researchers can use to perform preliminary laboratory and computational analyses that could help identify problematic data and thus help ensure the integrity of sequence databases.
1型人类免疫缺陷病毒(HIV-1)序列在文献中迅速积累。为了使这个不断扩展的资源发挥最大作用,研究人员在实验设计、实施及分析过程中努力维持高水平的质量保证至关重要。在此,我们对数据库中存在问题的HIV-1序列集进行了详细分析,这些序列存在一些异常情况,提示可能存在标签错误或样本污染问题。我们结合当前可用的HIV-1序列信息对这些数据进行了研究,以提供一个如何识别潜在有缺陷数据的示例。序列存在潜在问题的指标包括:(i)本应来自无关联个体但却几乎完全相同且与假定来源的其他序列明显不同的序列;(ii)与实验室菌株序列几乎完全相同的序列。我们提供了一份方法概述,研究人员可利用这些方法进行初步的实验室和计算分析,这有助于识别有问题的数据,从而有助于确保序列数据库的完整性。