Institute of Information Science, Academia Sinica, Nankang, Taipei 11529, Taiwan.
Institute of Information Science, Academia Sinica, Nankang, Taipei 11529, Taiwan.
J Proteomics. 2021 Jan 16;231:104021. doi: 10.1016/j.jprot.2020.104021. Epub 2020 Oct 24.
Concatenated target-decoy database searches are commonly used in proteogenomic research for variant peptide identification. Currently, protein-based and peptide-based sequence databases are applied to store variant sequences for database searches. The protein-based database records a full-length wild-type protein sequence but using the given variant events to replace the original amino acids, whereas the peptide-based database retains only the in silico digested peptides containing the variants. However, the performance of applying various decoy generation methods on the peptide-based variant sequence database is still unclear, compared to the protein-based database. In this paper, we conduct a thorough comparison on target-decoy databases constructed by the above two types of databases coupled with various decoy generation methods for proteogenomic analyses. The results show that for the protein-based variant sequence database, using the reverse or the pseudo reverse method achieves similar performance for variant peptide identification. Furthermore, for the peptide-based database, the pseudo reverse method is more suitable than the widely used reverse method, as shown by identifying 6% more variant PSMs in a HEK293 cell line data set. SIGNIFICANCE: In our survey of publications on proteogenomic studies, 57% of the studies adopt the peptide-based variant sequence database coupled with the reverse method for decoy generation to construct a target-decoy database for searches. However, our results show that when using the peptide-based variant sequence database, it is better to adopt the pseudo reverse method for generating decoy sequences, to avoid leading to fewer variant peptides being identified.
串联靶标-诱饵数据库搜索常用于蛋白质基因组学研究中的变体肽鉴定。目前,基于蛋白质和基于肽的序列数据库被应用于存储变体序列以进行数据库搜索。基于蛋白质的数据库记录了全长野生型蛋白质序列,但使用给定的变体事件来替换原始氨基酸,而基于肽的数据库仅保留包含变体的虚拟消化肽。然而,与基于蛋白质的数据库相比,各种诱饵生成方法在基于肽的变体序列数据库上的性能仍不清楚。在本文中,我们对由上述两种类型的数据库与各种诱饵生成方法构建的靶标-诱饵数据库进行了彻底的比较,用于蛋白质基因组学分析。结果表明,对于基于蛋白质的变体序列数据库,使用反向或伪反向方法对变体肽鉴定具有相似的性能。此外,对于基于肽的数据库,伪反向方法比广泛使用的反向方法更适合,因为在 HEK293 细胞系数据集上鉴定出的变体 PSM 多了 6%。意义:在我们对蛋白质基因组学研究出版物的调查中,57%的研究采用基于肽的变体序列数据库与反向方法相结合的方法来构建靶标-诱饵数据库进行搜索。然而,我们的结果表明,当使用基于肽的变体序列数据库时,最好采用伪反向方法来生成诱饵序列,以避免导致更少的变体肽被鉴定出来。