Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan 411105, China.
Advanced Analytics Institute, University of Technology Sydney, Sydney, NSW 2007, Australia.
Bioinformatics. 2021 May 5;37(6):750-758. doi: 10.1093/bioinformatics/btaa887.
Infection with strains of different subtypes and the subsequent crossover reading between the two strands of genomic RNAs by host cells' reverse transcriptase are the main causes of the vast HIV-1 sequence diversity. Such inter-subtype genomic recombinants can become circulating recombinant forms (CRFs) after widespread transmissions in a population. Complete prediction of all the subtype sources of a CRF strain is a complicated machine learning problem. It is also difficult to understand whether a strain is an emerging new subtype and if so, how to accurately identify the new components of the genetic source.
We introduce a multi-label learning algorithm for the complete prediction of multiple sources of a CRF sequence as well as the prediction of its chronological number. The prediction is strengthened by a voting of various multi-label learning methods to avoid biased decisions. In our steps, frequency and position features of the sequences are both extracted to capture signature patterns of pure subtypes and CRFs. The method was applied to 7185 HIV-1 sequences, comprising 5530 pure subtype sequences and 1655 CRF sequences. Results have demonstrated that the method can achieve very high accuracy (reaching 99%) in the prediction of the complete set of labels of HIV-1 recombinant forms. A few wrong predictions are actually incomplete predictions, very close to the complete set of genuine labels.
https://github.com/Runbin-tang/The-source-of-HIV-CRFs-prediction.
Supplementary data are available at Bioinformatics online.
不同亚型毒株的感染以及宿主细胞的逆转录酶在基因组 RNA 两条链之间的交叉阅读,是 HIV-1 序列多样性的主要原因。这种跨亚型的基因组重组可以在人群中广泛传播后成为循环重组形式(CRF)。完全预测 CRF 株的所有亚型来源是一个复杂的机器学习问题。也很难确定一个菌株是否是新出现的亚型,如果是,如何准确识别遗传来源的新成分。
我们引入了一种多标签学习算法,用于完全预测 CRF 序列的多个来源及其年代编号。通过各种多标签学习方法的投票来加强预测,以避免有偏见的决策。在我们的步骤中,提取了序列的频率和位置特征,以捕获纯亚型和 CRF 的特征模式。该方法应用于 7185 个 HIV-1 序列,包括 5530 个纯亚型序列和 1655 个 CRF 序列。结果表明,该方法可以非常准确地预测 HIV-1 重组形式的完整标签集(达到 99%)。少数错误的预测实际上是不完整的预测,非常接近完整的真实标签集。
https://github.com/Runbin-tang/The-source-of-HIV-CRFs-prediction。
补充数据可在生物信息学在线获得。