Centre for Bioinformatics, School of Data Sciences, Perdana University, Damansara Heights, Kuala Lumpur, 50490, Malaysia.
Department of Biochemistry, Faculty of Science, Kaduna State University, Kaduna, 800211, Nigeria.
BMC Genomics. 2021 Sep 28;22(Suppl 3):700. doi: 10.1186/s12864-021-07657-4.
Biology has entered the era of big data with the advent of high-throughput omics technologies. Biological databases provide public access to petabytes of data and information facilitating knowledge discovery. Over the years, sequence data of pathogens has seen a large increase in the number of records, given the relatively small genome size and their important role as infectious and symbiotic agents. Humans are host to numerous pathogenic diseases, such as that by viruses, many of which are responsible for high mortality and morbidity. The interaction between pathogens and humans over the evolutionary history has resulted in sharing of sequences, with important biological and evolutionary implications.
This study describes a large-scale, systematic bioinformatics approach for identification and characterization of shared sequences between the host and pathogen. An application of the approach is demonstrated through identification and characterization of the Flaviviridae-human share-ome. A total of 2430 nonamers represented the Flaviviridae-human share-ome with 100% identity. Although the share-ome represented a small fraction of the repertoire of Flaviviridae (~ 0.12%) and human (~ 0.013%) non-redundant nonamers, the 2430 shared nonamers mapped to 16,946 Flaviviridae and 7506 human non-redundant protein sequences. The shared nonamer sequences mapped to 125 species of Flaviviridae, including several with unclassified genus. The majority (~ 68%) of the shared sequences mapped to Hepacivirus C species; West Nile, dengue and Zika viruses of the Flavivirus genus accounted for ~ 11%, ~ 7%, and ~ 3%, respectively, of the Flaviviridae protein sequences (16,946) mapped by the share-ome. Further characterization of the share-ome provided important structural-functional insights to Flaviviridae-human interactions.
Mapping of the host-pathogen share-ome has important implications for the design of vaccines and drugs, diagnostics, disease surveillance and the discovery of unknown, potential host-pathogen interactions. The generic workflow presented herein is potentially applicable to a variety of pathogens, such as of viral, bacterial or parasitic origin.
随着高通量组学技术的出现,生物学已经进入了大数据时代。生物数据库为公共访问提供了数 PB 的数据和信息,促进了知识发现。多年来,由于病原体的相对较小基因组大小及其作为传染性和共生剂的重要作用,病原体的序列数据记录数量有了很大的增加。人类是许多病原体疾病的宿主,例如病毒,其中许多疾病的死亡率和发病率都很高。病原体与人类在进化历史上的相互作用导致了序列的共享,这具有重要的生物学和进化意义。
本研究描述了一种大规模的、系统的生物信息学方法,用于鉴定和描述宿主和病原体之间共享的序列。通过鉴定和描述黄病毒科-人类共享组来演示该方法的应用。共有 2430 个非重叠的九聚体代表了黄病毒科-人类共享组,具有 100%的同一性。尽管共享组代表了黄病毒科 (0.12%)和人类 (0.013%)非冗余九聚体的一小部分,但 2430 个共享九聚体映射到 16946 个黄病毒科和 7506 个人类非冗余蛋白质序列。共享九聚体序列映射到 125 种黄病毒科,包括一些未分类属。大多数 (~68%)共享序列映射到丙型肝炎病毒;黄病毒属的西尼罗河、登革热和寨卡病毒分别占黄病毒科蛋白质序列(16946 个)的约 11%、7%和 3%。共享组的进一步特征分析为黄病毒科-人类相互作用提供了重要的结构功能见解。
宿主-病原体共享组的映射对于疫苗和药物设计、诊断、疾病监测以及未知潜在宿主-病原体相互作用的发现具有重要意义。本文提出的通用工作流程可能适用于各种病原体,如病毒、细菌或寄生虫来源的病原体。