Biscarini F, Cozzi P, Orozco-Ter Wengel P
CNR-IBBA, Via Bassini 15, 20133, Milan, Italy.
School of Medicine, Cardiff University, Heath Park, CF14 4XN, Cardiff, UK.
Anim Genet. 2018 Jun;49(3):147-158. doi: 10.1111/age.12655. Epub 2018 Apr 6.
The 'omics revolution has made a large amount of sequence data available to researchers and the industry. This has had a profound impact in the field of bioinformatics, stimulating unprecedented advancements in this discipline. Mostly, this is usually looked at from the perspective of human 'omics, in particular human genomics. Plant and animal genomics, however, have also been deeply influenced by next-generation sequencing technologies, with several genomics applications now popular among researchers and the breeding industry. Genomics tends to generate huge amounts of data, and genomic sequence data account for an increasing proportion of big data in biological sciences, due largely to decreasing sequencing and genotyping costs and to large-scale sequencing and resequencing projects. The analysis of big data poses a challenge to scientists, as data gathering currently takes place at a faster pace than does data processing and analysis, and the associated computational burden is increasingly taxing, making even simple manipulation, visualization and transferring of data a cumbersome operation. The time consumed by the processing and analysing of huge data sets may be at the expense of data quality assessment and critical interpretation. Additionally, when analysing lots of data, something is likely to go awry-the software may crash or stop-and it can be very frustrating to track the error. We herein review the most relevant issues related to tackling these challenges and problems, from the perspective of animal genomics, and provide researchers that lack extensive computing experience with guidelines that will help when processing large genomic data sets.
“组学”革命为研究人员和行业提供了大量序列数据。这对生物信息学领域产生了深远影响,推动了该学科前所未有的发展。大多数情况下,人们通常从人类“组学”,特别是人类基因组学的角度来看待这一点。然而,植物和动物基因组学也受到了下一代测序技术的深刻影响,现在有几种基因组学应用在研究人员和育种行业中很受欢迎。基因组学往往会产生大量数据,并且基因组序列数据在生物科学大数据中所占比例越来越大,这主要是由于测序和基因分型成本的降低以及大规模测序和重测序项目的开展。大数据分析给科学家带来了挑战,因为目前数据收集的速度比数据处理和分析的速度要快,而且相关的计算负担越来越重,使得即使是简单的数据操作、可视化和传输都成为一项繁琐的工作。处理和分析大量数据集所花费的时间可能会以牺牲数据质量评估和关键解读为代价。此外,在分析大量数据时,很可能会出现问题——软件可能会崩溃或停止——而且追踪错误可能会非常令人沮丧。我们在此从动物基因组学的角度回顾与应对这些挑战和问题相关的最关键问题,并为缺乏丰富计算经验的研究人员提供在处理大型基因组数据集时有所帮助的指导方针。