Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil.
Genome Science and Technology Program, Faculty of Science, The University of British Columbia, Vancouver, Canada.
Big Data. 2022 Aug;10(4):279-297. doi: 10.1089/big.2020.0383. Epub 2022 Apr 7.
The amount of available data is continuously growing. This phenomenon promotes a new concept, named big data. The highlight technologies related to big data are cloud computing (infrastructure) and Not Only SQL (NoSQL; data storage). In addition, for data analysis, machine learning algorithms such as decision trees, support vector machines, artificial neural networks, and clustering techniques present promising results. In a biological context, big data has many applications due to the large number of biological databases available. Some limitations of biological big data are related to the inherent features of these data, such as high degrees of complexity and heterogeneity, since biological systems provide information from an atomic level to interactions between organisms or their environment. Such characteristics make most bioinformatic-based applications difficult to build, configure, and maintain. Although the rise of big data is relatively recent, it has contributed to a better understanding of the underlying mechanisms of life. The main goal of this article is to provide a concise and reliable survey of the application of big data-related technologies in biology. As such, some fundamental concepts of information technology, including storage resources, analysis, and data sharing, are described along with their relation to biological data.
可用数据量不断增加。这一现象催生了一个新概念,称为大数据。与大数据相关的重点技术包括云计算(基础设施)和非关系型数据库(NoSQL;数据存储)。此外,对于数据分析,决策树、支持向量机、人工神经网络和聚类技术等机器学习算法提供了有前景的结果。在生物学背景下,由于有大量的生物学数据库,大数据有许多应用。生物大数据的一些限制与这些数据固有的特征有关,例如高度的复杂性和异质性,因为生物系统提供的信息从原子水平到生物体之间或它们与环境的相互作用。这些特征使得大多数基于生物信息学的应用程序难以构建、配置和维护。尽管大数据的兴起相对较晚,但它有助于更好地理解生命的潜在机制。本文的主要目的是提供一个简洁可靠的大数据相关技术在生物学中应用的综述。因此,描述了信息技术的一些基本概念,包括存储资源、分析和数据共享,并说明了它们与生物数据的关系。