Fracchia Charles
BioBright, Boston, MA, USA.
Methods Mol Biol. 2021;2190:317-336. doi: 10.1007/978-1-0716-0826-5_16.
Recently, digitization of biomedical processes has accelerated, in no small part due to the use of machine learning techniques which require large amounts of labeled data. This chapter focuses on the prerequisite steps to the training of any algorithm: data collection and labeling. In particular, we tackle how data collection can be set up with scalability and security to avoid costly and delaying bottlenecks. Unprecedented amounts of data are now available to companies and academics, but digital tools in the biomedical field encounter a problem of scale, since high-throughput workflows such as high content imaging and sequencing can create several terabytes per day. Consequently data transport, aggregation, and processing is challenging.A second challenge is maintenance of data security. Biomedical data can be personally identifiable, may constitute important trade-secrets, and be expensive to produce. Furthermore, human biomedical data is often immutable, as is the case with genetic information. These factors make securing this type of data imperative and urgent. Here we address best practices to achieve security, with a focus on practicality and scalability. We also address the challenge of obtaining usable, rich metadata from the collected data, which is a major challenge in the biomedical field because of the use of fragmented and proprietary formats. We detail tools and strategies for extracting metadata from biomedical scientific file formats and how this underutilized metadata plays a key role in creating labeled data for use in the training of neural networks.
近年来,生物医学流程的数字化进程加速,这在很大程度上得益于机器学习技术的应用,而机器学习技术需要大量的标注数据。本章重点介绍任何算法训练的前期步骤:数据收集和标注。特别是,我们将探讨如何以可扩展性和安全性来设置数据收集,以避免出现代价高昂且会导致延迟的瓶颈。如今,公司和学术界能够获取前所未有的大量数据,但生物医学领域的数字工具面临规模问题,因为诸如高内涵成像和测序等高通量工作流程每天可能会产生数太字节的数据。因此,数据传输、聚合和处理具有挑战性。
第二个挑战是数据安全的维护。生物医学数据可能包含个人身份信息,可能构成重要的商业机密,且生成成本高昂。此外,人类生物医学数据通常是不可变的,基因信息就是如此。这些因素使得保护这类数据变得至关重要且紧迫。在此,我们将介绍实现安全性的最佳实践,重点关注实用性和可扩展性。我们还将探讨从收集到的数据中获取可用的、丰富的元数据这一挑战,由于使用了碎片化和专有的格式,这在生物医学领域是一项重大挑战。我们详细介绍了从生物医学科学文件格式中提取元数据的工具和策略,以及这种未得到充分利用的元数据在创建用于神经网络训练的标注数据方面如何发挥关键作用。