Boerman Jacquelyn P, Brito Luiz F, Montes Maria E, Maskal Jacob M, Doucette Jarrod, Kalbaugh Kirby
Department of Animal Sciences, Purdue University, West Lafayette, IN 47907.
Agriculture Data Services, Purdue University, West Lafayette, IN 47907.
JDS Commun. 2025 Mar 12;6(3):339-344. doi: 10.3168/jdsc.2024-0723. eCollection 2025 May.
Large-scale data generation on dairy cattle farms is expected to continue increasing due to more animals per farm and the adoption of on-farm sensors and technologies that generate additional information on individual animals in greater frequency. Siloed data and information, lacking interoperability, prevent end users from combining data from multiple data sources and drawing more meaningful conclusions from the data generated on farm. As a result of these data challenges, the objective of this technical note is to describe a process of designing and documenting the development of a data ecosystem that automatically collects, performs quality control, and integrates data from disparate data sources used on experimental and commercial dairy farms. Integrated data can be queried to answer specific questions or generate timed reports that provide more insight than singular data sources can provide. Our objective was to develop a collaborative research data infrastructure that enables comprehensive data accessibility through an integrated computational ecosystem comprising open-source technologies of JupyterHub, Python, and Apache Spark. This shared curated environment facilitates extensive dataset consumption, empowering users to leverage distributed computing resources and parallel processing capabilities for sophisticated multi-dataset analysis and integration. Before user accessibility, the farm data undergo a rigorous multistage preprocessing protocol designed to mitigate potential data integrity challenges. These comprehensive data curation steps systematically address complex variability with sources, including vendor-specific software modifications, intermittent data retrieval disruptions, and farm-level operational contingencies. Employing sophisticated data cleaning, transformation, and validation methodologies, the infrastructure ensures robust data standardization and quality assurance. The integration of datasets from different data sources is paramount for improving dairy cattle welfare and production efficiency, which are complex management and breeding goals influenced by a multitude of traits that can be measured by different sensors. We identified research and further development needed in the field of dairy data science (e.g., data editing and quality control procedures, references and standards for novel sensor-based variables, and validation of obtained data across sensors), which is expected to continue playing a major role in the dairy industry sustainability.
由于每个农场的动物数量增加,以及采用了农场传感器和技术,能够更频繁地生成有关个体动物的额外信息,预计奶牛场的大规模数据生成将持续增长。缺乏互操作性的孤立数据和信息,阻碍了最终用户整合来自多个数据源的数据,并从农场生成的数据中得出更有意义的结论。由于这些数据挑战,本技术说明的目的是描述一个设计和记录数据生态系统开发过程的过程,该生态系统可自动收集、进行质量控制,并整合来自实验性和商业性奶牛场使用的不同数据源的数据。整合后的数据可以被查询以回答特定问题或生成定时报告,这些报告提供的洞察力比单一数据源所能提供的更多。我们的目标是开发一个协作式研究数据基础设施,通过一个由JupyterHub、Python和Apache Spark等开源技术组成的集成计算生态系统,实现全面的数据可访问性。这个共享的精心策划的环境促进了大量数据集的使用,使用户能够利用分布式计算资源和并行处理能力进行复杂的多数据集分析和整合。在用户可访问之前,农场数据要经过严格的多阶段预处理协议,旨在减轻潜在的数据完整性挑战。这些全面的数据整理步骤系统地解决了数据源的复杂变异性,包括特定供应商软件的修改、间歇性数据检索中断以及农场层面的运营突发事件。通过采用复杂的数据清理、转换和验证方法,该基础设施确保了强大的数据标准化和质量保证。整合来自不同数据源的数据集对于提高奶牛福利和生产效率至关重要,奶牛福利和生产效率是复杂的管理和育种目标,受到多种可由不同传感器测量的性状的影响。我们确定了奶牛数据科学领域需要进行的研究和进一步开发(例如,数据编辑和质量控制程序、基于新型传感器变量的参考和标准,以及跨传感器对获取数据的验证),预计这些研究将在奶牛行业可持续发展中继续发挥重要作用。