Xodabande Ismail, Atai Mahmood Reza, Hashemi Mohammad R
Department of Foreign Languages, Kharazmi University, Tehran, Iran.
MethodsX. 2025 Jan 11;14:103168. doi: 10.1016/j.mex.2025.103168. eCollection 2025 Jun.
This article introduces a protocol designed to analyze large corpora for vocabulary profiling, aimed at enhancing corpus-based studies of academic discourse. Given the complexity and volume of data typical in academic fields, this protocol integrates advanced corpus compilation techniques with lexical analysis tools to effectively identify and categorize vocabulary suitable for academic use. The study details the systematic process of compiling a large corpus of academic texts, and describes the adaptations made to corpus linguistics tools to handle and analyze a corpus with 278 million running words efficiently. Validation of the mid-frequency word list demonstrated its strong relevance to chemistry, with 6.4% coverage in chemistry research articles and 2.5-3% coverage in related disciplines like biology and life sciences. However, the coverage was much lower in general corpora, highlighting its specialized nature. This methodology not only provides a framework for academic vocabulary profiling but also offers scalable solutions for educators and researchers dealing with extensive text datasets. The findings contribute to advancing vocabulary research in chemistry and related fields, offering practical applications for improving educational resources and designing more effective curricula for academic English. The resulting vocabulary lists have significant implications for the design of curricula and educational resources, aiming to improve both the precision and effectiveness of language instruction in specialized academic settings.•Developed a scalable protocol for analyzing large text data for vocabulary profiling.•Applied advanced lexical analysis to a 278-million-word academic corpus.•The mid-frequency vocabulary list produced offers pedagogical value in academic discourse.
本文介绍了一种旨在分析大型语料库以进行词汇剖析的方案,旨在加强基于语料库的学术语篇研究。鉴于学术领域数据的复杂性和体量,该方案将先进的语料库编纂技术与词汇分析工具相结合,以有效识别和分类适合学术使用的词汇。该研究详细阐述了编纂大型学术文本语料库的系统过程,并描述了对语料库语言学工具所做的调整,以便高效处理和分析一个包含2.78亿个词元的语料库。中频词表的验证表明其与化学具有很强的相关性,在化学研究文章中的覆盖率为6.4%,在生物学和生命科学等相关学科中的覆盖率为2.5%至3%。然而,在一般语料库中的覆盖率要低得多,凸显了其专业性。这种方法不仅为学术词汇剖析提供了一个框架,还为处理大量文本数据集的教育工作者和研究人员提供了可扩展的解决方案。这些发现有助于推动化学及相关领域的词汇研究,为改进教育资源和设计更有效的学术英语课程提供实际应用。所得出的词汇表对课程和教育资源的设计具有重要意义,旨在提高专业学术环境中语言教学的精准度和有效性。•开发了一种可扩展的方案,用于分析大型文本数据以进行词汇剖析。•将先进的词汇分析应用于一个包含2.78亿个词的学术语料库。•生成的中频词汇表在学术语篇中具有教学价值。