印度司法背景下法律问答系统的数据集

K Veningston, Mishra Apratim

Department of Computer Science and Engineering, National Institute of Technology Srinagar, Jammu and Kashmir 190006, India.

Data Brief. 2025 May 12;60:111647. doi: 10.1016/j.dib.2025.111647. eCollection 2025 Jun.

Legal documents, such as court judgments and statutes, are vital for understanding judicial decisions, legal principles, and procedural details. However, these documents are often dense, complex, and abundant, making it challenging for lawyers, researchers, and citizens to quickly and easily locate and retrieve relevant information. The Legal Question Answering (LQA) [1] task involves developing systems that can automatically answer legal questions based on relevant legal documents centred on the and , preferably from delivered judgments that are considered public property of the nation. The need for specialized datasets in LQA is particularly pressing in countries like India, where legal texts follow distinct judicial structures, specialized terminologies, and procedural intricacies [2]. Due to the lack of a relevant dataset for an LQA system [3], this paper presents a comprehensive dataset designed for LQA in the Indian judiciary context that facilitates efficient legal information retrieval. The dataset comprises 10,000 question-answer pairs derived from 1256 Indian Supreme Court judgments across various legal domains, including 538 criminal and 718 civil cases available on Mendeley Data [4]. Each QA pair is derived from detailed legal judgments from the Apex court (i.e. Supreme Court of India), with the questions framed to capture essential legal issues, principles, or facts, and answers extracted directly from the text. The dataset covers a balanced mix of legal topics in criminal and civil law, such as constitutional matters, property disputes, criminal offences, procedural matters, family disputes, employment matters, financial and taxation issues, and public welfare concerns. Additionally, it includes metadata such as case name and judgement date. This dataset supports the development of AI-driven LQA systems to enhance access to precise legal information and aid legal professionals/common citizens about India's complex legal system. To evaluate its effectiveness for legal question-answering tasks, the IndicLegalQA Dataset is fine-tuned on the "meta-llama/Llama-2-7b-hf" model [5] using Parameter-Efficient Fine-Tuning (PEFT), specifically the Low-Rank Adaptation (LoRA) technique [[6], [7], [8]]. The fine-tuned model is evaluated using Sentence-BERT (SBERT) [9], with the "paraphrase-MiniLM-L6-v2" model embedding. Cosine similarity measures how well the model captures the nuances of legal language between actual and generated answers. This ensures the dataset is well-suited for real-world legal applications, making it a valuable resource for improving AI-driven legal information retrieval systems.

法律文件，如法院判决和法规，对于理解司法判决、法律原则和程序细节至关重要。然而，这些文件往往篇幅冗长、复杂且数量众多，使得律师、研究人员和普通公民难以快速、轻松地查找和检索相关信息。法律问答（LQA）[1]任务涉及开发能够基于以和为中心的相关法律文件自动回答法律问题的系统，最好是根据被视为国家公共财产的已公布判决。在印度等国家，LQA对专门数据集的需求尤为迫切，因为印度的法律文本遵循独特的司法结构、专门术语和程序复杂性[2]。由于缺乏适用于LQA系统的相关数据集[3]，本文提出了一个专门为印度司法背景下的LQA设计的综合数据集，以促进高效的法律信息检索。该数据集包含从1256份印度最高法院跨各种法律领域的判决中提取的10000个问答对，其中包括门德利数据[4]上提供的538起刑事案件和718起民事案件。每个问答对均源自最高法院（即印度最高法院）的详细法律判决，问题的构建旨在捕捉关键的法律问题、原则或事实，答案则直接从文本中提取。该数据集涵盖了刑法和民法中法律主题的均衡组合，如宪法事务、财产纠纷、刑事犯罪、程序事务、家庭纠纷、就业事务、金融和税收问题以及公共福利问题。此外，它还包括案件名称和判决日期等元数据。这个数据集支持人工智能驱动的LQA系统的开发，以增强对精确法律信息的获取，并帮助法律专业人员/普通公民了解印度复杂的法律体系。为了评估其在法律问答任务中的有效性，IndicLegalQA数据集使用参数高效微调（PEFT），特别是低秩适应（LoRA）技术[[6], [7], [8]]在“meta-llama/Llama-2-7b-hf”模型[5]上进行微调。使用Sentence-BERT（SBERT）[9]和“paraphrase-MiniLM-L6-v2”模型嵌入对微调后的模型进行评估。余弦相似度衡量模型捕捉实际答案和生成答案之间法律语言细微差别的程度。这确保了该数据集非常适合实际的法律应用，使其成为改进人工智能驱动的法律信息检索系统的宝贵资源。

相似文献

Dataset for legal question answering system in the Indian judiciary context.

Data Brief. 2025 May 12;60:111647. doi: 10.1016/j.dib.2025.111647. eCollection 2025 Jun.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

StatMetaQA: A dataset for closed domain question answering in Indonesian statistical metadata.

Data Brief. 2024 Aug 14;57:110816. doi: 10.1016/j.dib.2024.110816. eCollection 2024 Dec.

Advanced neural network-based model for predicting court decisions on child custody.

PeerJ Comput Sci. 2024 Oct 22;10:e2293. doi: 10.7717/peerj-cs.2293. eCollection 2024.

BioInstruct: instruction tuning of large language models for biomedical natural language processing.

J Am Med Inform Assoc. 2024 Sep 1;31(9):1821-1832. doi: 10.1093/jamia/ocae122.

Gendering of Indian judiciary as a roadmap towards an equitable legal system and progressive gender-sensitive jurisprudence.

Front Sociol. 2025 Mar 26;10:1475043. doi: 10.3389/fsoc.2025.1475043. eCollection 2025.

Detecting Redundant Health Survey Questions by Using Language-Agnostic Bidirectional Encoder Representations From Transformers Sentence Embedding: Algorithm Development Study.

JMIR Med Inform. 2025 Jun 10;13:e71687. doi: 10.2196/71687.

Enhancing Bidirectional Encoder Representations From Transformers (BERT) With Frame Semantics to Extract Clinically Relevant Information From German Mammography Reports: Algorithm Development and Validation.

J Med Internet Res. 2025 Apr 25;27:e68427. doi: 10.2196/68427.

Using Bidirectional Encoder Representations from Transformers (BERT) to predict criminal charges and sentences from Taiwanese court judgments.

PeerJ Comput Sci. 2024 Jan 31;10:e1841. doi: 10.7717/peerj-cs.1841. eCollection 2024.

Do Legal Issues Deserve Space in Specialty Medical Journals ?

J Assoc Physicians India. 2016 Feb;64(2):86-87.

本文引用的文献

Overview and Discussion of the Competition on Legal Information, Extraction/Entailment (COLIEE) 2023.

Rev Socionetwork Strateg. 2024;18(1):27-47. doi: 10.1007/s12626-023-00152-0. Epub 2024 Jan 12.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Dataset for legal question answering system in the Indian judiciary context.

Data Brief. 2025 May 12;60:111647. doi: 10.1016/j.dib.2025.111647. eCollection 2025 Jun.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

StatMetaQA: A dataset for closed domain question answering in Indonesian statistical metadata.

Data Brief. 2024 Aug 14;57:110816. doi: 10.1016/j.dib.2024.110816. eCollection 2024 Dec.

Advanced neural network-based model for predicting court decisions on child custody.

PeerJ Comput Sci. 2024 Oct 22;10:e2293. doi: 10.7717/peerj-cs.2293. eCollection 2024.

BioInstruct: instruction tuning of large language models for biomedical natural language processing.

J Am Med Inform Assoc. 2024 Sep 1;31(9):1821-1832. doi: 10.1093/jamia/ocae122.

Gendering of Indian judiciary as a roadmap towards an equitable legal system and progressive gender-sensitive jurisprudence.

Front Sociol. 2025 Mar 26;10:1475043. doi: 10.3389/fsoc.2025.1475043. eCollection 2025.

Detecting Redundant Health Survey Questions by Using Language-Agnostic Bidirectional Encoder Representations From Transformers Sentence Embedding: Algorithm Development Study.

JMIR Med Inform. 2025 Jun 10;13:e71687. doi: 10.2196/71687.

J Med Internet Res. 2025 Apr 25;27:e68427. doi: 10.2196/68427.

Using Bidirectional Encoder Representations from Transformers (BERT) to predict criminal charges and sentences from Taiwanese court judgments.

PeerJ Comput Sci. 2024 Jan 31;10:e1841. doi: 10.7717/peerj-cs.1841. eCollection 2024.

Do Legal Issues Deserve Space in Specialty Medical Journals ?

J Assoc Physicians India. 2016 Feb;64(2):86-87.

本文引用的文献

Overview and Discussion of the Competition on Legal Information, Extraction/Entailment (COLIEE) 2023.

Rev Socionetwork Strateg. 2024;18(1):27-47. doi: 10.1007/s12626-023-00152-0. Epub 2024 Jan 12.

Dataset for legal question answering system in the Indian judiciary context.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献