LLM4DDC: Adopting Large Language Models for Research Data Classification Using Dewey Decimal Classification

Gautam Kishore Shahi; Renat Shigapov; Oliver Hummel

doi:10.11588/heibooks.1652

Zitationsvorschlag

Kishore Shahi, Gautam, Shigapov, Renat und Hummel, Oliver: LLM4DDC: Adopting Large Language Models for Research Data Classification Using Dewey Decimal Classification, in Heuveline, Vincent et al. (Hrsg.): E-Science-Tage 2025: Research Data Management: Challenges in a Changing World, Heidelberg: heiBOOKS, 2025, S. 476–484. https://doi.org/10.11588/heibooks.1652.c23948

Bibliografische Angaben herunterladen

Lizenz (Kapitel)

Dieses Werk steht unter der Lizenz Creative Commons Namensnennung - Weitergabe unter gleichen Bedingungen 4.0 International.

Identifier (Buch)

https://doi.org/10.11588/heibooks.1652

ISBN 978-3-911056-51-9 (PDF)

ISBN 978-3-911056-52-6 (Softcover)

Veröffentlicht

05.11.2025

Downloads

Kapitel herunterladen (PDF/991KB)

Statistik

Autor/innen

Gautam Kishore Shahi , Renat Shigapov , Oliver Hummel

LLM4DDC: Adopting Large Language Models for Research Data Classification Using Dewey Decimal Classification

Abstract: Classifying research data in institutional repositories is time-consuming and challenging. While the Dewey Decimal Classification (DDC) system is widely used in subject classification for texts, its application to research data metadata has been limited so far. This study explores the possible use of large language models (LLMs) and small language models (SLMs) for the automatic classification of research data in the context of DDC. This study uses sample data from an existing dataset compiled from different institutions mainly in Germany. We use a prompt engineering approach for LLMs, and fine tuning for SLMs, where we use RoBERTa as a baseline. Our results show that LLMs with prompt engineering currently are not able to classify metadata of research data into DDC classes as good as SLMs with fine tuning. To foster adoption, we openly release our models, code, and datasets for integration into research data infrastructures at GitHub.

Keywords: Dewey Decimal Classification, Research Data, Large Language Model, Automated Classification