Development of a high-performance question-answering system in the Kazakh language using external knowledge sources in specialized fields

Project Manager: Shomanov A.S.

Funding source: GF for Young Scientists MSHE RK

Goal: Increasing the performance of question-answering models in the Kazakh language and reducing the cost of their development by optimizing and fine-tuning large pre-trained multilingual models

Years of implementation: 2024–2026

Partners: Plasma Science LLP

Funding amount: 89,979,146.58 tenge

Project Abstract

In recent years, the field of Natural Language Processing (NLP) has made significant progress, which has become evident not only to experts but also to the general public thanks to question-answering systems and chatbots. These innovative developments, the most notable of which is GPT, have become a hallmark of NLP, demonstrating their practical value to millions of users around the world.

However, the implementation of similar systems for low-resource languages, such as Kazakh, still presents a challenge—primarily due to the lack of resources and high costs, especially costs associated with high-performance clusters of graphics processors (GPUs). This project is aimed at developing a high-performance question-answering system in the Kazakh language, operating on external knowledge sources in a specialized domain.

In accordance with the commonly used classification, the question-answering system can be open-domain (uses external and internal knowledge) or closed-domain (designed for a specialized, non-general area of knowledge). In this work, the question-answering system will be based on a question-answering model in the Kazakh language built on transformer architecture.

The goal

The goal of this project is to increase the performance of question-answer models in the Kazakh language and reduce the cost of their development by optimizing and fine-tuning large pre-trained multilingual models created in conditions of almost unlimited resources by technology giants such as Google, Microsoft, OpenAI, Meta and others.

The project tasks

To achieve the project goal, it is necessary to solve three main tasks, each of which in turn is divided into three subtasks. At the moment, work has been done to prepare pre-trained models for question-answer systems in the Kazakh language, one of the models (T5-Kazakh-QA) was published on the HuggingFace platform. The maturity level is assessed as TRL 2; upon completion, TRL 3 is expected.

Task 1 – to develop an economical and productive question-answer model in the Kazakh language.
Task 2 – to develop a model for semantic classification of contexts of asked questions.
Task 3 – integrate the developed models and create a prototype of a question-answer system in the Kazakh language.

Stages of the project implementation

1 Development of an economical and productive question-answering model in the Kazakh language

2 Development of a semantic classification model for the contexts of the questions being asked

3 Development of a prototype of a question-answering system in the Kazakh language

The main results

of this project will include:

1) a new cost-effective and productive question-answer model in the Kazakh language;

2) a new language-invariant model for the semantic classification of contexts of questions asked;

3) a prototype of an intelligent question-answer system in the Kazakh language.

SHOMANOV ADAI

Project principal

KAYRATULY BAUYRZHAN

Researcher

SHAKENOV ZHASULAN

Researcher

KADYRBEK NURGALI

Senior Researcher

TLEUBAYEVA ARAILYM

Senior Researcher

MANSUROVA AIGERIM

Junior researcher

MAKHAMBETOVA ZHANSAYA

Junior researcher

Expected project results

A question-answering model in the Kazakh language will be developed and published on the HuggingFace portal. A comparative analysis of multilingual question-answering models will be conducted, and the features of their adaptation will be studied. Methods for optimizing the parameters of selected multilingual models for adaptation to the Kazakh language will be explored and developed. One article will be published in a peer-reviewed domestic journal recommended by the Committee for Quality Assurance in the Sphere of Higher Education and Science of the Republic of Kazakhstan. Methods for fine-tuning selected multilingual models for adaptation to the Kazakh language will be studied and developed. Methods for semantic classification of contexts, integrating selected semantic classification algorithms with semantic embedding models, will be explored and developed. One article will be published in a peer-reviewed scientific journal indexed in the Science Citation Index Expanded database of Web of Science and/or with a CiteScore percentile of at least 50 in the Scopus database. A prototype of a Kazakh-language question-answering system will be developed, and a web interface for connecting to the system will be created. One article will be published in a peer-reviewed scientific journal indexed in the Science Citation Index Expanded database of Web of Science and/or with a CiteScore percentile of at least 50 in the Scopus database.

Publications

Tleubayeva, A., & Shomanov, A. (2024). COMPARATIVE ANALYSIS OF MULTILINGUAL QA MODELS AND THEIR ADAPTATION TO THE KAZAKH LANGUAGE. Scientific Journal of Astana IT University, 19, 89–97. https://doi.org/10.37943/19WHRK2878

Results of 2024:

As a result, Kazakh models based on RoBERTa, such as roberta-kaz-large and roberta-large-kazqad, were successfully developed and optimized, which showed high performance and accuracy in question-answering and ranking tasks.
The novelty of the work lies in the effective application of modern learning and optimization methods to the underrepresented Kazakh language, as well as in the creation of specialized datasets and models that help improve the quality of natural language processing for this language.
The following models were also published within the framework of the project:

1) Llama question-answering model in Kazakh.

2) GPT-j-3.4 model in Kazakh.

3) RoBERTa-Kaz-Large question-answering model for Kazakh.

4) Llama model in Kazakh.