Development of a high-performance question-answering system in the Kazakh language using external knowledge sources in specialized fields

Project Manager: Shomanov A.S.

Funding source: GF for Young Scientists MSHE RK

Goal: Increasing the performance of question-answering models in the Kazakh language and reducing the cost of their development by optimizing and fine-tuning large pre-trained multilingual models

Years of implementation: 2024–2026

Partners: Plasma Science LLP

Funding amount: 89,979,146.58 tenge

Project Abstract

In recent years, the field of Natural Language Processing (NLP) has made significant progress, which has become evident not only to experts but also to the general public thanks to question-answering systems and chatbots. These innovative developments, the most notable of which is GPT, have become a hallmark of NLP, demonstrating their practical value to millions of users around the world.

However, the implementation of similar systems for low-resource languages, such as Kazakh, still presents a challenge—primarily due to the lack of resources and high costs, especially costs associated with high-performance clusters of graphics processors (GPUs). This project is aimed at developing a high-performance question-answering system in the Kazakh language, operating on external knowledge sources in a specialized domain.

In accordance with the commonly used classification, the question-answering system can be open-domain (uses external and internal knowledge) or closed-domain (designed for a specialized, non-general area of knowledge). In this work, the question-answering system will be based on a question-answering model in the Kazakh language built on transformer architecture.

The goal

The goal of this project is to increase the performance of question-answer models in the Kazakh language and reduce the cost of their development by optimizing and fine-tuning large pre-trained multilingual models created in conditions of almost unlimited resources by technology giants such as Google, Microsoft, OpenAI, Meta and others.

The project tasks

The project tasks

To achieve the project goal, it is necessary to solve three main tasks, each of which in turn is divided into three subtasks. At the moment, work has been done to prepare pre-trained models for question-answer systems in the Kazakh language, one of the models (T5-Kazakh-QA) was published on the HuggingFace platform. The maturity level is assessed as TRL 2; upon completion, TRL 3 is expected.

  1. Task 1 – to develop an economical and productive question-answer model in the Kazakh language.
  2. Task 2 – to develop a model for semantic classification of contexts of asked questions.
  3. Task 3 – integrate the developed models and create a prototype of a question-answer system in the Kazakh language.

Stages of the project implementation

1 Development of an economical and productive question-answering model in the Kazakh language

2 Development of a semantic classification model for the contexts of the questions being asked

3 Development of a prototype of a question-answering system in the Kazakh language

The main results

of this project will include:

1) a new cost-effective and productive question-answer model in the Kazakh language;

2) a new language-invariant model for the semantic classification of contexts of questions asked;

3) a prototype of an intelligent question-answer system in the Kazakh language.

SHOMANOV ADAI

Project principal

KAYRATULY BAUYRZHAN

Researcher

SHAKENOV ZHASULAN

Researcher

KADYRBEK NURGALI

Senior Researcher

TLEUBAYEVA ARAILYM

Senior Researcher

MANSUROVA AIGERIM

Junior researcher

MAKHAMBETOVA ZHANSAYA

Junior researcher

Expected project results

A question-answering model in the Kazakh language will be developed and published on the HuggingFace portal. A comparative analysis of multilingual question-answering models will be conducted, and the features of their adaptation will be studied. Methods for optimizing the parameters of selected multilingual models for adaptation to the Kazakh language will be explored and developed. One article will be published in a peer-reviewed domestic journal recommended by the Committee for Quality Assurance in the Sphere of Higher Education and Science of the Republic of Kazakhstan. Methods for fine-tuning selected multilingual models for adaptation to the Kazakh language will be studied and developed. Methods for semantic classification of contexts, integrating selected semantic classification algorithms with semantic embedding models, will be explored and developed. One article will be published in a peer-reviewed scientific journal indexed in the Science Citation Index Expanded database of Web of Science and/or with a CiteScore percentile of at least 50 in the Scopus database. A prototype of a Kazakh-language question-answering system will be developed, and a web interface for connecting to the system will be created. One article will be published in a peer-reviewed scientific journal indexed in the Science Citation Index Expanded database of Web of Science and/or with a CiteScore percentile of at least 50 in the Scopus database.

Publications

Tleubayeva, A., & Shomanov, A. (2024). COMPARATIVE ANALYSIS OF MULTILINGUAL QA MODELS AND THEIR ADAPTATION TO THE KAZAKH LANGUAGE. Scientific Journal of Astana IT University19, 89–97. https://doi.org/10.37943/19WHRK2878

Results of 2024:

  • As a result, Kazakh models based on RoBERTa, such as roberta-kaz-large and roberta-large-kazqad, were successfully developed and optimized, which showed high performance and accuracy in question-answering and ranking tasks.
  • The novelty of the work lies in the effective application of modern learning and optimization methods to the underrepresented Kazakh language, as well as in the creation of specialized datasets and models that help improve the quality of natural language processing for this language.
  • The following models were also published within the framework of the project:

1) Llama question-answering model in Kazakh.

2) GPT-j-3.4 model in Kazakh.

3) RoBERTa-Kaz-Large question-answering model for Kazakh.

4) Llama model in Kazakh.

homescontents ataşehir escort ataşehir escort bostancı escort kadıköy escort istanbul escort şişli escort istanbul eskort ataköy escort ataşehir escort Marsbahis giriş Marsbahis küçükçekmece escort kadıköy escort çevrimsiz deneme bonusu marsbahis giris marsbahis casino marsbahis güncel adres marsbahis deneme bonusu betturkey Şartsız deneme bonusu veren siteler Şartsız deneme bonusu veren siteler Deneme Bonusu Veren Siteler Yeni 2025 Deneme Bonusu Veren Siteler Deneme Bonusu Veren Siteler deneme bonusu veren siteler 2025 serifali eskort atasehir escort bayan bursa escort bursa eskort yenibosna escort umraniye escort teksert
homescontents
https://www.fapjunk.com
ataşehir escort kadıköy escort kartal escort maltepe escort
gaziantep escort gaziantep escort
izmir escort
film izle
film izle film hd film
hd film izle
sakarya escort akyazı escort arifiye escort erenler escort eve gelen escort ferizli escort geyve escort hendek escort otele gelen escort sapanca escort söğütlü escort taraklı escort
sakarya escort akyazı escort arifiye escort erenler escort eve gelen escort ferizli escort geyve escort hendek escort karapürçek escort karasu escort kaynarca escort kocaali escort otele gelen escort pamukova escort sapanca escort söğütlü escort taraklı escort
Sakarya escort Sakarya escort Sakarya escort Sakarya escort Sakarya escort Sakarya escort Sapanca escort Sapanca escort Sapanca escort Sapanca escort Karasu escort