Project Manager: Shomanov A.S.
Funding source: GF for Young Scientists MSHE RK
Goal: Increasing the performance of question-answering models in the Kazakh language and reducing the cost of their development by optimizing and fine-tuning large pre-trained multilingual models
Years of implementation: 2024–2026
Partners: Plasma Science LLP
Funding amount: 89,979,146.58 tenge
In recent years, the field of Natural Language Processing (NLP) has made significant progress, which has become evident not only to experts but also to the general public thanks to question-answering systems and chatbots. These innovative developments, the most notable of which is GPT, have become a hallmark of NLP, demonstrating their practical value to millions of users around the world.
However, the implementation of similar systems for low-resource languages, such as Kazakh, still presents a challenge—primarily due to the lack of resources and high costs, especially costs associated with high-performance clusters of graphics processors (GPUs). This project is aimed at developing a high-performance question-answering system in the Kazakh language, operating on external knowledge sources in a specialized domain.
In accordance with the commonly used classification, the question-answering system can be open-domain (uses external and internal knowledge) or closed-domain (designed for a specialized, non-general area of knowledge). In this work, the question-answering system will be based on a question-answering model in the Kazakh language built on transformer architecture.
The goal of this project is to increase the performance of question-answer models in the Kazakh language and reduce the cost of their development by optimizing and fine-tuning large pre-trained multilingual models created in conditions of almost unlimited resources by technology giants such as Google, Microsoft, OpenAI, Meta and others.
The project tasks
To achieve the project goal, it is necessary to solve three main tasks, each of which in turn is divided into three subtasks. At the moment, work has been done to prepare pre-trained models for question-answer systems in the Kazakh language, one of the models (T5-Kazakh-QA) was published on the HuggingFace platform. The maturity level is assessed as TRL 2; upon completion, TRL 3 is expected.
1 Development of an economical and productive question-answering model in the Kazakh language
2 Development of a semantic classification model for the contexts of the questions being asked
3 Development of a prototype of a question-answering system in the Kazakh language
of this project will include:
1) a new cost-effective and productive question-answer model in the Kazakh language;
2) a new language-invariant model for the semantic classification of contexts of questions asked;
3) a prototype of an intelligent question-answer system in the Kazakh language.
A question-answering model in the Kazakh language will be developed and published on the HuggingFace portal. A comparative analysis of multilingual question-answering models will be conducted, and the features of their adaptation will be studied. Methods for optimizing the parameters of selected multilingual models for adaptation to the Kazakh language will be explored and developed. One article will be published in a peer-reviewed domestic journal recommended by the Committee for Quality Assurance in the Sphere of Higher Education and Science of the Republic of Kazakhstan. Methods for fine-tuning selected multilingual models for adaptation to the Kazakh language will be studied and developed. Methods for semantic classification of contexts, integrating selected semantic classification algorithms with semantic embedding models, will be explored and developed. One article will be published in a peer-reviewed scientific journal indexed in the Science Citation Index Expanded database of Web of Science and/or with a CiteScore percentile of at least 50 in the Scopus database. A prototype of a Kazakh-language question-answering system will be developed, and a web interface for connecting to the system will be created. One article will be published in a peer-reviewed scientific journal indexed in the Science Citation Index Expanded database of Web of Science and/or with a CiteScore percentile of at least 50 in the Scopus database.
Publications
Tleubayeva, A., & Shomanov, A. (2024). COMPARATIVE ANALYSIS OF MULTILINGUAL QA MODELS AND THEIR ADAPTATION TO THE KAZAKH LANGUAGE. Scientific Journal of Astana IT University, 19, 89–97. https://doi.org/10.37943/19WHRK2878
1) Llama question-answering model in Kazakh.
2) GPT-j-3.4 model in Kazakh.
3) RoBERTa-Kaz-Large question-answering model for Kazakh.
4) Llama model in Kazakh.