“Unsupervised term mining: a suite of models and datasets for high-tech domains and low-resource languages”

Project manager:​ Nugumanova Aliya Bagdatovna, PhD, Director of the Research Center “Big Data & Blockchain Technologies”

Source of funding: GF MSHE RK​​

Project goal: to ensure that supervised term extraction models are competitive when using automatically generated training data.​

Partners: ​ Scientific and production company “Plasmascience”

Years of implementation: 2023 – 2025 ​

Amount of funding: 94 625 358,5 tenge

Relevance of the project:

Modern supervised transformer models are the standard in natural language processing tasks, including term extraction. However, their use requires a large amount of labeled data, which is a serious problem for high-tech domains (e.g., materials science, blockchain) and low-resource languages, such as Kazakh. Manual annotation of such data is labor-intensive, costly, and requires the involvement of experts.

The relevance of the project is due to the fact that it offers an approach to the automatic generation of training data, which allows to overcome the deficit of manual labeling. For the first time, the dependence of the efficiency of term extraction on the properties of the original corpus will be studied and criteria for selecting high-quality texts will be developed. Also, for the first time, an unsupervised annotator based on NMF will be created, taking into account the semantic coherence of terms using embeddings.

The project is aimed at creating sustainable solutions for term extraction in English and Kazakh and provides for testing the results in the form of constructing knowledge graphs. This will facilitate the development of intelligent systems that support automatic text analysis in specialized areas, as well as the expansion of the linguistic and technological coverage of modern NLP models.

Project objectives:

For 2023

– Develop efficient unsupervised annotators UA1 and UA2.

– Evaluate the performance of UA1 and UA2 annotators on the ACTER and ACL RD-TEC 2.0 datasets.

– Study the dependence of term extraction performance on corpus characteristics. Develop an efficient text corpus optimizer.

For 2024

– Create text corpora in the Materials Science and Blockchain domains in English and Kazakh.

– Automatically generate the Matcha dataset for the Materials Science and Blockchain domains in English and Kazakh. Evaluate the performance of UA1 and UA2 annotators on the test subset of the Matcha dataset.

For 2025

– Fine-tune the BERT transformer models on the training subset of the Matcha dataset for English and Kazakh.

– Performance evaluation of Matcha-BERT supervised term extraction models for English and Kazakh languages.

– Summary performance analysis of the developed unsupervised term extraction models and supervised term extraction models before and after fine-tuning. Implementation of knowledge graph creation cases in the Materials Science and Blockchain domains.

Expected results:

  1. Scientific and technical products prepared as a result of the project implementation:

– a new efficient unsupervised term annotator;

– a new efficient text corpus optimizer;

– automatically generated (annotated) training datasets in the Materials Science and Blockchain domains in English and Kazakh;

– supervised term extraction models in the Materials Science and Blockchain domains in English and Kazakh;

– cases on developing knowledge graphs in the Materials Science and Blockchain domains.

  1. Scientific publications:

– 1 article or review will be published in a peer-reviewed foreign or domestic publication recommended by KOKSNVO.

– 3 articles will be published in peer-reviewed scientific publications indexed in the Science Citation Index Expanded of the Web of Science database and (or) having a CiteScore percentile in the Scopus database of at least 35.

Project Results

• An electronic structure of the GIS database was developed on the PostgreSQL platform.
 • Information on dams and reservoirs of the Republic of Kazakhstan was collected and entered into the database from open sources.
 • One article was published in a peer-reviewed domestic journal recommended by the Committee for Quality Assurance in Science and Higher Education (KOKSNVO).
 • A graphical interface with cartographic information on existing dams was developed.
 • A module based on neural network recognition of dams using internet maps was developed to enable user identification of these structures.
 • A module was developed for the input, systematization, and storage of documentation related to geophysical monitoring and UAV (unmanned aerial vehicle) surveys.
 • Project participants presented a report at the international conference “2024 International Conference on Information Science and Communications Technologies (ICISCT)” (November 2024, Seoul, Republic of Korea).

Project Team:

Full name, education, degree, academic title Role in the project H-index, Researcher ID, ORCID, Scopus Author ID

Nugumanova Aliya Bagdatovna, PhD in the specialty “Information Systems”

Scientific Director. Leading Researcher

H-index: 6.  

Researcher ID L-9616- 2015.  

ORCID 0000-0001-5522-4421.  

Scopus Author ID 55864815200. 

Bayburin Erzhan Mukhametkalievich

Senior Researcher

H-index: 4. ORCID: 0000-0002-1583-9912. Scopus Author ID: 56111999400. Researcher ID: –  

Alimzhanov Ermek Serikovich, master

Senior Researcher

H-index = 2 

Scopus ID = 57191433356, https://orcid.org/0000-0002-8758-2220 

Alzhanov Almas Mirzhanovich, doctoral student

Researcher

H-index: 1

ORCID 0009-0007-8083-2366. 

Scopus Author ID 58859587600. 

Mansurova Aigerim Kanatkyzy

Researcher

H-index: 1

ORCID  0009-0003-1978-9574
Scopus ID: 58614576700

Mansurova Aiganym Kanatkyzy

Researcher

H-index Scopus: 1

ORCID 0009-0007-9076-0722

Scopus ID: 59233698800

Kalykulova Aliya Maratovna

Researcher

ORCID 0009-0006-5641-3797

Results achieved:

 

  1. Scientific and technical products prepared as a result of the project implementation:
  • Effective unsupervised annotators UA1 and UA2 were developed.
  • Performance estimates of the annotators UA1 and UA2 on the ACTER and ACL RD-TEC 2.0 datasets were obtained.
  • An effective text corpus optimizer was developed.
  • The Matcha dataset was created in the Materials Science and Blockchain domains in English and Kazakh.
  • The performance of the annotators UA1 and UA2 on the test subset of the Matcha dataset was estimated.
  • A new term extraction method T-Extractor was developed.
  • Разработан новый метод извлечения терминов T-Extractor.
  1. Scientific publications:

 

  1. Semantic Non-Negative Matrix Factorization for Term Extraction

 

This study introduces an unsupervised term extraction approach that combines non-negative matrix factorization (NMF) with word embeddings. Inspired by a pioneering semantic NMF method that employs regularization to jointly optimize document–word and word–word matrix factorizations for document clustering, we adapt this strategy for term extraction. Typically, a word–word matrix representing semantic relationships between words is constructed using cosine similarities between word embeddings. However, it has been established that transformer encoder embeddings tend to reside within a narrow cone, leading to consistently high cosine similarities between words. To address this issue, we replace the conventional word–word matrix with a word–seed submatrix, restricting columns to ‘domain seeds’—specific words that encapsulate the essential semantic features of the domain. Therefore, we propose a modified NMF framework that jointly factorizes the document–word and word–seed matrices, producing more precise encoding vectors for words, which we utilize to extract high-relevancy topic-related terms. Our modification significantly improves term extraction effectiveness, marking the first implementation of semantically enhanced NMF, designed specifically for the task of term extraction. Comparative experiments demonstrate that our method outperforms both traditional NMF and advanced transformer-based methods such as KeyBERT and BERTopic. To support further research and application, we compile and manually annotate two new datasets, each containing 1000 sentences, from the ‘Geography and History’ and ‘National Heroes’ domains. These datasets are useful for both term extraction and document classification tasks. All related code and datasets are freely available.

Nugumanova A. et al. Semantic Non-Negative Matrix Factorization for Term Extraction //Big Data and Cognitive Computing. – 2024. – Т. 8. – №. 7. – С. 72. doi.org/10.3390/bdcc8070072.

 

  1. QA-RAG: Exploring LLM reliance on external knowledge

 

Large language models (LLMs) can store factual knowledge within their parameters and have achieved superior results in question-answering tasks. However, challenges persist in providing provenance for their decisions and keeping their knowledge up to date. Some approaches aim to address these challenges by combining external knowledge with parametric memory. In contrast, our proposed QA-RAG solution relies solely on the data stored within an external knowledge base, specifically a dense vector index database. In this paper, we compare RAG configurations using two LLMs—Llama 2 7b and 13b—systematically examining their performance in three key RAG capabilities: noise robustness, knowledge gap detection, and external truth integration. The evaluation reveals that while our approach achieves an accuracy of 83.3%, showcasing its effectiveness across all baselines, the model still struggles significantly in terms of external truth integration. These findings suggest that considerable work is still required to fully leverage RAG in question-answering tasks.

Mansurova A., Mansurova A., Nugumanova A. QA-RAG: Exploring LLM reliance on external knowledge //Big Data and Cognitive Computing. – 2024. – Т. 8. – №. 9. – С. 115. doi.org/10.3390/bdcc8090115.

 

  1. Development of a question answering chatbot for blockchain domain.

 

Large Language Models (LLMs), such as ChatGPT, have transformed the field of natural language processing with their capacity for language comprehension and generation of human-like, fluent responses for many downstream tasks. Despite their impressive capabilities, they often fall short in domain-specific and knowledge-intensive domains due to a lack of access to relevant data. Moreover, most state-of-art LLMs lack transparency as they are often accessible only through APIs. Furthermore, their application in critical real-world scenarios is hindered by their proclivity to produce hallucinated information and inability to leverage external knowledge sources. To address these limitations, we propose an innovative system that enhances LLMs by integrating them with an external knowledge management module. The system allows LLMs to utilize data stored in vector databases, providing them with relevant information for their responses. Additionally, it enables them to retrieve information from the Internet, further broadening their knowledge base. The research approach circumvents the need to retrain LLMs, which can be a resource-intensive process. Instead, it focuses on making more efficient use of existing models. Preliminary results indicate that the system holds promise for improving the performance of LLMs in domain-specific and knowledge-intensive tasks. By equipping LLMs with real-time access to external data, it is possible to harness their language generation capabilities more effectively, without the need to continually strive for larger models.

Mansurova A., Nugumanova A., Makhambetova Z. Development of a question answering chatbot for blockchain domain //Scientific Journal of Astana IT University. – 2023. – С. 27-40. doi.org/10.37943/15XNDZ6667.

 

  1. Evaluation Of IBM’s Proposed Term Extraction Approach On The ACTER Corpus
  2.  

Automated term extraction seeks more efficient and precise methods. IBM researchers have proposed an unsupervised annotator aimed at extracting highly technical domain terms. This approach utilizes sentence encoders and analysis of morphological signals, term-to-topic relationships, and similarities within terms. In this paper, we attempt to realize this method proposed by IBM from scratch and conducted testing using the ACTER dataset. Additionally, in our experimentation, we include an analysis of extracting incorrect n-grams that may adversely affect the quality of the unsupervised annotator. Our recreated method has demonstrated an F1-score of 44.8% and a loss of 5.15% compared to the IBM approach on the ACL-RD-TEC 2.0 dataset. On the ACTER dataset, our metrics show similar results to other advanced methods in the field performed on this dataset.

Kalykulova A., Kairatuly B., Rakhymbek K., Kyzyrkanov A., Nugumanova A. Evaluation Of IBM’s Proposed Term Extraction Approach On The ACTER Corpus // IX — International Scientific Conference «Computer Science and Applied Mathematics». – Almaty: Institute of Information and Computational Technologies CS MSHE RK, 2024. – С. 597–604. https://conf.iict.kz/wp-content/uploads/2025/01/collection_CSAM_IX_2024.pdf

 

  1. Evaluation and comparison of the quality of word embeddings

 

This article explores approaches to evaluating embedding quality, which can be divided into intrinsic and extrinsic methods. Intrinsic methods assess representations independently of specific tasks, while extrinsic methods rely on NLP tasks for evaluation. Particular attention is given to the evaluation of semantic similarity using datasets such as WordSim-353, SimLex-999, and SimVerb-3500. Pretrained models FastText and SentenceBERT were used for the evaluation. The results show that FastText models demonstrate high correlation scores and outperform SentenceBERT in representing individual words. Although SentenceBERT offers advantages in tasks like semantic similarity search and clustering, it is less effective for individual word representation. The choice of model should be based on empirical evidence and the specific requirements of the task.

Альжанов А. М., Рахымбек К. К. Оценка и сравнение качества эмбеддингов слов // Проблемы оптимизации сложных систем: Материалы XX Междунар. Азиат. школы-семинара. – Алматы, 2024. – С. 211–215. https://conf.iict.kz/wp-content/uploads/2024/09/opcs_material_2024.pdf.

homescontents ataşehir escort ataşehir escort bostancı escort kadıköy escort istanbul escort şişli escort istanbul eskort ataköy escort ataşehir escort Marsbahis giriş Marsbahis küçükçekmece escort kadıköy escort çevrimsiz deneme bonusu marsbahis giris marsbahis casino marsbahis güncel adres marsbahis deneme bonusu betturkey Şartsız deneme bonusu veren siteler Şartsız deneme bonusu veren siteler Deneme Bonusu Veren Siteler Yeni 2025 Deneme Bonusu Veren Siteler Deneme Bonusu Veren Siteler deneme bonusu veren siteler 2025 serifali eskort atasehir escort bayan bursa escort bursa eskort yenibosna escort umraniye escort teksert
homescontents
https://www.fapjunk.com
ataşehir escort kadıköy escort kartal escort maltepe escort
gaziantep escort gaziantep escort
izmir escort
film izle
film izle film hd film
hd film izle
sakarya escort akyazı escort arifiye escort erenler escort eve gelen escort ferizli escort geyve escort hendek escort otele gelen escort sapanca escort söğütlü escort taraklı escort
sakarya escort akyazı escort arifiye escort erenler escort eve gelen escort ferizli escort geyve escort hendek escort karapürçek escort karasu escort kaynarca escort kocaali escort otele gelen escort pamukova escort sapanca escort söğütlü escort taraklı escort
Sakarya escort Sakarya escort Sakarya escort Sakarya escort Sakarya escort Sakarya escort Sapanca escort Sapanca escort Sapanca escort Sapanca escort Karasu escort