Project Manager: Doctor of Technical Sciences, Biloshchitskaya Svitlana Vasilievna
Source of funding: GF MSHE OF RK
Goal: Finding and preventing plagiarism, as well as unauthorized use of intellectual property by improving the existing system for determining the degree of uniqueness of scientific papers by creating a system for identifying text borrowings based on the developed combined methods and models for searching for partial duplicates, taking into account the Kazakh language.
Partners: PVLHOST LLP, Republic of Kazakhstan, Pavlodar, Mairy St., 29-101
Years of implementation: 2024-2026
Amount of funding: 97,752,196 tenge
9.1 Develop a conceptual model of software for accumulating and processing information about scientific papers and scientific researchers.
9.2 Develop a structural model of an information system based on a microservice architecture.
9.3 Creation of an architectural database for the system for monitoring academic and scientific works.
9.4 Development of a structural model for constructing data management methods, methods for exporting and transferring data, and information backup.
9.5 Creation of a visualization module to ensure the submission of information on matches in text arrays, taking into account three languages: Kazakh, English, and Russian.
9.6 Testing of the developed experimental information system for identifying text borrowings.
– at least 2 (two) articles or reviews in a peer-reviewed scientific publication indexed in the Social Science Citation Index, Arts and Humanities Citation Index, and (or) the Web of Science database and (or) having a CiteScore percentile in the Scopus database of at least 35 (thirty-five);
– at least 4 (four) articles and (or) reviews in peer-reviewed foreign and (or) domestic publications recommended by the Committee on the Protection of the Rights of the Native American Society of the Russian Federation.
Full name
Role in the project and nature of work performed
Scopus Author ID, Hirsch Index, ResearcherID, ORCID
Biloshchitskaya Svitlana Vasilievna
Doctor of Technical Sciences (Information Technology)
Scientific Manager
Project management, implementation of all stages according to the project schedule and ensuring the required results.
Scopus Author ID 57194208505
h=14
https://www.scopus.com/authid/detail.uri?authorId=57194208505
Researcher ID AAR-7542-2020
ORCID 0000-0002-0856-5474
Toksanov Sapar Nurakhmetovich
PhD in Information Systems
Leading Researcher
Development of alternative models and methods for searching for near-duplicates based on N-gram analysis of text data.
Development of an information system for identifying text borrowings based on combined methods and models for searching for near-duplicates:
Development of a conceptual model of software for accumulating and processing information about scientific papers and scientific researchers.
Development of a structural model of an information system based on a microservice architecture.
Scopus Author ID 57222154960
h=5
https://www.scopus.com/authid/detail.uri?authorId=57222154960
Researcher AAH-7150-2019
ORCID 0000-0002-2915-9619
Kuchansky Alexander Yuryevich
Doctor of Technical Sciences (Information Technology)
Leading Researcher
Develop models and methods for identifying near-duplicates in the content of electronic documents in the text part of documents and images, on the basis of which it is possible to develop an information system for detecting plagiarism. The developed methods must ensure the detection of near-duplicates with significant modifications of documents.
Develop methods for preparing content elements that must neutralize the impact of using methods for hiding plagiarism. These methods must bring the structure of an electronic document to a standard general form, the same for all file types.
Scopus Author ID 57190488151
h=19
https://www.scopus.com/authid/detail.uri?authorId=57190488151
Researcher AAF-1964-2019
ORCID 0000-0003-1277-8031
Mukhataev Aidos Agdarbekovich
Candidate of Pedagogical Sciences
Senior Researcher
Improvement of existing methods for identifying near-duplicates to the verifiable features of electronic documents, taking into account the language component.
Scopus Author ID 57210173007
h=6
https://www.scopus.com/authid/detail.uri?authorId=57210173007
Researcher AAI-7490-2021
ORCID 0000-0002-8667-3200
Andrashko Yuri Vasilievich
Candidate of Technical Sciences (Information Technology)
Researcher
Develop methods for process optimization that involve searching for near-duplicates in documents in order to minimize search time.
Development of alternative models and methods for searching for near-duplicates based on N-gram analysis of text data.
Improvement of methods for indexing, canonization, and comparison of text information written in the Kazakh language.
Improvement of existing methods for identifying near-duplicates to the verifiable features of electronic documents, taking into account the language component.
Scopus Author ID 57194702818
h=16
https://www.scopus.com/authid/detail.uri?authorId=57194702818
Researcher F-6021-2019
ORCID 0000-0003-2306-8377
Sharipova Saltanat Erkinovna
PhD (Systems Engineering)
Researcher
Creation of an architectural database for a system for monitoring academic and scientific works.
Development of a structural model for constructing data management methods, methods for exporting and transferring data, and information backup.
Creation of a visualization module to ensure the submission of information on matches of text arrays, taking into account three languages: Kazakh, English, and Russian.
Scopus Author ID 57884433800
h=3
https://www.scopus.com/authid/detail.uri?authorId=57884433800
Researcher KVH-2721-2024
ORCID 0000-0001-7267-3261
Tleubaeva Arailym Orynbaykyzy
PhD student in the EP “Computer Science”
PhD student in Computer Science program at Astana IT University
Researcher
Development of an information system for identifying text borrowings based on combined methods and models for searching for near-duplicates:
Development of a structural model of an information system based on microservice architecture.
Creation of an architectural database for a system for monitoring academic and scientific works.
Development of a structural model for constructing data management methods, data export and transmission methods, and information backup.
Scopus Author ID 58613980300
h=1
https://www.scopus.com/authid/detail.uri?authorId=58613980300
Researcher HHM-3840-2022
ORCID 0000-0001-9560-9756
Results of 2024:
Task code, stage |
Name of the works under the Agreement and the main stages of its implementation |
Result |
|
|
|||
1 |
Analysis of existing scientific developments and application software that allows finding near-duplicates in text electronic documents in three languages (Kazakh, English and Russian) |
A review of popular software libraries and platforms for searching for partial duplicates in texts was conducted, including the following tools:
As part of the analysis, a detailed study of these software solutions aimed at finding partial duplicates in texts was conducted, with a special focus on their application to processing the Kazakh language. Each of them offers different approaches to comparing and analyzing text data, but support for the Kazakh language, especially with its complex morphology and syntax, varies greatly between the tools. The analysis showed that additional settings are required to work correctly with the Kazakh language, especially in systems that support only Latin languages by default. Testing on Kazakh texts revealed that the most successful solution for searching for partial duplicates in this language is Elasticsearch with additional settings and the implementation of a morphological analyzer for the Kazakh language. FuzzyWuzzy turned out to be less effective in processing texts with non-Latin characters, and TextRazor does not support the Kazakh language at all to the required extent, which makes it unsuitable for use in this project. As a result, in order to successfully search for partial duplicates in texts in the Kazakh language, it is planned to study Elasticsearch with settings for processing morphology and tokenization of Kazakh words in the future. This software solution provides the necessary accuracy and flexibility for working with large volumes of data and the integration of specialized algorithms and tools for working with the Kazakh language.
An analysis of monitoring and verification technologies was conducted using standard search algorithms that provide a fairly fast and effective search for complete or partial duplicates, and also guarantee correct text processing. It was found that such systems do not allow checking images and mathematical formulas. In this case, problems arise that cannot be solved using standard procedures. Additional special difficulties arise in the process of analyzing mathematical formulas in compared texts in order to identify similarities in them. In the text, formulas can be given as a picture or as a graphic object created using one of the formula editors. A formula editor is a computer program designed to create and edit mathematical and other formulas. Formula editors are based on the following technologies:
Comparison of formulas using templates. Searching by formulas is much more complicated than searching by text (for example, formulas x2 and a2 are identical by template, but differ by variable names). Comparison of formulas using converters. It has been established that a promising direction for automated analysis of mathematical formulas is the creation of converters of formulas from different formats (TeX, Equation, MathType) into the canonical XML/MathML format |
|
2 |
Analysis of methods for concealing borrowings in documents that allow changing the structure of the document content without changing its content. Identify structural changes that content components may be subject to
|
|
|
3 |
Models and methods for identifying near-duplicates in the content of electronic documents in the text part of documents and images, on the basis of which it is possible to develop an information system for detecting borrowings. The developed methods should ensure the detection of near-duplicates with significant modifications of documents
|
1.1. N-gram-based methods. 1.2. Models based on vector representation of text. 1.3. Shingling (splitting text into substrings of fixed length). 1.4. Deep learning-based methods. Among the methods considered, methods using N-grams, shingles, and deep learning models that provide a high level of accuracy when comparing documents are highlighted as promising for further development. 2. An analysis is conducted and a vision for creating a hybrid method for detecting partial duplicates in tables is presented. It is assumed that the method will allow identifying similarities in text and numeric data of tables separately, and then generalize the results obtained. For text data, sequences of words in canonical form are formed, from which bit sequences are constructed based on the locally sensitive hashing method. In this case, the similarity is calculated based on the Hamming distance with a given threshold value. Identification of similarity between numerical data of tables is implemented based on the method of nearest neighbors with specified metric distances. The method allows identifying incomplete duplicates present in the input table data in comparison with a set of tables selected from scientific publications and diploma and dissertation works. |