Development of a system for identifying text borrowings based on combined methods and models for searching for incomplete duplicates, taking into account the Kazakh language

Project Manager: Doctor of Technical Sciences, Biloshchitskaya Svitlana Vasilievna

Source of funding: GF MSHE OF RK

Goal: Finding and preventing plagiarism, as well as unauthorized use of intellectual property by improving the existing system for determining the degree of uniqueness of scientific papers by creating a system for identifying text borrowings based on the developed combined methods and models for searching for partial duplicates, taking into account the Kazakh language.

Partners: PVLHOST LLP, Republic of Kazakhstan, Pavlodar, Mairy St., 29-101

Years of implementation: 2024-2026

Amount of funding: 97,752,196 tenge

Project objectives:

Conduct an analysis of existing scientific developments and application software that allows finding near-duplicates in text electronic documents in three languages (Kazakh, English and Russian).
Conduct an analysis of methods for concealing borrowings in documents that allow changing the structure of the document content without changing its content. Identify structural changes that content components may be subject to.
Develop models and methods for identifying near-duplicates in the content of electronic documents in the text part of documents and images, on the basis of which it is possible to develop an information system for detecting borrowings. The developed methods must ensure the detection of near-duplicates with significant modifications to documents.
Develop methods for preparing content elements that must neutralize the impact of using methods for concealing borrowings. These methods must bring the structure of an electronic document to a standard general form that is the same for all file types.
Develop methods for optimizing the process, providing for the search for near-duplicates in documents in order to minimize the search time.
Develop alternative models and methods for searching for near-duplicates based on N-gram analysis of text data;
Improvement of methods for indexing, canonization and comparison of text information written in the Kazakh language.
Improvement of existing methods for identifying near-duplicates to the verified features of electronic documents, taking into account the language component;
Develop an information system for identifying text borrowings based on combined methods and models for searching for near-duplicates:

9.1 Develop a conceptual model of software for accumulating and processing information about scientific papers and scientific researchers.

9.2 Develop a structural model of an information system based on a microservice architecture.

9.3 Creation of an architectural database for the system for monitoring academic and scientific works.

9.4 Development of a structural model for constructing data management methods, methods for exporting and transferring data, and information backup.

9.5 Creation of a visualization module to ensure the submission of information on matches in text arrays, taking into account three languages: Kazakh, English, and Russian.

9.6 Testing of the developed experimental information system for identifying text borrowings.

Based on the results of the implementation of the scientific project for the entire period of the project, scientific articles will be published, namely:

– at least 2 (two) articles or reviews in a peer-reviewed scientific publication indexed in the Social Science Citation Index, Arts and Humanities Citation Index, and (or) the Web of Science database and (or) having a CiteScore percentile in the Scopus database of at least 35 (thirty-five);

– at least 4 (four) articles and (or) reviews in peer-reviewed foreign and (or) domestic publications recommended by the Committee on the Protection of the Rights of the Native American Society of the Russian Federation.

Based on the results of the work, it is planned to obtain 1 (one) copyright certificate for the source code of specialized software for searching for near-duplicates, taking into account the Kazakh language. The members of the research group will be registered as authors of intellectual property.

Expected results:

Models and methods for identifying near-duplicates in the content of electronic documents in the text part of documents and images.
Methods for preparing content elements that should neutralize the impact of using methods to hide borrowings.
Alternative models and methods for searching for near-duplicates based on N-gram analysis of text data.

Members of the research team:

Full name

Role in the project and nature of work performed

Scopus Author ID, Hirsch Index, ResearcherID, ORCID

Biloshchitskaya Svitlana Vasilievna

Doctor of Technical Sciences (Information Technology)

Scientific Manager

Project management, implementation of all stages according to the project schedule and ensuring the required results.

Scopus Author ID 57194208505

h=14

https://www.scopus.com/authid/detail.uri?authorId=57194208505

Researcher ID AAR-7542-2020

ORCID 0000-0002-0856-5474

Toksanov Sapar Nurakhmetovich

PhD in Information Systems

Leading Researcher

Development of alternative models and methods for searching for near-duplicates based on N-gram analysis of text data.

Development of an information system for identifying text borrowings based on combined methods and models for searching for near-duplicates:

Development of a conceptual model of software for accumulating and processing information about scientific papers and scientific researchers.

Development of a structural model of an information system based on a microservice architecture.

Scopus Author ID 57222154960

h=5

https://www.scopus.com/authid/detail.uri?authorId=57222154960

Researcher AAH-7150-2019

ORCID 0000-0002-2915-9619

Kuchansky Alexander Yuryevich

Doctor of Technical Sciences (Information Technology)

Leading Researcher

Develop models and methods for identifying near-duplicates in the content of electronic documents in the text part of documents and images, on the basis of which it is possible to develop an information system for detecting plagiarism. The developed methods must ensure the detection of near-duplicates with significant modifications of documents.

Develop methods for preparing content elements that must neutralize the impact of using methods for hiding plagiarism. These methods must bring the structure of an electronic document to a standard general form, the same for all file types.

Scopus Author ID 57190488151

h=19

https://www.scopus.com/authid/detail.uri?authorId=57190488151

Researcher AAF-1964-2019

ORCID 0000-0003-1277-8031

Mukhataev Aidos Agdarbekovich

Candidate of Pedagogical Sciences

Senior Researcher

Improvement of existing methods for identifying near-duplicates to the verifiable features of electronic documents, taking into account the language component.

Scopus Author ID 57210173007

h=6

https://www.scopus.com/authid/detail.uri?authorId=57210173007

Researcher AAI-7490-2021

ORCID 0000-0002-8667-3200

Andrashko Yuri Vasilievich

Candidate of Technical Sciences (Information Technology)

Researcher

Develop methods for process optimization that involve searching for near-duplicates in documents in order to minimize search time.

Development of alternative models and methods for searching for near-duplicates based on N-gram analysis of text data.

Improvement of methods for indexing, canonization, and comparison of text information written in the Kazakh language.

Improvement of existing methods for identifying near-duplicates to the verifiable features of electronic documents, taking into account the language component.

Scopus Author ID 57194702818

h=16

https://www.scopus.com/authid/detail.uri?authorId=57194702818

Researcher F-6021-2019

ORCID 0000-0003-2306-8377

Sharipova Saltanat Erkinovna

PhD (Systems Engineering)

Researcher

Creation of an architectural database for a system for monitoring academic and scientific works.

Development of a structural model for constructing data management methods, methods for exporting and transferring data, and information backup.

Creation of a visualization module to ensure the submission of information on matches of text arrays, taking into account three languages: Kazakh, English, and Russian.

Scopus Author ID 57884433800

h=3

https://www.scopus.com/authid/detail.uri?authorId=57884433800

Researcher KVH-2721-2024

ORCID 0000-0001-7267-3261

Tleubaeva Arailym Orynbaykyzy

PhD student in the EP “Computer Science”

PhD student in Computer Science program at Astana IT University

Researcher

Development of an information system for identifying text borrowings based on combined methods and models for searching for near-duplicates:

Development of a structural model of an information system based on microservice architecture.

Creation of an architectural database for a system for monitoring academic and scientific works.

Development of a structural model for constructing data management methods, data export and transmission methods, and information backup.

Scopus Author ID 58613980300

h=1

https://www.scopus.com/authid/detail.uri?authorId=58613980300

Researcher HHM-3840-2022

ORCID 0000-0001-9560-9756

Results of 2024:

Results of 2024:

Task code, stage

Name of the works under the Agreement and the main stages of its implementation

Result

Analysis of existing scientific developments and application software that allows finding near-duplicates in text electronic documents in three languages (Kazakh, English and Russian)

A review of popular software libraries and platforms for searching for partial duplicates in texts was conducted, including the following tools:

Apache Lucene / Elasticsearch – a full-text search system with similarity search capabilities (fuzzy search).
FuzzyWuzzy – a Python library for string similarity matching using the Levenshtein library.
SimString – a library for quickly searching for strings similar to a given string.
TextRazor – an API for text analysis and finding similar fragments with support for many languages.

As part of the analysis, a detailed study of these software solutions aimed at finding partial duplicates in texts was conducted, with a special focus on their application to processing the Kazakh language. Each of them offers different approaches to comparing and analyzing text data, but support for the Kazakh language, especially with its complex morphology and syntax, varies greatly between the tools. The analysis showed that additional settings are required to work correctly with the Kazakh language, especially in systems that support only Latin languages by default.

Testing on Kazakh texts revealed that the most successful solution for searching for partial duplicates in this language is Elasticsearch with additional settings and the implementation of a morphological analyzer for the Kazakh language. FuzzyWuzzy turned out to be less effective in processing texts with non-Latin characters, and TextRazor does not support the Kazakh language at all to the required extent, which makes it unsuitable for use in this project.

As a result, in order to successfully search for partial duplicates in texts in the Kazakh language, it is planned to study Elasticsearch with settings for processing morphology and tokenization of Kazakh words in the future. This software solution provides the necessary accuracy and flexibility for working with large volumes of data and the integration of specialized algorithms and tools for working with the Kazakh language.

An analysis of monitoring and verification technologies was conducted using standard search algorithms that provide a fairly fast and effective search for complete or partial duplicates, and also guarantee correct text processing. It was found that such systems do not allow checking images and mathematical formulas. In this case, problems arise that cannot be solved using standard procedures. Additional special difficulties arise in the process of analyzing mathematical formulas in compared texts in order to identify similarities in them. In the text, formulas can be given as a picture or as a graphic object created using one of the formula editors. A formula editor is a computer program designed to create and edit mathematical and other formulas. Formula editors are based on the following technologies:

use of a special markup language, such as TeX, MathML in the LaTex editor, Math for the OpenOffice editor;
creation of formulas using a graphical interface: KFormula, MathType, MathCastmula, WIRIS Editor, MathCast;
built-in components: Math Expression Editor Light;
symbolic calculations: Mathematica.

Comparison of formulas using templates. Searching by formulas is much more complicated than searching by text (for example, formulas x2 and a2 are identical by template, but differ by variable names). Comparison of formulas using converters.

It has been established that a promising direction for automated analysis of mathematical formulas is the creation of converters of formulas from different formats (TeX, Equation, MathType) into the canonical XML/MathML format

Analysis of methods for concealing borrowings in documents that allow changing the structure of the document content without changing its content. Identify structural changes that content components may be subject to

1. The analysis revealed that in the digital world, hiding borrowings in text documents has become a challenging task for plagiarism detection systems. When attempting to hide borrowings, authors use various content modification methods. These changes can range from simple word changes to more complex structural modifications. This report discusses the key methods for hiding borrowings, structural content changes, and models and methods for detecting partial duplicates that are used to detect borrowings even with significant text modifications.

2. The main methods for hiding borrowings in documents are defined:

2.1. Semantic changes: replacing keywords with synonyms, paraphrasing, replacing terms or phrases with simpler or more complex constructions.

2.2. Structural changes: changing the order of sentences or paragraphs, splitting or combining sentences and paragraphs, changing the text style (for example, from active to passive voice).

2.3. Changes in formatting: changing the font, adding or removing headings and subheadings.

2.4. Structural changes in content: rearranging paragraphs and sentences, adapting grammatical constructions, changing data presentation formats, for example, changing a list to a paragraph containing the same meaning.

3. An analysis of the main methods of manipulating documents aimed at complicating the process of detecting borrowings was conducted:

3.1. Paraphrasing – changing the wording of the text while maintaining the main meaning (replacing words with synonyms; changing the word order; using other grammatical constructions; converting active sentences into passive ones and vice versa).

3.2. Changing the order of sentences and paragraphs – reorganizing the structure of the text by rearranging sentences or paragraphs.

3.3. Replacing parts of the text with graphs or tables – a significant change in the structure of the document.

3.4. Intentional misuse of quotation – quoting with modifications or using incorrect references (breaking quotations into several parts; changing the attribution of quotations; mixing quotation with paraphrasing).

3.5. Translation from one language to another with subsequent paraphrasing

3.6. Text compression or expansion techniques – shortening or adding explanations, examples, data and other details.

3.7. Fragmenting content into parts

3.8. Using metaphors and analogies – using figurative expressions of thoughts to convey information.

Models and methods for identifying near-duplicates in the content of electronic documents in the text part of documents and images, on the basis of which it is possible to develop an information system for detecting borrowings. The developed methods should ensure the detection of near-duplicates with significant modifications of documents

Classification of models and methods for detecting partial duplicates

1.1. N-gram-based methods.

1.2. Models based on vector representation of text.

1.3. Shingling (splitting text into substrings of fixed length).

1.4. Deep learning-based methods.

Among the methods considered, methods using N-grams, shingles, and deep learning models that provide a high level of accuracy when comparing documents are highlighted as promising for further development.

2. An analysis is conducted and a vision for creating a hybrid method for detecting partial duplicates in tables is presented. It is assumed that the method will allow identifying similarities in text and numeric data of tables separately, and then generalize the results obtained. For text data, sequences of words in canonical form are formed, from which bit sequences are constructed based on the locally sensitive hashing method. In this case, the similarity is calculated based on the Hamming distance with a given threshold value. Identification of similarity between numerical data of tables is implemented based on the method of nearest neighbors with specified metric distances. The method allows identifying incomplete duplicates present in the input table data in comparison with a set of tables selected from scientific publications and diploma and dissertation works.