Document Plagiarism Detection Using a New Concept Similarity in Formal Concept Analysis

This paper proposes an algorithm for document plagiarism detection using the provided incremental knowledge construction with formal concept analysis (FCA). The incremental knowledge construction is presented to support document matching between the source document in storage and the suspect document. Thus, a new concept similarity measure is also proposed for retrieving formal concepts in the knowledge construction. The presented concept similarity employs appearance frequencies in the obtained knowledge construction. Our approach can be applied to retrieve relevant information because the obtained structure uses FCA in concept form that is de ﬁ nable by a conjunction of properties. This measure is mathematically proven to be a formal similarity metric. The performance of the proposed similarity measure is demonstrated in document plagiarism detection. Moreover, this paper provides an algorithm to build the information structure for document plagiarism detection. Thai text test collections are used for performance evaluation of the implemented web application.


Introduction
Recently, plagiarism has increased because of easy access to data on the World Wide Web. For this reason, producing a written document can be easy and quick [1][2][3][4]. However, plagiarism or copying in a different style is a problem in education, research, publications, and other contexts. Software for detecting such problems has been mostly developed based on text string comparisons [1,2,5,6]. Grouping prior documents based on their similarity has been demonstrated to reduce the search time. Formal concept analysis (FCA) is widely used to identify groups of objects sharing common attributes [7][8][9][10]. This work focuses on using FCA to group documents.
FCA is a popular approach for knowledge representation and data analysis in many applications and has become popular in information and document retrieval [11][12][13][14][15]. Document retrieval is used to retrieve the plagiarism candidate documents, so the suspect document is the query, while the stored source documents are retrieved [2,16,17]. FCA is one approach for grouping documents in a hierarchy that supports browsing. It automatically provides generalization and specialization relationships among the formal concepts for documents represented in a concept lattice [10][11][12][13][14][15][16][17][18]. Thus, this work applied FCA to detect document plagiarism. Moreover, this method provides related documents or groups of documents to the user. However, the application requires a similarity measure to retrieve source documents or to identify groups of similar documents in a concept hierarchy. Thus, the choice of the concept similarity measure is a challenging problem for identifying different concepts that are semantically close.
Many measures have been proposed for concept similarity based on set theory in binary weighting form (e.g., [19][20][21][22][23][24][25]). However, the weights determined from all content can be used to improve the precision of concept retrieval. Formica [12,13,26] used information in this manner. Later, concept similarity measures were developed with flexibility to adapt to user preferences (e.g., [24,25,27,28]). The previous studies mostly used only intensions, instead of both intensions and extensions of the formal concept. In addition, the appearance frequency of formal concepts could be useful for improved concept retrieval.
Thus, this paper proposes such similarity measures for formal concepts that use the above ideas. Later, mathematical proof is provided that a formal similarity metric has been defined. Concept similarity of FCA has gained importance from its application to plagiarism detection, which has to assess the similarity between formal concepts to find relevant information. We present and investigate a candidate algorithm to support plagiarism detection with the proposed concept similarity measures. Finally, plagiarism detection test cases are evaluated from collected Thai text data.
This report is organized as follows. Section 2 provides details of the formal concept analysis, Section 3 discusses related prior work, Section 4 presents the research methodology and the proposed system, Section 5 has results and discussion, and Section 6 is the conclusion.

Formal Concept Analysis
Formal concept analysis (FCA) is applied in many fields for data analysis and knowledge representation [9,10,14]. In this section, the basic definitions are presented to understand the useful notions of FCA, following the book [29]. We start with a formal context with the following definitions.
A formal context K = ðG, M, IÞ consists of two sets G and M and a relation I between G and M. The elements of G are called the objects, and the elements of M are called the attributes of the context. ðg, mÞ ∈ I can be used to express that an object g is in a relation I with an attribute m and read as "the object g has the attribute m," also denoted by gIm. For a set A ⊆ G of objects, A ′ is defined as follows A ′ ≔ fm ∈ M | gI m for all g ∈ Ag: Correspondingly, for a set B ⊆ M of attributes, B′ is defined as follows B′ ≔ fg ∈ G | gIm for all m ∈ Bg. A formal concept of the formal context ðG, M, IÞ is a pair ðA, BÞ with A ⊆ G, B ⊆ M, A ′ = B, and B ′ = A. We call A the extent and B the intent of the formal concept ðA, BÞ. BðG, M, IÞ denotes the set of all formal concepts of the formal context ðG, M, IÞ.
The above definitions show a group of documents and their shared keywords. Practically, these definitions are presented to identify groups of source documents sharing common keywords. FCA is, as such, applicable to a formal context which contain only binary values, 0 or 1. However, a typical database will hold collected data not restricted to only binary values. Since the database holds a finite set of objects and their attributes, the set of attribute values is also finite: this is called a many-valued context. A many-valued context ðG, M, W, IÞ consists of a set of G, M, and W and a ternary relation I between G, M, and W (i.e., I ⊆ G × M × W) for which it holds that g, m, w ð Þ∈ I and g, m, v ð Þ∈ I always imply w = v: The elements of G are called objects, those of M (manyvalued) attributes and those of W attribute values. If W has n distinct elements, it is called an n-valued context. The condi-tion in the above definition states that there is at most one attribute value given for an object and an attribute, so we can again have an information matrix with single entries for object rows and attribute columns. We read ðg, m, wÞ ∈ I as "the attribute m has the value w" for the object g and can write mðgÞ = w or ðg, m, wÞ ∈ I. To obtain a concept lattice from a many-valued context, it has to be transformed to a formal context. The transformation can be done with conceptual scales. In practice, each many-valued attribute is represented by a collection of binary attributes.
A scale for the attribute m of a many-valued context is a (one-valued) context Sm ≔ ðGm, Mm, ImÞ with mðGÞ ⊆ Gm. The objects of a scale are called scale values, and the attributes are called scale attributes.
The scales of each context are joined to make a onevalued context (formal context), for which the simplest method is called plain scaling. In plain scaling, the derived formal context is obtained from a many-valued context ðG, M, W, IÞ and the scale contexts Sm, m ∈ M where the attribute set of Sm is replaced by Mm ≔ m × Mm. Thus, the new formal context ðG, N, JÞ is derived from a many-valued context by plain scaling with this formal transformation and gJðm, nÞ: ⟺ mðgÞ = w and wImn.
We will later use this plain scaling approach to transform a many-valued context into a formal context so that FCA becomes applicable. Afterwards, the formal concept form is generated to obtain the knowledge structure, and to use this knowledge, the concept similarity measure is applied. Many similarity measures have been developed for use in retrieving formal concepts, surveyed in [30]. Lengnink [31] proposed similarity measures by using the averages of fractional overlaps of objects and attributes, relative to all objects (attributes) in the concepts compared as a local measure and relative to all objects (attributes) available overall as a global measure. Saquer and Deogun [22] and Dau et al. [32,33] applied rough sets for concept approximation to improve the information retrieval with guiding query refinement. They use symmetric differences between the objects (attributes), similarly as the previous definitions using overlap. Therefore, we actually have distance measures instead of similarity measures, with zero distance given for identical concepts (instead of similarity one). Next, Formica [12,13] improved the concept similarity by using the attribute intent with the attribute extent. The researchers used approximate extensions for information content. Moreover, they inserted weight parameters that can be adjusted by the user to tune the method. In addition, Wang and Liu [24] applied rough sets to evaluate a formal concept of the interval between upper neighbors and lower neighbors. However, only intension of formal concepts was applied on determining their similarity based on Tversky's model [27] instead of both intension and extension [34,35]. Alqadah [19] applied set theory using intension in formal concepts to propose similarity measures. In summary, the challenge of concept similarity measures can be considered for intent and extent attributes. Journal of Applied Mathematics However, the previous works have not considered merging intent and extent attributes. The studies have used only intent or only extent attributes to compute the similarity weights. Inspired by these reviewed similarity measures or indices, this work focuses on using intent and extent in all formal concepts.

Related Works
Plagiarism detection was divided into two approaches, namely, external and intrinsic plagiarism detection. External plagiarism detection involves identification of the source documents by using a database, while intrinsic plagiarism detection is not available to the text, but it is a plagiarized text converted with the use of synonyms. This paper is aimed at detecting external plagiarism by using a database. Computerassisted plagiarism detection is an information retrieval (IR) task supported by specialized IR systems, which are referred to as plagiarism detection systems or document similarity detection systems. Thus, this section presents a plagiarism detection system based on IR and FCA for IR, as applied in this work. Many studies have applied semantic role labeling (SRL) [1,3,[35][36][37]. Abdi et al. [1] present an external plagiarism detection system that employs a combination of SRL with semantic and syntactic information. The semantic role labeling technique is here used to handle active to passive and vice versa transformations. The proposed method is able to detect different types of plagiarism, such as exact verbatim copying, paraphrasing, transformation of sentences, or changing word structure. Osman et al. [35,36] applied SRL to extract arguments from the sentences and then compare arguments to detect the plagiarized part from the text. Paul and Jamal [37] also improved SRL for the ranking of sentences to identify direct copy-paste, active-passive transformation, and synonym conversions with faster execution times. Moreover, machine learning of both supervised and unsupervised types has been applied to detect document plagiarism [16,38]. Vani and Gupta [3,38,39] studied and compared different methods of document categorization for external plagiarism detection. They applied the K-means algorithm and the general N-gram. The K-means gave promising results when dealing with highly obfuscated data. Rahman and Chow [16] proposed a new document representation to enhance the classification accuracy using a new hybrid neural network model to handle the document representation. They represent the document in a tree structure that has a superior ability to encode document characteristics.
The IR was applied to enhance the performance of plagiarism detection [4,[39][40][41][42]. Ekbal et al. [40] propose a technique based on textual similarity for external plagiarism detection by using a vector space model, which is one technique in IR to compare source and suspect documents. The results show encouraging performance with a benchmark setup, but not with language translation. Ahuja et al. [4] use the Dice measure as a similarity measure for finding the semantic resemblances between pairs of sentences. It also uses linguistic features like path similarity, a depth estimation measure, to compute the resemblance between pairs of words, and these features are combined by assigning different weights to them. It is capable of identifying cases of restructuring, paraphrasing, verbatim copying, and synonymized plagiarism. Moreover, the vector space model was applied in [39,41,42] to improve recall performance. These studies represent suspected and source documents as vectors using VSM and TF-ISF weighting. However, this work's conceptual IR systems are aimed at addressing the limitations of the classical keyword systems and identifying the conceptual associations and links between the documents. Thus, FCA can be used to fulfil an IR system in order to obtain the document relationship. Hierarchical order visualization of formal concepts in the concept lattice structure is an important concern for practical applications of FCA [43]. In addition, Kumar et al. [44] discussed the use of FCA for results in LSI and SVM. The authors applied FCA to discover dependencies in the data for clustering documents [45][46][47].
IR is concerned with selecting appropriate information from an information collection. Traditionally, the process is begun by submitting a query, matching the query with information collection, seeing the ranked information, and submitting a newly revised query, until the target information is found or the user quits [48]. FCA has been successfully applied to enhance the efficiency and effectiveness of each task in this process. Mostly, an information collection is analysed with FCA to form a hierarchy, and retrieving information from such structures requires suitable methods. Next, we briefly review interesting work on FCA for IR. A retrieval task is composed of three natural subtasks: query, matching, and ranking. The matching is based on a similarity measure of the kind that was reviewed in the previous section. Query refinement allows users to recover from situations where the returned solution set is too large or too small. By the use of related keywords (attributes) in a concept lattice, the retrieval process performed on the initial query can also retrieve further relevant keywords. For this reason, concept lattice techniques have been developed for query refinement to improve web search engines (e.g., [20,21,28,43,44,48,49]). Nafkha et al. [15] applied FCA to retrieve solutions by using the cooccurrence of documents inside formal concepts. Qadi et al. [11] applied FCA in both refining the query and in ranking the solutions. An ontology for image processing was used in this retrieval process. They ranked the solutions by counting the number of documents in the retrieved concepts.

The Proposed Document Plagiarism
Detection Approach 4.1. System Overview. The document plagiarism detection using FCA is aimed at detecting good matches between the source document in storage and a suspect document. In this section, we discuss the proposed system shown in Figure 1.
The source documents will be subjected to text operations such as word segmentation and stopwords to extract keywords. We applied the Thai segmentation library [50] to obtain keywords (or words in general) from the source documents. That set of extracted words is represented with the attributes of the formal context, and the source documents provide the objects of the formal context. Afterwards, the 3 Journal of Applied Mathematics formal context will be processed into a concept lattice to retrieve the relevant documents in document plagiarism detection. Likewise, the suspect document will be subjected to text operations to match and retrieve from the concept lattice.
The concepts in the lattice are used to index source documents. This structure is incrementally and automatically rebuilt when new cases are added or existing cases are updated. The new source documents are collected to prepare data with text operations. Next, the keywords are rebuilt with a new concept as a new node in the lattice structure. To initially find a new node and its position simultaneously for the updated concept lattice, we applied the algorithm in [18] to insert into concept lattice according to its position, in a scalable knowledge structure. This is used to retrieve a similarity concept from the suspected document in the subconcept lattice form by using the new concept similarity proposed in the next section.

A New Concept Similarity.
We propose two concept similarity measures that not only are applicable within a concept lattice but also give similarity values between any existing formal concept and a tentative concept formed from available objects and attributes, which need not be an element of the lattice. These similarities use both object extent and attribute intent based on their appearance frequencies in the concept of the lattice. The proposed method allows ranking by the similarity values.
Both new similarity measures are introduced based on extension and intension. The first measure weighs the objects and attributes equally. The second weighs them based on existing concepts in the lattice. In this section, we define the building blocks used in our approach.
We define C p as a formal concept represented by a pair ðE p , I p Þ in the formal concepts BðG, M, IÞ, where E p ⊆ G, I p ⊆ M, E p , I p are the extent and intent of formal concept, respectively. A new formal concept is defined as C N = ðE N , I N Þ, where E N is a set of the retrieved object(s) and I N is a set of new attributes provided by the suspect document. Thus, a new concept similarity measure between formal concepts in BðG, M, IÞ and new formal concept is defined as simðC N , C p Þ. The proposed concept similarity measures are based on an appearance frequency of formal concepts denoted with f ðC N , C p Þ according to (3) and (4). The closer simðC N , C p Þ is to 1, the greater the similarity of C N and C p .
Given a formal concept C p = ðE p , I p Þ and a new formal concept C N = ðE N , I N Þ in a formal context ðG, M, IÞ, concept similarity equally weighting objects and attributes is defined as When the objects are used to weigh existing attributes, the concept similarity is defined as where f ðC N , C p Þ meet is the frequency of objects in a formal concept BðG, M, IÞ, I N ∩ I p ≠ ∅ and f ðC N , C p Þ join is the total number of objects in formal concept BðG, M, IÞ.
In equations (3) and (4), the frequency of objects is applied because the concept lattice is derived from the formal concept. If the number of formal concepts is high, this shows that it is general knowledge, and it shows in the upper concept lattice. Thus, f ðC N , C p Þ meet and f ðC N , C p Þ join are applied in this work. We apply these similarity measures for document plagiarism detection and provide mathematical proof of having a formal similarity metric in Theorem 1 [18]. Theorem 1. simðC N , C p Þ is the degree of similarity between the formal concepts C p and the formal concept C N in concept lattice BðG, M, IÞ if simðC N , C p Þ satisfies the following conditions [15]:   Journal of Applied Mathematics The proposed similarity measures are applied in the system for document plagiarism detection. Normally, the similarity measure between source document (D i ) and a suspect document (Q) is defined as simðQ, D i Þ: Let D i and Q be a set of keywords (wd) where D i is defined as D i = fwd d1 , wd d2 , ⋯, wd dn g where n is the total number of keywords of source document i. Similarly, Q = fwd 1 , wd 2 , ⋯, wd m g , where m is the total number of keywords of a suspect document. Given a formal concept C p = ðE p , I p Þ and a new formal concept C N = ðE N , I N Þ in a formal context K ≔ ðG, M, IÞ, for any element D i in EP and a suspect document Q is a new formal concept where I N represent set of keywords of a suspect document. From definitions (3) and (4), we can apply to document plagiarism detection as follows: where for any source document D i and any suspect document Q, we define f ðQ, D i Þ meet = the frequency of source document D i in a formal concept BðG, M, IÞ, where Q ∩ D i ≠ ∅, and f ðQ, D i Þ join = the total number of formal concepts which contain source document D i = the frequency of source document D i in a formal concept BðG, M, IÞ.
From equations (5) and (6), we get the following equation: The proposed concept similarity measure for document plagiarism detection is mathematically proved to be a formal similarity metric following Theorem 1. Namely, our concept similarity measure is the degree of similarity according to Theorem 2. Theorem 2. simðC N , C p Þ = max fsimðQ, D i ÞjD i ∈ E p g is the degree of similarity between the formal concepts C p and the formal concept C N .

Proof.
(1) This work will prove that 0 ≤ simðC N , C p Þ ≤ 1: To prove this, we first consider that any D i is a source document in formal concept C p and Q = I N , From definition of f ðQ, D i Þ meet and f ðQ, D i Þ join , it is obvious that Then, we have jQ ∩ D i j/jQ ∪ D i j ≤ 1, where Q ∪ D i ≠ ∅. Now, we get the following result: From equation (14), we also get that 0 ≤ simðQ, D i Þ ≤ 1: (2) Let C N = C p ; now, we have E N = E p and I N = I p . Since the suspect document Q is the I N , Q = I p .This implies that Q ∩ D i ≠ ∅ for all documents D i in formal concept C p . Hence, we get by definitions of f ðQ, D i Þ meet and f ðQ, Thus, simðC N , C p Þ = 1: (3) Since C N is a new formal concept which needs to be assigned similarity with the given formal concept C p , it is obvious that simðC N , C p Þ = simðC p , C N Þ Firstly, we show that simðC N , C p Þ = sim ðC N , C O Þ. It is clear by careful inspection that for any source document D k in E O \ E N , we get that f ðI N , D k Þ meet = 0 and jI N ∩ D k j = 0. Now, we have Hence, we can conclude that for any Similarly, we have simðC N , C p Þ = simðC N , C N Þ. Hence, we get simðC N , C p Þ = simðC N , C O Þ: Next, we show that sim Journal of Applied Mathematics numbers of formal concepts which contain source document D i . Since I N ⊆ I p , then I N ∩ D i ⊆ I p ∩ D i . This implies that if I N ∩ D i is empty, then I p ∩ D i may be not empty. So, we get that f ðI N , D i Þ meet ≤ f ðI p , D i Þ meet . This leads to It is clear by careful inspection that for any source docu- By (14) and (15), we get   Consider the following result: Similarly, we can prove simðQ, D i Þ from equation (7) with the above 1-4.
In summary, the proposed concept similarity can be applied to retrieve source documents from knowledge storage in the concept lattice form, and this is demonstrated both empirically and theoretically in the next section.

Algorithm for Building Knowledge Base and Performance
Evaluation. In this section, we evaluate the implemented system for document plagiarism detection using the proposed algorithm and use it to retrieve source documents. We provide an algorithm for building a knowledge base in formal concept form and retrieve source documents when the user inputs a suspect document. Algorithm 1 generates a set of formal concepts that consist of two parts, i.e., extent and intent. The result from this algorithm is used as a knowledge base for retrieving source documents when the user inputs a suspect document. Algorithm 2 next matches the suspect document within the set of all formal concepts to retrieve a group of source document(s) relevant to the suspect document, represented by the retrieved formal concept.

Implementation and Results
The proposed system was implemented with web applications as shown in Figure 2. This workflow demonstrates the process by the user. Firstly, the user has three ways to input the suspect document, namely, stored document, user's documents, or document from the internet. Next, the system provides a document database to support the comparison between the suspect document and prior source documents, using the FCA module mentioned in Section 4. This module is enabled in the back end of the web application. If the user would like to check with their documents, they select the provided option to compare the document similarity. Moreover, the user can check their suspect document with a document from the internet, for which they will get a URL (Uniform Resource Locator) for results on the suspect document on a website. Finally, the suspect document will be stored in the database to check in the future. We developed our system as an online website. PHP language was used to implement the system, while a MySQL database is used for details of the documents. An example of the application is provided in Figure 3. After the user selects various options, the result will show the document similarity, for example, Figure 4. If the user would like to see the details of plagiarism, they can click the provided link in Figure 5.
In this work, we designed an experiment to evaluate provided document plagiarism. We provided documents with copied text to various extents, namely, with 100%, 80%, 50%, 30%, and 0% of copying. Each level of copying was designed with 5 general text files derived from news or academic publications, with different sizes of 200 kB, 400 kB, 800 kB, 1200 kB, or 1600 kB. These files were tested for the operation of the proposed approach, 10 times for each file. The results are shown in Table 1. Table 1 shows the performance of the proposed system with an overall plagiarism detection accuracy of 94.01%.  Each level of copied text shows similarity between the source file and the provided suspect files (or documents). If the suspect document is completely copied, the proposed method will detect 100% of plagiarism. However, even if no copying occurred, the system still detects a few percent of plagiarism, because some frequent words have appeared.

Conclusions
This paper proposed an algorithm for detecting document plagiarism by using formal concept analysis (FCA) with the presented concept similarity candidate to retrieve relevant source documents. The proposed similarity measures employ concept approximation using frequency of the formal concepts and were mathematically proven to be formal similarity metrics. The source documents were processed and retrieved with the proposed algorithm to demonstrate performance of the proposed similarity measure in document plagiarism detection by implemented web applications. This work proposes 3 formats to prevent plagiarism: (1) to detect among documents inside the document collection, (2) to detect between the suspect document and source documents, and (3) to detect between the suspect document and other documents from the Internet. The proposed 3 formats in the system were implemented in PHP language with the MySQL database. Moreover, in the last format, the presented system applies services by Google. The proposed system was demonstrated to be efficient and effective with a case study of news and academic documents. The experiments were evaluated from two aspects: efficiency tests by type of document and an effectiveness test regarding correctness. The results show that (1) the proposed system can detect document types .docx, .pdf, and .txt as designed and (2) the proposed system can detect plagiarized documents with an average accuracy of 94.01%.

Data Availability
There are no data.

Conflicts of Interest
The authors declare that they have no conflicts of interest.