WASTK : A Weighted Abstract Syntax Tree Kernel Method for Source Code Plagiarism Detection

In this paper, we introduce a source code plagiarism detection method, named WASTK (Weighted Abstract Syntax Tree Kernel), for computer science education. Different from other plagiarism detection methods, WASTK takes some aspects other than the similarity between programs into account. WASTK firstly transfers the source code of a program to an abstract syntax tree and then gets the similarity by calculating the tree kernel of two abstract syntax trees. To avoid misjudgment caused by trivial code snippets or frameworks given by instructors, an idea similar to TF-IDF (Term Frequency-Inverse Document Frequency) in the field of information retrieval is applied. Each node in an abstract syntax tree is assigned a weight by TF-IDF. WASTK is evaluated on different datasets and, as a result, performs much better than other popular methods like Sim and JPlag.


Introduction
Due to the advancement of the Internet, source code plagiarism has become a big issue in the field of computer science education [1].Some students usually try to copy source code from their classmates or search similar source code from the Internet as their assignments without modifying.Plagiarism diminishes the quality of education seriously.Students are deprived of the abilities to make innovations and think independently, which may also cause academic dishonesty.
As what happened in traditional offline computer science education, online education suffers from plagiarism, too.Moreover, it is even harder for online education platforms to find a method against source code plagiarism when the number of submissions from students goes much greater than the traditional offline cases.Therefore, detecting source code plagiarism becomes heavy load of responsible instructors' daily work [2].
In order to detect the source code plagiarism automatically, three problems have already been considered by researchers in this field [3][4][5].
Problem 1. Computer programs are structured and hierarchical.Looking for a reasonable method to measure the similarity between a pair of programs needs to be carefully treated.
Problem 2. The modifications on programs for the plagiarism purpose are almost changeless.Common distortions, for example, comment alteration, whitespace padding, identifier renaming, code reordering, and algebraic expressions, can be found with high confidence.
Problem 3. The comparison between each pair of programs takes a long time and leads to inefficiency.
Additionally, there exist some other facts which need to be considered.Problem 4. Instructors may provide frameworks of programming assignments for students to start with.The provided frameworks will contribute a lot to the similarity between each pair of programs.
Problem 5. Solutions for simple problems may be alike.For example, a hundred students, without any communication, may produce very similar programs for the task of calculating the sum of two variables.
To solve these two problems, in this paper, two accurate methods for source code plagiarism detection are presented, named ASTK (Abstract Syntax Tree Kernel) and WASTK (Weighted ASTK).Since computer programs are structured, in this work, they are presented as abstract syntax trees [6].In ASTK, a method called tree kernel [7] is used for measuring the similarity between a pair of programs.Additionally, different from traditional tree kernel methods, to highlight the significance of nontrivial parts and reduce impacts caused by short sample source code and source code provided by instructors, WASTK gives weights to every tree node.Inspired by the idea of TF-IDF [8], lower weights are given to the part of common code and code provided by the framework of coding assignments, and higher weights are assigned to those distinguished parts of code.
The rest of this paper is organized as follows.Section 2 introduces related previous work.Section 3 illustrates the methods of ASTK and WASTK.We show the experiment results and corresponding analyses in Section 4. Finally, in Section 5, we conclude this work and give some discussions about possible future work.

Related Work
There exist some plagiarism detection measures.According to the categories identified by Mozgovoy in 2006 [9], there are mainly fingerprint based and content comparison based approaches.Content comparison techniques have three subcategories including string-matching algorithms, parameterized matching algorithms, and parse tree comparison algorithms.
MOSS (Measure of Software Similarity), proposed by University of California-Berkeley in 2003 [10], as a proposed fingerprint based measure, divides each program into kgrams.Each gram is made of some substrings of length .The possibility of plagiarism is determined according to the number of overlapped fingerprints generated by hashing each gram.
There are some famous string-matching algorithms, including Plague, YAP3 (the third version of Yet Another Plague), JPlag, and FDPS (Fast Plagiarism Detection System) [11].These methods all firstly convert programs into corresponding token sequences.Then they use similarity as the evidence of plagiarism by comparison among token sequences generated from different programs.Plague, proposed by Whale in 1988, is the first one converting a file into a token sequence and using a string-matching technique for comparison [12].YAP3, proposed by Wise in 1996, performs better than Plague due to a faster converting method using Running Karp-Rabin Greedy String Tiling (RKR-GST) [13].JPlag, proposed by Malpohl in 2006, runs even faster than YAP3 by defining a minimum-matching length [14].FDPS, as another algorithm similar to YAP3, pays more attentions on efficiency which is proposed by Mozgovoy et al. in 2007 [15].It uses an indexed data structure to store matches which support faster searching.
Dup tool, proposed by Baker in 1995, is a parameterized matching algorithm.It firstly uses a lexical analyzer to scan a program.Then it modifies all identifiers and constants into same symbols and output the transformed program together with a list of parameter candidates.The similarity is determined according to -matches between two transformed files, where  means a parameter [16].
Sim and Brass are based on parse tree comparison algorithms.Sim, proposed by Gitchell and Tran in 1999, gets a token sequence by a lexical analyzer and calculates the sequence alignment as the similarity [17].In order not to be influenced by identifiers or statement orders, in 2014, Kikuchi et al. proposed a modified method which uses syntactic elements for tokens with lexical elements and the method does not include identifier names or literal values in the tokens [4].Brass, proposed by Belkhouche et al. in 2004, is an application of parse tree comparison algorithms.It represents each file as a binary tree and a symbol table (the data dictionary), containing the variables and data structures used in the file [18].

Proposed Approach
Definition 1 (abstract syntax tree).An abstract syntax tree is an abstract syntactic structure of a piece of source code in the form of tree. Figure 1 shows an example of an abstract syntax tree created from short sample code.The left part is the source code in two lines and the right part is its corresponding abstract syntax tree after lexical and syntax analysis.It is easy to find that a short piece of source code has a large abstract syntax tree and only leaf nodes show the symbols of source code.The in-order traversal result of all the leaf nodes will get the source code.
Definition 2 (tree kernel).Tree kernel is an algorithm for computing tree structures and measuring the similarity between two contents in Natural Language Processing (NLP), which is firstly proposed by Collins and Duffy in 2001 [7].
Definition 3 (TF-IDF).In information retrieval, TF-IDF is a numerical statistic model that is intended to reflect how important a word means to a document in a collection of documents [8].
TF-IDF includes two parts: TF is the abbreviation for "Term Frequency" and IDF is the abbreviation for "Inverse Document Frequency."TF(, ) denotes the frequency of word  in document .Similarly, DF(, ) is the frequency of documents containing word  in the set of all documents .IDF(, ) is the reciprocal of DF(, ).TF focuses on the importance of words in a document while IDF concerns the universality of words in all documents.TF-IDF can help to compute the importance of words for solving Problems 4 and 5.
The process of WASTK is shown in Figure 2. The proposed method firstly transforms programs into abstract syntax trees.Inspired by the idea of TF-IDF, weights are calculated for every subtree, giving expressions with high frequency in a document but low frequency in other documents higher weights and giving expressions that appear everywhere (in all documents) lower weights.Finally, tree kernel is applied with calculated weights on nodes to determine whether plagiarism happens.The details of the designed approach are shown in Figure 2. Then some adjustments are applied to each abstract syntax tree.As mentioned in Problem 2, replacing variable names and size declarations of arrays are common ways to plagiarize.To solve this problem, all variable tokens are replaced with a unified token.The tokens for size declarations of arrays and the indices of array elements are unified as well.
Because rephrasing expressions into different expressions is not trivial, there is rarely a problem about plagiarism by modifying expressions.The structure information of expressions related to subtrees in the abstract syntax tree is abandoned.The resulting strings of the in-order traversal of leaf nodes on the subtrees are adopted as replacements of the subtrees.Besides, a simplified abstract syntax tree results in less time needed by running ASTK and WASTK, which helps to solve Problem 3.
After adjusting, the abstract syntax tree in Figure 1 is transformed into a new tree shown in Figure 3.All variable names are replaced with "temp."Because the node "exp" is a type of expression, it is adjusted to be a leaf node and represents the in-order traversal of leaf nodes on the previous subtree with the root "exp"; that is to say the dotted portion in Figure 3 is cut out from the original abstract syntax tree.

Determine Node Weights.
In ASTK, the weights on all nodes are equal to 1.However, in WASTK, the weights of nodes depend on TF-IDF.Actually, this is the only difference between ASTK and WASTK.In WASTK, it is considered that weights on abstract syntax tree nodes reflect the significance of the corresponding subtree fragments of code.Taking Problems 4 and 5 into consideration, lower weights are given to the nodes that represent common expressions appearing in many other programs.
Different types of symbols and expressions frequently appear in the programs.They do not have abundant semantic  meaning and can be treated as stop words.A list of stop words is shown in Table 1.
Let  denote the abstract syntax tree.For each subtree  of , there is a weight  , .The in-order traversal on  is denoted as word  .If word  is treated as a stop word, the weight of  is specially adjusted to 0, that is,  , = 0.For example, the weight of the nodes in Figure 3 is 0 whose type is "T INT," "V ID," and "SEMICOLON."On the contrary, if word  is not a stop word, the weight  , can be computed as follows: , = TF (, ) ⋅ IDF (, ) . ( By Definition 3, TF(, ) and IDF(, ) can be calculated as follows: TF (, ) = cnt (, )  () , where cnt(, ) is the frequency of the appearances of subtree  ∈   .() is the number of subtrees in  and () is the number of abstract syntax trees which contains . is used to represent the number of generated abstract syntax trees from programs related to a specific assignment.

Calculate Tree Kernel and Similarity
. By Definition 2, the similarity between two abstract syntax trees  1 and  2 can be measured by a tree kernel and denoted by ( 1 ,  2 ): where   denotes the set of all subtrees in .When calculating ( 1 ,  2 ) directly, it needs to enumerate each subtree in both  1 and  2 and then calculate cnt(  ,  1 ) and cnt(  ,  2 ), respectively.This method appears inefficient and a recursive method is applied: ( 1 ,  2 ) is calculated as follows.
(1) If the roots of  1 and  2 are both leaf nodes of the "exp" type, then where  tree is a decay factor to avoid the changes near the root producing too much influence [6].As the height increases, the kernel value of the subtree is penalized by ( tree ) size , where size is the height of the subtree.word  is a string denoted the in-order traversal on the subtree  as defined in Section 3.2.dist(, ) is defined as follows: where lev , is the edit distance between strings  and .|| is the length of string .Different from the traditional tree kernel, the similarity between expressions is not equal to 0. It is denoted by the edit distance between two expressions.By using the above definition of dist(, ), expression-level modifications that are mentioned in Problem 2 will be discovered by ASTK and WASTK.
(2) If the root of  1 is different from the root of  2 , then (3) If the roots of  1 and  2 are both leaf nodes, then where ns() is the number of subtrees rooted at children nodes of root  and st(, ) is the subtree rooted at the th children node of root  .Due to the root of  1 being the same as the root of  2 , ns( 1 ) = ns( 2 ).
After computing tree kernel ( 1 ,  2 ) between  1 and  2 , a normalization is needed.The method of normalization is as follows: where   ( 1 ,  2 ) is the cosine similarity of h( 1 ) and h( 2 ).In ASTK and WASTK,   ( 1 ,  2 ) is the similarity of two pieces of source code.

Datasets.
There are two groups of data.The programs in each group are generated from 10 independent submissions of programs for the same assignment by applying to 40 different generators.Table 2 shows the sizes of datasets.Threefold cross-validation has been adopted for the experiment.The dataset has been randomly split into 3 parts.Each time, one part is used as a testing set and the other two parts are used as a training set.
The 10 independent submissions of programs for each assignment are randomly picked from the submitted solution of "Problem I" and "Problem J" by students who attend the final exam of C programming course in Harbin University of Science and Technology.These pieces of code are very short as the average number of lines is no more than 20.Particularly, a programming framework is provided in "Problem J." The 40 generators are designed by 40 players who attend the Jisuan-zhidao programming contest (https://nanti .jisuanke.com/contest).The generator reads original programs as the input and returns the plagiarized programs as results.Only the generated programs functionally equivalent to the original ones are used.
All valid generated programs are used in the datasets and labeled.The programs generated from the same program are labeled as the same origin.Any pair of programs that are labeled as the same origin will be treated as plagiarism.

Evaluation of Proposed Methods.
After determining the similarity score of a pair of programs, it still needs to evaluate whether these two programs are plagiarized or not.Therefore, a threshold  of the similarity score is suggested.The threshold  is determined by setting it to different values in  = {0.01| 0 <  < 100} and using a training set to find which value generates the best outcome.The pair with the similarity score higher than  will be treated as a plagiarism.
In this study, precision is a measurement method described as the number of true plagiarized programs over the number of all programs that are marked as plagiarism.Recall is another measurement method described as the number of all found plagiarized programs over the number of all plagiarized programs.
Since we have the assumption that the positive case of plagiarism is rare and the negative case of plagiarism is common, in addition to precision and recall, Jaccard score is picked as a reasonable method for evaluation.Jaccard score is calculated as follows: where TP is the number of plagiarized programs that are marked as plagiarism.FP is the number of nonplagiarized programs that are marked as plagiarism.FN is the number of plagiarized programs that are not marked as plagiarism.

Results.
Our experiments select Sim and JPlag to compare with ASTK and WASTK.In a survey, Deokate and Hanchate mentioned four popular plagiarism detection tools: JPlag, Sim, MOSS, and Plaggie in 2016 [19].However, MOSS is running on the web server, and the results contain no more than 250 pairs of code in high similarity.Plaggie is similar to JPlag but it only supports Java [20].As a result, we select Sim and JPlag as comparison tools while Sim represents the parse tree algorithm and JPlag represents the string-matching algorithm.
In all experiments, the decay factor  tree is set to 0.1 empirically.Table 3 shows the thresholds  for WASTK, ASTK, Sim and JPlag, which are determined by using the training data.The performance results are judged according to Jaccard score.The average value is used as the threshold for each model in the experiments.
The results by applying different models on the testing data of each fold are shown in Table 4, and the corresponding comparison of these four tools is illustrated in Figure 4.
According to the results, the precision values of ASTK and WASTK are much higher than Sim and JPlag.The recall values of ASTK and WASTK are much higher than JPlag but slightly less than Sim.The Jaccard score, as the overall evaluation measure, shows the advantage of ASTK and WASTK.The Jaccard scores of ASTK and WASTK are much greater than Sim and JPlag.
The added weights in WASTK damage precision but increase recall.The Jaccard score of WASTK is higher than ASTK, showing that the TF-IDF weighting is worthy.

An Online
Example.WASTK has been applied to an online exam held on Jisuanke (https://www.jisuanke.com/)for code plagiarism detecting, and 290 accepted pieces of source code are collected about one problem in this exam.This problem inputs only two numbers  and  and outputs the result whether number  can be divided by number .The average number of lines of these pieces of code is 11.
The following illustrates two pieces of code selected to be analyzed.
(1) #include"stdio.h" (1) #include<stdio.h>(2) int main() (2) int main() (3) { (3) {int a,b; (4) int M,N; (4) scanf("%d",&a); (5) scanf("%d",&M); (5) scanf("%d",&b); (6) scanf("%d",&N); ( 6) if(a%b==0) (7) if(M%N==0) (7) printf("YES"); (8) printf("YES\n"); (8) else (9) else (9) printf("NO"); (10)  Their similarity is 0.694 detected by WASTK while the similarity is 0.93 detected by Sim.It is easy to find that the structures of these two pieces of code are almost the same.The variable names in them are different.Also, the methods of the head file reference are different.There is one line of "return 0;" at the end of the left one but it is not found in the right one.Moreover, these two pieces of code contain different linefeed at the beginning and the end of the function "main."For these short pieces of code of simple problems, a higher similarity is obtained by the ordinary detection tool like Sim, but it may lead to a wrong judgment.However, WASTK acquires a more accurate result by catching the unique features of each piece of code.

Conclusions
In this paper, a method and its enhancement for detecting source code plagiarism are proposed.This method not only considers the similarity between two pieces of code but also takes the context of programming into consideration.WASTK transforms string based programs to abstract syntax trees.The nodes on trees are weighted according to both tree structural similarity between a pair of programs and common structures of all programs in the dataset.
According to the results of the experiments, ASTK and WASTK perform much better than other popular methods on the same datasets.ASTK and WASTK are both based on the structure of programs instead of text like JPlag or token sequences like Sim.Besides, improved tree kernel considers the similarity between corresponding expressions for two subtrees, which is helpful for detecting plagiarism with minor changes.Additionally, WASTK adds weights to ASTK, which increases the recall and Jaccard score by applying TF-IDF.When the pieces of code have common frameworks or are based on similar solutions, TF-IDF sets lower weights to their nodes to avoid misjudgments.
WASTK can help instructors to detect whether plagiarism exists in the assignments of students in both online and offline computer science education.Also, it can be applied to online programming contests to detect plagiarism.When programs   contain a common framework or solutions for problems are simple, WASTK will show a better performance.However, there is still space for improvement.The current method employs the tree kernel, a symmetric similarity measurement which may lead the judgment to a wrong direction.Moreover, efficiency is still a problem since the comparison has to be made between each pair of programs.

Figure 1 :Figure 2 :
Figure 1: A piece of source code and its corresponding abstract syntax tree.

3. 1 .
Adjust the Structure of Abstract Syntax Tree.To solve Problem 1, programs are parsed into abstract syntax trees for further understanding the semantics.

Figure 3 :
Figure 3: A new abstract syntax tree after adjustments.

Figure 4 :
Figure 4: Comparison of four tools.

Table 1 :
A list of stop words.

Table 2 :
The sizes of datasets.

Table 4 :
The results of WASTK, ASTK, Sim, and JPlag on the testing data of each fold.