The Use of Stemming in the Arabic Text and Its Impact on the Accuracy of Classification

Department of Computer Information Systems, Al-Balqa Applied University, Al-Salt, Jordan Faculty of Artificial Intelligence, Al-Balqa Applied University, Al-Salt, Jordan Cybersecurity Department, Science and IT College, Irbid National University, Irbid, Jordan Department of Computer Science, Al-Balqa Applied University, Al-Salt, Jordan Department of Information Science, College of Computer and Information Systems, Umm Al-Qura University, Makkah, Saudi Arabia


Introduction
Machine learning (ML) is a branch of Artificial Intelligence (AI) research [1], which aims to develop practically relevant multipurpose algorithms based on a little amount of data. Difference between ML approaches and general AI lies on the discovered patterns in data and the way in which data is used. ere are different examples of the application of ML, such as fraud discovery, weather forecasting, and patients' diagnosis. e two major forms of ML are supervised and unsupervised learning. Here we consider the former, which involves the generation of a mapping from labeled training data into an output of predictions or classes. is process can be described as classification and is the core aspect of supervised ML.
Classification involves the determination of output values known as classes or labels using input objects. is mapping is known as a model or classifier. e entered objects are related to the categorized objects also recognized as examples, instances, or tuples. According to [2], ML classification technique involves combining several instances together with their known labels by manually tagging a group of instances. e group of labeled instances is recognized as a training set. e labeled instances (i.e., training set) are used by classifier to generate the model that maps the instance to its label. As a result, then the training model can be used to label or classify new, unknown instances. In the current study, which focuses on the classification of Arabic text, the instances are carefully chosen from a prelabeled pool of instances by employing enhanced Arabic classifiers.
ere are many situations in which unlabeled documents are both plentiful and cheap. However, labeling them is regarded as costly and time-consuming. For example, it is handy to get a huge amount of documents basically with no price; in contrast a lot of money is paid for human comment hosts to classify these documents with their subject classification whether they are in Arabic or non-Arabic. Also, data for videos is easy to collect, but it is very difficult to get good semantic content labels from that data. Likewise, it is easy to get a wide range of compounds that may be useful for treating a disease, but it is very expensive to run expensive biochemical tests to see which one really works. ese three examples are essentially classification problems.
Several algorithms have been implemented to solve the text classification (TC) problem. More than one work in this field has focused on English text. In contrast, little research has been done on the Arabic script. e English text differs from the Arabic text in terms of its morphological structure, which makes the preprocessing of Arabic text more challenging for a number of reasons. e aim of the study is to evaluate the performance of the Arabic text classification system using three distinct categorization methods, namely, the decision tree (DT), Naïve Bayes (parametric-based), and K-nearest neighbor (KNN) (example-based) classifiers. In order to get the best integration of weighting scheme and technique, various weighting schemes were adopted in the first two methods.
In the following, Section 2 discusses text classification. en we present the motivation and objective of this work in Section 3. An overview of the related works and the three classifiers considered in this study are provided in Section 4. Section 5 introduces the framework of the proposed Arabic text classifier. Section 6 describes the experiment and Section 7 presents the document representation. Section 8 presents results and Section 9 contains the conclusion and details of future work.

Text Classification
Text classification is a machine learning supervised task requiring prelabeled documents in need of learning. Furthermore, it aims to detect new documents based on certain learned criteria [3]. Applications of text-based knowledge and the TC feature are particularly important in natural language processing (NLP), at least because of the recent increase in the volume of available text data. One example of an area in which TC and NLP are needed is filtering [4], which is a process that attempts to filter a user's inbound documents to identify those that are unwanted or unsolicited. Another is sentiment analysis [5], which looks to identify the general feelings cleared up in a document in order to measure, for example, customer satisfaction.
It is possible to apply the supervised learning algorithms of the classification training model to a set of respective problem states to overcome the problems encountered in the TC. ese models can then be used to identify the unlabeled document class [2,[6][7][8][9][10].
ere are two phases in the TC approach: training and testing. e training phase involves the building of a classifier using a group of the collected documents (called the training set) and by allocating a subset of the training set to each category before processing them via several NLP techniques. e aim of this processing is to extract the set of features from the training set which will be used as the representative for each category. e remainder of the collected documents is the so-called test set, which is used in the testing stage to evaluate the performance of the classifier in terms of its ability to classify the documents that it has not seen before into the correct categories, where performance is assessed by comparing the categories selected by the classifier with those of the predefined documents [3].
A TC system generally consists of these parts: (i) Text preprocessing, which converts the text into a group of dimensions that can be processed by classifiers. (ii) Reducing dimensionality, which decreases features number to enhance the efficiency of classification algorithms. is can be done using methods such as feature selection and dimension reduction [8,9,11,12]. (iii) Classifier training, which is the process of building an autonomous classifier using supervised learning frameworks [2]. (iv) Prediction, which is the process of using a trained classifier to generate labels for new documents [2].
It has been indicated in [13] that texts can be symbolically represented as a set of characteristics by employing two representation methods, namely, the n-gram and the bag of words (BOW). e former involves the use of some words or sentences as characteristics while the latter employs the order of the words or characters of n length. Past studies [14,15] have pointed out that the creation of an accurate TC system requires the effective handling of a high number of characteristics or features (which may be number in their tens of thousands). Hence some information retrieval (IR) techniques such as stemming and elimination of stop-words have been used to decrease the feature space dimensionality.

Motivation and Objectives
e importance of using technologies for classification has increased due to the need to have the ability to automatically classify the huge amounts of diverse text-based information that can be found on the Internet and in electronic/digital format in many languages, including Arabic. Hence, several studies initially focused on addressing the challenges associated with standard Arabic document classifiers [6,7,9,16], which then encouraged more studies that concentrated on enhancing the performance of Arabic document classifiers.
is research continues because most Arab classifiers are characterized by their inability to deal accurately with the vast quantities of documents that have been identified as Arabic documents. As such, this is considered the major problem in the classification of Arabic texts.
One of the main obstacles facing researchers working in the field of text classification for documents in Arabic is the failure of the available classifiers to deal with stemming, which is a factor that might affect other processes in a document classification system. To address this issue, an algorithm is employed to define the stemming rule, and this rule depends on the processing of grammatical components of an utterance to solve the complexity of morphological and syntax.
e major TC problem is related with the enormous features extracted from the text (can reach hundreds or thousands). erefore, the time required to substitute a term with its possible concepts may increase and the high dimensionality of the feature space may reduce classifier performance. e number of features or feature size can be reduced by extracting the essential semantics from texts [17,18]. erefore, in order to reduce the feature size of Arabic text, this study evaluates three classifiers without and with stemming [19]. It is hoped that the outcome of this research will contribute to the improved tracking and detection of new documents and their categorization into the relevant categories and consequently, the improved performance of Arabic classifiers. In sum, this study attempts to answer the following research question: What is the effect of classification techniques on Arabic documents without or with the use of stemmer?

Related Works
Text classification refers to assigning predefined categories of text depending on the content of the documents. For natural language processing and other applications of textual knowledge, text classification is important. e importance of text classification is due to the recent increase in the volume of available text data. It is possible to overcome the problems of text classification by applying supervised learning algorithms to train the models of classification with a group of abovementioned examples of the problem in question that clarify correct classification (labels). ese models can then be used to predict the labels of unlabeled documents [12,[20][21][22][23]. A text classification system may be built from the following components.
It is supposed that the structure of categories is known in advance in the case of supervised algorithms, and these algorithms require a group of tagged documents to map the documents to some prespecified classes. However, as abovementioned, in case of huge dataset it is difficult to remark the true label and class of the document in training set. Hence, the focus and review in this section will be on the most commonly used classification based on algorithms, namely, KNN, NB, and DT.

K-Nearest Neighbor Algorithm (KNN)
Classifier. KNN is a popular example-based classifier. ere are two basic steps, the KNN was developed as a popular instance-based learning technique which has been efficient in several text categorization tasks. e flow of the algorithm is boiled down as follows: first, the k-nearest neighbors are found within the given training documents [24]. Second, the test document category is found using the category labels of these neighbors. e conventional approach usually assigns the test document with the commonest label of category among the established k-nearest neighbors. e conventional KNN is the basis of the extended weighted kNN in which the contribution of each neighbor is weighted with respect to its proximity to the test document. Next, the similarity of the adjacent documents in each class is collected to obtain the document class score; i.e., the class score cj for x document is illustrated as follows: where the training document is � di, group of x nearest k training document is � N(x), cos (x, di) � the cosine similarity between x and d i , and y (di , cj ) � a function with a value of 1 if d i is relevant to class cj, and 0 else. e class with the highest score allocates x test document.

Naïve Bayes (NB) Classifier.
e NB classifier is a simple probabilistic-based classifier, which is based on Bayes' theorem which estimates the likelihood of the classes assigned to a test document using the joint probabilities of terms and classes of such document. e naïve aspect of the classifier originated from its assumption of the conditional independence of all terms of each category from the other category. Based on this assumption of independence, the parameters of each term can be separately learned, as such, making the computation operations easier compared to the non-NB classifiers. An NB proper classifier can merely assume that there is no relation between the presence or the nonappearance of a particular category trait with any other feature. We can express this presumption as follows: where P(C i |d) refers to the previous probability of class C i in the presence of a new instance d and P(C i ) symbolizes toe probability of class C i , which can be figured by where the proper samples that are associated with class C i � N i , N is the number of classes, the likelihood of a sample d being assigned to a class C i � P(d|C i ), and the likelihood of sample d � P(d).

Decision Tree (DT) Classifier.
e DT is a commonly used inductive learning method that is characterized by its ability to resist noisy data and its ability to learn detailed expressions, which makes it suitable for document classification [25]. is algorithm employs a "divide and conquer" Scientific Programming approach, where it divides complex decisions into several simpler ones.
It divides complex decisions into several simpler ones. In the learning stage of the DT, it is contained from a group of tagged training examples manifested in a record of features values and a label class due to big areas of decision tree learning and search are top-down, repeated process and greedy start with an empty tree and the entire training data. A feature has more information about content and has a best partition chosen as the splitting feature for the training data and for the root and then the training data is divided into disjoint subgroups satisfying the values of the incision features. In respect of every subgroup, the algorithm occurs before repeatedly until each subgroup's classes maintain the same class [3].

Framework of Arabic Text Classifier
When answering the user's demand, the TC system requests to get the following: classify the intended document, classify it swiftly, meet user requirements, and obtain optimum classification efficacy [26,27].
us, the objective of the Arabic TC (ATC) structure presented in this study is to raise the ATC system efficiency, if the system takes into account the semantic relationship and the complexity of the Arabic terms.
e ATC framework depends on the following stages: preprocessing, extraction, representation, application of classifiers, and evaluation. e ATC framework takes into account these important issues (Figure 1). e ATC system's first step is the preprocessing phase, which is an important step for document presentation. It involves the initial processing of the text to choose the appropriate terms to be indexed. rough the preprocessing phase, many operations are performed like stemming, stopwords eliminations, tokenization, and normalization.
In this study, the main contribution is to build an automatically Arabic text classifier to classify documents based on morphological knowledge representation by utilizing a light stemmer. e general procedures performed in this method are as follows (Figure 2). Figure 3 shows the different stages of the ATC framework which will be discussed in detail in Section 6, "Experiment."

Experiment
Arabic-language classification is a supervised learning-dependent process; three ML processes and supervised algorithms were used in this experiment, the KNN, NB, and DT classifiers [28]. In order to enhance the accuracy of the Arabic classifier, the Arabic Light10 stemmer was employed and tested. In this section, the steps shown earlier in the Arabic text classifier framework were presented and tested.

Dataset.
We used a dataset that consisted of 800 documents that were classified into four classes. ese documents were extracted from the relevant documents for four queries (i.e., each query represents class) from an Arabic Newswire dataset that were used recently in TREC experiments [29]. Figure 4 shows a sample document from the dataset.

Preprocessing.
e aim of the preprocessing phase is to filter out nonsignificant data, such as tags (i.e., <DOC>, <DOCNO >, <DOCTYPE>, <DATE_TIME>, <BODY>, <TEXT>, <END_TIME>) from a document. In carrying out the step of preprocessing, the document must be converted into a format suitable for the representation process so that learning algorithms are applied. Following this, removal of the unnecessary words used as the characters such as punctuation and special markers takes place. us, in carrying out this step, three commonly identified tasks, tokenization and normalization, stop-word removal (in order to reduce the dimension of the feature space), and mainly stemming and lemmatization, need to be done. Based on the review of these tasks in previous studies, the following section provides a brief description of these three tasks.

Tokenization and Normalization of Data.
According to [31], text documents are usually converted in a way that is appropriate for their analysis by employing a machine learning algorithm. e text is divided into separate units by using either spaces or special symbols. As such, every word in a text is represented as a single unit. is procedure is called tokenization. For instance, ( ) it can be tokenized using white space to list of tokens (words) as ( ). Accordingly, the other task known as normalization is useful because this is done before the task stemming particularly for the Arabic script. is is for the reason that the text normalization in the Arabic language helps in the downgrading the various shapes of characters to produce a uniformed shape representing these shapes.
is is illustrated by the following example: (i) Substitute ‫,ﺁ‬ ‫إ‬ as well as ‫أ‬ by ‫ا‬ (ii) Substitute the last ‫ة‬ by ‫ﻩ‬ Issues 1 • Investigate the stop-word list in the pre-processing stage [28].

Issues 2
• Investigate the light stemmer in the stemming stage [29].

Issues 3
• Extract and represent the terms stage.  Step 1

• Text normalization
Step 2 • Tokenization of the text Step 3 • Stop-word removal Step 4 • Use of the full word or the stem word by using light stemmer.
Step 5 Step 6 • Extraction of the terms as a BOW for representation using the (TFIDF) technique • Training of the three classifiers.
Step 7 • Testing of the three classifiers.
Step 8 • Evaluation of the three classifiers with and without stemming after the training and testing phases.  Stop-words are those words that occur frequently in the document. ese words give no hint to the document content in which they appear. Stopwords removal is mandatory prior to submitting text to be processed by an ATC system in order to reduce both time and cost. Hence a list of stop-words is created, which is then applied to the indexed terms to be eliminated. However, for an ATC system there is no prominent stop-words list that could be used in such systems. Consequently, for the experiment, the same stop-words list used in [32] was used here. Table 1 provides some examples of Arabic stop-words.

Stemming Text.
e text stemming process helps in reducing the various inflectional derivational words forms to a uniform called the stem [32]. For instance, the terms, "work," "works," "working," "worked," and "worker" are derived from the "work" stem. Table 2 shows an example of different Arabic words derived from the same root. e word root is gained by eliminating some or all the word suffixes attached to it. In the ATC system, terms are grouped together that share the same stem or root, which effectively raises the number of matched documents to the user query. Furthermore, there is an overall improvement in the ATC performance due to the reduction in the dictionary size as a result of the stemming process [33].
In this paper, for stemming purpose we followed the same stemming steps in [33] using Light10 stemmer, as follows: (1) Remove ‫"و"‬ ("and") for Light2, Light3, Light8, and Light10 if the remainder of the word is three or more characters long (2) Eliminate the definite articles that leave the remaining word with more than or equal to two letters (3) Keep words with a length of two or more letters after suffixes removal which appears in the list; remove one at a time in order from right to left Table 3 shows the list of strings that should be removed. Note that the conjunction and definite articles are the prefixes shown in the table. No elimination is done for the strings that deemed an actual Arabic prefix in Light10 stemmer. Table 4 shows an example of affixes in Arabic word.

Document Representation
Each document in the study dataset was represented by a vector t i with the term as the attribute and the attribute value as its TFIDF weight [34], which is a statistical way of determining the relevance of a word to a document in a corpus. e most commonly used method to weight a term is the (TF.IDF) weighting, because it considers the attribute. With this weighting scheme, setting the weight of the term I in the document d is proportional to the number of times the term appears in the document, the Term Frequency (TF), and inversely related to the total number of documents in which the term appeared from the corpus, the Inverse Document Frequency (IDF). e TFIDF weighting method assigns a weight to the number of term occurrences in a document by disregarding its relevance in case it appeared in most of the documents, especially when the term is assumed to possess little discriminating power:

Construction of the ree Classifiers.
In this experiment, the Arabic dataset documents were categorized using the following classifiers: KNN, NB, and DT in two forms, the full word (without stemming) and the stem word (full word stemmed by light10 stemmer).

Evaluation and Comparison of Classification Quality.
Two measures are mainly used to evaluate the quality of the output of a classifier, namely, the f-measurement and accuracy [35]. In classification problems, the evaluation is generally represented in the form of a confusion matrix. e matrix contains the number of instances that are correctly and wrongly classified for each class.
In practice, the most widely used evaluation metric is the accuracy (ACC) rate. It represents the classifier efficiency based on the proportion of the number of correctly predicted instances the classifier made. e classifier accuracy is calculated as ACC � (TP + TN) (TP + TN + FP + FN) .

Results
A comparison of the three classifiers was conducted in respect of accuracy and the number of features selected with and without the use of stemming in the preprocessing phase. Tables 5 and 6 show the results for the three classifiers with and without stemmer, respectively. e tables show that, without a stemmer, DT outperformed KNN and NB achieving 90% accuracy as compared to 33.83% and 26.11%, respectively. When a stemmer was included in the preprocessing phase, all three classifiers improved their performance, and again, DT produced the best result with 93% as compared to NB with 35% and KNN with 26.36%. us, the use of a stemmer improved the accuracy of all three classifiers. Furthermore, the tables show that the use of a stemmer also reduced the number of features around 50% by the classifiers. Figure 5 provides a graphical illustration of the results, by which we can conclude that the number of features has effect on the NB and KNN performance. KNN when using all features got accuracy of 26.12, while when using stemmer the performance was not satisfying, with accuracy of 26.36%. On the other classifier of NB the stemmer enhances around 1.8% but comparing with DT the performance was better. We can conclude that the DT can be used for huge features better than NB and KNN.  Table 3: List of prefixes and suffixes eliminated by Light10.    Scientific Programming e result shows that the decision tree with light stemmer was the best accuracy rate for classification algorithm with 93%.

Conclusion and Future Work
In this paper, prior to developing our proposed method, we reviewed several previous studies that contributed to improving our understanding of the study problem, namely, the classification of Arabic text, and potential solutions. Given the vast amount of information in Arabic that is available online, and which continues to grow, the main aim of this study was to save the effort and cost of both users and developers in searching for and using such data. In this work, we address the weakness of classifiers used for TC before as KNN, NB, and DT. e main weakness of the classifier algorithms is being poor when holding a huge number of features. Based on our experimental outcomes, we find that DT with stemmer can improve efficiency and outperform other classifiers compared to this work. However, the dimensionality of the terms without light stemming is the primary weakness in preprocessing phase, where there is a need for feature selection to fill the gap in the number of huge terms as a future work. We offer future work to improve text classifier with deep reinforcement Q-learning combined with our proposals. We also recommend the use of other classification criteria not used here in this work.
Data Availability e data are available at https://catalog.ldc.upenn.edu/ LDC2001T55 and are not free to access.