Validation of Text Data Preprocessing Using a Neural Network Model

Many artificial intelligence studies focus on designing new neural network models or optimizing hyperparameters to improve model accuracy. To develop a reliable model, appropriate data are required, and data preprocessing is an essential part of acquiring the data. Although various studies regard data preprocessing as part of the data exploration process, those studies lack awareness about the need for separate technologies and solutions for preprocessing. 'erefore, this study evaluated combinations of preprocessing types in a text-processing neural network model. Better performance was observed when two preprocessing types were used than when three or more preprocessing types were used for data purification. More specifically, using lemmatization and punctuation splitting together, lemmatization and lowering together, and lowering and punctuation splitting together showed positive effects on accuracy.'is study is significant because the results allow better decisions to be made about the selection of the preprocessing types in various research fields, including neural network research.


Introduction
Recently, attempts have been made to increase work efficiency through studies using similarities between sentences. Examples include document classification, plagiarism detection, document summarization, paraphrasing, and automatic question-and-answer systems using a similarity measurement model between sentences [1].
Studying the similarity between sentences requires a deep understanding of the semantic and structural information of the language. Previous studies have extracted and used features in sentences [2], but the feature-extraction process was complicated, and the performance was irregular depending on the extracted features. erefore, attempts have been made to learn a language model that computes probability distributions without extracting features. e linguistic model, which was based on statistical theory, used the conditional probability of a single word (unigram) or a sequence of multiple words (n-gram). In addition, a method has been proposed that combines a word-embedding method, in which information about the meaning or structure of a word is expressed in terms of a real-time multidimensional vector and a deep belief network structure that uses a prelearning method [3].
To improve the prediction accuracy of a high-performance neural-network-based sentence model or a naturallanguage-based study, confidence in the data should be the highest priority. e dataset of the public database is already purified. Data for research studies should be processed through a filtering step, in which the researcher himself conducts the preprocessing. erefore, it is necessary to investigate the data preprocessing features that should be selected for machine learning [4] as well as the effects of various preprocessing tasks on the performance of classification models [5][6][7].
Data integration, refinement, reduction, discretization, feature selection, and data conversion can be used for data preprocessing. Recently, studies have been conducted in which various evaluation methods or preprocessing steps are performed automatically to select appropriate data features [8]. However, most studies on machine learning to date do not include data preprocessing [9][10][11][12][13][14]. Furthermore, even in the studies where preprocessing was mentioned, only some parts of various processes, such as word normalization and elimination, were presented [15,16]. e purpose of this study is to analyze the effect of text data preprocessing on the sentence model. If previous studies were aimed at improving the performance of the model through preprocessing, this study focuses on the effect of combinations of data preprocessing types on performance. Various preprocessing methods were compared and analyzed, such as the setting method for using the preprocessing technique or performance analysis where learning is performed according to the order of complexity of sentences.
e Materials and Methods section describes the different types of text data preprocessing. It also describes the research methods used and explains the sentence model, preprocessing type, and dataset. In the Results and Discussion section, the results of using combinations of different preprocessing types are discussed, and finally, the conclusions of this study are presented in the last section.

Text Data Preprocessing.
e quality of the data plays an important role in the performance of the algorithm. If data are not preprocessed, the algorithm may behave unexpectedly due to inconsistent data, and performance may be affected.
Existing data preprocessing studies have been mainly conducted in the field of data mining.
ere have been studies that process web data to format them into an analytical form. ese studies did not explain the effect of data preprocessing on the algorithm as a method included in the process of preparing data for analysis [17][18][19]. ere is also a study that analyzed the effect of data preprocessing on predictive ability, limited to numerical data in neural network models [4,20]. Feature selection, outlier data removal, dimension reduction, etc., were conducted; however, it is difficult to understand their effects on text data. e sentence model uses word-based text data that include plural words, special characters, and numerals. erefore, preprocessing for analysis is divided into transformation, in which the original form is transformed to a word-based form, and elimination, in which the words that are considered unnecessary for semantic interpretation are eliminated. e text preprocessing technique is shown in Table 1.
ere are three types of normalization: lowering, which converts uppercase letters to lowercase letters, stemming, and lemmatization.
Stemming, which is a normalization technique that reduces the complexity of data, removes affixes and separates stems from words with modified word forms. Lemmatization is a technique that converts words used in various forms into dictionary forms [21,22]. Table 2 compares stemming and lemmatization techniques.
In stemming, words with different roots are mapped to the same stem. erefore, it is mainly used in search engines. Lemmatization extracts the original form of a word as the word is converted to a basic form. erefore, lemmatization does not change the meaning of words [23][24][25].
As an example of punctuation, the method of removing "-" from the word "brute-force" and obtaining the two words "brute" and "force" is called splitting. Furthermore, if you get the word "bruteforce," it is called merging. Splitting is the same as tokenization, which divides sentences into words.
Elimination assumes that all words that make up a sentence, paragraph, or document do not have the same significance. In other words, according to this method, a word with a low frequency of occurrence or a word with a high frequency of occurrence in a document but with low semantic information, such as a stop word, a one-syllable character or a special character, is deleted.

Research Methods.
is study was conducted to analyze the effects of data preprocessing on sentence models and not to examine fine-tuning or performance improvement. is section describes the setting of the preprocessing study, structure of the sentence model, and datasets used in the study. e procedure used in this study is shown in Figure 1.
is study, which aims to analyze the performance of combining types of text data preprocessing in a sentence model, can be divided into a typical preprocessing type and a preprocessing type that is developed according to the needs of the study. In the typical data preprocessing step, lowering, lemmatization, punctuation splitting and merging, and special character elimination were used by considering the preservation and accuracy of the part-of-speech information. However, splitting and merging, which transform based on punctuation, were also used.
is study considered that preprocessing steps such as normalization and punctuation could semantically damage the meaning of a sentence by making modifications. For example, technical terms can be important in a paragraph or sentence and can help readers to understand the meaning. Because terminology can consist of a single word or multiple words, the segmentation of all words may not reflect the essential meaning of the terminology. To analyze the method of using technical terms in the sentence model, this study developed a module that can identify technical terms composed of complex words and process them as multiple single words. In addition, an entropy-based sorting module was developed to check the effect of sentence complexity on accuracy.
is study applied various combinations of typical preprocessing types, and the techniques developed in this study were analyzed separately. Table 3 shows the preprocessing types used in the analysis.
e accuracy of the model was measured five times for each of the 25 preprocessing types, for a total of 125 measurements.

Preprocessing Techniques.
e preprocessing techniques developed in this study are the sorting module, which sorts    the sentences according to an order of complexity of sentences, and the terminology-identification module. e details are described as follows.

Complexity Sorting Module.
e characteristics of the data used in machine learning are an important factor in determining the efficiency of learning [26]. Documents are composed of various sentences, and each sentence has a different length, parts of speech, and complexity. Determining entropy for information complexity is a generalpurpose technology that is commonly used for signal or video compression. e entropy of a sentence is calculated according to the distribution of syllables, and the calculated entropy is used to define the sentence complexity. is study also analyzed the relationship between the complexity, which is a characteristic of a sentence, and the accuracy of the model. In other words, based on the entropy, a sorting module that classified sentences according to their complexity was developed and confirmed. e process of model development progressed as follows. Information entropy is an expected value (average) of information in data (as explained below), and when the expected value is high, it can be expressed as "much information." In other words, "much information" in a sentence indicates that the sentence is complicated on the surface. For a random variable for an event, P (X), the information entropy, H (X), is defined as follows: (1) e operation algorithm of the complexity sorting module is shown in Figure 2.
In Figure 2, D m indicates a set of sentences in the corpus and counts each sentence read from D m according to the ASCII code value. In other words, it calculates the number of ASCII codes in a sentence. en, the entropy is calculated based on the ASCII code value of a sentence. Finally, it returns the sentences sorted in the ascending or descending order based on the calculated entropy.

Module to Identify Technical Terms Composed of Complex Nouns.
Technical term identification is a timeconsuming and costly task that can be divided into statistical and rule-based methods. Statistical methods can have high portability because they are not affected by domain restrictions [27]. However, the low accuracy of the identified terms and the inclusion of noise pose difficulties in semantic interpretation. e rule-based method analyzes many terms and processes them through morphemes such as prefixes and suffixes. Although this method can have a low portability because the rules are manually defined and supplemented for each specific field, the accuracy of the identified terms can be high.
In this study, an algorithm to apply the rule-based method and achieve high accuracy was developed. To extract the rules, 1,540 morphemes in the technical term corpus published by the Japan Information Processing Society in 2018 were analyzed. e analysis found that the number of parts-of-speech among the technical terms was 82, and these were composed of single words or combinations of morphemes.
e technical term identification algorithm we developed is shown in Figure 3.
First, morphemes are analyzed for each word in the sentence. For analysis, NLTK's pos_tag module was used. Second, technical terms were identified from the learning data using the extracted rules. From the most common composition to the least common composition of the part of speech, there were 485 singular nouns, 396 adjectives + singular nouns, and 257 singular nouns + singular nouns, etc. ird, a search engine (Wikipedia API) was used to verify the identified technical terms. When a search result for a technical term exists in the search engine, the term is converted into an identified sentence. For example, in the sentence "People in a car_race," "car race" is identified to convert it into the sentence "People in a car_race." Fourth, the sentence in which technical terms have been processed is newly stored in Transformed − D. Learning of the sentence model was conducted using the dataset in Transformed − D.

Sentence Model.
e model used to measure the similarity between sentences consists of an encoder/decoder method, which can process two sentences. Siamese networks include two identical subnetwork components to handle each of the two inputs. In other words, Siamese networks can be used as a method to measure the similarity between two sentences. is section describes the structure and performance of the Siamese networks' convolutional neural network-(CNN-) based sentence model for the study.

Structure of the Sentence Model.
To measure the similarity between sentences, the CNN model was implemented based on Siamese networks [9]. e model developed by Kim et al. [9] for emotion analysis and question classification is a CNN structure using one layer, but the model proposed in this study is composed of two layers. In addition, hyperparameters for filter size and feature map size were properly tuned. e proposed model is shown in Figure 4.
In Figure 4, n is the number of words in the sentence, k is the dimension of the word vector, and h is the filter window size. Based on the CNN, sentences X (i) and X (j) are sequentially processed through the convolutional, pooling, and fully connected layers, and feature vectors are produced as outputs. For hyperparameters, the filter sizes were set to 2, 3, and 4, and they were set to have 50 feature maps. e dropout was set to 0.5. To compare two sentences, X (i) and X (j), the distance can be expressed as follows, by outputting a feature vector that is the encoding result of the same neural network structure.
If the characteristics of the two sentences are well expressed, the distance between the vectors is small. Otherwise, the distance is large. In this study, the Manhattan distance, which is a similarity function and has excellent performance, was applied (Jonas Mueller, 2016). is study was developed using Python 3.6.8 in the Linux 16.04 operating system environment.

Comparison of the Performance of Sentence Models.
In the present study, a CNN-based model (Figure 4), Siamese LSTM model [28], and transformer model [29] were implemented for model selection.    Mathematical Problems in Engineering parameters of each model, the epoch was set to 10, batch size to 512, and learning rate to 0.001, and learning time and accuracy were as shown in Table 4.
To evaluate how the performance was affected using combinations of various types of preprocessing, a study was conducted using a CNN-based model, which had low accuracy but the fastest learning time.

Dataset. For model training, the Stanford Natural
Language Inference (SNLI) corpus was used. e SNLI corpus is a dataset that expresses the logical relationship between two sentences, and it is used in research for various inferences [30,31]. e corpus consists of 367,369 instances, of which 257,158 (70%) are training sets, 36,737 (10%) are validation sets, and 73,474 (20%) are testing sets. e structure of the SNLI corpus is shown in Table 5.
is structure includes two sentences and a corresponding label that is based on the similarity between the sentences. For example, "A person on a horse jumps over a broken down airplane" and "A person is at a diner, ordering an omelette" in the first line are not semantically/logically equivalent, and therefore the label is 0 (False). Table 6 shows the results when the models with different combinations of preprocessing types were sorted by accuracy.

Results and Discussion
e variance of the average value of the results by the preprocessing type ranged from a minimum of 0.05 to a maximum of 0.34. Based on analysis of the results, we determined that a combination of two preprocessing techniques showed good performance. If only two preprocessing techniques can be used, lemmatization and punctuation splitting (No. 1) are good candidates.
is combination showed the highest accuracy with 79.09%, which is 0.73% higher than the accuracy without preprocessing (No. 21). Furthermore, it was found that the use of the normalization techniques, lemmatization and lowering (No. 2) together or the use of lowering and punctuation splitting or merging (Nos. 3 and 4) also increased accuracy. It should be noted that, in any of these combinations, lemmatization, lowering, and punctuation splitting were used. Lemmatization and lowering are techniques that can improve accuracy by normalizing different words. If only one technique was used (Nos. 9, 10, 11, 15, and 18), then the accuracy was higher than that without preprocessing. Among these, the accuracies associated with lemmatization (No. 9), lowering (No. 10), and punctuation splitting (No. 11) were similar and ranged from 78.876% to 78.842%. is implies that splitting a word into two words based on punctuation, such as during normalization or punctuation splitting, can extend the length of a sentence and have a positive effect on accuracy.
Characteristic features include the use of special character elimination and punctuation merging. When special character elimination was combined with lemmatization and lowering, the accuracy was increased (Nos. 5, 7, and 8). However, if lemmatization or lowering was used separately, the accuracy decreased (Nos. 12, 13, 16, 19, 20, and 22). In addition, except for the case where it was combined with lowering (No. 3), the accuracy decreased when punctuation merging was used.
is contrasts with the fact that  e accuracy was lower than that without preprocessing by 5.54%, and it was also relatively low compared with other preprocessing combinations. As suggested by the algorithm, there are many technical terms that contain more than two words, such as in the forms (adjective-singular noun), (singular noun-singular noun), and (adjective-singular noun-singular noun). As a result, the meanings of these terms cannot be correctly interpreted. In other words, because the technical terms consisting of two or more segments were processed as one word and the length of the sentence was shortened, the accuracy decreased.
When sentences were sorted according to their entropy complexity (Nos. 23 and 24), the accuracies were 78.624% and 78.286% for the descending and ascending order, respectively.
is represents a difference of +0.322% and −0.016% compared with data that were not preprocessed. erefore, ordering sentences according to their complexity may not affect the accuracy.

Conclusions
In neural network research, data, algorithms, and parallel hardware are essential elements. Even with good algorithms and high-performance hardware, studies cannot be conducted  if the quality of data is low or no data are available. Despite its importance, many existing neural network studies do not provide any information about data preprocessing. is study analyzed the effect of preprocessing through text data preprocessing of sentence models. To this end, experiments were conducted to evaluate combinations of typical data preprocessing types. Furthermore, the effects of two new techniques on the accuracy of the model were analyzed: preprocessing of technical terms composed of compound words and determining the learning order based on data complexity.
Based on the results of this study, the following conclusions can be drawn. First, when only two preprocessing techniques are used, we recommended using lemmatization and punctuation splitting, lemmatization and lowering, or lowering and punctuation splitting. Second, when only one preprocessing technique is used, it is better to use lemmatization, lowering, or punctuation splitting. ird, to improve accuracy, it is generally not recommended to use a preprocessing type that shortens the lengths of sentences. Fourth, the use of special character elimination and normalization techniques does not contribute to improving the accuracy. Fifth, setting the learning order according to sentence complexity does not contribute to improving the accuracy.
Building predictive or sentence models from refined data can help improve the performance of the model. e accuracy of the preprocessing of text data in this study suggested a certain combination of preprocessing types could improve performance when various models are established. Consequently, this study is significant in that it allows better decision-making about which preprocessing type should be selected according to the purpose of the study or the type of the construction model.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.