Improved Feature Weight Algorithm and Its Application to Text Classification

Text preprocessing is one of the key problems in pattern recognition and plays an important role in the process of text classification. Text preprocessing has two pivotal steps: feature selection and feature weighting. The preprocessing results can directly affect the classifiers’ accuracy and performance. Therefore, choosing the appropriate algorithm for feature selection and feature weighting to preprocess the document can greatly improve the performance of classifiers. According to the Gini Index theory, this paper proposes an Improved Gini Index algorithm.This algorithm constructs a new feature selection and feature weighting function.The experimental results show that this algorithm can improve the classifiers’ performance effectively. At the same time, this algorithm is applied to a sensitive information identification system and has achieved a good result. The algorithm’s precision and recall are higher than those of traditional ones. It can identify sensitive information on the Internet effectively.


Introduction
The information in the real-world is always in disorder.Usually, we need to classify the disordered information for our cognition and learning.In the field of information processing, a text is the most basic form to express information, such as the news, websites, and online chat messages.Therefore, text classification is becoming an important work.Text classification (TC) [1] is the problem of automatically assigning predefined categories to free text documents.Vector Space Model (VSM) [2,3] is widely used to express text information.That is, a text  can be expressed as () = {⟨ 1 ,  1 ⟩, ⟨ 2 ,  2 ⟩, ⟨ 3 ,  3 ⟩, . . ., ⟨  ,   ⟩}, where  1 ,  2 , . .., and   are features of the text and  1 ,  2 , . .., and   are the weights of the features.Even a moderatesized text collection consists of tens or hundreds of feature terms.This is prohibitively high for many machine learning algorithms, and only a few neutral network algorithms can handle such a large number of input features.However, many of these features are redundant, which will influence the precision of TC algorithms.Hence, the problem or difficulty in the research of TC is how to reduce the dimensionality of features.
Feature selection (FS) [4,5] is to find the minimum feature subsets that can represent the original text; that is, FS can be used to reduce the redundancy of features, improve the comprehensibility of models, and identify the hidden structures in high-dimensional feature space.FS is the key step in the process of TC, since it is the preparation of classification algorithms, and the precision of FS has direct impact on the performance of classification.The common methods of FS often use machining learning algorithms [6], such as Information Gain, Expected Cross Entropy, Mutual Information, Odds Ratio, and CHI.Each algorithm has its own advantages and disadvantages.In the English data sets, Yang and Pedersen's experiments showed that the Information Gain, Expected Cross Entropy, and CHI algorithms get better performance [7].In Chinese data sets, Liqing et al. 's experiments showed that the CHI and Information Gain algorithms get the best performance and the Mutual Information algorithm is the worst [8].Many researchers employ other theories or algorithms on FS; references [9,10] have adopted Gini Index into FS and achieved good performance.Gini Index is a measurement to evaluate the impurity of a set, which describes the importance of features in classification.It is widely used for splitting attributes in decision tree algorithms [11].That is to say, Gini Index can identify the importance of features.
On the other hand, feature weighting (FW) is another important aspect of FS.TF-IDF [12] is a popular FW algorithm.However, TF-IDF is unsuitable for text FW.The basic idea of TF-IDF is that a feature term is more important if it has a higher frequency in a text, known as Term Frequency (TF); and feature term is less important if it appears in different text documents in a training set, known as Inverse Document Frequency (IDF).In TC, feature frequency is one of the most important aspects regardless of its appearance in one or more texts.Many researchers have improved the TF-IDF algorithm by replacing the IDF part with FS algorithms, such as TF-IG and TF-CHI [13][14][15].This paper mainly focuses on an Improved Gini Index algorithm and its application and proposes an FW algorithm using Improved Gini Index algorithm instead of the IDF part of TF-IDF algorithm, known as TF-Gini algorithm.The experiment results of the algorithm show that the TF-Gini algorithm is a promising algorithm.To test the performance of the Improved Gini Index and TF-Gini algorithms in this paper, we introduce NN [16,17], fNN [18], and SVM [19,20] as the TC algorithms standards.
As an application of TF-Gini algorithm, this paper designs and implements a sensitive information monitoring system.Sensitive information on the Internet is the most interesting phenomenon that information experts are eager to investigate.Since the application of the Internet is developing very fast, many new words and phrases emerge on the Internet every day.Therefore, new words and phrases pose challenges to sensitive information monitoring.At present, most of the network information monitoring and filtering algorithms are inflexible and inaccurate.By the previous research results, we use the TF-Gini as the core algorithm and design and implement a system for sensitive information monitoring.This system can update sensitive words and phrases in time and has better performance on monitoring the Internet information.(1) Words Segmentation.In English and other Western languages there is a space delimiter between words.In Oriental languages there are no space delimiters between words.Chinese word segmentation is the basis for all Chinese text processing.There are many Chinese word segmentation algorithms [21,22]   Technology, Chinese Academy of Science).In this paper, we use ICTCLAS for Chinese word segmentation.

Text Classification
(2) Stop Words or Stemming.A text document always contains hundreds or thousands of words.Many of the words appear in a very high frequency but are useless, such as "the," "a," nonsense articles, and adverbs.These words are called Stop Words [23].Stop Words, which are language-specific functional words, are frequent words that carry no information.The first step during text process is to remove these Stop Words.
Stemming techniques are used to find out the root or stem of a word.Stemming converts words to their stems, which incorporates a great deal of language-dependent linguistic knowledge.Behind stemming, the hypothesis is that words with the same stem or root mostly describe the same or relatively close concepts in the document.Hence, words can be conflated by using stems.For example, the words, user, users, used, and using all can be stemmed to the word "USE." In Chinese text classification, we only need Stop Words because there is no conception of stemming in Chinese.
(3) VSM (Vector Space Model).VSM is very popular in natural language processing, especially in computing the similarity between documents.VSM is mapping a document to a vector.For example, if we view each feature word as a dimension, and the word frequency as its value, the document can be represented as an -dimensional vector; that is,  = {⟨ 1 ,  1 ⟩, ⟨ 2 ,  2 ⟩, . . ., ⟨  ,   ⟩}, where   (1 ≤  ≤ ) is the (5) Classification and Building Classification Model.After the above steps, all the training set texts have been mapped to the VSM.This step uses the classification algorithms on VSM and builds classification model which can assign an appropriate category to new input samples.

The Classification Process.
The steps of word segmentation, Stop Words or stemming, and VSM are similar to the classification process.The classification process will assign an appropriate category to new samples.When a new sample arrives, using the classification model calculates the most likely class label of the new sample and assigns it to the sample.The results evaluation is a very important step in the text classification process.It indicates whether the algorithm is good or not.

Feature Selection.
Although the Stop Words and stemming step get rid of some useless words, the VSM still has many insignificant words which even have a negative effect on classification.It is difficult for many classification algorithms to handle high-dimensional data sets.Hence, we need to reduce the dimensional space and improve the classifiers' performance.FS uses machine learning algorithms to choose appropriate words that can represent the original text, which can reduce the dimensions of the feature space.
The definition of FS is as follows.
Select  features from the original  features ( ≤ ).The  features can be more concise and more efficient to represent the contents of the text.The commonly used FS algorithm can be described as follows.

Information Gain (IG).
IG is widely used in the machine learning field, which is a criterion of the importance of the feature to a class.The value of IG equals the difference of information entropy before and after the appearance of a feature in a class.The greater the IG, the greater the amount of information the feature contains, the more important in the TC.Therefore, we can filter the best feature subset according to the feature's IG.The computing formula can be described as follows: where  is the feature word, () is the probability of text that contains  appearing in the training set, () is the probability of text that does not contain  appearing in the training set, (  ) is the probability of   class in the training set, ( |   ) is the probability of text that contains  in   , ( |   ) is the probability of text that does not contain  in   , and || is the total amount of classes in training set.
The shortage of IG algorithm is that it takes into account features which do not appear although in some circumstance, the nonappearing features could contribute to the classification or the contribution is far less than that of features appearing in the training set.However, if the training set is unbalanced and () ≫ (), the value of IG will be decided by () ∑ || =1 (  | ) log ( |   ), and the performance of IG can be significantly reduced.

Expected Cross Entropy (ECE)
. ECE is well known as KL distance, and it reflects the distance between the probability of the theme class and the probability of the theme class under the condition of a specific feature.The computing formula can be described as follows: where  is the feature word, () is the probability of text that contains  appearing in the training set, (  ) is the probability of class   in the training set, ( |   ) is the probability of text that contains  in   , and || is the total amount of classes in training set.ECE has been widely used in feature selection of TC, and it also achieves good performance.Compared with the IG algorithm, ECE does not consider the case that feature does not appear, which reduced the interference of rare feature which does not occur often, and improves the performance of classification.However, it also has some limitations.(2) If ( |   ) < (  ), then log(( |   )/(  )) < 0.
When ( |   ) is smaller and (  ) is larger, the logarithmic value is smaller, which indicates that feature  has weaker association with class   .
(3) When ( |   ) = 0, there is no association between feature  and the class   .In this case, the logarithmic expression has no significance.Introducing a small parameter, for example,  = 0.0001, that is, log((( |   ) + )/(  )), the logarithmic will have significance. ( In a summary, we combine ( 2) and ( 3) together; the ECE formula is as follows: If feature  exists only in one class, the value of information entropy is zero; that is, IE() = 0. Hence we should introduce a small parameter in the denominator as a regulator.

Mutual Information (MI).
In TC, MI expresses the association between feature words and classes.The MI between  and   is defined as where where (  ) is the probability of class   in the training set and || is the total amount of classes in the training set.Generally, MI max () is used commonly.

Odds Ratio (OR).
The computing formula can be described as follows: where  is the feature word,   represents the positive class and   represents the negative class, ( |   ) is the probability of text that contains  in class   , and ( |   ) is the conditional probability of text that contains  in all the classes except   .OR metric measures the membership and nonmembership to a specific class with its numerator and denominator, respectively.Therefore, the numerator must be maximized and the denominator must be minimized to get the highest score according to the formula.The formula is a one-sided metric because the logarithm function produces negative scores while the value of the fraction is between 0 and 1.In this case, the features have negative values pointed to negative features.Thus, if we only want to identify the positive class and do not care about the negative class, Odds Ratio will have a great advantage.Odds Ratio is suitable for binary classifiers.

𝜒 2 (CHI)
.  2 statistical algorithm measures the relevance between the feature  and the class   .The higher the score, the stronger the relevance between the feature  and the class   .It means that the feature has a greater contribution to the class.The computing formula can be described as follows: where  is the feature word,  1 is the frequency containing feature  and class   ,  2 is the frequency containing feature  but not belonging to class   ,  3 is the frequency belonging to class   but not containing the feature ,  4 is the frequency not containing the feature  and not belonging to class   , and || is the total number of text documents in class   .When the feature  and class   are independent, that is,  2 (,   ) = 0, this means that the feature  is not containing identification information.We calculate values for each category and then use formula (9) to compute the value of the entire training set:

The Gini Feature Selection Algorithm
3.1.The Improved Gini Index Algorithm

Gini Index (GI).
The GI is a kind of impurity attribute splitting method.It is widely used in the CART algorithm [24], the SLIQ algorithm [25], and the SPRINT algorithm [26].It chooses the splitting attribute and obtains very good classification accuracy.The GI algorithm can be described as follows.
Suppose that  is a set of  samples, and these samples have  different classes (  ,  = 1, 2, 3, . . ., ).According to the differences of classes, we can divide  into  subsets (  ,  = 1, 2, 3, . . ., ).Suppose that   is a sample set that belongs to class   and that   is the total number of   ; then the GI of set  is where   is the probability of   in   and calculated by   /.When Gini() is 0 that is the minimum, all the members in the set belong to the same class; that is, the maximum useful information can be obtained.When all of the samples in the set distribute equally for each class, Gini() is the maximum; that is, the minimum useful information can be obtained.
For an attribute  with  distinct values,  is partitioned into  subsets  1 ,  2 , . . .,   .The GI of  with respect to the attribute  is defined as The main idea of GI algorithm is that the attribute with the minimum value of GI is the best attribute which is chosen to split.

The Improved Gini Index (IGI) Algorithm.
The original form of the GI algorithm was used to measure the impurity of attributes towards classification.The smaller the impurity is, the better the attribute is.However, many studies on GI theory have employed the measure of purity: the larger the value of the purity is, the better the attribute is.Purity is more suitable for text classification.The formula is as follows: This formula measures the purity of attributes towards categorization.The larger the value of purity is, the better the attribute is.However, we always emphasize the highfrequency words in TC because the high-frequency words have more contributions to judge the class of texts.But when the distribution of the training set is highly unbalanced, the lower-frequency feature words still have some contribution to judge the class of texts, although the contribution is far less significant than that the high-frequency feature words have.Therefore, we define the IGI algorithm as where  is the feature word, || is the total number of classes in the training set, ( |   ) is the probability of text that contains  in class   , and (  | ) is the posterior probability of text that contains  in class   .This formula overcomes the shortcoming of the original GI, which considers the features' conditional probability, combining the conditional probability and posterior probability.Hence, the IGI algorithm can depress the affection when the training set is unbalanced.

TF-Gini Algorithm.
The purpose of FS is to choose the most suitable representative features to represent the original text.In fact, the features' distinguishability is different.Some features can effectively distinguish a text from others, but other features cannot.Therefore, we use weights to identify different features' distinguishability.The higher distinguishability of features will get higher weights.
The TF-IDF algorithm is a classical algorithm to calculate the features' weights.For a word  in text , the weights of  in  are as follows: where TF(, ) is the frequency of  appearance in text , || is the number of text documents in the training set, and DF() is the total number of texts containing  in the training set.
The TF-IDF algorithm is not suitable for TC, mainly because of the shortcoming of TF-IDF in the section IDF.The TF-IDF considers that a word with a higher frequency in different texts in the training set is not important.This consideration is not appropriate for TC.Therefore, we use purity Gini to replace the IDF part in the TF-IDF formula.The purity Gini is as follows: where  is the feature word and  is a nonzero value.  ( |   ) is the probability of text that contains  in class   , and || is the total number of classes in the training set.When  = 2, the following formula is the original Gini formula: However, in TC, the experimental results indicate that when  = −1/2, the classification performance is the best: Therefore, the new feature weighting algorithm, known as TF-Gini, can be represented as TF-Gini = TF ⋅ GiniTxt () . (18)

Experimental and
(2) The fkNN Classifier.The NN classifier is a lazy classification algorithm.It does not need the training set.The classifier's performance is not ideal, especially when the distribution of the training set is unbalanced.Therefore, we improve the algorithm by using the fuzzy logic inference system.Its decision rule can be described as follows: where  ( = 1, 2, 3, .(3) The SVM Classifier.Support vector machine (SVM) is a potential classification algorithm, which is based on statistical learning theory.SVM is highly accurate, owing to its ability to model complex nonlinear decision boundary.It uses a nonlinear mapping to transform the original set into a higher dimension.Within this new dimension, it searches for the linear optimal separating hyperplane, that is, decision boundary, with an appropriate nonlinear mapping to a sufficiently high dimension; data from two classes can always be separated by a hyperplane.Originally, SVM mainly solves the problem of two-class classification.The multiclass problem can be addressed using multiple binary classifications.Suppose that in the feature space a set of training vectors   ( = 1, 2, 3, . ..) belong to different classes   ∈ {−1, 1}.We wish to separate this training set with a hyperplane: Obviously, there are an infinite number of hyperplanes that are able to partition the data set into sets.However, according to the SVM theory, there is only one optimal hyperplane, which is lying half-way within the maximal margin as the sum of distances of the hyperplane to the closest training sample of each class.As shown in Figure 2, the solid line represents the optimal separating hyperplane.
One of the problems of SVM is to find the maximum separating hyperplane, that is, the one maximum distance between the nearest training samples.In Figure 2,  is the optimal hyperplane;  1 and  2 are the parallel hyperplanes to .The task of SVM is making the margin distance between  1 and  2 maximum.The maximum margin can be written as follows: Therefore, from Figure 2, we can get that the process of solving SVM is making the margin between  1 and  2 maximum: min The equation is also the same as minimizing ‖‖ 2 /2.Therefore, it becomes the following optimization question: 2 ) st.   ( ⋅   + ) ≥ 1. ( The formula means the points which satisfy   (⋅  +) = 1, called support vectors, namely, the points on  1 and  2 in Figure 2, are all support vectors.In fact, many samples are not classified correctly by the hyperplane.Hence, we introduce slack variables.Then, the above formula becomes min where  is a positive constant to balance the experience and confidence and || is the total number of classes in the training set.

Performance Evaluation.
Confusion matrix [27], also known as a contingency table or an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning algorithm.It is a table with two rows and two columns that reports the number of false positives, false negatives, true positives, and true negatives.Table 2 shows the confusion matrix.True positive (TP) is the number of correct predictions that an instance is positive.False positive (FP) is the number of incorrect predictions that an instance is positive.False negative (FN) is the number of incorrect predictions that an instance is positive.True negative (TN) is the number of incorrect predictions that an instance is negative.
Precision is the proportion of the predicted positive cases that were correct, as calculated using the equation Recall is the proportion of the actual positive cases that were correct, as calculated using the equation We hope that it is better when the precision and recall have higher values.However, both of them are contradictory.For example, if the result of prediction is only one and accurate, the precision is 100% and the recall is very low because TP equals 1, FP equals 0, and FN is huge.On the other hand, if the prediction as positive contains all the results (i.e., FN equals 0), therefore the recall is 100%.The precision is low because FP is large.1 [28] is the average of precision and recall.If 1 is high, the algorithm and experimental methods are ideal.1 is to measure the performance of a classifier considering the above two methods and was proposed by Van Rijsbergen [29] in 1979.It can be described as follows: Precision and recall only evaluate the performance of classification algorithms in a certain category.It is often necessary to evaluate classification performance in all the categories.When dealing with multiple categories there are two possible ways of averaging these measures, namely, macroaverage and microaverage.The macroaverage weights equally all classes, regardless of how many documents belong to it.The microaverage weights equally all the documents, thus favoring the performance in common classes.The formulas of macroaverage and microaverage are as follows, and || is the number of classes in the training set: 3.3.4.Experimental Results and Analysis.Figures 3 and 4 are the results on the English data set.The performance of classical GI is the lowest, and the IGI is the highest.Therefore, the IGI algorithm is helpful to improve the classifiers' performance.
From Figures 5 and 6, it can be seen that, in Chinese data sets, the classical GI still has poor performance.The best performance belongs to ECE, and then the second is IG.The performance of IGI is not the best.Since the Chinese language is different from the English language, at the processing of Stop Words, the results of Chinese word segmentation algorithm will influence the classification algorithm's accuracy.Meanwhile, the current experiments do not consider the feature words' weight.The following TF-Gini algorithm has a good solution to this problem.Although the IGI algorithm does not have the best performance in the Chinese data set, its performance is very close to the highest.Hence, the IGI feature selection algorithm is suitable and acceptable in the classification algorithm.It promotes the classifiers' performance effectively.From Figures 7 and 8, we can see that the TF-Gini weights algorithm's performance is good and acceptable for the English data set; its performance is very close to the highest.From Figures 9 and 10, we can see that on the Chinese data set the TF-Gini algorithm overcomes the deficiencies of the IGI algorithm and gets the best performance.These results show that the TF-Gini algorithm is a very promising feature weight algorithm.

Sensitive Information Identification System
Based on TF-Gini Algorithm will produce extremely adverse impacts and serious consequences.
The commonly used methods of Internet information filtering and monitoring are string matching.There are a large number of synonyms and ambiguous words in natural texts.Meanwhile, a lot of new words and expressions appear on the Internet every day.Therefore, the traditional sensitive information filtering and monitoring are not effective.In this paper, we propose a new sensitive information filtering system based on machine learning.The TF-Gini text feature selection algorithm is used in this system.(1) IKanalyzer uses a multiprocessor module and supports English letters, Chinese vocabulary, and individual words processing.
(2) It optimizes the dictionary storage and supports the user dictionary definition.

The Feature Selection.
At the stage of feature selection, the TF-Gini algorithm has a good performance on Chinese data sets.Feature words selection is very important preparation for classification.In this step, we use TF-Gini algorithm as the core algorithm of feature selection.

The Text Classification.
We use the Naïve Bayes algorithm [30] as the main classifier.In order to verify the performance of the Naïve Bayes algorithm proposed in this paper, we use the NN algorithm and the improved Naïve Bayes [31] algorithm to compare with it.
The NN algorithm is a typical algorithm in text classification and does not need the training data set.In order to calculate the similarity between the test sample  and the training samples   , we adopt the cosine similarity.It can be described as follows: where   is the th feature word's weights in   and   is the th feature word's weights in sample .|| is the number of feature spaces.Naïve Bayes classifier is based on the principle of Bayes.Assume  classes  1 ,  2 , . . .,   and a given unknown text  with no class label.The classification algorithm will predict if  belongs to the class with a higher posterior probability in the condition of .It can be described as follows: Naïve Bayes classification is based on a simple assumption: in a given sample class label, the attributes are conditionally independent.Then, the probability () is constant, where   is the th feature word's weights of text  in class   and  is the number of features in .

Experimental Results and Analysis.
In order to verify the performance of the system, we use two experiments to measure the sensitive information system.We use tenfold cross validation and threefold cross validation in different data sets to verify the system.
Experiment 1.The Chinese data set has 1,000 articles, including 500 sensitive texts and 500 nonsensitive texts.The 1,000 texts are randomly divided into 10 portions.One of the portions is selected as test set and the other 9 samples are used as a training set.The process is repeated 10 times.The final result is the average values of 10 classification results.Table 3 is the result of Experiment 1.
Experiment 2. The Chinese data set has 1,500 articles, including 750 sensitive texts and 750 nonsensitive texts.All the texts are randomly divided into 3 parts.One of the parts is test set; the others are training sets.The final result is the average values of 3 classification results.Table 4 is the result of Experiment 2.
From the results of Experiments 1 and 2, we can see that all the three classifiers have a better performance.The Naïve Bayes and Improved Naïve Bayes algorithms get a better performance.That is, the sensitive information system could satisfy the demand of the Internet sensitive information recognition.

Conclusion
Feature selection is an important aspect in text preprocessing.Its performance directly affects the classifiers' accuracy.In this paper, we employ a feature selection algorithm, that is, the IGI algorithm.We compare its performance with IG, CHI, MI, and OR algorithms.The experiment results show that the IGI algorithm has the best performance on the English data sets and is very close to the best performance on the Chinese data sets.Therefore, the IGI algorithm is a promising algorithm in the text preprocessing.
Feature weighting is another important aspect in text preprocessing, since the importance degree of features is different in different categories.We endow weights on different feature words; the more important the feature word is, the bigger the weights the feature word has.According to the TF-IDF theory, we construct a feature weighting algorithm, that is, TF-Gini algorithm.We replace the IDF part of TF-IDF algorithm with the IGI algorithm.To test the performance of TF-Gini, we also construct TF-IG, TF-ECE, TF-CHI, TF-MI, and TF-OR algorithms in the same way.The experiment results show that the TF-Gini performance is the best in the Chinese data sets and is very close to the best performance in the English data sets.Hence, TF-Gini algorithm can improve the classifier's performance, especially in Chinese data sets.
This paper also introduces a sensitive information identification system, which can monitor and filter sensitive information on the Internet.Considering the performance of TF-Gini on Chinese data sets, we choose the algorithm as text preprocessing algorithm in the system.The core classifier which we choose is Naïve Bayes classifier.The experiment results show that the system achieves good performance.

2. 1 .
Text Classification Process.The text classification process is shown in Figure1.The left part is the training process.The right part is the classification process.The function of each part can be described as follows.2.1.1.The Training Process.The purpose of the training process is to build a classification model.The new samples are assigned to appropriate category by using this model.

Figures 7 , 8 , 9 ,
Figures 7,8,9, and 10 are results for the English and Chinese date sets.To determine which algorithm's performance is the best, we use different TF weights algorithms, such as TF-IDF and TF-Gini.From Figures7 and 8, we can see that the TF-Gini weights algorithm's performance is good and acceptable for the English data set; its performance is very close to the highest.From Figures9 and 10, we can see that on the Chinese data set the TF-Gini algorithm overcomes the deficiencies of the IGI algorithm and gets the best performance.These results show that the TF-Gini algorithm is a very promising feature weight algorithm.
) If ( |   ) and (  ) are close to each other, then log(( |   )/(  )) is close to zero.When ( |   ) is large, this indicates that feature  has a strong association with class   and should be retained.However, when log(( |   )/(  )) is close to zero, feature  will be removed.In this case, we employ information entropy to add to the ECE algorithm.The formula of information entropy is as follows: is the feature word, (,   ) is the probability of text that contains  in class   , () is the probability of text that contains  in the training set, and (  ) is the probability of class   in the training set.It is necessary to calculate the MI between features and each category in the training set.The following two formulas can be used to calculate the MI: We choose ten big categories among them.The Chinese data set is Fudan University Chinese text data set (19,637 texts).We use 3,522 pieces as the training texts and 2,148 pieces as the testing texts.
Analysis 3.3.1.Data Sets Preparation.In this paper, we choose English and Chinese data sets to verify the performance of the FS and FW algorithms.The English data set which we choose Mathematical Problems in Engineering is Reuters-21578. ∑ =1   (  ) sim (,   ) , (19) where  ( = 1, 2, 3, . ..) is the th text in the training set,  is a new input sample with unknown category, and   is a testing text whose category is clear.  (  ) is equal to 1 if sample   belongs to class ; otherwise it is 0. sim(,   ) indicates the similarity between the new input sample and the testing text.The decision rule is if   () = max    () , then  ∈   .
. .) is the th text in the training set and   (  )sim(,   ) is the value of memberships for sample  belonging to class .  (  ) is equal to 1 if sample   belongs to class ; otherwise it is 0. From the formula, it can be seen that fNN is based on NN algorithm, which endows a distance weight on NN formula.The parameter  adjusts the degree of the weight,  ∈ [0, 1].The fNN decision rule is as follows: if   () = max    () , then  ∈   .
This system can Mathematical Problems in Engineering

Table 3 :
The result of Experiment 1.