For the search engine, error-input query is a common phenomenon. This paper uses web log as the training set for the query error checking. Through the
Recently, with the development of search engine technology, abundant web log is produced. These resources are naturally used into query error checking and correcting. The error checking is the crucial step for query correction. Chinese and English belong to different language families; their queries have differences among query length, spelling error, and so forth. The average lengths of queries are 1.85 words in Chinese as well as 2.35 words in English [
In this paper the query logs of SOGOU (
Query log format of SOGOU.
Continued time | User ID | Query string | Page rank | Page number | URL |
---|---|---|---|---|---|
00:00:03 | 9717831746543397 | 奥运 (Olympics) | 7 | 3 |
|
00:00:03 | 7954902679225404 | 芜湖旅游 (Wuhu Tourism) | 7 | 3 |
|
The web log contains abundant resources for error checking. We propose a query checking model by mining query log after it has been cleaned by wiping off its noise.
Query error checking has received great attention these years; their models involve more complex language problems [
There are two technical branches for query error checking, including statistical and rule-based methods, respectively [
The queries are put into the input box of search engine. They need some corrections or adjustments to meet user’s intents. When some results are returned, the best results will be clicked and also recorded in the log file. All these clues are saved as the query logs. Query log is the user’s operation record. These log records are some irregular data because users’ operating habits are different. When these logs are used as experimental materials, they must be preprocessed including removing this noise. Finally it provides favorable conditions.
We use
Query error checking model learning and its application.
In the query error checking model, the most important thing is to calculate the frequency of cooccurrence of words in context. With the help of the prior distribution of words in context, the following word of every query can be predicted by its prior knowledge. This method can be used to check the query. For example, the query “大安门 (it is error form of ‘Tiananmen Square’; its correct Chinese form is 天安门)” is an error query that needs query error correction. The error checking method will predict the next word by corpus and choose the cue word for it. The formalization description is as shown below.
Give a query string
This model depends on context, such as the fact that
If there are
It is a kind of Markov model for query checking;
Relation between parameter coverage and size of data set.
From Figure
The
There are many kinds of methods for data smoothing. In the experiment, we will use the absolute discounting smoothing operation on the experimental data [
There are about 2,900,000 query logs and removing noise for experiment. After data cleaning, it collects about 440,000 query entries without duplication; its compression ratio is 15.3%.
According to the proposed method, we use the query log to do the following experiment. Firstly, we choose 10 days of continuous data and label these queries by manual work. Secondly, randomly select three consecutive days’ data as the training set and extract 2,000 correct queries and 2,000 error queries, respectively; it consists of 4,000 queries as the test set. Finally segment the queries and train the bigram model with data smoothing processing because the coverage of bigram is good. Then acquire their parameters. We define the Word Cooccurring Distributes (WCD) as (
Figures
Distribution for two words in queries.
Distribution for three words in queries.
Distribution for four words in queries.
Distribution for five words in queries.
From the above figures known, WCD_CQ is above WCD_EQ in certain level. It has relatively clear threshold between correct queries and error queries. The correct queries and error queries can be distinguished when the thresholds are kept in the certain level. Thus this threshold is very significant to distinguish right from error queries. They mean that the correct queries are more frequently used than error queries.
By occasion the error queries are higher than threshold. We check these error queries and find that the error terms are frequently used in the web log. It is the usual thing for most users because they do not concern the spelling sometimes. We can establish the general table for those frequent error queries that are endowed with different threshold.
Through the figures above, we also conclude that under the condition of the same number of query words, in most cases, WCD_CQ tends to be greater than WCD_EQ except the new net words occurring rapidly.
Another phenomenon is that when the number of Chinese characters enlarges, their distribution decreases. Thus we get the relations between number and accuracy as in Table
Accuracy for correct queries and error queries by Chinese character.
Chinese character number | 2 | 3 | 4 | 5 |
|
||||
Accuracy of correct queries | 84.97% | 83.57% | 68.39% | 60.23% |
|
||||
Accuracy of error queries | 89.25% | 81.55% | 85.57% | 95.00% |
Here the meaning for measure is shown in Table
The confused set of measure.
Judged as right | Judged as wrong | |
---|---|---|
Correct query |
|
|
Error query |
|
|
Through the experiments we can draw the following conclusion. The number of Chinese characters in queries has a great influence on a query. When the number increases, its effects on the thresholds will decrease and the ranges of thresholds also gradually turn narrower. It will lead to distinguishing the correct words difficultly.
The results of above several group experiments are consistent with our expected effects. However, with the number increasing, the correct rate and the discrimination of this feature will drop down. It needs further investigation.
Besides the number of Chinese characters and its possibility of affecting the error checking between correct and error queries, the number of Chinese word is also a kind of important feature. We analyze the correct queries and wrong queries, respectively as shown in Table
Improved results by word between correct queries and error queries.
Word number | 2 | 3 | 4 | 5 |
|
||||
Correct queries | 95.95% | 85.92% | 78.16% | 71.59% |
|
||||
Error queries | 86.02% | 79.13% | 84.08% | 84.17% |
When the number of Chinese word as a feature is added into the model, the results are improved significantly. Table
In this paper, we propose an error checking model
Although this method achieves the anticipated effect, when the word number of queries is more 6, its performance will decrease. The following work will continue to improve the error checking method.
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors are grateful to the reviewers for reviewing this paper. This work is supported by the National Science Foundation of China (Grant no. 61103112), Social Science Foundation of Beijing (Grant no. 13SHC031), Ministry of Education of China (Project of Humanities and Social Sciences, Grant no. 13YJC740055), and Beijing Young Talent Plan (Grant no. CIT&TCD201404005).