Automatic Task Classification via Support Vector Machine and Crowdsourcing

Automatic task classification is a core part of personal assistant systems that are widely used in mobile devices such as smartphones and tablets. Even though many industry leaders are providing their own personal assistant services, their proprietary internals and implementations are not well known to the public. In this work, we show through real implementation and evaluation that automatic task classification can be implemented for mobile devices by using the support vector machine algorithm and crowdsourcing. To train our task classifier, we collected our training data set via crowdsourcing using the Amazon Mechanical Turk platform. Our classifier can classify a short English sentence into one of the thirty-two predefined tasks that are frequently requested while using personal mobile devices. Evaluation results show high prediction accuracy of our classifier ranging from 82% to 99%. By using large amount of crowdsourced data, we also illustrate the relationship between training data size and the prediction accuracy of our task classifier.


Introduction
Artificial intelligence and machine learning has received much attention in our information technology era, and we are observing more and more applications in our daily lives than before.In particular, many industry leaders have developed and introduced top-notch applications based on artificial intelligence [1][2][3][4][5].ese applications include personalized content recommendations and personal assistant services [3][4][5][6][7].
Many advanced personal assistant services heavily depend on natural language understanding (NLU) for humancomputer interactions [8][9][10].ere are also many systems that are based on touch-driven interactions [11,12].Nowadays, many machines can interact with humans with a certain level of intelligence, and at the core of them are artificial intelligence algorithms and natural language processing.
ere are many unsolved problems of natural language understanding, and the problem of automatically classifying a given natural language input into a suitable task or category is one of them.Many researchers and industry leaders have suggested various algorithms and approaches to tackle the problem [8,9,[13][14][15][16][17][18][19][20].ese research and development activities later resulted in various personal assistant services such as Apple's Siri, Google's Google Now, and Amazon's Alexa.
However, the personal assistant services that are provided by the industry leaders are proprietary, and their internals and implementations are not well known to the public.As they have been continuously updated and improved over the past several years, we believe that their implementations are highly sophisticated and complicated combinations of many different algorithms and the state-ofthe-art technologies.erefore, we asked ourselves the following question: "Is it possible to implement a personal assistant system that is simple enough to be built by applying a well-known machine learning algorithm and personally crowdsourced data?"By answering this question, we hope that our work motivates many researchers and small industries to build their own intelligent systems in their particular domains.
With the motivation in mind, in this paper, we introduce our own implementation of an automatic task classification system, which is based on a classical machine learning algorithm and crowdsourcing.Many different classification algorithms have been proposed and introduced to the artificial intelligence and machine learning community.To implement our task classification module, we used the support vector machine (SVM), a popular classification algorithm.In particular, we used the LibShortText library [21], which is an extension of Liblinear [22], a library implementing linear support vector machine algorithms.
Using our implementation, we show that the support vector machine algorithm can be successfully used for building personal assistant services, in particular, task classifiers for mobile devices. is task classifier can take a natural language text input and classify the input text into an implied task category among many predefined tasks.erefore, it can understand humans' natural language command and execute the intended task accordingly on behalf of the user.
Even though Apple, Google, and Amazon are not disclosing the internal architecture or algorithms that were used to implement their own personal assistant services [3], it is believed that they are making use of a large amount of data that they have collected from various sources to implement their systems.In order to train our classification module, we also collected our own training data, and we describe how we collected our data via crowdsourcing.
By using a large amount of collected training data, we investigate and present a relationship between task classification accuracy of our classifier and training data size.We verify that the more training data we use, the better prediction accuracy we can get, but the performance increase rate drops.
is paper is organized as follows.Section 2 introduces a couple of commercial personal assistant systems and the support vector machine algorithm.
en, an open-source library implementing the support vector machine is briefly introduced.Our classifier uses the library to build a task classifier model.In Section 3, we describe our classifier and the library on which the classifier is built.Section 4 describes our crowdsourcing procedure for collecting training data that are used to train our classifier model.Section 5 shows that our classifier can classify short English texts into implied task categories.In particular, precision and recall values are presented for each task.e relationship between prediction accuracy and training data size is also investigated.In Section 6, we propose an overall architecture of a possible implementation of a personal assistant system, which is based on our task classifier.Finally, Section 7 concludes the paper.

Background and Prior Work
Before we propose our task classifier, we introduce some prior work on natural language processing and a couple of personal assistant systems.en, a brief background on the support vector machine algorithm is introduced.
2.1.Natural Language Processing.Natural language processing is a fairly large research area and has a long history in the computer science community.e following are a few prior work in the field.
In 1972, Winograd tried to implement a computer system that can interact with human beings in English [14].Kuhn and De Mori tried a new data structure, semantic classification tree, to implement a building block for robust matchers for NLU tasks [9].Manning and Schütze describe statistical natural language processing in their book [8].Yi et al. presented a sentiment analyzer that extracts sentiment about a subject using natural language processing techniques [15].Collobert et al. proposed a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-ofspeech tagging, chunking, named entity recognition, and semantic role labeling [16].
Task or category classification has much prior work, too.In 1994, Cavnar and Trenkle proposed a text classification approach based on N-gram [17].Yang and Pedersen performed a comparative study on feature selection methods in text categorization [18].Text categorization via SVM was studied by Joachims in 1998 [19].Yang and Liu performed a study on five different methods for text classification [23].Sebastiani summarized many different approaches for text classification based on machine learning [24].Pang et al. performed a study on sentiment analysis (positive or negative) by employing three different machine learning algorithms: naive Bayes, maximum entropy classification, and support vector machine [25].Tong and Koller proposed support vector machine active learning for text classification applications [20].Leopold and Kindermann showed that term-frequency transformations have a larger impact on the performance of SVM for text classifications than the kernel functions [26].Genkin et al. proposed a new approach based on logistic regression that can handle highdimensional data such as natural language text [27].Lan et al. proposed a new term weighting method for text classification [28].

Personal Assistant Systems.
One of the innovative and well-known automatic task classification systems for a natural language input sentence would be Siri, a personal assistant system developed by Apple, Inc. [29].Soon after Siri was introduced, Google also started providing its own similar service, which is called as Google Now [30,31].
ese two systems can understand humans' natural language command, which means that they can classify a given natural language input to a command implied by the input text and perform the predicted task accordingly.For example, these systems understand a voice input command such as "Call John," and on behalf of the user, they perform automatically the user's intended task, which is a "Call" task.
However, Apple and Google have not disclosed how these systems are implemented.erefore, we designed and implemented our own automatic task classifier based on a widely used classification algorithm, the support vector machine.

2
Mobile Information Systems

e Support Vector Machine.
ere are many classification algorithms that are well known to the machine learning community.In particular, Deep Learning [32][33][34] is a very hot topic these days.If designed and trained carefully, deep learning algorithms usually outperform (in terms of classification accuracy) most of the previously known classification algorithms such as the support vector machine, random forests, and naive Bayes in many domains.However, in order to train a competitive deep learning network, a very large amount of training data is usually required.Furthermore, deep neural networks are often regarded as black boxes because it is hard to understand how the networks classify test instances into correct categories through the deep network layers.
On the contrary, the support vector machine is seen to be more interpretable than deep neural networks, and it had been mostly used in many classification problems.Even though there is no one universal algorithm that outperforms every other classification algorithms in various domains, the support vector machine was widely used due to its relatively powerful performance over many different areas [35].Considering our goal of this work, we decided to use the support vector machine algorithm rather than using more contemporary but complicated deep learning algorithms.
Support vector machine algorithm was conceptually invented in 1963 by Vapnik and Chervonenkis [36].In 1992, Boser et al. proposed a way to create nonlinear classifiers via the kernel trick [37].A couple years later, Cortes and Vapnik introduced the concept of the soft margin [38].Since its introduction, SVM has seen many applications such as hand-written character recognition.
ere are many implementations of SVM with different optimization algorithms.Fan et al. implemented an opensource library for large-scale linear classification, which is named as Liblinear [22].Liblinear supports logistic regression and linear support vector machines.Another opensource library for short text classification and analysis, called LibShortText, was implemented [21].LibShortText is an extension of Liblinear, and it can train a classification model with a given training data set consisting of short natural language texts with labels.

Automatic Task Classifier
Our main idea is that a practical task classifier, a core part of personal assistant systems, can be implemented to reach a sufficient accuracy by using a classical classification algorithm and basic natural language processing techniques.As we have briefly introduced in Section 2, there exists an open-source library for text classification based on the support vector machine.erefore, we adopted this library to design our task classifier instead of reinventing the wheels.
3.1.Predefined Tasks.We implemented our own automatic task classifier that can classify a given natural language input text to the most appropriate task among the thirty-two predefined tasks.e thirty-two distinct tasks that we have used are shown in Table 1.We picked these tasks based on our observation that they would span most frequently used tasks that a user can command their mobile devices such as smartphones or tablets.However, these thirty-two tasks are not meant to be hardcoded; users can define any task list of their own interest.
For example, the thirty-two tasks contain the "Call" task, and our task classifier can classify an input text "Call John."In other words, our classifier will automatically recognize the user's intended task, which is the "Call" task, and the personal assistant system will command the mobile device to search its contacts list to find "John" and finally place a call to him.Even though we defined our own thirty-two tasks targeting mobile devices, different use cases can define their own task lists.For an instance, navigation systems may have totally different tasks such as "Search Location" or "Cancel Navigation."

Training Task Classifier.
In order to implement our automatic task classifier, we exploit LibShortText [21], an open-source library implementing a short-text classifier.
is library is very well implemented and provides a capability to change the parameters of the support vector machine algorithm or natural language processing.As the authors of LibShortText claimed, our preliminary experimentation has shown that the default parameters of the library result in very good classification accuracies for our purpose, if not the best, so we mostly followed their recommendations as is.
In addition, in an attempt to enhance the accuracy further, we designed a preprocessing step, which is to replace some words into more general categories.For example, if the given sentence is "I want to have a sushi," then the word "sushi" is replaced with "categoryFood."erefore, the final sentence in this case becomes "I want to have a categoryFood." We experimented with the idea of word replacement because replacing more specific words with more general terms may increase classification accuracies by decreasing the dimension of the input feature space (the space of N-grams).In order to implement the preprocessing of word replacement, we first created a dictionary, which maps some words to more general categories.For example, "sushi" and "pizza" are mapped to the "categoryFood" category.After we generated the dictionary, we applied the preprocessing step to the training data set.In other words, all the training data To our disappointment, however, the experiment results were not promising; the classification accuracy with the preprocessing was not higher than the case without that step.We believe there may be many reasons for this.First of all, the classification accuracy of the support vector machine is already very high without the preprocessing, which makes it very hard to increase the accuracy further.Second, the feature space dimension is not reduced enough to affect the classification accuracy.
After we confirmed the experiment results, we chose to stay with the recommended setting of LibShortText without the preprocessing.We also want to mention that the preprocessing takes computation time, which is another reason why we chose not to apply our tested preprocessing.
e LibShortText library requires a training data set to train a support vector machine, so we used a popular crowdsourcing platform to collect our training data set.In order to achieve high classification accuracy for general natural language text inputs, we need a large amount of data.
e data collection process is described in the next section.

Data Collection via Crowdsourcing
is section describes how we collected our own training data set for our task classifier training.

Amazon Mechanical Turk.
Crowdsourcing has been a powerful way to obtain human intelligent services, ideas, or content by soliciting contributions from a large group of people and especially from online communities [39].A wellknown online survey platform, SurveyMonkey, is a good example of many services that can be used for collecting data via crowdsourcing.Many people are now using Survey-Monkey in small scales for personal, academic, or industrial purpose.
Amazon.com, Inc. is also providing a popular and commercial crowdsourcing platform called Amazon Mechanical Turk (MTurk).MTurk provides an easy-to-use system for collecting a large amount of data sets via crowdsourcing.
ere have been many research results about the data quality collected by MTurk.Buhrmester et al. described and evaluated the potential contributions of MTurk to psychology and other social sciences [40].Paolacci et al. addressed potential concerns about the quality of collected data through MTurk by presenting new demographic data about the Mechanical Turk subject population, reviewing the strengths of MTurk relative to other online and offline methods of recruiting subjects and comparing the magnitude of effects obtained using Mechanical Turk and traditional subject pools [41].Kittur et al. also performed a study on validity of MTurk platform [42].Many of them indicate that MTurk is a good crowdsourcing platform to collect high-quality data with inexpensive monetary costs.
We collected our own training data set using MTurk to train our task classification engine.MTurk enables crowdsourcing requesters to upload their questionnaires.Once uploaded, MTurk publishes the questionnaires to the MTurk open marketplace so that many MTurk workers can answer the uploaded questionnaires and get paid by the requester.

Our Data Collection Process.
In order to collect English sentences for commanding any of the predefined thirty-two tasks, we created a questionnaire that requests a worker to fill out his/her own example sentence for each different task.For example, for the task of "Restaurant Search", one person can provide an example sentence such as "Find me a good Italian restaurant nearby," and another person can provide a different sentence such as "best Korean food in San Francisco."Once a worker finished filling out an example sentence for each of all the thirty-two tasks, then we compensated the worker for their contribution.
rough the aforementioned MTurk crowdsourcing process, we were able to collect 65,890 sentences for the thirty-two tasks.Each task has at least 2,000 sentences.All these sentences are human generated, so the data set has high quality and variety.For example, a worker provided an example sentence, "I am hungry" for the task of "Restaurant Search," whereas many other workers just provided names of various cuisines such as "Pizza" or "Sushi."

Prediction Accuracy
In order to train our classifier and evaluate the classification accuracy, we applied the tenfold cross-validation approach using the 65,890 sentences for the thirty-two tasks collected via crowdsourcing.While applying tenfold cross validation, we measured the precision and recall values for each task.

Precision and Recall.
While performing the tenfold cross validation, the precision and recall values for each task can be computed as follows for each fold.
Suppose that we randomly partition all the collected data set T into ten equal folds.We partitioned T so that each fold has roughly equal number of data points for each task.To compute precision and recall values corresponding to a partition P i , i � 1, 2, . . ., 10, we form a training data set consisting of all the data except those belonging to P i .In other words, if we call the training data set for the fold P i as T i , we have By using T i as a training data set, we train a task classifier model f i .We then measure the prediction accuracy of the model f i using the fold P i as a validation set.
For each pair of task t j , j � 1, 2, . . ., 32, and fold P i , i � 1, 2, . . ., 10, the precision and recall values are computed by the following equations: 4Mobile Information Systems where t(d) and f i (d) are the original and the predicted task labels of the data sentence d, respectively.

Measured Prediction Accuracy.
e tenfold crossvalidation results are shown in Table 2.As shown in Table 2, the precision and recall values are very high ranging from 82% up to 99%.Even though there are some misclassifications between similar tasks, the overall prediction accuracy is promising.erefore, our task classifier can be employed to build practical personal assistant services.
We can also observe that only a few groups of similar tasks have relatively high misclassification ratios.For example, the three tasks of "Map," "Restaurant," and "Shopping" search were mostly confused with each other.e "Music" search and "Music player" app launch tasks were also mostly confused with each other.e "Apps" search and "Game" app launch tasks were mostly confused with each other, and the "Wikipedia" and "Information" search tasks were also confused with each other.Finally, the five chatting tasks were confused among themselves.
In order to increase the accuracy of our task classifier further, we may build a second layer of classification, which is applied to and classifies each group of similar tasks to a more accurate task inside the group.e second layer need not use the support vector machine as the first layer; it may use rule-based classifiers or any other algorithm.

Accuracy and Training Data Size.
We performed an experiment to investigate the relation between training data size and classification accuracy.It is expected that the more the data we use for training, the more the accuracy we can get.We wanted to confirm this expectation and to get the precise relation between the classifier performance and the training data size.For this experiment, we randomly chose 20% of all the collected data points as the test set.en, we used the remaining 80% of the data set as a training data pool, so that we can sample a certain amount of data points randomly from the pool.
More specifically, suppose that S is the randomly sampled test set whose size is 20% of the collected data set T.
en, the remaining 80% of the collected data is used as the training data pool P, from which we sample different sizes of training data.erefore, we have In particular, we tested with ten different sizes of training data: 10%, 20%, 30%, . . ., 100% of the training data pool P. For each different training data size of (i/10)|P| for i � 1, 2, 3, . . ., 10, we repeated random sampling from the pool P ten times to train ten different classifier models with the same training data size.In other words, for each i, we train ten different task classifiers f ij for j � 1, 2, 3, . . ., 10 by sampling training data of the same size from the pool P ten times.
en, using each classifier f ij , we measured the classification accuracy acc ij as is result is reasonable and verifies our original hypothesis.

Personal Assistant Systems
So far, we have described our proposed task classifier that is based on the support vector machine algorithm.e proposed task classifier may be used to build a personal assistant system.Figure 2 shows the overall architecture of a possible implementation of a personal assistant system.A user's voice command is first transmitted to a server, which is running the speech-to-text converter and the task classifier.
For the speech-to-text converter, any suitable converter may be used; for example, we may use Sphinx speech recognition system [43][44][45].
e received speech at the server is converted to text, and the converted text is given to the task classifier, a core part of the overall system.e task classifier classifies the given text into the most probable task, and the predicted task is transmitted back to the user's mobile device.en, the task launcher of the personal assistant system performs the predicted task accordingly on behalf of the user.

Mobile Information Systems
Since the proposed task classifier is built upon the linear support vector machine, the computation time typically ranged from a few milliseconds to a few tens of milliseconds.Of course, as we use more powerful servers, we can reduce the computation time more.

Conclusion
We presented a method to implement personal assistant services that can understand human's natural language commands.Even though there already exist such services including Siri, Google Now, and Alexa, the internal technologies have not been disclosed and are not well known to the public.erefore, we investigated whether it is possible to build a task classifier, a core part of personal assistant services, using a well-known machine learning algorithm.Our implementation is based on the support vector machine, a widely used classification algorithm in many domains.
To train our support vector machine with sufficient data, we collected our own training data set by using a popular crowdsourcing platform, the Amazon Mechanical Turk.We predefined thirty-two tasks that are frequently commanded while using mobile devices and collected natural language sentences for each task by using the crowdsourcing platform.rough this process, we were able to collect 65,890 natural language sentences in total.
We tested our task classifier performance with the tenfold cross-validation approach.
e evaluation results show that the precision and recall values of our classifier are very high. is result indicates that our simple approach can be employed to implement practical personal assistant services.
e relationship between training data size and classifier performance was also investigated.We confirmed that the performance becomes better as we use more training data, but the performance increase rate drops as we increase the training data size.All of these observations are reasonable, and we hope that our work can motivate and be referred to by many researchers or industries who are trying to build their own personal assistant systems.

Mobile Information Systems
where f ij (d) and t(d) are the predicted and original task of the input text d, respectively.erefore, we have ten different accuracy values for each training data size.eresults of the experiment are shown in Figure1, where a boxplot is generated with ten different accuracy values for each different training data size.e figure shows that the classifier performance increases as the training data size increases, but the performance increase rate drops as the training data size increases.

Figure 1 :Figure 2 :
Figure 1: Prediction accuracy increases as we use more training data, but the increase rate drops.

Table 1 :
e predefined thirty-two tasks for mobile devices.to sentences where specific words are replaced with corresponding category titles.Of course, when the task classifier classifies a test input sentence, the same preprocessing step should be performed on the test sentence before the classification process kicks in.

Table 2 :
e precision and recall values of each task as well as the top 3 misclassifications.