Automatic task classification is a core part of personal assistant systems that are widely used in mobile devices such as smartphones and tablets. Even though many industry leaders are providing their own personal assistant services, their proprietary internals and implementations are not well known to the public. In this work, we show through real implementation and evaluation that automatic task classification can be implemented for mobile devices by using the support vector machine algorithm and crowdsourcing. To train our task classifier, we collected our training data set via crowdsourcing using the Amazon Mechanical Turk platform. Our classifier can classify a short English sentence into one of the thirty-two predefined tasks that are frequently requested while using personal mobile devices. Evaluation results show high prediction accuracy of our classifier ranging from 82% to 99%. By using large amount of crowdsourced data, we also illustrate the relationship between training data size and the prediction accuracy of our task classifier.
Artificial intelligence and machine learning has received much attention in our information technology era, and we are observing more and more applications in our daily lives than before. In particular, many industry leaders have developed and introduced top-notch applications based on artificial intelligence [
Many advanced personal assistant services heavily depend on natural language understanding (NLU) for human-computer interactions [
There are many unsolved problems of natural language understanding, and the problem of automatically classifying a given natural language input into a suitable task or category is one of them. Many researchers and industry leaders have suggested various algorithms and approaches to tackle the problem [
However, the personal assistant services that are provided by the industry leaders are proprietary, and their internals and implementations are not well known to the public. As they have been continuously updated and improved over the past several years, we believe that their implementations are highly sophisticated and complicated combinations of many different algorithms and the state-of-the-art technologies. Therefore, we asked ourselves the following question: “Is it possible to implement a personal assistant system that is simple enough to be built by applying a well-known machine learning algorithm and personally crowdsourced data?” By answering this question, we hope that our work motivates many researchers and small industries to build their own intelligent systems in their particular domains.
With the motivation in mind, in this paper, we introduce our own implementation of an automatic task classification system, which is based on a classical machine learning algorithm and crowdsourcing. Many different classification algorithms have been proposed and introduced to the artificial intelligence and machine learning community. To implement our task classification module, we used the support vector machine (SVM), a popular classification algorithm. In particular, we used the LibShortText library [
Using our implementation, we show that the support vector machine algorithm can be successfully used for building personal assistant services, in particular, task classifiers for mobile devices. This task classifier can take a natural language text input and classify the input text into an implied task category among many predefined tasks. Therefore, it can understand humans’ natural language command and execute the intended task accordingly on behalf of the user.
Even though Apple, Google, and Amazon are not disclosing the internal architecture or algorithms that were used to implement their own personal assistant services [
By using a large amount of collected training data, we investigate and present a relationship between task classification accuracy of our classifier and training data size. We verify that the more training data we use, the better prediction accuracy we can get, but the performance increase rate drops.
This paper is organized as follows. Section
Before we propose our task classifier, we introduce some prior work on natural language processing and a couple of personal assistant systems. Then, a brief background on the support vector machine algorithm is introduced.
Natural language processing is a fairly large research area and has a long history in the computer science community. The following are a few prior work in the field.
In 1972, Winograd tried to implement a computer system that can interact with human beings in English [
Task or category classification has much prior work, too. In 1994, Cavnar and Trenkle proposed a text classification approach based on N-gram [
One of the innovative and well-known automatic task classification systems for a natural language input sentence would be Siri, a personal assistant system developed by Apple, Inc. [
These two systems can understand humans’ natural language command, which means that they can classify a given natural language input to a command implied by the input text and perform the predicted task accordingly. For example, these systems understand a voice input command such as “Call John,” and on behalf of the user, they perform automatically the user’s intended task, which is a “Call” task.
However, Apple and Google have not disclosed how these systems are implemented. Therefore, we designed and implemented our own automatic task classifier based on a widely used classification algorithm, the support vector machine.
There are many classification algorithms that are well known to the machine learning community. In particular, Deep Learning [
On the contrary, the support vector machine is seen to be more interpretable than deep neural networks, and it had been mostly used in many classification problems. Even though there is no one universal algorithm that outperforms every other classification algorithms in various domains, the support vector machine was widely used due to its relatively powerful performance over many different areas [
Support vector machine algorithm was conceptually invented in 1963 by Vapnik and Chervonenkis [
There are many implementations of SVM with different optimization algorithms. Fan et al. implemented an open-source library for large-scale linear classification, which is named as Liblinear [
Our main idea is that a practical task classifier, a core part of personal assistant systems, can be implemented to reach a sufficient accuracy by using a classical classification algorithm and basic natural language processing techniques. As we have briefly introduced in Section
We implemented our own automatic task classifier that can classify a given natural language input text to the most appropriate task among the thirty-two predefined tasks. The thirty-two distinct tasks that we have used are shown in Table
The predefined thirty-two tasks for mobile devices.
Search | App | Chatting | |
---|---|---|---|
Transportation | Map | Call | Greeting |
Travel | Music | Praise | |
Movie | Photo | Camera | Dispraise |
Book | News | Check schedule | Boredom |
Game | Apps | Schedule | Love |
Restaurant | Recipe | Memo | — |
Shopping | Weather | Timer | — |
Hospital | — | Alarm | — |
Wikipedia | — | Music player | — |
Information | — | SNS | — |
For example, the thirty-two tasks contain the “Call” task, and our task classifier can classify an input text “Call John.” In other words, our classifier will automatically recognize the user’s intended task, which is the “Call” task, and the personal assistant system will command the mobile device to search its contacts list to find “John” and finally place a call to him. Even though we defined our own thirty-two tasks targeting mobile devices, different use cases can define their own task lists. For an instance, navigation systems may have totally different tasks such as “Search Location” or “Cancel Navigation.”
In order to implement our automatic task classifier, we exploit LibShortText [
In addition, in an attempt to enhance the accuracy further, we designed a preprocessing step, which is to replace some words into more general categories. For example, if the given sentence is “I want to have a sushi,” then the word “sushi” is replaced with “categoryFood.” Therefore, the final sentence in this case becomes “I want to have a categoryFood.”
We experimented with the idea of word replacement because replacing more specific words with more general terms may increase classification accuracies by decreasing the dimension of the input feature space (the space of N-grams). In order to implement the preprocessing of word replacement, we first created a dictionary, which maps some words to more general categories. For example, “sushi” and “pizza” are mapped to the “categoryFood” category. After we generated the dictionary, we applied the preprocessing step to the training data set. In other words, all the training data sentences were transformed to sentences where specific words are replaced with corresponding category titles. Of course, when the task classifier classifies a test input sentence, the same preprocessing step should be performed on the test sentence before the classification process kicks in.
To our disappointment, however, the experiment results were not promising; the classification accuracy with the preprocessing was not higher than the case without that step. We believe there may be many reasons for this. First of all, the classification accuracy of the support vector machine is already very high without the preprocessing, which makes it very hard to increase the accuracy further. Second, the feature space dimension is not reduced enough to affect the classification accuracy.
After we confirmed the experiment results, we chose to stay with the recommended setting of LibShortText without the preprocessing. We also want to mention that the preprocessing takes computation time, which is another reason why we chose not to apply our tested preprocessing.
The LibShortText library requires a training data set to train a support vector machine, so we used a popular crowdsourcing platform to collect our training data set. In order to achieve high classification accuracy for general natural language text inputs, we need a large amount of data. The data collection process is described in the next section.
This section describes how we collected our own training data set for our task classifier training.
Crowdsourcing has been a powerful way to obtain human intelligent services, ideas, or content by soliciting contributions from a large group of people and especially from online communities [
Amazon.com, Inc. is also providing a popular and commercial crowdsourcing platform called Amazon Mechanical Turk (MTurk). MTurk provides an easy-to-use system for collecting a large amount of data sets via crowdsourcing. There have been many research results about the data quality collected by MTurk. Buhrmester et al. described and evaluated the potential contributions of MTurk to psychology and other social sciences [
We collected our own training data set using MTurk to train our task classification engine. MTurk enables crowdsourcing requesters to upload their questionnaires. Once uploaded, MTurk publishes the questionnaires to the MTurk open marketplace so that many MTurk workers can answer the uploaded questionnaires and get paid by the requester.
In order to collect English sentences for commanding any of the predefined thirty-two tasks, we created a questionnaire that requests a worker to fill out his/her own example sentence for each different task. For example, for the task of “Restaurant Search”, one person can provide an example sentence such as “Find me a good Italian restaurant nearby,” and another person can provide a different sentence such as “best Korean food in San Francisco.” Once a worker finished filling out an example sentence for each of all the thirty-two tasks, then we compensated the worker for their contribution.
Through the aforementioned MTurk crowdsourcing process, we were able to collect 65,890 sentences for the thirty-two tasks. Each task has at least 2,000 sentences. All these sentences are human generated, so the data set has high quality and variety. For example, a worker provided an example sentence, “I am hungry” for the task of “Restaurant Search,” whereas many other workers just provided names of various cuisines such as “Pizza” or “Sushi.”
In order to train our classifier and evaluate the classification accuracy, we applied the tenfold cross-validation approach using the 65,890 sentences for the thirty-two tasks collected via crowdsourcing. While applying tenfold cross validation, we measured the precision and recall values for each task.
While performing the tenfold cross validation, the precision and recall values for each task can be computed as follows for each fold.
Suppose that we randomly partition all the collected data set
By using
For each pair of task
The tenfold cross-validation results are shown in Table
The precision and recall values of each task as well as the top 3 misclassifications.
Task name | Precision (%) | Recall (%) | Top 3 misclassifications | ||
---|---|---|---|---|---|
Transportation | 91.48 | 94.33 | Travel | Map | Restaurant |
1.61% | 1.46% | 0.76% | |||
Map | 83.87 | 82.76 | Restaurant | Shopping | Transportation |
5.53% | 4.49% | 3.45% | |||
Travel | 93.64 | 94.44 | Transportation | Wikipedia | Restaurant |
0.85% | 0.81% | 0.62% | |||
Music | 88.98 | 86.25 | Music player | Wikipedia | Book |
6.33% | 3.07% | 1.56% | |||
Movie | 89.74 | 93.10 | Book | Information | Music |
1.28% | 1.18% | 1.09% | |||
Photo | 95.86 | 95.13 | Information | Camera | Wikipedia |
0.99% | 0.90% | 0.61% | |||
Book | 91.21 | 91.51 | Music | Wikipedia | Shopping |
1.52% | 1.47% | 1.09% | |||
News | 93.08 | 94.14 | Wikipedia | Information | Book |
1.37% | 0.85% | 0.61% | |||
Game | 88.23 | 89.07 | Apps | Music player | Movie |
6.48% | 0.71% | 0.47% | |||
Apps | 90.02 | 88.92 | Game | Map | Movie |
6.92% | 0.47% | 0.43% | |||
Restaurant | 86.54 | 88.35 | Map | Shopping | Transportation |
4.31% | 2.84% | 1.14% | |||
Recipe | 97.14 | 98.15 | Restaurant | Information | Map |
0.43% | 0.24% | 0.14% | |||
Shopping | 86.25 | 88.75 | Map | Restaurant | Hospital |
3.99% | 2.71% | 1.04% | |||
Weather | 97.90 | 99.20 | News | Information | Check schedule |
0.14% | 0.14% | 0.14% | |||
Hospital | 95.55 | 94.92 | Shopping | Map | Restaurant |
1.28% | 1.05% | 0.81% | |||
Wikipedia | 83.05 | 78.63 | Information | News | Music |
7.98% | 3.32% | 1.85% | |||
Information | 82.51 | 83.88 | Wikipedia | Music | Book |
5.61% | 1.24% | 1.14% | |||
Call | 98.57 | 98.34 | Greeting | Love | Movie |
0.48% | 0.24% | 0.14% | |||
99.38 | 99.19 | Call | Dispraise | Memo | |
0.14% | 0.14% | 0.10% | |||
Camera | 98.16 | 98.72 | Photo | Apps | Movie |
0.24% | 0.19% | 0.14% | |||
Check schedule | 90.60 | 92.97 | Schedule | Boredom | Memo |
4.66% | 0.48% | 0.38% | |||
Schedule | 93.43 | 91.35 | Check schedule | Memo | Timer |
6.51% | 0.90% | 0.14% | |||
Memo | 97.57 | 97.29 | Schedule | Timer | |
0.62% | 0.29% | 0.24% | |||
Timer | 97.59 | 98.24 | Alarm | Memo | Information |
0.76% | 0.24% | 0.14% | |||
Alarm | 98.38 | 98.38 | Timer | Greeting | Movie |
0.90% | 0.24% | 0.14% | |||
Music player | 91.23 | 96.28 | Music | Timer | Love |
2.62% | 0.14% | 0.14% | |||
SNS | 98.69 | 96.36 | News | Movie | Wikipedia |
1.09% | 0.57% | 0.43% | |||
Greeting | 93.87 | 92.14 | Dispraise | Check schedule | Praise |
1.51% | 1.11% | 1.00% | |||
Praise | 89.12 | 89.52 | Dispraise | Love | Restaurant |
2.79% | 1.95% | 1.06% | |||
Dispraise | 84.00 | 83.48 | Boredom | Praise | Shopping |
3.29% | 3.07% | 1.79% | |||
Boredom | 89.76 | 84.14 | Dispraise | Game | Greeting |
4.80% | 2.79% | 1.45% | |||
Love | 92.40 | 88.27 | Praise | Dispraise | Boredom |
4.41% | 2.12% | 1.68% |
Accuracy of our classifier (in terms of precision and recall) is promisingly high ranging from
We can also observe that only a few groups of similar tasks have relatively high misclassification ratios. For example, the three tasks of “Map,” “Restaurant,” and “Shopping” search were mostly confused with each other. The “Music” search and “Music player” app launch tasks were also mostly confused with each other. The “Apps” search and “Game” app launch tasks were mostly confused with each other, and the “Wikipedia” and “Information” search tasks were also confused with each other. Finally, the five chatting tasks were confused among themselves.
In order to increase the accuracy of our task classifier further, we may build a second layer of classification, which is applied to and classifies each group of similar tasks to a more accurate task inside the group. The second layer need not use the support vector machine as the first layer; it may use rule-based classifiers or any other algorithm.
We performed an experiment to investigate the relation between training data size and classification accuracy. It is expected that the more the data we use for training, the more the accuracy we can get. We wanted to confirm this expectation and to get the precise relation between the classifier performance and the training data size. For this experiment, we randomly chose
More specifically, suppose that
In particular, we tested with ten different sizes of training data:
The results of the experiment are shown in Figure
Prediction accuracy increases as we use more training data, but the increase rate drops.
So far, we have described our proposed task classifier that is based on the support vector machine algorithm. The proposed task classifier may be used to build a personal assistant system. Figure
A proposed architecture of a personal assistant system.
For the speech-to-text converter, any suitable converter may be used; for example, we may use Sphinx speech recognition system [
The received speech at the server is converted to text, and the converted text is given to the task classifier, a core part of the overall system. The task classifier classifies the given text into the most probable task, and the predicted task is transmitted back to the user’s mobile device. Then, the task launcher of the personal assistant system performs the predicted task accordingly on behalf of the user.
Since the proposed task classifier is built upon the linear support vector machine, the computation time typically ranged from a few milliseconds to a few tens of milliseconds. Of course, as we use more powerful servers, we can reduce the computation time more.
We presented a method to implement personal assistant services that can understand human’s natural language commands. Even though there already exist such services including Siri, Google Now, and Alexa, the internal technologies have not been disclosed and are not well known to the public. Therefore, we investigated whether it is possible to build a task classifier, a core part of personal assistant services, using a well-known machine learning algorithm. Our implementation is based on the support vector machine, a widely used classification algorithm in many domains.
To train our support vector machine with sufficient data, we collected our own training data set by using a popular crowdsourcing platform, the Amazon Mechanical Turk. We predefined thirty-two tasks that are frequently commanded while using mobile devices and collected natural language sentences for each task by using the crowdsourcing platform. Through this process, we were able to collect 65,890 natural language sentences in total.
We tested our task classifier performance with the tenfold cross-validation approach. The evaluation results show that the precision and recall values of our classifier are very high. This result indicates that our simple approach can be employed to implement practical personal assistant services.
The relationship between training data size and classifier performance was also investigated. We confirmed that the performance becomes better as we use more training data, but the performance increase rate drops as we increase the training data size. All of these observations are reasonable, and we hope that our work can motivate and be referred to by many researchers or industries who are trying to build their own personal assistant systems.
The authors declare that there are no conflicts of interest regarding the publication of this article.
This work was supported by 2017 Hongik University Research Fund, in part by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science, ICT & Future Planning (MSIP)) (no. 2017R1C1B5015901) and also in part by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2017R1D1A1B03031348).