A Bibliometric Review of Natural Language Processing Empowered Mobile Computing

1College of Economics, Jinan University, Guangzhou, China 2School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, China 3School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China 4Department of Chinese Language and Literature, University of Macau, Macau SAR, China 5School of Computer, South China Normal University, Guangzhou, China 6Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, China


Introduction
With the development of mobile devices as well as the advances in wireless communication technologies, mobile computing is becoming a significantly important paradigm in today's world of networked computing systems [1].Mobile computing enables a computer to be used normally while in the state of movement.Based on perceived situational information in personal and ubiquitous environments, mobile computing provides services automatically.With the rapid growth in use of mobile devices, far-reaching and diverse information is being produced rapidly and distributed instantly in digitized format [2].A large amount of valuable information existing in unstructured texts are of great need of processing, such as web pages, short messages, Twitter/WeChat messages, etc. Natural Language Processing (NLP) focuses on the interactions between computers and natural language texts.NLP is capable of providing a computer program with the ability to process and understand unstructured texts.By automatically analyzing the meaning of user content to take appropriate actions, NLP can make applications smarter in the mobile environment.
NLP empowered mobile computing research field has attracted more and more interests from scientific community, witnessing 12 publications in 2000 to 55 publications in 2016 from Web of Science (WoS).Some representative examples are as follows.Chen et al. [3] applied the technique of multitask learning using deep neural networks to Mandarin-English code-mixing recognition.Three schemes of the auxiliary tasks were proposed to introduce the language 2 Wireless Communications and Mobile Computing information to networks and to improve the prediction of language switching for the primary task of senone classification.The proposed schemes enhanced the recognition on both languages and reduced the relative overall error rates by 4.4% on average when dealing with real-world Mandarin-English corpus in mobile voice search.Ilayaraja et al. [4] presented a weighted association rule mining prefetching technique to determine the secondary service item, with the consideration of access frequency of services, semantic distance among the successive query request, and spatial distance between service instances and user context (e.g., position, service type, and query request time).Wong et al. [5] analyzed the students' vocabulary usage using a corpus analysis tool to identify and unpack the contextual conditions in which a mobile-and cloud-assisted Chinese language learning environment promoted key learning outcomes.Räsänen and Saarinen [6] proposed a method based on sparse hyperdimensional coding of sequence structures for sequence prediction.Their experiments suggested that the method was capable of capturing the relevant variable-order structure from the sequences.A NLP based tool MOTTE was developed by Puppala et al. [7] for extracting and structuring data in pathology reports automatically to support clinical solution applications.With an aim of screening information on human immunodeficiency virus/acquired immune deficiency syndrome, Adesina et al. [8] designed a monolingual short message services based system for the retrieval of frequently asked questions.
Bibliometric analysis is defined as the use of statistical methods on evaluating scholarly publications from an objective and quantitative perspective within a certain field [9].Benefits of bibliometric analysis include (1) organizing information in a specific thematic field [10], (2) evaluating scientific developments in knowledge of a specific subject and assessing the scientific quality [11], (3) determining the impact of research funding, (4) comparing research performance across different affiliations and document changes in the research workforce, and (5) identifying emerging areas of research focus and predicting future research success [12].As for researchers, especially newcomers, bibliometric analysis can assist them in (1) better selecting potential research topics, (2) demonstrating the values and impacts of their relevant works, (3) recognizing appropriate academic researchers to seek research collaboration, and (4) keeping abreast of new research status and new technological changes [13].
Bibliometric analysis has been widely applied to various fields for the measurement of quality and productivity of academic output and has demonstrated excellent effectiveness from long-term practice.Relevant researches mainly focused on revealing publication statistical characteristics, exploring the collaboration relationship, and uncovering research themes and their evolution.Some examples are as follows.Geng et al. [14] conducted a bibliometric survey of the research field of residential energy and greenhouse gas emissions for the purpose of uncovering research status.In their work, citation analysis was used to assess the influence of journals, countries, and authors, while network analysis was performed to evaluate the relationships among countries, authors, and keywords.Based on 117,340 obesity-related research publications indexed in Scopus database published from 1993-2012, Khan et al. [15] reported research trends and collaboration patterns in the field.Roig-Tierno et al. [16] conducted a bibliometric analysis on research publications with the application of qualitative comparative analysis (QCA).Their study revealed the differences in quantitative terms of the three variants of QCA.Albort-Morant and Ribeiro-Soriano [17] focused on the research development of business incubators.They sorted 445 publications from WoS according to bibliographic indicators such as research area and year of publication.Their study revealed the lack of publications on business incubators and highlighted the fragmented nature of research themes.Merigó and Yang [18] aimed at identifying relevant researches and the newest trends in field of operation research and management science.The analysis involved some influential journals, two hundred most cited publications, and productive and influential authors.Zhang et al. [19] quantitatively and qualitatively evaluated carbon tax related literature from 1989 to 2014 using bibliometric analysis.Their study demonstrated that the USA was the leading country and the Vrije University Amsterdam and Massachusetts Institute of Technology and Stanford University were the most productive affiliations in the research field.Randhawa et al. [20] conducted a systematic review of publications on open innovation (OI) research area using bibliometrics, cocitation analysis, and text mining.Three distinct areas within OI research were identified, i.e., firm-centric aspects of OI, management of OI networks, and role of users and communities in OI.In order to discover the worldwide trends in the research field of drying brick/tile, Yataganbaba and Kurtbas ¸ [21] analyzed relevant patents in terms of, e.g., publication number, authorship and ownership, and international collaboration patterns.Merigó et al. [10] explored the research development trends in fuzzy sciences.Similar works have also been conducted in other fields, e.g., natural language processing [22], neuroimaging [23], and diabetes [24].
To the best of our knowledge, there is no scientific review of NLP empowered mobile computing research field currently.Thus, in this study, we conduct a bibliometric analysis on publications retrieved from WoS during the years 2000-2016 to explore the research status of the research field.The main objective is to address the following issues: (1) investigating publication statistical characteristics and publication collaborations, (2) exploring publication geographical distributions, (3) visualizing scientific collaboration relationships, and (4) reveling current hot research topic themes and research topic changes.
The rest of the paper is organized as follows.Section 2 introduces methods and materials.Bibliometric analysis results on retrieved research publications are reported in Section 3. Findings and discussion are shown in Section 4 while Section 5 summarizes the work.

Methods and Materials
Five different methods are applied to analyze research publications in the NLP empowered mobile computing field retrieved from WoS.The details of the methods are described in Section 2.1 and the publication data is introduced in Section 2.2.

Descriptive Statistics Method.
Descriptive statistics are brief descriptive coefficients that summarize a collection of information, which can be either a representation of the entire population or a sample.Descriptive statistics are commonly used as measures of central tendency and measures of variability.Measures of central tendency usually include mean, median, and mode, while measures of variability generally contain standard deviation, minimum and maximum variables, kurtosis, and skewness.These two measures use graphs, tables, and general discussions to simply describe data.This simplifies large amounts of data in a sensible way by presenting quantitative descriptions in a manageable form to help users understand the meaning of the data being analyzed.
In this study, descriptive statistics method was applied to acquire characteristics of the retrieved publications, including publication distribution by year, most influential publications, productive journals, authors, affiliations, and countries/regions, as well as co-authors, coaffiliation, and cocountry/region publication distribution and topic distribution by year.

Geographic Visualization Method. Geographic visualization or
Geovisualization is a set of tools and techniques supporting the analysis of geospatial or spatial data, emphasizing knowledge construction over knowledge storage or information transmission.By combining technologies, e.g., image processing, simulation, and virtual reality, computers can help present information in a way that patterns can be found.Geovisualization can be applied to all the stages of problem-solving in geographical analysis, from development of initial hypotheses to knowledge discovery, analysis, presentation, and evaluation.According to Tobler's First Law of Geography [25], everything is related to everything else, but near things are more related than distant things.Through Geovisualization, we can use location as the key index variable and get related information which is previously unfound.Locations or extents in the earth space-time may be recorded as dates/times of occurrence.Longitude, latitude, and elevation are represented as X, Y, and  coordinates, respectively.
In this study, we applied geographic visualization analysis to explore geographical distributions of publications in country/region level.

Social Network Analysis
Method.Social network analysis is a process of investigating social structures using networks and graph theory [26].It focuses on relationship structures, ranging from casual acquaintance to close bonds.Network structures are characterized in terms of nodes (items, individuals, or things within the network) with the edges or links (relationships or interactions) connecting the nodes.Researches using social network analysis have been undertaken in different areas, e.g., collaboration graphs [27], social media networks [28], and disease transmission [29].These networks are often visualized through sociograms in which nodes are represented as points and edges are represented as lines.The social network analysis can help identify the individuals, teams, and units who play central roles, leverage peer support, and strengthen the efficiency and effectiveness of existing channels [30].
In this study, we applied social network analysis to explore the cooperation relationships for specific countries/regions, affiliations, and authors in the NLP empowered mobile computing research field.The cooperation among countries/regions, affiliations, and authors was visualized using interactive force directed networks.In the networks, nodes represented specific countries/regions, affiliations or authors, and lines indicated cooperation.The size of nodes represented publication numbers of a specific country, affiliation, or author.The width of lines reflected cooperation frequencies between two countries/regions, affiliations, or authors.The color indicated specific continent of a country/region, or specific country/region of an affiliation or author.Users could explore the cooperation relationships for specific countries/regions, affiliations, or authors by dynamically dragging the nodes.

Latent Dirichlet Allocation
Method.Latent Dirichlet allocation (LDA), proposed by Blei [31], is a generative probabilistic model.The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words, and topics are assumed to be uncorrelated.
LDA formally defines the following terms: (1) A word is defined as an item from a vocabulary indexed by {1, . . ., }.
LDA assumes the following generation process: (1) The term distribution  which contains the probability of a word occurring in a given topic is determined by  ∼ Dirichlet().
(2) The proportions  of the topic distribution for a document  are determined by  ∼ Dirichlet().
(3) For each word   in the document d, a topic is chosen by the distribution   ∼ Multinomial() and a word is chosen from a multinomial probability distribution conditioned on the topic   : (  |   , ).

Wireless Communications and Mobile Computing
Gibbs sampling defines a Markov chain in the space of possible variable assignments such that the stationary distribution of the Markov chain is the joint distribution over variables.Thus, it is a Markov Chain Monte Carlo method [32].Its aim is to construct a Markov chain converging to the target probability distribution in the high dimensional model and then the sample distribution closest to the target probability distribution will be extracted.The log-likelihood for Gibbs sampling can be obtained through The perplexity, as shown in (3), is often used to evaluate the models on held-out data and is equivalent to the geometric mean per-word likelihood.The less the perplexity is, the better the model is.
In (4),  () denotes how often the jth term occurs in the dth document.If the model is fitted through Gibbs sampling, the likelihood can be determined for the perplexity using Additionally, estimation using Gibbs sampling requires specification of values for the parameters of the prior distributions.
In this study, topic discovery and distribution were analyzed using LDA models with the following steps: (1) We assigned the weights of segmented author keywords and Keywords Plus, publication title, and abstract as 0.4, 0.4 and 0.2, respectively, as determined in our former experiment [13].
(2) Term Frequency-Inverse Document Frequencies (TF-IDF) were used to filter out unimportant terms.As one of the most popular term-weighting schemes, TF-IDF increases proportionally to the number of times a term appears in a publication but is often offset by the frequency of the term in the whole collection of publications.We calculated the TF-IDF values of all terms to sort the terms.By manually examining these ranked terms, we defined a threshold as 0.1 empirically.Only the terms with a TF-IDF value greater than the threshold were kept for further analysis.
(3) Through sampling, 16 different topic numbers were set to  (2 : 10, 15, 20, 40, 50, 80, 150, 250).For each topic number, 10-fold cross-validation was used to evaluate model performance.Specifically, dataset was split into 10 test datasets to conduct multiple runs.Perplexity criteria were used to select optimal topic number. for Gibbs sampling was initialized as the mean value of  values for model fitting using VEM with the optimal topic number.
(4) With an initialized  and the optimal topic number, we adopted Gibbs sampling and VEM method to estimate the LDA model.
(5) By matching the topics detected by VEM and Gibbs sampling based on Hellinger distance, the best matches with the smallest distance could be identified.Hellinger distance is calculated as (5), in which  and  denote two probability measures. (5)

Affinity Propagation Clustering
Method.Affinity Propagation (AP) algorithm was proposed by Frey and Dueck [33].
It is a technique for data clustering based on message passing.AP does not require the predefined number of clusters.It identifies cluster centers, or exemplars as representative members of clusters.Initially, all nodes are considered as exemplars."Preference" is used to reflect how likely one node is chosen as an exemplar.If no prior knowledge is available, all nodes will be assigned the same preference value.AP has been shown to be more efficient and effective in cluster identification than traditional clustering methods, e.g., means [34].AP algorithm takes (, ) as function of similarity to reflect the fitness of the data point  being the exemplar of data point .The aim of AP is to maximize the similarity (, ) between every data point  and its chosen exemplar .Each node  also has a self-similarity (, ).Individual data points initialized with a larger self-similarity are more likely to become exemplars.All data points are equally likely to be exemplars when they are initialized with the same constant self-similarity.The number of clusters produced will be increased and decreased accordingly with this common self-similarity input.
There are two types of messages contained in this technique.The responsibility (, ) is directed from  to candidate exemplar .It indicates how well suited  is to be 's exemplar, taking into consideration competing potential exemplars.The availability (, ) is sent from candidate exemplar  back to .It indicates 's desire to be an exemplar for  based on supporting feedback from other data points.Both the self-responsibility (, ) and the self-availability (, ) can Table 1: The query used to retrieve research publications in the NLP empowered mobile computing field from WoS. TS=(("natural language processing" OR "NLP" OR "semantic analysis" OR "bag of words" OR "word sense disambiguation" OR "named entity recognition" OR "NER" OR "sentiment analysis" OR "information extraction" "tokenization" OR "stemming" OR "lemmatization" OR "corpus" OR "stop words" OR "parts-of-speech" OR "language modeling" OR "n-grams" OR "syntactic analysis" OR "information retrieval" OR "language model") AND ("mobile computing" OR "mobile" OR "smart device" OR "smartphone" OR "cellphone" OR "telephony device" OR "Cellular network" OR "Android" OR "iOS" OR "phone")) reflect accumulated evidence that  is an exemplar.The update formulas for responsibility and availability are as follows: Responsibility and availability of message updates are  new =  old + (1 − ) new , where  is a weighting factor between 0 and 1.In AP, the clustering is complete when the messages converge.Also, AP algorithm is able to determine when a specific data point has converged to cluster head status in its given cluster.A point becomes the cluster head when its self-responsibility plus self-availability becomes positive.Upon convergence, each node 's cluster head can be calculated using In our study, with the basis of term-topic posterior probability matrix, we applied AP clustering method for the cluster analysis of the topics identified by the LDA method.

Materials.
Web of Science, as the most authoritative citation database, was used as the data source for retrieving research publications in the NLP empowered mobile computing field.First of all, a list of keywords related to the "natural language processing" and "mobile computing" was determined by a domain expert.With "Science Citation Index Expanded" and "Social Sciences Citation Index" as indexes, publications used in this study were identified using the specific query in Table 1.716 publications in "article" type during years 2000-2016 were obtained.Citations counted to September 8th, 2017 were considered for each publication.
The raw data of the 716 publications were downloaded as plain text.Key elements including title, author, journal, publication date, subject category, language, funding, author keywords, Keywords Plus, abstract, and author address, as well as number of citations, pages, and references, were extracted.In order to ensure they were closely related to the research field, manual verification was conducted by a domain expert on each publication.471 publications were identified as relevant for analysis eventually.Further, corresponding affiliations and countries/regions were identified out from author address information.Key terms were extracted from author keywords, Keywords Plus, title, and abstract.
The statistical characteristics of the publications are shown as Table 2.The average page number of the publications is 15.66 and the average reference number of the publications is 33.29.There are 48 subject categories included, where the top 3 categories are computer science (38.76%), engineering (16.27%), and telecommunications (10.98%).
The distribution characteristics of the 471 publications are shown in Figure 1. Figure 1(a) shows the distributions of the numbers of countries/regions, affiliations, authors, and funds.Figure 1(b) shows the distributions of the numbers of keywords, pages, and references.The distribution of the number of title characters is shown in Figure 1(c).In Figure 1(d) the right bottom illustrates the distribution of the number of abstract characters.

Publication with Year.
The total publications, total citations, average number of citations per publication, and the number of annual citations are demonstrated in Figure 2. The results show that the research in the NLP empowered mobile computing field exhibits an overall upward trend in fluctuation (from 12 publications in 2000 to 55 publications in 2016).The publication number presents a stable increasing trend since 2010.Based on the data for years 2010-2016, we developed a regression model by setting the independent variables as time/1000 and (time/1000) 2 .The estimated regression model is calculated as  = 6.7143 * 10      In order to better measure the overall scientific importance of these 11 journals, 5 assessment indicators acquired from Scientific Journal Rankings were used, including Impact Factor (IF), SCImago Journal Rank (SJR), 5-Year IF, Source Normalized Impact per Paper (SNIP), and CiteScore.IF is a measure for reflecting the yearly average number of citations to recent publications published in a journal.It is the primary and widely used indicator on assessing one journal's significance.SJR is a measure of scientific influence of scholarly journals.It accounts for both the number of citations received by a journal and the importance or prestige of the journals where such citations come from.5-Year IF is calculated by dividing the number of citations to the journal in a given year by the number of publications published in that journal in the previous five years.SNIP is defined as the ratio of the journal's citation count per publication and the citation potential in its subject field.CiteScore index, launched by Elsevier in December 2016, is calculated as the ratio of total citations received in a given year by all publications published in a given journal in three previous years and the number of publications published in the journal in three previous years.
Therefore, the 11 productive journals were compared by using their IF, SJR, 5-Year IF, SNIP, and CiteScore for year 2016, as shown in Figure 3

Most Influential Publications.
The number of citations reflects the popularity and influence of a publication in the scientific community [10].Thus, we used the total citations as a measurement of influence.There are 69 and 129 publications with the number of citations ≥20 and ≥10.Top 15 most influential publications are listed in Table 4.The publication by Miao et al. [35] in 2010 (376 citations) is the most influential one, followed by [36] published by MacKenzie and Soukoreff in 2002 (172 citations) and [37]       100 most influential publications.It is noted that publications from Singapore have the highest ACP, which indicates the high quality of the publications.As for most of the top 15 productive countries/regions, the international collaboration rates are around 30%, except for Greece with 0 and Australia with 61.11%.The USA is the closest collaborator for 9 of the 15 countries/regions.The ACP of internationally collaborated publications is much higher than that of noninternationally collaborated publications for countries/regions like China, Japan, Italy, France, Spain, and Singapore.This potentially indicates that international collaboration can improve the quality of their publications.
Since the publications are mainly distributed in the USA, China, England, and South Korea, we further explored the annual publication distributions for these 4 countries, as shown in Figure 5.The number of publications for the USA and China is on the whole presenting upward trend in fluctuation.As for the USA, the number increases from 2 in 2000 to 9 in 2007 but dwindles to 2 in 2010.After that, the upward trend becomes more significant.The situation for China is quite like that for the USA after 2010, witnessing the great mass upsurge on the NLP empowered mobile computing research in these two countries since 2010.As for England and South Korea, the number of publications does not increase much in fluctuation with years going on.
3.6.Cooperation Relationship.Figure 6 shows the trends of the international collaborative and the percentage of international collaborative publications.We found that the international collaborative publications increase during the years 2000-2016.The percentage of international collaborations increases from 8.33% in 2000 to 32.73% in 2016.This indicates that international collaborations in the NLP empowered mobile computing research field have become increasingly important.
Figures 7 and 8 present the institutional level of cooperation and the author level of cooperation, respectively.The cooperation between different institutions is becoming more and more frequent.The percentage of institutioncollaborative publication increases from 16.67% in 2000 to 58.18% in 2016.More than 90% of the publications are multiauthored since 2011.It is worth noticing that the percentage reaches up to 100% in 2015.
Furthermore, the cooperation relations for specific countries/regions, affiliations, and authors were visualized with Wireless Communications and Mobile Computing   : Cooperation network of 48 countries/regions (node colors represent different continents, e.g., orange for Asia, blue for North America, green for Europe, red for Oceania, purple for Africa, and brown for South America).The network can be accessed via the link (http://www.zhukun.org/haoty/resources.asp?id=NLPEMC cocountry).social network analysis.A cooperation network for 48 countries/regions is shown in Figure 9. 17 of them come from Asia (represented as orange nodes), 3 from North America (represented as blue nodes), 22 from Europe (represented as green nodes), 3 from Africa (represented as purple nodes), 2 from South America (represented as brown nodes), and 1 from Oceania (represented as red node).There are 141 affiliations with the number of publications ≥ 2, and there exists cooperation among 91 of them.Figure 10 shows a cooperation network of the 91 affiliations.23 of the 91 affiliations are from the USA and 14 from China.As for cooperation of author level, there are 98 authors with publication count ≥ 2. among them, 65 authors involve in cooperation.We created a cooperation network of the 65 authors, as shown in Figure 11.

Topic Discovery and Distribution
. By setting TF-IDF value threshold as 0.1, the terms were ranked by frequency.Table 8 lists top 20 most frequent terms, in which the top 5 Figure 10: Cooperation network of 91 affiliations (node colors represent different countries/regions, e.g., red for the USA, pink for South Korea, and purple for Australia).The network can be accessed via the link (http://www.zhukun.org/haoty/resources.asp?id=NLPEMC coaffiliation).
terms are "Agent" (369), "Image" (215), "Sentiment" (128), "Dialogue" (83), and "Health" (81).Figure 12 presents the perplexities of models fitted by using Gibbs sampling with different numbers of topics.The result suggests that the optimal topic number is between 40 and 80.Hence, we set the topic number as 40.The  was set to the mean value 0.01101332 in the cross-validation fitted using VEM.Using the parameters, we estimated the LDA model using Gibbs sampling.By semantics analysis of representative terms in each topic, as well as reviewing text intention of the corresponding publications, we assigned potential theme to each topic.The order of topics are determined based on Hellinger distance.Specifically, Topic 36 is the best matching topic and Topic 11 ranks 2nd, while Topic 37 is the worse matching one.Due to space limitation, Table 9 only displays the top 10 best matching topics with the most frequent terms.Each publication was assigned to the most likely topic with the highest posterior probability.Integrating topic proportions for all the publications, we obtained a topic distribution.The 4 most frequent research topics are Topic 36 (6.38%), Topic 4 (4.26%),Topic 11 (3.83%), and Topic 17 (3.83%),while the 4 least frequent research topics are Topic 26 (1.49%), Topic 23 (1.28%), Topic 10 (1.06%), and Topic 20 (1.06%).
We used the AP clustering analysis to perform the cluster analysis of the 40 topics.One way for measuring topic similarity is based on term-level similarity with the hypothesis that topics may contain the same terms.The clustering result based on term-topic posterior probability matrix is shown in Figure 13, where the 40 topics are categorized into 8 groups.
Identifying emerging research topics can provide valuable insights into the development of the research field.Likewise, identification of fading research topics can also help understand the hot spots evolution [40].We then explored the annual publication proportions of the 40 research topics, as shown in Figure 14.We used Mann-Kendall test [41], a nonparametric trend test, to examine whether increasing or decreasing trends are existing in the 40 topics.Test results show that 12 topics, including Topic 1, Topic 4, Topic 7, Topic 10, Topic 14, Topic 18, Topic 20, Topic 26, Topic 29, Topic 32, Topic 33, and Topic 39, present a statistically significant increasing trend.While Topic 36 presents a statistically significant decreasing trend, both at the two-sided  = 0.05 levels.

Discussions
This study provides a most up-to-date bibliometric analysis on the publications in WoS during the years 2000-2016 in the NLP empowered mobile computing research field.Some interesting findings are discussed below.The annual number of the publication distribution shows a significant growth trend, from 12 publications in 2000 to 55 publications in 2016.This indicates a growing interest in the research field.
The literature characteristics analysis shows that the 417 publications are widely dispersed throughout 287 journals.Through geographic visualization analysis, 60 countries/regions have participated in the publications.The top 15 productive countries/regions are developed countries/regions, except for China.As the top 2, the USA and China have shown a significant growth in the numbers of scientific publications since 2010.These numbers are predicted to continue to increase in the coming years.This partially reflects the need of the development of NLP techniques in solving mobile computing issues.
Scientific collaboration analysis shows that there are significant growth of international collaborations, institutioncollaborations as well as author-collaborations.Through social network analysis, we found that researchers tend to collaborate with others within the same country or area, with institutions under similar administration, or with a neighboring country or area.However, some research institutions might have separate administration arrangements from their associated universities or hospitals and a researcher might be affiliated with multiple institutions.Most topics identified using LDA method are recognizable, as they are related to major issues in the research field.Due to space constraints, here we only provide interpretations of some representative topics.
Topic 36 and Topic 11 contain words such as "Agent", "Mobile-agent", "Multi-agent", "Itinerary", "Migration", "Protocol", and "Truncation".Thus, Topic 36 and Topic 11 pertain to mobile agent computing.As an emerging and exciting paradigm for mobile computing applications [42], mobile agent can not only support mobile computers and disconnected operations but also provide an efficient, convenient and robust programming paradigm for implementing distributed applications.The use of mobile agent can bring about significant benefits, e.g., reduction of network traffic, overcoming network latency, and seamless system integration.Therefore, mobile agent is well adapted to the domain of mobile computing.
Topic 32 discusses events about mobile privacy and security.Words in this topic include "Privacy", "Private", "Secure", "Encryption", "Privacy-preserving", "Password", and "Cryptosystem".As pointed out by Mollah et al. [43], security and privacy challenges are introduced with the development of mobile cloud computing which aims at relieving challenges of the resource constrained mobile devices in mobile computing area.Studies centering on mobile privacy can be found.For example, Xi et al. [44] applied Private Information Retrieval techniques in finding the shortest path between an origin and a destination in location privacy issues without the risk of disclosing their privacy.
Topic 1 discusses mobile computing on image and syllable events.It includes words such as "Image", "Syllable", "Reranking", "Content-based", "Composite Phoneme", "Simple Phonemes", and "Modern Orthography".Image search in mobile device is quite worthy of challenge [45].Many researchers are seeking ways to solve this problem.For example, Cai et al. [46] presented a new geometric reranking algorithm specific for small vocabulary in aforementioned scenarios based on Bag-of-Words model for image retrieval.Mobile computing on syllable events is another focus.A representative work is by Eddington and Elzinga [47].They conducted a quantitative analysis on the phonetic context of word-internal flapping with great attention paid to stress placement, following phone, and syllabification.
Topic 4 mainly focuses on mobile social media event.Words like "Twitter", "Sentiment", "Tweet", "Emojis", "Micro-blog", "Opinion", "Public", and "Emotion" can be found within this topic.With the rapid development of social network, information spreading and evolution is facilitated with popularity of the environment of wireless communication, especially social media platform on mobile terminals [48].Researchers are gradually paying attention to this area.For example, based on 100 million collected messages from Twitter, Wang et al. [49] presented a hybrid model for sentimental entity.
Based on topic distributions, we found that mobile agent computing, mobile social media computing, and sound related event computing are 3 highest-frequent research themes.From Figure 14 as well as Mann-Kendall test results, we found that some research themes present a statistically significant increasing trend, e.g., image and syllable related events, mobile social media computing, and healthy related events, while researches on mobile agent computing presents a statistically significant decreasing trend.
In the thematic analysis, the optimal number of topics was selected as 40 by a statistical measure of model fitting the data.However, mechanical reliance on statistical measures might lead to the selection of a less meaningful topic model [50].Hence, we manually checked the robustness of the results by confirming identified topics using a qualitative assessment with the basis of prior knowledge.For each topic, we checked the semantic coherence of its high-frequency terms and examined the contents of publication with a high proportion of this topic.
Through the AP clustering analysis on the 40-topics, 8 clusters were identified, i.e., mobile agent computing, mobile social media computing, image and syllable related events, context-aware computing, sound related events, mobile location computing, healthy related events, and other events.The results of AP clustering analysis are on the whole sensible and easy-to-understand.However, we still found that the 8 categories vary a lot in topic numbers.One possible reason is the choice of clustering method.We then adopted hierarchical clustering method with category number setting to 8. The result was similar with AP clustering.Another possible reason is the sample size since the number of the relevant publications in WoS is limited.
This study is the first to thoroughly explore research status of the NLP empowered mobile computing research field in the statistical perspective.The study provides a comprehensive overview and an intellectual structure of the field from 2000 to 2016.The findings can potentially help researchers especially newcomers systematically understand the development of the field, learn the most influential journals, recognize potentially academic collaborators, and trace research hotspots.
For future work, there are several directions.First, more comprehensive data is expected to be included.Though WoS is a widely applied repository for bibliometric analysis due to its high authority, some relevant conference proceedings have not been indexed yet in WoS.Second, we intend to employ different data clustering methods and compare clustering results for deeper cluster analyzing.

Conclusions
We conducted a bibliometric analysis on natural language processing empowered mobile computing research publications from Web of Science published during years 2000-2016.The literature characteristics were uncovered using a descriptive statistics method.Geographical publication distribution was explored using a geographic visualization method.By applying a social network analysis method, cooperation relationships among countries/regions, affiliations, and authors were displayed.Finally, topic discovery and distribution were presented using a LDA method and an AP clustering method.We believe the analysis can help researchers comprehend the collaboration patterns and distribution of scholarly resources and research hot spots in the research field more systematically.

Figure 3 :
Figure 3: Comparisons of IF, SJR, 5-Year IF, SNIP, and CiteScore for the top 11 productive journals for year 2016.
by Strayer and Drews in 2007 (148 citations).We further consider the number of annual citations of the 15 publications.The top 3 publications measured by this indicator are [38] by Cao et al.

Figure 4 :
Figure 4: Geographical distributions of the NLP empowered mobile computing research publications.

Figure 5 :Figure 6 :
Figure 5: Publication distributions by year for the top 4 countries/regions.

Figure 7 :
Figure 7: Institution-collaborative publication distribution by year.

Figure 8 :
Figure 8: Author-collaborative publication distribution by year.

Figure 11 :
Figure 11: Cooperation network of 65 authors (node colors represent different countries/regions, e.g., range for South Korea, red for the USA, purple for Australia, green for China, and brown for Italy).The network can be accessed via the link (http://www.zhukun.org/haoty/resources.asp?id=NLPEMC coauthor).

Figure 12 :
Figure 12: (a) Estimated  value for the models fitted using VEM.(b) Perplexities of the test data for the models fitted by using Gibbs sampling.Each line corresponded to one of the folds in the 10-fold cross-validation.

Figure 13 :
Figure 13: The visualized result of hierarchical clustering based on term-topic posterior probability matrix.

Figure 14 :
Figure 14: The trends of the 40 research topics during 2000-2016 (x-coordinate as year, y-coordinate as proportion %).
3 − 1.34777 * 10 4 .The adjusted goodness-of-fit  2 of the model is 0.9468.With the regression model, publication number in 2017 is predicted as 65, while the actual number of publications on WoS in 2017 is 66.The trend of citations does not keep step with publication number, and extreme values appear in 2002 as 431, 2007 as 503, and 2010 as 490.The average number of citations per publication is calculated as total citations/total publications.The top 11 contributing journals in the research field are presented in Table 3.These journals contribute about 21% of the total publications and 29.20% of the total citations.The most productive 3 are IEEE/ACM Transactions on Audio Speech and Language Processing (25 3.2.Productive Journals.

Table 2 :
The statistical characteristics of the 471 publications.

Table 3 :
Top 11contributing journals in the NLP empowered mobile computing research field.
Notice.Journal IEEE Transactions on Audio Speech and Language Processing changed name as IEEE/ACM Transactions on Audio, Speech, and Language Processing in 2013, and journal IEEE Transactions on Speech and Audio Processing ceased publication in 2005, and the current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.Therefore, publications from these 2 journals were combined as published by IEEE/ACM Transactions on Audio, Speech, and Language Processing; Abbreviations.SC: subject categories only with NLP empowered mobile computing research (A: acoustics; E: engineering; CS: computer science; OR&MS: operations research and management science; T: telecommunications); TP: total publications; % : percentage of the publications; TC: total citations; ACP: average number of citations per publication, calculated as TC/TP; H: H-index; ≥10: number of publications with citations ≥10; T100: number of publications in the top 100 most influential publications.
. As for IF, SJR, and CiteScore, the top 3 are Information Sciences (IF 4.832, SJR 1.91, and CiteScore 5.37), Expert Systems with Applications (IF 3.928, SJR 1.433, and CiteScore 4.7), and IEEE/ACM Transactions on Audio Speech and Language Processing (IF 2.491, SJR 0.813, and CiteScore 3.5).As for 5-Year IF, the top 3 are Information Sciences (5-Year IF 4.731), Expert Systems with Applications (5-Year IF 3.526), and Personal and Ubiquitous Computing (5-Year IF 2.512).As for SNIP score, the top 3 are IEEE/ACM Transactions on Audio Speech and Language Processing (SNIP 3.143), Information Sciences (SNIP 2.537), and Expert Systems with Applications (SNIP 2.492).

Table 4 :
Top 15 most influential publications in the NLP empowered mobile computing research field.As for the ranking based on the total citations, the top 3 are Georgia Institute of Technology from the USA (550 citations and 110 ACP), Microsoft Research Asia from China (115 citations and 16.43 ACP), and National Cheng Kung University from Taiwan (62 citations and 12.4 ACP).Ranking based on the ACP indicator yields the same result.

Table 5 :
The most productive authors in the NLP empowered mobile computing research field.
Abbreviations.CA: Canada; USA: the USA; UK: England; CN: China; KR: South Korea; GR: Greece; IT: Italy; JP: Japan; SG: Singapore; TP: total publications; TC: total citations; ACP: average number of citations per publication; : -index; T100: number of publications in the top 100 highly cited publications; : number of publications with funding; FP: number of publications as first author; LP: number of publications as last author; CP: number of collaborated publications.

Table 6 :
The most productive affiliations in the NLP empowered mobile computing research field.

Table 7 :
The most productive countries/regions in the NLP empowered mobile computing research field.

Table 8 :
Top 20 most frequent terms.
The co-authors Mobile agent computing