A Method for Identifying Japanese Shop and Company Names by Spatiotemporal Cleaning of Eccentrically Located Frequently Appearing Words

We have developed a method for spatiotemporally integrating databases of shop and company information, such as from a digital telephone directory, spatiotemporally, in order to monitor dynamic urban transformations in a detailed manner. To realize this, an additional method is necessary to verify the identicalness of di ﬀ erent instances of Japanese shop and company names that might contain ﬂuctuations of description. In this paper, we discuss a method that utilizes an n -gram model for comparing and identifying Japanese words. The processing accuracy was improved through developing various kinds of libraries for frequently appearing words, and using these libraries to clean shop and company names. In addition, the accuracy was greatly and novelty improved through the detection of those frequently appearing words that appear eccentrically across both space and time. By utilizing natural language processing (NLP), our method incorporates a novel technique for the advanced processing of spatial and temporal data.


Introduction
Spatiotemporal changes of shop and company locations have a major effect on the vitality and attraction of urban space. It is a significant challenge to monitor these changes, quantitatively and in as detailed as manner as possible, for use in various fields including urban engineering, geography, and economics. However, it is difficult to comprehensively monitor urban spaces, because much general regional and statistical information (e.g., the population census, commercial statistics) is compiled by separate administrative or city block units.
On the other hand, detailed information on shop and company locations and names can be collected using telephone directories and web information. Fortunately, this is possible in Japan, because of the availability of digital telephone directories and detailed digital maps which can monitor almost all residents and tenants in a given building.
The yearly continuations and changes in tenants or residents can be monitored for a certain location, and we can integrate these data across multiple years. The same can be done for shop and company locations over multiple years, by measuring changes in shop and company names. However, this measure is not easy because of name fluctuations between different two years or different kinds of data. Therefsore, we have been developing a dataset that can monitor the time-series changes of each shop and company and a system that can develop such data as to resolve this challenge [1,2]. This paper focuses on a particular method of name identification, pertinent to shop and company names-that is, an identification method for Japanese words.

Previous Studies: About Spatial and Temporal Data
Developments. There have been many previous studies that have attempted to monitor changes in urban spaces using time-series information of shops, companies, and buildings.

Advances in Artificial Intelligence
For example, the locations of open and closed shops were extracted using digital maps and the results of field surveys by Ato et al. [3], and time-series changes of building locations were extracted and applications were developed using digital residential maps by Ai et al. [4]. However, the processing methods used in almost all of these previous studies are inappropriate for large quantities of data (e.g., the whole area of one city or prefecture) because the researchers developed their time-series data by manual, time-intensive processing.
On the other hand, Ito and Magaribuchi have developed a completely automated method of spatiotemporal integration of digital residential maps, which is capable of processing large volumes of data [5]. However, this method has been applied to only one specific kind of digital map. In addition, the method encounters difficulties when dealing with problems of so-called noise word cleaning and local frequently appearing words (to be described below) because of the focus of this study was only Tokyo's 23 wards. As result, we consider that there are some limitations to apply this method over a broad area.

Previous Studies: About Language
Processing. For this study, a method to recognize name entities (i.e., compound noun) is necessary. There have been many previous studies that worked in various ways to develop this method. Florian et al. presented a statistical language-independent framework for identifying and tracking named, nominal, and pronominal references to entities within unrestricted text documents, then chaining them into clusters corresponding to each logical entity present in the text [6]. Tri Tran et al. applied a support-vector-machine-(SVM-) based NER model to the Vietnamese language [7]. Tjong Kim Sang and Meulder processed named entity data from English and German using sixteen different kinds of systems to recognize the entities' identities, and they obtained the best result using a combined learning system that applied Maximum Entropy to each language [8]. In addition, there have been many studies that have attempted to recognize name entities from other kinds of languages [9][10][11][12][13].
There have also been many previous studies focusing on the processing of Japanese words. Sato et al. developed a method to predict the authors of a text based on frequencies of word usage within it [14]. Kawakami and Suzuki presented a method to calculate word similarities in random texts using a decision list [15]. Mishina et al. evaluated word similarities using n-grams. Similarly, we will use n-grams in this study in order to recognize and identify shop and company names [16].
However, it has been difficult for previous methods to deal with local frequently appearing words (LFAW). Our approach to managing this problem is introduced in Section 3.5 in detail.

About This
Paper. This paper and our system focus on Japanese language processing. Our system can monitor the time-series changes of each shop and company, integrating them to create a dataset containing their names and locations, that is, address, longitude, and latitude across two years spatially and to measure identifications of their names. In  There are two remarkable and novel points our paper introduces. The first is that it utilizes natural language processing (NLP) for the advanced processing of spatial and temporal data. There are few studies that have processed data using NLP in the field of spatial information science. Some studies in this field have partly utilized NLP [17,18]; however, there are no studies that have utilized NLP for the processing of spatial and temporal data to the same extent as our study. In Japan, Ito and Magaribuchi [5] have accomplished a similar trial to our study, as detailed in Section 1.1. However, their method has been applied only in central Tokyo. Our study is the first to develop a spatiotemporal dataset for throughout Japan. The second is that our method can deal with LFAW. It is a novel method that recognizes so-called pure shop and company names ("Pure" refers to the elements of a character string which identifies a tenant uniquely.) and detects words that are eccentrically-located spatially and temporally (LEAW) and also cleans shop and company names of LFAW.
In Japan, there are many kinds of data that contain name and location information. One of the largest and most complete datasets for shops and companies all over Japan comprises residential and tenant information from digital residential maps (Zenrin CO., Ltd.; in Japanese, "Zyutaku-Chizu") and digital telephone directories (e.g., "Town Page Database" by NTT Business Information Service, Inc. and "Telepoint Data" by Zenrin CO., Ltd.). Other data are data of each companies and enterprises, for example, the quarterly journal of companies and enterprises of Japan, or shop information on the Web collectable by API services. Therefore, our method can be adapted across various fields of data development.
Our system should be able to process data in various regions and times. Therefore, the test data used in the development of our system was the residential and tenant information in the digital residential maps and telephone

Characteristic Features of Japanese
Japanese is a language used throughout Japan. There are about 130 million speakers in the world, mainly in East Asia [19].
One of the main characteristics of Japanese is its use of three kinds of characters: hiragana, katakana, and kanji. Hiragana and katakana are phonograms, while kanji are ideograms. The origin of kanji is Chinese characters, while hiragana and katakana are unique characters originating in Japan. Kanji are mainly used to write nouns, roots of verbs and adjectives, and personal names of Japanese and Chinese people. Kanji can be described by hiragana and katakana because the pronunciations of kanji can be written phonetically, as seen in Table 1. In addition, there are multiple pronunciations in kanji, unlike Chinese characters ( Table 2). In such cases, the pronunciation of a given kanji is decided by the context of the surrounding words and texts.
A notable characteristic of written Japanese is that it does not have blanks between single words. Because of this, it is difficult to divide one text into its component single words without an adequate understanding of word meaning or class (Table 3). This is a common characteristic of major languages in East Asia. Klein et al. have also pointed this out and tried to recognize Chinese name entities using the character sequences [20].
In addition, one of the interesting features of Japanese is that it typically writes loanwords with similarly pronounced katakana (Table 4). Chinese also has this same feature. When it is difficult to write loanwords with katakana because of inadequate or incompatible phonemic inventories, loanwords are sometimes written in the original languages and script [21].

Difficulty in Verifying the Identity of Shop and Company
Names. It is not easy to verify that two Japanese words are identical or even similar because of the above features of Japanese. For example, character string lengths tend to be longer than English or French, because Japanese is written without blanks between single words. In addition, there are many fluctuations of description, because Japanese uses three kinds of characters and changes word order frequently.
Moreover, one kind of character string that appears frequently in shop and company names is branch names. These strings become noise words, making name identification difficult if a shop and company name contains long geographic or building names (Table 5). We realize the necessity of solving these kinds of problems if we wish to verify the identicalness of different instances of shop and company names adequately.

Development
We identify and verify the time-series changes of shops and companies between two different years based on location (i.e., address, longitude, and latitude) and name information (i.e., shop, company, or building names). Then, we can assess the kind of time-series change-that is, continuation, change, emergence, and demise-of each shop or company between different two years, for monitoring purposes.

Input Data.
The input data consisted of name and location information separated by commas (e.g., in csv or txt format) containing an address at minimum. When more specific information is provided in the source data-building names, floors, and room numbers-our system can integrate input data more accurately than without. Figure 1 shows an image of some sample input data and their resultant output. Figure 2 shows the processing flow of our method and all potential results of time-series integration. At first, new and old data are integrated spatially for each shop and company unit. Shops and companies found at the same location are integrated into a set, and after subsequent time-series integration, they are labeled either "continuation" or "change." The time-series results of shop and companies that exist only as new data are labeled "emergence" and those that exist only as old data are labeled "demise."

Processing Flow.
In this paper, we introduce a method to verify whether or not two spatially integrated names refer to the same tenant, and then to decide whether the time-series change is best classified as "continuation" or "change." The details of spatial integration have been described in our previous studies [1,2].
In this study, the time-series changes of each shop and company were monitored based on the "name" changes within the same location. In other words, our method monitors time-series changes of buildings and their floors and rooms. Therefore, our method will not identify transfers of shop or company ownership, whether by merger or acquisition. However, we expect that interested parties will be able to track changes of ownership at the same "continuation" location by using other data or statistics more relevant to company mergers and acquisitions, such as the Japan Company Handbook (Toyo Keizai Inc.).

Verification of Name Identification.
It is not easy to verify that a new name and old name refer to the same company at a given location, because simply determining whether each name is exactly the same returns inadequate results. There are subtle fluctuations of description between the names in new data versus old data, even though they may actually be the same shops or companies. Table 6 shows some examples of fluctuations of description between old and new names of the same business. The shops and companies listed in Table 6 were taken from the 2000 and 2005 Tokyo telephone directories. Each shop or company is located at the same location in 2000 and 2005, and this fact can be verified via human manual processing. However, each name is subtly different.
In order to solve this problem, we must meaningfully quantify similarities between the words of the shop and company names.  In this study, this word quantification has been realized by the "n-gram." The n-gram is one method of natural language processing that can quantify the degree of similarity between different two words [22]. The method has been attracting attention in fields as diverse as literature, linguistics, and computer science [23][24][25].
We use the bigram (2-gram) to calculate name similarity in this study. The bigram extracts string blocks constructed of 2 characters from new and old names and then compares them. This method can resolve the problem of fluctuations of description. Figure 3 depicts the bigram calculation method as applied to our word similarity problem. A name similarity between word i and word j is defined by where S i j (n) is the name similarity between word i and word j, m i (n) is the number of string blocks extracted from word i, m j (n) is the number of string blocks extracted from word j, and n i j (n) and n ji (n) are the number of string blocks within m i (n) matching within m j (n) , and vice versa, respectively.  It was necessary to designate a minimum threshold for the similarity metric S i j (n) experimentally. First, 3,000 shops and companies were randomly extracted from the 2005 Tokyo telephone directory, and then integrated with the shops and companies in their respective spaces from the 2000 directory. Next, the name similarities between shops and offices in the integrated dataset were calculated using the method above, using a comprehensive range of values for S i j (n) . We then compared these automated results with results obtained via manual processing that were verified as correct. Figure 4 shows the results of this comparison. Accordingly, a value of about 0.4 was determined to be optimal for the threshold of S i j (n) . Integrated data over the threshold S i j (n) are considered to accord. As a result, we set the default value of S i j (n) as 0.4 for our system.

Removal of Noise Words.
Shop and company names may often contain frequently appearing words (FAWs), geographic names, and station names. Because of the confounding and pseudosimilar effects of these words and names, appropriate verification that similar names refer to the same tenant is difficult to achieve. Sagara and Kitsuregawa have also pointed out this difficulty in recognizing pure shop and company names using computers [26]. A method that can remove these so-called "noise words" from name information is necessary. We solved this problem by creating dictionaries of noise words and using them to remove noise words from shop and company names prior to n-gram analysis. The FAW dictionary developed in this study was developed by applying an automated system of Japanese morphological analysis called "Chasen" [27] to tenant names that had been extracted from the 2005 residential maps and telephone directory covering the South Kanto region. Tenant names were divided into parts of speech by the Chasen, and these data were combined with manually culled FAWs to develop our library. Table 7  using railroad timetables of Japan. Table 8 shows some examples of noise words, and Table 9 shows the number of words present in each library. Only those character strings that contain the geographic and station name structures depicted in Figure 5 are removed from shop and company names. The processing in Figure 5 is necessary because it decreases the risk that character strings which have no relation with geographic and station names might be removed from shop and company names. This risk increases to remove noise words in shop and company names to use only geographic and station names. We consider one geographic name, "Nakagawa (中川)", as an example. This name is common in Adachi ward in Tokyo Prefecture, where it refers to a geographic name. However, it is strongly expected that there will also be many shops and companies that contain "Nakagawa" even outside of the Nakagawa area, because it is a very popular family name in Japan: in fact, in the 2005 telephone directory, there are 311 shops and companies that do. Almost all of the shops and companies extracted were this nongeographical Nakagawa ( Figure 6). Table 10 shows some examples of shop and company names containing an instance of Nakagawa unrelated to its geographic and station usages.
On the other hand, using the removal procedure in Figure 5 diminishes the risks associated with removing character strings unrelated to geographic and station names, that is, used for the n-gram similarity metric. Figure 7 shows the results of a search for shops and companies containing "Nakagawa" in their names using this rule. Two shops were found, and Table 11 shows that the "Nakagawa" in their names refers to the geographic Nakagawa.
Eventually, pure shop and company names remain after removing the various kinds of noise words through the above processing. This process is demonstrated in Figure 8.

Removal of Local Frequently Appearing Words.
Nonetheless, there are cases that cannot be processed well, even when all of the above methods and libraries are incorporated. This is because there are frequently appearing words that are eccentrically located both spatially and temporally. We refer to such words as "Local Frequently Appearing Words (LFAW)" in this study.
We explain about the LFAWs using three examples, depicted in Figure 9. "Shinjuku Nishiguchi" (the western exit of Shinjuku terminal) and "Yaesu-guchi" (Yaesu exit) are not geographic names. In addition, "Yaesu-guchi" is not a station name. However, there are many shops and companies whose names contain these character strings, because they are located in the western area of Shinjuku terminal or the eastern area of Tokyo terminal, respectively. These are examples of FAWs, which are eccentrically located in space: that is, they are concentrated only around a particular area. Figure 10 shows the locations of shops and companies whose names contain "Shinjuku Nishiguchi-ten" (Shinjuku terminal western exit shop) taken from the 2005 telephone directory. There are many data points in the western area outside Shinjuku terminal that fit this category.
Advances in Artificial Intelligence 9 Figure 7: Search results of shops and companies containing "Nakagawa" as a geographic name.     On the other hand, "Roppongi Hills" (one of the largest and most famous skyscraper complexes in Tokyo and Japan) in Figure 9 is an example of an FAW, which is eccentrically-located not only spatially but also temporally. There are almost no shop or company names from before 2003 containing "Roppongi Hills," because Roppongi Hills was only opened in that year. There are 105 shops and companies in 2005 and 141 in 2009 that contain "Roppongi Hills" in their names: however, there was zero such shops in the 2000 Tokyo telephone directory, as Roppongi Hills opened only in 2003. Thus, a method to remove these kinds of LFAWs was necessary.
We constructed grids measuring millidegree square along longitude and latitude, and all source data were allocated to this grid. Frequently appearing character strings were searched for in the shop and company names within each grid using the n-gram method. For each grid, n-grams created strings measuring from n = 4 to n = 9 based on both the shop and company names within the targeted grid itself and the neighboring 8 grids on all sides. It was necessary to search in the neighboring grids as well, so that shops or companies located near the grid borders could be incorporated into the LFAW identification process. Finally, the identified LFAWs were removed from the shop and company names in each grid. For our purposes, LFAWs are only 4-to 9-gram-constructed strings that appear multiple times, and whose endings comprise "店/ten" (shop), "支社/sisya" (branch), " 所 /eigyosyo" (office), " 口/higashiguchi" (eastern exit), "西口/nishiguchi" (western exit), "南口/Minamiguchi" (southern exit), and "北口/kitaguchi" (northern exit). Also, long LFAWs are removed earlier than short LFAWs. In other words, those LFAWs created by the 9-gram are removed first, those created by the 8-gram next, and so on through the 4-gram.
To this effect, Figure 11 shows the results from these example LFAWs and their removal in one grid west of Shinjuku terminal. Almost all of the shop and company names were processed adequately. In addition, even when

Names In Japanese
In English Nakagawa Printing Press Co., Ltd. Nakagawa Metal Co., Ltd. Nakagawa Dental Clinic Nakagawa-ya Curry Udon Table 11: Shop names of search results from Figure 7.

Names In Japanese
In English Aichiya Nakagawa shop Nishizawa pharmacy, Nakagawa branch removed, the effects can be largely ignored because the n-gram can still calculate name similarity effectively as in Figure 3. Tables 12 and 13 show some example test results of our method for removing noise words. For telephone directory data (Table 12), there are 654 shops and companies from which noise words should be removed, out of 1000 extracted by random sampling. 92.4% of these 654 shops and companies had their noise words successfully removed without damaging the character strings of pure names. For web information (Table 13), the fluctuation of shop and company names seems larger than in the telephone Can we get the same result as manual processing using the FAW dictionary?

Yes: 513
No: 141 Can we get the same result as manual processing using the dictionary of geographic names and station names?

Yes: 70
No: 71 Can we get the same result as manual processing after LFAW removal?

Yes:11 No:60
Do pure names remain after all noise word removal processing?

Yes: 330 No: 16 Sum total
Number of data processed successfully 513 70 11 0 330 0 924 Processing accuracy (%) 92.40 Table 13: Processing accuracy of removal of noise words (Data consists of 1000 samples extracted randomly from web data using the Hot Pepper API from within Tokyo prefecture).
Number of samples 1000 Is it necessary to remove noise words from names, as determined by a manual check?

Yes: 545
No: 455 Can we get the same result as manual processing using the FAW dictionary?

Yes: 67
No: 478 Can we get the same result as manual processing using the dictionary of geographic names and station names?

Yes: 237
No: 241 Can we get the same result as manual processing after LFAW removal?

Yes:81 No:160
Do pure names remain after all noise word removal processing? Processing accuracy (%) 79.40 "Hot Pepper" is a famous free coupon magazine in Japan, produced by Recruit Co., Ltd. Using the Hot Pepper API, we can collect information about many kinds of shops, companies, restaurants, and so forth. directory, with 79.4% of the locations having their noise words removed successfully. In addition, the FAW dictionary exerts the largest discriminative effect within each test.
There are cases where noise words still remain partly in shop and company names or where important character strings are erroneously removed after the LFAW processing. These are indicated by the blue numbers in Table 12 (76 shops and companies: 7.6%) and Table 13 (206 shops and companies: 20.6%) and the blue character strings in the "Name after removal of LFAW" row within Figure 11. "Z" in number of data denotes residential map tenant data. "T" in number of data denotes telephone directory data. "Co": continuation, "Ch": change, "Em": emergence, and "De": demise "FSI" means failure of spatial integration.
However, these effects are negligibly small, because n-gram processing can nonetheless verify name similarity despite the incompleteness of the pure character string. Tables 15, 16, and 17 in Section 4 will demonstrate that our method can process data with sufficiently high accuracy.

Processing Accuracy
So far, we have developed a method for removing various kinds of noise words from shop and company names, and one for verifying that differing names may refer to the same tenant, by calculating the name similarity. In this section, the processing accuracy of our system achieved by these methods is discussed. We compared the results of time-series data produced by our system with results created manually (and verified for correctness) in some sample areas in the South Kanto region of Japan. Input data for this verification of identity were taken from the telephone directories and digital residential maps as described in Section 1.3. Table 14 shows our sample areas: two each of urban and rural areas.
First, telephone directory and residential map tenant data from 2000 were integrated spatiotemporally with the same data from 2005 over the whole South Kanto region. Then, sample data were extracted from the results of time-series integration. Finally, these automated results were compared with results manually obtained and verified as correct. Tables 15 and 16 show the processing accuracy achieved by our system's time-series integration. Each table begins with the manually verified total of all the time-series changes observed in each of the sample areas. The system accomplished a processing accuracy of 94.36% (820/869) in integrating the old and new residential map tenant data spatiotemporally (Table 15), and one of 95.22% (478/502) in integrating the old and new telephone directory data ( Table 16). The reason why the sum totals here are discordant with their respective sum totals in Table 14 is because the "Demise" results are counted instead of obscured by subtraction. That is, the sum total in Table 15 not including "Demise" data is the same as the sum total of the number of tenants in the 2005 residential maps. The most remarkable and salient point from Tables 15 and 16 is the high accuracy achieved for continuation and change results. We could not have acquired such high values without not only accurate spatial integration but also a robust method for identifying differing names as referring to the same tenant. We demonstrate that the Japanese language processing methodology introduced in this paper is effective for the realization of time-series integration.
In addition, Table 17 shows the processing accuracy for the residential map tenant data in each sample area. For example, in the Continuation column of "Kabukicho," "79 (87)" means that 87 data points were judged as "Continuation" through manual analysis, and 79 out of those 87 were judged to be the same time-integration category by our system. Processing accuracies in urban areas were slightly lower than in rural areas, because of high density of shops and companies and frequent transfers of them. Processing accuracies in rural areas were almost 100%.
It has been shown in Tables 15,16, and 17 that there is about a 5% error rate when creating time-series data manually. Compared to this rate, the processing accuracy of   (110) 2(3) 27(27)   our system is certainly practical and robust, considering the inevitable human error and large amounts of labor and time necessary when performing such work manually. The high processing accuracy observed with our system was achieved not only with accurate spatiotemporal integration-the method we developed, which identifies two different names as referring to the same tenant by calculating of word similarity, was essential. The results detailed in this section demonstrate that our method for name identification discussed in this paper performs at a reliable level.

Examples of Data Graphics Developed Using Our System
In this section, we will briefly discuss some examples and applications of detailed time-series datasets that can be developed by our system. Figure 12 shows a 3D time-series map of tenant changes around Shinjuku terminal between 2000 and 2005. This map was constructed so as to integrate the residential map timeseries datasets with the respective telephone ones for the years 2000 and 2005. It is possible to find buildings that were newly built between 2000 and 2005 by searching for buildings where all tenants are categorized as "Emergence," and conversely, to find vacant sites or sites under construction by searching for buildings where all tenants are categorized as "Demise." In addition, we can easily see that many of the "Change" tenants are (in 2005) located in low floors around Shinjuku terminal. Figure 13 shows a grid map (500 m square length) of various "Change" rates based on the results of time-series integration of residential map tenant data from 2003 and 2008 from all over Japan. This is calculated as the number of "Change" tenants divided by the total number of tenants. It is readily apparent from Figure 14 that grids with high "Change" rates are located in urban areas: this may be expected, since competition among shops and companies is usually intense in such areas. In addition, it is interesting to be able to monitor the variability of the "Change" rate across many different areas in the same city. This is the first instance where such a detailed time-series dataset with such homogenous resolution over this broad of an area has been realized in Japan. This kind of data can make a valuable contribution in solving the problems encountered in previous studies, as introduced in Section 1.

Conclusion
In this paper, we discussed a method for identifying Japanese names, by quantitatively analyzing their true, "pure" similarities while ignoring pseudosimilar "noise words" within them. The most remarkable achievement of this study was its removal of eccentricallylocated LFAW located both spatially and temporally by an n-gram-adapted methodology. This novel approach integrates knowledge bases from both linguistics and spatial information science. In addition, we can further conjecture that this study is predictive of how the demands for natural language processing will increase more and more in the fields of spatial information science and geography.
There are some future challenges to improve the identification of Japanese words. One challenge is to develop an environment that can convert effortlessly between kanji, hiragana, and katakana. Mutual conversion between hiragana and katakana is very easy because both sets of characters comprise the same set of phonograms. However, it seems difficult to convert kanji directly into hiragana or katakana because kanji are ideograms. In addition, almost all kanji in Japanese have multiple kinds of pronunciation. The development of a method to accurately and robustly convert kanji into hiragana or katakana is one of the most important tasks facing our research. Another important challenge is that of converting loanwords into katakana. We have already realized a simplified system that can do this. However, the processing accuracy of this system is inadequate, with this system converting only some English and French words into katakana precisely. Both are very difficult challenges, yet nonetheless very interesting and exciting directions for future research.