We have developed a method for spatiotemporally integrating databases of shop and company information, such as from a digital telephone directory, spatiotemporally, in order to monitor dynamic urban transformations in a detailed manner. To realize this, an additional method is necessary to verify the identicalness of different instances of Japanese shop and company names that might contain fluctuations of description. In this paper, we discuss a method that utilizes an
Spatiotemporal changes of shop and company locations have a major effect on the vitality and attraction of urban space. It is a significant challenge to monitor these changes, quantitatively and in as detailed as manner as possible, for use in various fields including urban engineering, geography, and economics. However, it is difficult to comprehensively monitor urban spaces, because much general regional and statistical information (e.g., the population census, commercial statistics) is compiled by separate administrative or city block units.
On the other hand, detailed information on shop and company locations and names can be collected using telephone directories and web information. Fortunately, this is possible in Japan, because of the availability of digital telephone directories and detailed digital maps which can monitor almost all residents and tenants in a given building.
The yearly continuations and changes in tenants or residents can be monitored for a certain location, and we can integrate these data across multiple years. The same can be done for shop and company locations over multiple years, by measuring changes in shop and company names. However, this measure is not easy because of name fluctuations between different two years or different kinds of data. Therefsore, we have been developing a dataset that can monitor the time-series changes of each shop and company and a system that can develop such data as to resolve this challenge [
There have been many previous studies that have attempted to monitor changes in urban spaces using time-series information of shops, companies, and buildings. For example, the locations of open and closed shops were extracted using digital maps and the results of field surveys by Ato et al. [
On the other hand, Ito and Magaribuchi have developed a completely automated method of spatiotemporal integration of digital residential maps, which is capable of processing large volumes of data [
For this study, a method to recognize name entities (i.e., compound noun) is necessary. There have been many previous studies that worked in various ways to develop this method. Florian et al. presented a statistical language-independent framework for identifying and tracking named, nominal, and pronominal references to entities within unrestricted text documents, then chaining them into clusters corresponding to each logical entity present in the text [
There have also been many previous studies focusing on the processing of Japanese words. Sato et al. developed a method to predict the authors of a text based on frequencies of word usage within it [
However, it has been difficult for previous methods to deal with local frequently appearing words (LFAW). Our approach to managing this problem is introduced in Section
This paper and our system focus on Japanese language processing. Our system can monitor the time-series changes of each shop and company, integrating them to create a dataset containing their names and locations, that is, address, longitude, and latitude across two years spatially and to measure identifications of their names. In this paper, we focus on how to measure identifications of Japanese words.
There are two remarkable and novel points our paper introduces. The first is that it utilizes natural language processing (NLP) for the advanced processing of spatial and temporal data. There are few studies that have processed data using NLP in the field of spatial information science. Some studies in this field have partly utilized NLP [
In Japan, there are many kinds of data that contain name and location information. One of the largest and most complete datasets for shops and companies all over Japan comprises residential and tenant information from digital residential maps (Zenrin CO., Ltd.; in Japanese, “Zyutaku-Chizu”) and digital telephone directories (e.g., “Town Page Database” by NTT Business Information Service, Inc. and “Telepoint Data” by Zenrin CO., Ltd.). Other data are data of each companies and enterprises, for example, the quarterly journal of companies and enterprises of Japan, or shop information on the Web collectable by API services. Therefore, our method can be adapted across various fields of data development.
Our system should be able to process data in various regions and times. Therefore, the test data used in the development of our system was the residential and tenant information in the digital residential maps and telephone directory mentioned above, because these data can cover all of Japan with a homogeneous resolution.
One of the main characteristics of Japanese is its use of three kinds of characters: hiragana, katakana, and kanji. Hiragana and katakana are phonograms, while kanji are ideograms. The origin of kanji is Chinese characters, while hiragana and katakana are unique characters originating in Japan. Kanji are mainly used to write nouns, roots of verbs and adjectives, and personal names of Japanese and Chinese people. Kanji can be described by hiragana and katakana because the pronunciations of kanji can be written phonetically, as seen in Table
Example of description of Kanji by Hiragana.
Described by Kanji | Described by Hiragana/Katakana | |
---|---|---|
Japanese characters | ||
Pronunciations | Nihon | Ni Ho Nn |
Meaning in English | Japan | Japan |
Pronunciations by Chinese and Japanese of the same characters.
Chinese/Japanese character | |||
---|---|---|---|
Pronunciations in Chinese | Zhōng | Shān | Bĕn |
Pronunciations in Japanese | Naka Chūuu | Yama | Hon |
A notable characteristic of written Japanese is that it does not have blanks between single words. Because of this, it is difficult to divide one text into its component single words without an adequate understanding of word meaning or class (Table
Example of Japanese without blanks.
Text | Translation from English to Japanese | |
---|---|---|
Japanese | ||
English | Tokyo Stock Exchange | Tokyo = |
Stock = | ||
Exchange = |
In addition, one of the interesting features of Japanese is that it typically writes loanwords with similarly pronounced katakana (Table
Examples of description of loanwords by katakana.
Example 1 | Example 2 | Example 3 | ||
---|---|---|---|---|
Words | Loanwords | Notebook (En) | Baumkuchen (De) | Château (Fr) |
Japanese | ||||
Pronunciations | Loanwords | nóutbuk | ||
Japanese | nōtobukku | Bāmukūhen | Shatō |
It is not easy to verify that two Japanese words are identical or even similar because of the above features of Japanese. For example, character string lengths tend to be longer than English or French, because Japanese is written without blanks between single words. In addition, there are many fluctuations of description, because Japanese uses three kinds of characters and changes word order frequently.
Moreover, one kind of character string that appears frequently in shop and company names is branch names. These strings become noise words, making name identification difficult if a shop and company name contains long geographic or building names (Table
Examples of noise words in shop and company names.
Example 1 | Example 2 | |
---|---|---|
Shop and company names | ||
Japanese | ||
English | McDonalds Shimokitazawa shop | Starbucks Coffee Roppongi Hills shop |
Noise words | ||
Japanese | ||
English | Shimokitazawa shop | Roppongi Hills shop |
Kind of noise words | Station/geographic name | Building name |
We identify and verify the time-series changes of shops and companies between two different years based on location (i.e., address, longitude, and latitude) and name information (i.e., shop, company, or building names). Then, we can assess the kind of time-series change—that is, continuation, change, emergence, and demise—of each shop or company between different two years, for monitoring purposes.
The input data consisted of name and location information separated by commas (e.g., in csv or txt format) containing an address at minimum. When more specific information is provided in the source data—building names, floors, and room numbers—our system can integrate input data more accurately than without. Figure
Image of sample input data and resultant output (time-series integration).
Figure
Processing flow of time-series integration by our method.
In this paper, we introduce a method to verify whether or not two spatially integrated names refer to the same tenant, and then to decide whether the time-series change is best classified as “continuation” or “change.” The details of spatial integration have been described in our previous studies [
In this study, the time-series changes of each shop and company were monitored based on the “name” changes within the same location. In other words, our method monitors time-series changes of buildings and their floors and rooms. Therefore, our method will not identify transfers of shop or company ownership, whether by merger or acquisition. However, we expect that interested parties will be able to track changes of ownership at the same “continuation” location by using other data or statistics more relevant to company mergers and acquisitions, such as the Japan Company Handbook (Toyo Keizai Inc.).
It is not easy to verify that a new name and old name refer to the same company at a given location, because simply determining whether each name is exactly the same returns inadequate results. There are subtle fluctuations of description between the names in new data versus old data, even though they may actually be the same shops or companies. Table
Examples of fluctuations of description between old and new names.
Name (in 2005) | Name (in 2000) | Address |
---|---|---|
In order to solve this problem, we must meaningfully quantify similarities between the words of the shop and company names.
In this study, this word quantification has been realized by the “
We use the bigram (2-gram) to calculate name similarity in this study. The bigram extracts string blocks constructed of 2 characters from new and old names and then compares them. This method can resolve the problem of fluctuations of description. Figure
Calculation method for word similarity using a bigram.
It was necessary to designate a minimum threshold for the similarity metric
Distributions of bigram values in the case of accord and discord.
Shop and company names may often contain frequently appearing words (FAWs), geographic names, and station names. Because of the confounding and pseudosimilar effects of these words and names, appropriate verification that similar names refer to the same tenant is difficult to achieve. Sagara and Kitsuregawa have also pointed out this difficulty in recognizing pure shop and company names using computers [
We solved this problem by creating dictionaries of noise words and using them to remove noise words from shop and company names prior to
Appearance frequencies of FAW in tenant names in 2005.
FAW in Japanese | FAW in English | Appearance frequency | Ratio of appearance (%) |
---|---|---|---|
Co. | 532007 | 13.88 | |
Ltd. | 322517 | 8.42 | |
Corporation | 151421 | 3.95 | |
Limited company | 69205 | 1.81 | |
Center | 54179 | 1.41 | |
Office | 39617 | 1.03 | |
Beauty salon | 32534 | 0.85 | |
Business office | 28489 | 0.74 | |
Building | 27510 | 0.72 | |
Clinic | 19028 | 0.50 | |
Tokyo | 18702 | 0.49 | |
Parking | 17703 | 0.46 | |
Heights | 17679 | 0.46 | |
Service | 17542 | 0.46 | |
Cleaning | 17432 | 0.45 | |
Snack bar | 14308 | 0.37 | |
Cooperative | 13716 | 0.36 |
Tenants in the 2005 residential maps: 3,141,434
Tenants the 2005 telephone directory: 690,183
Total tenants: 3,831,617.
Examples of noise words in shop and company names.
Names (Japanese/English) | Noise words | Kind of noise word |
---|---|---|
FAW | ||
Geographic name | ||
Station name | ||
FAW | ||
Geographic name | ||
FAW | ||
Station name | ||
FAW | ||
Geographic name |
The number of words in each noise word dictionary.
Regions of Japan | Number of words | ||
Geographic names | Station names | FAW | |
Hokkaido | 48570 | 1519 | 963 |
Tohoku | 135436 | 1314 | |
North Kanto | 12224 | 678 | |
South Kanto | 35869 | 2625 | |
Koshinetsu | 16677 | 785 | |
Hokuriku | 14248 | 721 | |
Tokai | 76361 | 2008 | |
Kinki | 66681 | 2290 | |
Chugoku | 21554 | 1081 | |
Shikoku | 17303 | 580 | |
Kyusyu | 28868 | 1717 | |
Okinawa | 1835 | 19 |
Each region contains the following prefectures.
Only those character strings that contain the geographic and station name structures depicted in Figure
Examples of shop and company names containing Nakagawa unrelated by geography or station name.
Names | |
In Japanese | In English |
Nakagawa Printing Press Co., Ltd. | |
Nakagawa Metal Co., Ltd. | |
Nakagawa Dental Clinic | |
Nakagawa-ya Curry Udon |
Character strings removed from shop and company names.
Locations of shops and companies in the 2005 telephone directory containing “Nakagawa (
On the other hand, using the removal procedure in Figure
Shop names of search results from Figure
Names | |
In Japanese | In English |
Aichiya Nakagawa shop | |
Nishizawa pharmacy, Nakagawa branch |
Search results of shops and companies containing “Nakagawa” as a geographic name.
Eventually, pure shop and company names remain after removing the various kinds of noise words through the above processing. This process is demonstrated in Figure
Accuracy improvement in bigram processing achieved by the removal of noise words.
Nonetheless, there are cases that cannot be processed well, even when all of the above methods and libraries are incorporated. This is because there are frequently appearing words that are eccentrically located both spatially and temporally. We refer to such words as “Local Frequently Appearing Words (LFAW)” in this study.
We explain about the LFAWs using three examples, depicted in Figure
Image of LFAWs.
Data distribution of locations containing “Shinjuku Nishiguchi-ten” (“Nichiguchi-ten” means western exit shop/branch).
On the other hand, “Roppongi Hills” (one of the largest and most famous skyscraper complexes in Tokyo and Japan) in Figure
We constructed grids measuring millidegree square along longitude and latitude, and all source data were allocated to this grid. Frequently appearing character strings were searched for in the shop and company names within each grid using the
To this effect, Figure
Examples of LFAWs and their removal from shop and company names in one grid covering an area west of Shinjuku terminal (taken from the 2005 telephone directory).
Tables
Processing accuracy of removal of noise words (Data consists of 1000 samples extracted randomly from the 2005 Tokyo prefecture telephone directory).
Number of samples | 1000 | ||||||
Is it necessary to remove noise words from names, as determined by a manual check? | Yes: 654 | No: 346 | |||||
Can we get the same result as manual processing using the FAW dictionary? | No: 141 | ||||||
Can we get the same result as manual processing using the dictionary of geographic names and station names? | No: 71 | ||||||
Can we get the same result as manual processing after LFAW removal? | |||||||
Do pure names remain after all noise word removal processing? | |||||||
Number of data processed successfully | 513 | 70 | 11 | 0 | 330 | 0 | 924 |
Processing accuracy (%) | 92.40 |
Processing accuracy of removal of noise words (Data consists of 1000 samples extracted randomly from web data using the Hot Pepper API from within Tokyo prefecture).
Number of samples | 1000 | ||||||
Is it necessary to remove noise words from names, as determined by a manual check? | Yes: 545 | No: 455 | |||||
Can we get the same result as manual processing using the FAW dictionary? | No: 478 | ||||||
Can we get the same result as manual processing using the dictionary of geographic names and station names? | No: 241 | ||||||
Can we get the same result as manual processing after LFAW removal? | |||||||
Do pure names remain after all noise word removal processing? | |||||||
Number of data processed successfully | 67 | 237 | 81 | 0 | 409 | 0 | 794 |
Processing accuracy (%) | 79.40 |
“Hot Pepper” is a famous free coupon magazine in Japan, produced by Recruit Co., Ltd. Using the Hot Pepper API, we can collect information about many kinds of shops, companies, restaurants, and so forth.
There are cases where noise words still remain partly in shop and company names or where important character strings are erroneously removed after the LFAW processing. These are indicated by the blue numbers in Table
So far, we have developed a method for removing various kinds of noise words from shop and company names, and one for verifying that differing names may refer to the same tenant, by calculating the name similarity. In this section, the processing accuracy of our system achieved by these methods is discussed.
We compared the results of time-series data produced by our system with results created manually (and verified for correctness) in some sample areas in the South Kanto region of Japan. Input data for this verification of identity were taken from the telephone directories and digital residential maps as described in Section
Sample areas and their numbers of data.
The number of data | |||||||
Sample areas | in 2005 | in 2000 | |||||
Locations of area | Area types | Z | T | Z | T | ||
Part of Kabukicho, Shinjuku-ku, Tokyo | Bustling shopping area in city center | 191 | 88 | 210 | 94 | ||
Nodai dori shopping street, Setagaya-ku, Tokyo | Old shopping street around train station | 276 | 130 | 247 | 141 | ||
Vicinity of Amatsu port, Kamogawa city, Chiba | Port town | 121 | 82 | 136 | 94 | ||
Nakanogo and Kashidate districts, Hachijo island, Tokyo | Settlements in an isolated island | 140 | 105 | 144 | 121 | ||
Sum total | 728 | 405 | 737 | 450 |
“Z” in number of data denotes residential map tenant data.
“T” in number of data denotes telephone directory data.
Processing accuracy of time-series integration for residential map tenant data.
System results | ||||||||
Sum total | Co | Ch | Em | De | FSI | Accuracy (%) | ||
Manual results | Continuation | 4 | 22 | 0 | 2 | |||
Change | 4 | 8 | 0 | 0 | ||||
Emergence | 4 | 2 | 0 | 0 | ||||
Demise | 8 | 3 | 0 | 0 | ||||
Sum total |
“Co”: continuation, “Ch”: change, “Em”: emergence, and “De”: demise
“FSI” means failure of spatial integration.
Processing accuracy of time-series integration for telephone directory data.
System results | ||||||||
Sum total | Co | Ch | Em | De | Sim | Accuracy (%) | ||
Manual results | Continuation | 4 | 10 | 0 | 0 | |||
Change | 1 | 2 | 0 | 0 | ||||
Emergence | 0 | 3 | 0 | 0 | ||||
Demise | 2 | 2 | 0 | 0 | ||||
Sum total |
Comparison of processing accuracy in each sample area.
Sample area | Number of data | Processing accuracy of time-series integration | Accuracy (%) | |||||
in 2005 | in 2000 | Co | Ch | Em | De | FSI | ||
Part of Kabukicho, Shinjuku-ku, Tokyo | 191 | 210 | 79(87) | 67(74) | 28(30) | 44(49) | 2 | |
Nodai-dori shopping street, Setagaya-ku, Tokyo | 276 | 247 | 137(154) | 48(51) | 67(71) | 38(42) | 0 | |
Vicinity of Amatsu port, Kamogawa city, Chiba | 121 | 136 | 116(116) | 0(1) | 4(4) | 19(19) | 0 | |
Nakanogo and Kashidate districts, Hachijo island, Tokyo | 140 | 144 | 106(110) | 2(3) | 27(27) | 29(31) | 0 |
First, telephone directory and residential map tenant data from 2000 were integrated spatiotemporally with the same data from 2005 over the whole South Kanto region. Then, sample data were extracted from the results of time-series integration. Finally, these automated results were compared with results manually obtained and verified as correct.
Tables
In addition, Table
It has been shown in Tables
The high processing accuracy observed with our system was achieved not only with accurate spatiotemporal integration—the method we developed, which identifies two different names as referring to the same tenant by calculating of word similarity, was essential. The results detailed in this section demonstrate that our method for name identification discussed in this paper performs at a reliable level.
In this section, we will briefly discuss some examples and applications of detailed time-series datasets that can be developed by our system.
Figure
Time-series 3D map of tenants around Shinjuku terminal.
Figure
Grid map of rate of tenant change all over Japan (1 km square grid).
Grid map of rate of tenant change in parts of Japan (500 m square grid).
This is the first instance where such a detailed time-series dataset with such homogenous resolution over this broad of an area has been realized in Japan. This kind of data can make a valuable contribution in solving the problems encountered in previous studies, as introduced in Section
In this paper, we discussed a method for identifying Japanese names, by quantitatively analyzing their true, “pure” similarities while ignoring pseudosimilar “noise words” within them. The most remarkable achievement of this study was its removal of eccentricallylocated LFAW located both spatially and temporally by an
There are some future challenges to improve the identification of Japanese words. One challenge is to develop an environment that can convert effortlessly between kanji, hiragana, and katakana. Mutual conversion between hiragana and katakana is very easy because both sets of characters comprise the same set of phonograms. However, it seems difficult to convert kanji directly into hiragana or katakana because kanji are ideograms. In addition, almost all kanji in Japanese have multiple kinds of pronunciation. The development of a method to accurately and robustly convert kanji into hiragana or katakana is one of the most important tasks facing our research. Another important challenge is that of converting loanwords into katakana. We have already realized a simplified system that can do this. However, the processing accuracy of this system is inadequate, with this system converting only some English and French words into katakana precisely. Both are very difficult challenges, yet nonetheless very interesting and exciting directions for future research.
The authors were given the digital telephone directory by ZENRIN CO., LTD (Telepoint Pack!) and NTT Business Information Service, Inc. (Town Page Databese) and the digital residential maps by ZENRIN CO., LTD (Zmap TOWN II). Publication of this paper was supported by Earth Observation Data Integration and Fusion Research Institute (EDITORIA). They would like to thank ZENRIN CO., LTD, NTT Business Information Service, Inc., and EDITORIA for their contribution.