Knowledge graph (KG) as a popular semantic network has been widely used. It provides an effective way to describe semantic entities and their relationships by extending ontology in the entity level. This article focuses on the application of KG in the traditional geological field and proposes a novel method to construct KG. On the basis of natural language processing (NLP) and data mining (DM) algorithms, we analyze those key technologies for designing a KG towards geological data, including geological knowledge extraction and semantic association. Through this typical geological ontology extracting on a large number of geological documents and open linked data, the semantic interconnection is achieved, KG framework for geological data is designed, application system of KG towards geological data is constructed, and dynamic updating of the geological information is completed accordingly. Specifically, unsupervised intelligent learning method using linked open data is incorporated into the geological document preprocessing, which generates a geological domain vocabulary ultimately. Furthermore, some application cases in the KG system are provided to show the effectiveness and efficiency of our proposed intelligent learning approach for KG.
Geological data is a variety of data and information accumulated in the geological research work and practical activities. Generally, the types of geological data are in a wide variety, including geological documents, geological books, geological information and journals, physical specimens, and electronic file data [
With the increasing economic and society, in the field of geological survey, geological data sharing service has become an important tool to measure the level of social and business management, which is significant in ensuring the sustainable development of geological work. The features of geological data include increasing volume, complex type, and long response time. Aiming at the geological application problems, the intelligent analysis and deep mining of geological data could reduce the repetitive working and the risk of geological survey [
In recent years, knowledge service based on the knowledge graph (KG) technology and the search technology of semantic web has become a research hot spot in information service. In this case, the KG arises at the historic moment [
In this article, the KG construction technology is applied in geology to implement intelligent analysis and deep mining of geological data. Through an unsupervised knowledge learning method for open data sources, we not only achieve self-learning process for a set of documents, but also form a geology glossary and complete the construction of KG. Through the research along this topic, promoting the geological materials information and social services has important value for the realization of intelligent geological survey.
The contributions of this article are as follows:
The rest of this article is organized as follows. Section
In recent years, with the increasing demand of geological data in production units and social masses, the geological data services are facing the dual demands of “digitization” and “socialization.” It is necessary to improve the contents and methods of geological data services, promoting the government departments to adapt to the development of the situation and achieve the transformation from the archival results to the service product [
The requirement of maintaining the type and quantity of geological data grows with the long-term accumulation for data. It includes various types of electronic file data, such as documents, maps, database (map database, spatial database, and attribute database), pictures, charts, videos, audio, which might be structured, semistructured, and unstructured. Due to the technical reasons, this storage mode makes data query, statistics, updates, and other operations to the data not only inefficient, but also detrimental to the application, such as check, query, and mining, which leads to the low capability to the data service. Hence, it is significant to exploring how to apply the concept and technology of big data to organize massive geological data in the field of geology effectively and achieve the corresponding services [
Generally, the diversified fragmentation of complex geological unstructured data is one of the most striking features. There are mainly three contents that reflect to the data analysis and mining processing, including the establishment of content index library, search, and clustering recommendation [
KG is also known as science knowledge graph, knowledge domain visualization, and knowledge domains map. It is a series of various graphs that show the development process of scientific knowledge and the structure relation [
Most works regarding the KG originated from Google KG. It is essentially a semantic network. The nodes represent entities or concepts and the edges represent a variety of semantic relations between entities and concepts. Moreover, the motivation of KG is from a series of practical applications, including semantic search, machine answering, information retrieval, electronic reading, and online learning. Now, some companies, such as Baidu, Sogou, have launched their own KG.
Our researchers have developed many applications around KG, while illustrating different perspectives in their process. For example, in the process of visual analysis of Chinese science literature, it showed the time sequence distribution, journal distribution, and author distribution of scientific literature during the past 30 years [
In addition to the above applications, many scholars have also carried out some works in KG. Hook showed that KG has four purposes (i.e., discovery, understanding, communication, and education) and six aspects of application (i.e., microcosmic display of specific areas, macroscopic visualization of subject, assisting in the education course teaching, saving document knowledge in coordination, facilitating the use of digital library, and displaying knowledge dissemination) [
KG applications increase rapidly in recent years, which cover some disciplines of natural science and social science, and show the osmotic tendency towards other disciplines. Drawing KG and mining KG have formed a high mature methodology. However, the function of KG has not been fully applied, and the application still needs to be further strengthened. So far, only little attention has been paid to the geological data field. Hence, it is necessary and important to consider these particular objects.
The construction of KG towards geological data consists of two logical components: knowledge extraction and knowledge management. The former mainly learns the corresponding geological knowledge through unsupervised processing and including five steps, which are word segmentation, frequency statistics, web crawler, keywords extraction, and relation extraction. The latter is basically composed of two parts: knowledge graph storage and retrieval. The specific processes are shown in Figure
The logical structure of knowledge graph construction towards geological data.
Knowledge extraction is a key step in the construction of knowledge graph, as well as in the processing of geological documents. Knowledge extraction in this article, through an unsupervised knowledge learning method based on an open source, and the geological domain vocabulary and knowledge graph would be formed through the automatic learning of a large number of geological documents. The flow of knowledge extraction is shown in Figure
The flow of knowledge extraction.
Knowledge extraction has three major steps, including data sources analysis, entity/concept extraction, and relation extraction.
Although the contents of encyclopedias exist with the form of web pages, there are still a lot of structured information. Since all of encyclopedias have their own classification system, category labels are used to organize a large number of entries. In general, each entry has category label, which could be used to label its own type. In addition, most of entries have multiple labels. For example, the category labels of “Steve Jobs” could be “20th-century American business people,” “American billionaires,” “American computer business people,” and many others in Wikipedia.
This article mainly focuses on Chinese information in Internet encyclopedias. Wikipedia is considered the Internet’s largest and most popular general reference book. However, Chinese content in Wikipedia is not perfect. On the one hand, the total number of entries is insufficient. And, the contents of the articles in Wikipedia are also relatively short, and some parts of them are translated from other languages directly, which are lacking the expression exactly in Chinese. Consequently, we make use of Baike.com instead of Wikipedia as the data source of web crawler in this article.
Entity/concept extraction mainly starts from these two data sources. We could filter out entities or concepts of geology directly by combining the information after text processing with category labels of Baike.com. Therefore, the entity/concept extraction includes four bottom-up steps: word segmentation, frequency statistics, web crawler, and keywords extraction.
The technology of HanLP could be used in the word segmentation, stop word filtration, and frequency statistics. Motivated by the TextRank algorithm, word segmentation used in this article is as follows. First of all, we use HanLP standard tokenizer to process documents, which are divided into different parts of speech words. Secondly, the custom data dictionary and extended stop list are designed. Finally, we filter out the word with little relevance to retrieving content and only retain the designated part of speech through the method of TextRank algorithm. Meanwhile, we also filter out the stop words, so as to achieve the effect of keyword extraction.
In terms of web crawler, we mainly consider to crawl the category labels of entries in the Internet encyclopedia by an automated tool Selenium, which could open the HtmlUnit browser, search entries, and access to class label information via programming by custom. Specifically, the method for online encyclopedia crawler is as follows. When we want to get information about a word “
In terms of the keywords extraction, according to the geological dictionary and the category labels, we could exactly determine whether the words in the segmentation results belong to the geological keywords or not. Through the statistical characteristics of Wikipedia category labels, we extract some keywords, including geography, mining, marine, rock, hydrology, environment, natural disasters, biology, city, air, oil, roads, plants, energy, metallurgy, and civil. We put all crawled category labels into a map collection. By calling the containsKey method of map, we can determine whether the collected object contains the keywords, if the answer is yes, this object is defined as a geological entity.
The purpose of relation extraction is nontaxonomic relation extraction of the association rule analysis in data mining and the Internet encyclopedia. The correlation between two geological terminologies is acquired by association rule analysis. And the category relationship of terminologies is acquired through crawling Internet encyclopedia.
The basic principle of association rule is that if the two concepts or entities frequently appeared in the same unit (e.g., a document, a paragraph, or a sentence), we could make sure there exist some relationships between them. We do not care about the specific semantic relations between two concepts, but the correlated degree between them. Hence, judging the correlated degree between two concepts through cooccurrence analysis in a document is more important. With the increase of the number of documents processed, there would be a higher correlated degree if the two concepts frequently appeared together. This method is also motivated by the process of human reading and learning. However, this method is just suitable to be employed for dealing with large number of documents; when the number of documents is small, this method would be inefficient.
Meanwhile, the purpose of crawling Internet encyclopedia is to obtain relationships between concepts and entities by making use of the open data source in the online encyclopedia. As mentioned above, here we mainly consider the category relationships.
Using the above two methods, the rule of our relation extraction is as follows. In terms of correlated degree, we set a relational degree
Knowledge management considers how to show the knowledge acquired through above steps in a visualized way. The main technical methods are the database storage and retrieval.
Considering the actual needs of the geological field, the system uses MySQL database as the background database. MySQL database is one of the best relational database management systems in web application, which has a small size, fast storage and retrieval speed, and low cost.
In our system, the entities and relationships acquired by processing geological documents are stored in a special database. Through JDBC technology, the background database operations, such as CRUD, are allowed. There are five tables in our database. Table “articles” stores the information about the documents processed, including ID, name, added time, and local storage path of documents. Table “words” stores the information about the words filtered out from the results of segmentation, including ID, content, frequency, and category labels of words. Table “re_words_words” stores the correlation information between two geological terms.
The attributes of those tables in our background database are in Table
The attributes of tables in background database.
Table name | Attribute 1 | Attribute 2 | Attribute 3 | Attribute 4 | Attribute 5 |
---|---|---|---|---|---|
“articles” | ID | Content | Date | Path | — |
“words” | ID | Content | Frequency | Label | — |
“re_words_articles” | ID | ID1 | ID2 | Frequency | — |
“re_words_words” | ID | ID1 | ID2 | ID3 | Count |
“dictionary” | Name | Label | — | — | — |
Retrieval can be carried out by users only after storing the knowledge extracted from documents in our database. Based on B/S working schema, the browser makes a post request to the back-end server after users input the search words. Meanwhile, back-end server responds to the request, getting the words submitted and the number of nodes that need to be rendered (it is set to 20 as a default value). The retrieved words are set to the key nodes and retrieved in our database. Then, it returns the results to the browser. The returned contents include ID, content and category labels of node, and ID of correlated documents.
Backstage management system of KG is designed to facilitate the process of documents and the operation of database for users, mainly including login page, geological documents processing page, and expert intervention page.
Two login modes could be chosen when users enter the login page by inputting the URL in the browser. Users could enter the geological document processing page if logged in as an administrator. Users also could enter the page of expert intervention if logged in as an expert. The browser submits the form, including name, password, and login mode. Subsequently, the users authorization would be checked by server, and users could enter the relevant page after verified.
On the page of geological documents processing, users could input the document name and storage path. And background module gets the form data submitted by users when the button “submit” is clicked on. The background module enters the stage of document processing and the results are stored in background database if all of this input data is valid. On the page of expert intervention, the experts have the right to add and delete the correlation between two words. For example, when adding a correlation, the experts enter the two words in the input box and click the button “submit.” The browser submits these two words to the background module, and the background module judges whether there is a correlation between them or not. If the association does not exist, the background module would add a correlation, which is defined as “expert-defined.”
The prototype system of KG towards geological big data is designed and accordingly implemented using B/S architecture and HTTP protocol, which includes natural language processing (NLP), data mining, web application development, and other related technologies. Key technologies and solutions involved during the process of system development are described as follows.
HanLP is a Java toolkit composed of a series of models and algorithms, whose target is to promote the application of NLP in the production environment. HanLP supports Chinese word segmentation. Its functions include
With the above composition diagram, we could calculate the weight of each word node. Then, the iterative formula in TextRank algorithm is as follows:
User defined dictionary.
We add a large number of words that could help the word segmentation of geological documents in the custom dictionary effectively. Here, the “CustomDictionary” includes 21,742 geological words, the “OrganizationDictionary” includes 31,926 institutional nouns, the “ChinesePlaceDictionary” includes 90,558 place names, the “PeopleNameDictionary” includes 50,192 personal names, and the “ModernChineseDictionary” includes 207,964 modern Chinese additional words. Among them, “CustomDictionary” is a dictionary defined by a global user which could add, delete, and affect all word segmentation at any time.
On the basis of our analysis mentioned above, it is effective and efficient to integrate the online encyclopedia crawler technology into the processing flow of geological documents, which need to get the category labels of words obtained by word segmentation in Wikipedia. The mainstream method of crawler is implemented using URL address webs which can be obtained through depth or breadth first search strategy. Here, the web site that we need to crawl is fixed (i.e.,
Selenium automation test browser is mainly applied to the automated testing of the web application, while supporting all management task automation based on web. By embedding the Selenium IDE plug-in into the browser, the recording and playback functions of a simple browser operation could be achieved.
It should be noted that Selenium provides a highly rapid and convenient way for the fixed web crawler. Here, we use Selenium to control HtmlUnit, a virtual browser that Java comes with, which serves the purpose of automated crawler. The specific process mainly includes opening the HtmlUnit browser, reading a search word “
Implementation details of Internet encyclopedia crawler are as follows: Open HtmlUnit browser: Open the interface of search word “ Locate the label element:
Java Servlet is a Java program that extends the capabilities of a server. Although Servlets could respond to any types of requests, they implement applications hosted on web servers usually. Such Web Servlets are the Java counterpart to other dynamic web content technologies, such as PHP and ASP.NET.
Servlets are often used to process and store a Java class in Java EE that conforms to the Java Servlet API, a standard for implementing Java classes which could respond to the requests. And, Servlets could communicate over any clientCserver protocol, but they are often used with the HTTP protocol. So, “Servlet” is often used as shorthand for “HTTP Servlet.” Thus, a software developer should use a Servlet to add dynamic content to a web server by using the Java platform. The generated content is HTML but may be other data such as XML. Servlets could maintain state in session variables across many server transactions by using HTTP cookies or rewriting URLs.
Servlets could be generated from JSP by the JavaServer pages compiler automatically. Architecturally, JSP could be viewed as a high-level abstraction of Java Servlets. It allows Java code and certain predefined actions to be interleaved with static web markup content, such as HTML, with the resulting page being compiled and executed on the server to deliver a document. JSP are translated into Servlets at runtime, and each JSP Servlet is cached and reused until the original JSP is modified.
Servlets can complete the following tasks: The web container initializes the Servlet instance; then the Servlet instance could read data that has been provided in the HTTP request. The Servlet instance could create and return a dynamic response page to the client. The Servlet instance could access server resources, such as files and database. The Servlet instance could prepare dynamic data for the JSP and create a response page with JSP.
In this article, the Servlets and their key functions that we design under com.servlet package are shown in Table
The Servlets and their key functions.
Servlet name | Key functions |
---|---|
“Myservlet.java” | It is used in the retrieval of KG, and it gets the form data submitted by the user and retrieves them. |
“Myservlet2.java” | It is used in the second retrieval. When clicking on some word in the page, the user can get the graph of this word. |
“LoginServlet.java” | It is used in the login function of the backstage management system of KG, and it gets the form data submitted by the user and enters the response page. |
“AddServlet.java” | It is used while adding a relationship in expert intervention page. |
“DelServlet.java” | It is used while deleting a relationship in expert intervention page. |
“CoreServlet.java” | It is used while showing the intermediate processing for geological documents. |
In summary, the software platforms and development environments in our system are as follows. Operating system is Windows 7. Programming language is Java. Programming environment is MyEclipse 10. Web development environment is Tomcat + Severlet + JSP. Web crawler environment is Selenium + HtmlUnit.
Processing for a single geological document is as shown in Figure
Processing for geological documents.
The document is processed using the similar method in [
Some results of segmentation in our KG system are shown in Figure Some results of segmentation by NLPIR systems of Beijing Institute of Technology [
Some results of segmentation.
The updated version of Figure
Some results of segmentation in NLPIR system.
The updated version of Figure
According to the process in [
The results of word frequency statistics are as shown in Figure
The results of word frequency statistics.
The updated version of Figure
Figure
The results of keywords extraction.
The results of category labels crawled from the Internet encyclopedia (
The results of Internet encyclopedia crawler.
The updated version of Figure
The specific process of retrieval in KG is shown in Figure
The specific process of retrieval.
After processing one geological document, the results of retrieving “ After processing 100 geological documents, the results of retrieving “
An experimental result after processing one geological document.
An experimental result after processing 100 geological documents.
Figures
From the comparison of two retrieval processing stages, we could see that the results of KG have been improved with a growing number of documents processed. When the number of processed documents is 1, the retrieval results have little relevance with the retrieved word. However, when the number is 100, we could get entities that have a very close relationship with “
In addition, we could get the following information from the above results. The top 20 geological terminologies associated with “ The category labels for every geological terminology. The ID of documents in which both words appear.
Furthermore, some complicated phrases and sentences can also be processed correctly. For example, when inputting “
An experimental result after searching several key geological words.
Similarly, we could get the following information from the above results. We can get the top 20 geological terminologies associated with “ We can get the category labels for every geological terminology in KG. We can get the ID of documents in which both words appear. While retrieving the two words, we can get the documents in which both words appear, and it achieves mining of implicit related documents. In addition, we can see the following:
In terms of “ In terms of “
Geological professionals know that “
When processing geological documents, new geological terminologies and their category labels are obtained from web crawler. And they are added in our expanded geological domain dictionary.
In our experiments, the original number of words in geological domain dictionary is 11,062. And after processing 100 documents, the number of words in geological dictionary is 13,227. Some results of geological domain dictionary are in Figure
The geological domain dictionary.
This article proposes a novel approach to constructing KG towards geological data. The proposed approach uses unsupervised learning method with linked open data to process geological documents and extract knowledge directly. Through this approach, we accordingly achieve an effective self-learning process for documents, form a geology glossary, and complete the construction of KG based on the technologies of documents processing and dictionary expanding. Furthermore, we design an application system of KG on the basis of B/S working schema. Finally, the test on a large number of geological documents is conducted and some satisfactory results are obtained. In the future work, aiming at the features of geological data, the knowledge extracting approach in the KG is further improved to get more accurate entities and relations.
The authors declare that there is no conflict of interests regarding the publication of this article.
This work was partially supported by the Special Funds Project for Scientific Research of Public Welfare Industry from the Ministry of Land and Resources of China under Grant 201511079 and the National Key Technologies R&D Program of China under Grant 2015BAK38B01.