Concept-centered semantic maps were created based on a text-mining analysis of PubMed using the BiblioEngine_v2018 software. The objects (“concepts”) of a semantic map can be MeSH-terms or other terms (names of proteins, diseases, chemical compounds, etc.) structured in the form of controlled vocabularies. The edges between the two objects were automatically calculated based on the index of semantic similarity, which is proportional to the number of publications related to both objects simultaneously. On the one hand, an individual semantic map created based on the already published papers allows us to trace scientific inquiry. On the other hand, a prospective analysis based on the study of PubMed search history enables us to determine the possible directions for future research.
Today, the number of papers and citations is considered as the main indicator of scientific output [
MeSH indexing terms (the main headings and subheadings) are served as a rich resource for extracting a broad range of domain knowledge [
In this study, we developed an idea of combined MeSH-based profiles of the published and recently viewed papers in the form of a MeSH-centered semantic network. Past research in this field including an analysis of MeSH indexing patterns consists of Unified Medical Language System [
An example of the use of automated text processing is an approach implemented in BiblioEngine_v2018 software and based on the analysis of the summary of scientific publications on medical and biological topics from the PubMed/MEDLINE database. Automated analysis and comparison of MeSH-terms included in the papers enable us to form groups of relevant literature according to user-defined criteria and highlight the key concepts within these groups expressed in the form of a relationship between MeSH. This approach was implemented in the automated analysis of scientific papers in the field of molecular mechanisms of the onset of Alzheimer’s disease, investigation of drug transport systems, and discovery of natural compounds with therapeutic properties [
In this work, concept-centered semantic maps were created based on a text-mining analysis of PubMed using the BiblioEngine_v2018 software. The edges between the two objects (
The proposed approach was demonstrated using an example of the creation of an individual cognitive map of Professor Alexander Archakov, a biochemist and one of the most famous Russian scientists in the field of life science. Archakov A.I. is a founder of a scientific school in the field of molecular organization studies. His interests are focused on the study of functioning the oxygenase cytochrome P450-containing system, molecular mechanisms, structure, and function of membranes and biological oxidation. Being the pioneer in the development of proteomics in Russia, Archakov A.I. headed the Chromosome 18 Team in the international Human Proteome Project [
To download a list of relevant literature published in PubMed, we used the query [“
To quantify the semantic relationships between the nodes of the semantic map (MeSH-terms here), the BiblioEngine software was used [
To assess how the scientific interests have changed over time, all the papers published were divided into groups according to the release date (starting from the year 1970). Thus, we have formed five groups including the articles published in 1970-1980, 1980-1990, 1990-2000, 2000-2010, and 2010-2017, respectively. The time intervals were chosen in such a way that, on the one hand, the selections of papers were comparable in volume, and, on the other, they covered a period that was significant for the research. For each time interval, a list of PMIDs and the corresponding MeSH was formed.
The semantic concept-centered map was created based on the calculated matrix of semantic similarity between the nodes (MeSH-terms) and then visualized in the Cytoscape v.3.6.1 program [
The nodes formed a separate cluster if the measure of semantic similarity between the nodes was in the range of 0.75 to 1. For each time interval, the MeSH-centered clusters were visualized as a semantic network. Additional information for annotation of the semantic network was obtained from GoPubMed system [
For the creation of the MeSH-centered semantic network, based on search history, BioKnol [
Over the period from 1968 to 2017, more than 600 publications authored by Alexander Archakov were published, of which 424 papers are indexed in the PubMed system. This number is sufficient enough for demonstrating the
The most cited articles (data about citing obtained from Scopus) are published in such journals as Proteomics, Biosensors & Bioelectronics, Journal of Proteome Research, Biochemical and Biophysical Research Communications and Biochemistry and Molecular Biology International, according to GoPubMed annotations.
The subjects of all published articles were presented in the form of a MeSH-terms cloud (see Figure
The cloud of MeSH-terms associated with the publications of Alexander Archakov in PubMed. The frequency of MeSH occurrence in the articles published is proportional to the font size.
For a retrospective analysis of changes in the priorities of the scientific school founded by Alexander Archakov, we analyzed the frequency of MeSH-terms occurrence for the scientific papers published between 1970 and 2010 (see Figure
A fragment of the cognitive map created based on the analysis of scientific papers of Professor Alexander Archakov. The nodes are MeSH-terms associated with the papers. The edges between the two objects were automatically calculated based on the index of semantic similarity, which is proportional to the number of publications related to both objects simultaneously. (a) The period from 1970 to 2010. The scientific papers of each decade are presented on A1-A4 fragments. The MeSH-terms common to the A1-A4 fragments are marked with (
During the first decade under consideration (1970-1980), the specificity of the work performed is related to the study of biological membranes and electron transport, which is specified by the presence of terms such as “
The last decade of the previous century (since 1991) signaled the beginning of active bioinformatics research on the structure and function of proteins of the cytochromes P450 superfamily. This period is specified by the presence of decade-specific MeSH-terms such as “
After 2010, more than 40 articles led by Alexander Archakov were published, the main MeSH-terms of which are presented in Figure
Thus, the first cluster (B1) reflects studies in the field of bioinformatics, the development of information systems for storage of proteomic data (“
The department of proteomic research was established in 2001 and has increased significantly over the past ten years when Russia took part in the international Human Proteome Project [
The work performed under the leadership of Alexander Archakov resulted in an increased knowledge in the field of molecular functioning of living systems that is necessary for improving the methods of diagnosis and treatment of diseases and potentially cost-effective for the use in medical practice. Thus, the other two fragments, B4 and B5, are connected with personalized medicine. The B4 fragment specifies nanotechnology studies, namely, the development of new drugs (see MeSH “
Citation is one of the knowledge-intensive indicators that indirectly enables us to assess the relevance of published work. It is believed that citation in rapidly developing scientific fields is higher than in others. When analyzing the paper activity and bibliometric indicators, the citation statistics of the works published stands out; according to Google Scholar (
The maximum number of citations (2552) was received by the book
The most cited papers in journals are related to proteomics and participation in the international projects [
MeSH-terms associated with the 20 most cited publications of Alexander Archakov. The frequency of MeSH occurrence is proportional to the font size. The distance between the nodes is random.
The scientist’s cognitive map constructed based on the
Therefore, two sources of MeSH-terms were considered. On one hand, there were frequencies of MeSH-terms, which occur in the articles published that are written by the author (or coauthors). On the other hand, there were frequencies of MeSH-terms, which characterize scientific papers read by the same author. The difference between MeSH profiles of written/published and read papers is useful to capture the new trends in the personal research interests. For instance, for professor Archakov’s sample for the period from 2014 to 2017 terms “Proteomics”, “Enzyme Activation”, and “Membrane Proteins” significantly prevailed in connection with the published articles, as compared to the articles read. This observation reflects the field of the scientific interests and the process of knowledge actualization in this area (see Figure
A fragment of the cognitive map constructed based on an analysis of the frequencies of MeSH-terms for the scientific papers viewed by the scientist in 2012-2013. In the foreground, there is a fragment of the MeSH network associated both with the articles viewed and with the author’s articles published in the latest years (2014-2017).
MeSH-terms appeared more frequently in published papers include terms “
Figure
From our point of view, the retrospective part of an individual semantic map is less subject to temporal changes. This mainly depends on the age of a researcher: the younger the scientist is, the more the changes on the retrospective map can be expected including radical changes in its structure like the emergence of a new cluster of terms if a researcher has changed his place of work or a field of study. The prospective part of a map is more susceptible to changes since it depends on the cognitive interest of a scientist. The higher the cognitive activity is, the more active a person is and the greater the number of scientific directions (and consequently MeSH-terms) can be reflected in a personal semantic map. The predictive power of a MeSH-centered semantic map depends on the level of the nodes hierarchy, MeSH-terms. The more detailed (or less) the level of the MeSH-terms hierarchy on a prospective semantic map is, the more precise one can assume the specificity of scientific papers.
As an example of the BiblioEngine software capabilities, we shall consider the creation of a semantic map in the field of scientific developments of John Craig Venter (JC Venter), who had a significant impact on the development of postgenomic technologies and molecular biology. John Craig Venter is a biologist, businessman, and a cofounder of the Institute for Genomic Research and J.
Figure
A fragment of the MeSH-centered semantic map created based on the analysis of John Craig Venter’s publications. The nodes are MeSH-terms associated with the articles.
Another subgraph consisted of the terms “
The approach suggested can also be illustrated by constructing the semantic networks for proteins mentioned in the articles published in
Protein-centered semantic map created based on the analysis of the papers from “Nature” journal: (a) 2008-2011 years; (b) 2016-2018 years.
“Nature” journal, 2008-2011
“Nature” journal, 2016-2018
The simple idea of using BiblioEngine and BioKnol allows users to form concept-centered semantic networks (maps), organizing real-time PubMed-available knowledge in the form of semantic networks. Networks represent relationships between various objects: genes (proteins), MeSH, chemical compounds, names, terms from the dictionary compiled by experts, names for providing information of an individual or group scientific output, etc.
Semantic networks were visualized using Cytoscape according to the matrix of similarity, and the distance between the nodes (concepts) was correlated with the normalized number of popular scientific articles. Relevant fragments of a network, as well as a list of PMIDs for each relationship detected, could be provided to an expert for the future analysis.
This work shows the possibility of using the BiblioEngine software package combined with the BioKnol social network for an automated text-mining analysis of scientific literature and creation of a personal cognitive map. By the example of scientific publications of Alexander Archakov, a semantic map of key MeSH-based concepts was constructed, allowing us to trace the main scientific directions of his work. The practical use of the data obtained is directly related to the possibility of predicting the scientific output of an individual or a group of researchers. One of the proposed methodical solutions is to use the results of a text-analysis of the articles viewed since this indicator is one of the most important factors in the scientific search of a scientist. This study demonstrates an obvious correlation between viewed, already published scientific articles, and those that will be published in the future. Approved methodological approaches can be applied to other authors and represent practical significance in terms of developing modern approaches to evaluation of scientific output.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.
The work was done within the framework of the State Academies of Sciences Fundamental Scientific Research Program for 2013-2020.
Supplementary Table
Supplementary Note 1: BiblioEngine Toolkit Manual.