Medical entity recognition, a basic task in the language processing of clinical data, has been extensively studied in analyzing admission notes in alphabetic languages such as English. However, much less work has been done on nonstructural texts that are written in Chinese, or in the setting of differentiation of Chinese drug names between traditional Chinese medicine and Western medicine. Here, we propose a novel cascade-type Chinese medication entity recognition approach that aims at integrating the sentence category classifier from a support vector machine and the conditional random field-based medication entity recognition. We hypothesized that this approach could avoid the side effects of abundant negative samples and improve the performance of the named entity recognition from admission notes written in Chinese. Therefore, we applied this approach to a test set of 324 Chinese-written admission notes with manual annotation by medical experts. Our data demonstrated that this approach had a score of 94.2% in precision, 92.8% in recall, and 93.5% in F-measure for the recognition of traditional Chinese medicine drug names and 91.2% in precision, 92.6% in recall, and 91.7% F-measure for the recognition of Western medicine drug names. The differences in F-measure were significant compared with those in the baseline systems.
The existing natural languages are mainly divided into alphabetic and logographic languages. Logographic languages contain written characters representing words or phrases, and the best known logographic language is Chinese, while alphabetic languages are made of written characters representing sounds or sound combinations rather than concepts [ Is there a generalized approach for efficient MER for nonstructural EHRs written in a logographic language? Is it feasible to selectively extract concrete medical features from free narrative clinical texts, so that supervised algorithms can fulfill MER tasks in clinical ANs written in a logographic language? Among some supervised algorithms [e.g., support vector machine (SVM), conditional random field (CRF) with specific characteristic types] that are widely applied in NER, which of them are applicable for MER in ANs written in a logographic language?
Here, we proposed a new cascade-type MER approach called cascade-type Chinese medication entity recognition (CCMER) for processing admission notes in Chinese. We aimed to test CCMER in clinical texts in English (an alphabetic language) and investigate the roles of 4 feature types based on supervised algorithms [
Results showed that the performance was significantly improved (
Across the world, key information search in abundant nonstructural ANs through computer-based resolution was used to assist better clinical decisions as early as 1967. Depending on the purposes and contexts, key information search can be realized either by information retrieval (IR) at document level (recognition and identification of relevant AN documents from massive free text-formatted ANs) [
The MER of early MLP systems usually adopted rule-based methods [
However, most of the studies described above are limited to clinical ANs written in alphabetic languages. Studies for NER in logographic languages are relatively few. With the wide application of EHRs in China, it is urgent to test the reliability of the secondary use of abundant Chinese clinical ANs. So far, little effort has been devoted to NER in ANs written in logographic languages, especially in Chinese. As reported, several ML-based algorithms including CRF, SVM, and maximum entropy have been used to identify symptom descriptions and pathogenic mechanisms from TCM EMRs [
Here, based on document-level IR of ANs [
Based on the standard writing formats of WM and TCM prescriptions in medical orders [
Drug name entities and definitions of medication events.
Semantic type | Definition |
---|---|
Drug name entity | Names of TCM or WM drugs used in clinical treatment, including single drugs or drug combinations, general names of drugs, and TCM-specific decoctions and paste formula |
Name of WM drug | Drugs used in WM, including chemical name, trade name, common name, and prescription name |
Dose of WM | Dosage of WM for each patient each time as per doctor’s advice |
WM use method | Methods to use WM by a patient as per the doctor’s advice |
WM use frequency | Time interval for use of a single dose of WM as per the doctor’s advice |
Name of TCM drug | Different types of TCM products made from TCM materials or with TCM materials as raw materials, in TCD or TCM-WM combined treatment |
Dose of TCM | A TCM-exclusive concept, including 3 cases: potions, tablets, and pills. Weight of TCM potions and number of TCM tablets or pills following the doctor’s advice |
TCM drug form | A TCM-exclusive concept or a TCM application mode in adaptation to requirements of treatment or prevention following the doctor’s advice |
TCM overall dose | A TCM-exclusive concept or the total number of medications following the doctor’s advice |
TCM use frequency | Time interval for use of a single dose of TCM following the doctor’s advice |
TCM use method | Methods to use TCM by a patient following the doctor’s advice |
TCM use requirements | A TCM-exclusive concept or the conditions met by a patient to use TCM following the doctor’s advice |
In the actual clinical environment, medical orders regarding TCM or WM may contain only WM and TCM drug named entities without any other descriptive terms of the related drug use events.
To improve the reliability of manually annotated drug names, we established detailed annotation rules based on the above basic definitions. We subdivided medications into Western medication and traditional Chinese medication and further improved TCM-related annotation rules. The name annotation rules of core drugs are listed in Table
Name annotation rules of core drugs.
Number | Rule description |
---|---|
1 | Drug information should be recorded in the ANs, including the names of disease-treating or symptom-relieving drugs (e.g., TCM, WM, and biological agents). Drug name defined here describes one drug, one drug combination, or one medical product |
2 | The modifiers indicating a change of drug use or a patient’s drug use duration should not be included in the annotated drug name |
3 | Drug name entity is annotated by 1 phrase. For instance, the drug “nifedipine GITS” is usually annotated by two phrases: nifedipine and GITS, while here we annotate the whole drug name phrase as one drug name entity or namely the whole entity should be ascribed as one phrase |
4 | For TCM, Chinese characters indicating drug forms, such as “丸” (pill), “粉” (powder), and “汤” (decoction), cannot be annotated as single characters, because they are usually placed as the last characters within certain TCM drug names, and thus, the drug names should be annotated as a whole. For example, in the TCM drug name “大青龙汤” (da qin long decoction), “da qin long” is the pinyin to Chinese characters “大青龙,” while Chinese character “汤” is a drug form meaning “decoction” |
5 | The explicit negative modifiers around the drug names are not included in the annotated drug name entity |
6 | When Chinese drug name and the corresponding English name coexist in one short description without other words between them, they are jointly annotated as one drug name entity |
7 | When Chinese drug name and the corresponding English name coexist in one short description with simple symbols such as “/” or “-” between them, they are jointly annotated as one drug name entity |
8 | We also have seen parallel construction or ellipsis construction in some drug names. If two drug names are connected by one conjunction, the two drug names should be annotated as two separate drug entities |
9 | In some situations, certain words and punctuations in a drug name entity are ignored. Then, the following rules are used: |
10 | If two or more valid drug names end with the same characters and are combined together, then, the last drug name with the ending characters is taken as one complete drug name. For instance, in a description of two drug names ofloxacin and vitamin C injection, vitamin C injection is recognized as one complete drug name entity |
11 | Drug names usually contain figures, letters, and other symbols. Since these symbols represent drug-related information (e.g., (), <>), they are included in one drug name entity |
12 | When the TCM name and the description of the producing area coexist in the drug name, the information of the producing area is ignored. For instance, in Chuan Bei Mu (Zhejiang), Zhejiang is ignored |
13 | Specification may follow a drug name that does not belong to a drug name entity and may not need separate annotation. For example, in the drug name Cold Clear Capsule (a capsule with 24 mg of paracetamol), the specification is in brackets |
14 | Maximum annotation length of drug name entities should be set and followed, except when such a limitation of annotation length destroys the validity of the grammar structure. Especially, when modifiers of a drug name contain special information about a brand and pattern and form an agglutinate structure within the drug name, then, these modifiers should be included in the drug name entity. For example, in the drug name “苗泰小儿柴桂退烧颗粒” (pinyin translation is “miao tai xiao er chai gui tui sao ke li”), “苗泰” (pinyin: “miao tai”) is a drug brand and should not be excluded |
Totally 1000 out of 120,000 CEHRTG-based ANs between January 2011 and December 2012 recorded in SAHZU were extracted randomly. After deletion of incomplete ANs, 972 ANs were kept and then anonymized before manual annotation: private information including ID, name, sex, age, and reception department was deleted. Then, two native Chinese-speaking nurses annotated the names of clinical WM and TCM drugs according to the predefined guidance. To measure the interannotator agreement (IAA), the nurses independently annotated the same 100 ANs, and a senior doctor evaluated the annotations by using arbitration on disagreements. Potential issues were identified and the predefined annotation guideline was modified when necessary. The most difficult step for concept annotation of Chinese drug names is the determination of the boundary of an expression. Using the modified annotation guideline, the nurses manually annotated drug names in the remaining 872 admission notes. Therefore, 100 notes were annotated by both nurses, and the remaining 872 ANs were evenly divided and annotated by each nurse.
Then, the total 972 ANs were divided into 2 sets: 2/3 (648 ANs) as a training set and 1/3 (324 ANs) as a test set. The feature set for a classifier’s optimal performance was determined by 10-fold cross-validation using the training set. The performance was then assessed using the test set. The statistics of used data are listed in Table
Dataset scales used in this study.
Dataset name | Number of ANs | Number of sentences | Number of sentences mentioning drug name | Number of annotated WM drug name entities | Number of annotated TCM drug name entities | Number of annotated drug name entities |
---|---|---|---|---|---|---|
Training set | 648 | 40,649 | 1665 | 1322 | 487 | 1809 |
Test set | 324 | 20,397 | 716 | 581 | 209 | 790 |
Total | 972 | 61,046 | 2381 | 1903 | 696 | 2599 |
The 972 ANs contain 61,046 sentences, with 2739 mentioning drug names. There are totally 2599 drug named entities, including 1903 WM ones (73.2%) and 696 TCM ones (26.8%). Based on the 100 ANs annotated by both experts, the IAA was measured by kappa statistics to be 0.968, which indicates that the annotation was reliable.
CCMER is a new-pipelined cascade-type framework scheme. It locates hot-text chunks based on sentence classification and recognizes drug named entities from candidate target sentences via a supervised algorithm for sequence annotation. The system structure (Figure
High-level architecture for the CCMER.
The preprocessing module directly runs on the original document set. Because of specific writing habits, ANs with TCM-WM combined treatment contain abundant figures, Chinese characters, and English words. So such AN documents should be normalized first so as to reconstruct the inputted sentences, which can then be processed by the standardized natural language processing (NLP) tool. For instance, numbers and Greek letters are normalized to 9 and @, respectively. Then, with the character-to-pinyin function on Microsoft Office Word 2011, we transformed all Chinese characters into pinyin. Finally, the sentence splitter from the adjusted ICTCLAS system [
Though CRF-based recognizers have achieved outstanding performances in different sequence annotation tasks [
To delete such confusing data from the dataset, we used an SVM- based [
When the NLP system is trained by a text set composed of characters, it usually describes texts by selecting phrase characteristics first. The occurrence of a phrase is considered the most important phrase characteristics in the bag-of-words model. However, accurate segmentation of Chinese words or phrases is not easy, especially in clinical texts [
In this feature extraction function, each sentence is described as a vector of features.
For a feature vector
Here, we use a 6-dimension feature vector, mainly including the features based on clinical knowledge, statistics, and linguistics separately. It is specifically defined as follows:
The number of formal symbols (e.g., “<,” “>,” “(,” and “),” commonly used in clinical medication rules) was contained in the current sentence. If SF1 = 0, this sentence does not contain such descriptive symbols.
Do the Chinese drug name terms contained in Chinese drug name dictionaries appear in the current sentence?
Does the pinyin corresponding to the drug name terms contained in drug name dictionaries appear in the current sentence? The idea behind is that “An inputted Chinese character string is mapped into a voice coding sequence, which is the pinyin pronunciation or a rough approximation of the inputted string,” since the Chinese translation of English WM drug names is mainly based on transliteration. In actual clinical EHR writing, due to the wide use of the Chinese pinyin input method, when the description of drug names in ANs in Chinese characters shows writing or printing errors, the corresponding pinyin spellings might be actually correct. So different Chinese character strings with the same pronunciations are mapped to the same voice-encoded strings since they have the same pinyin spellings. This approach is essentially a feature clustering and can be used to correct many writing and printing errors.
The features for SF4 and SF5 are represented by the sum of frequency-weighted values of statistical word features (WFs) in the current sentence. An
Does the predefined drug use event phrase collocation template appear in the current sentence? Through observation of drug use events, we defined some phrase modes commonly used in drug use events, such as
[NUMBER + DOSEPATTERN];
[MODEPATTERN];
[FREQUENCYPATTERN];
[MODEPATTERN + CANDIDATE MEDICATION NAME];
[CANDIDATE MEDICATION NAME + FREQUENCYPATTERN].
The collocation of these phrases is a favorable indicator of drug use events.
The AN of one patient contains “Anti-inflammation treatment with provision of (Ofloxacin) Levofloxacin tablets Q.D 0.5 g for 7 continuous days in a local clinic,” which was transformed by a feature transform function into a 6-dimension feature vector:
< 2, 2, 2, 0.045, 0, 4 > ➔ MEDICATION.
The first feature indicates that the number of special characters is 2.
The second feature indicates that the candidate drug name terms appear 2 times in the current sentence.
The third feature indicates that the pinyin description corresponding to the candidate drug name terms appears 2 times in the current sentence.
The fourth and fifth features indicate that positive feature words, but not negative feature words, appear in the current sentence, with a frequency weight of positive feature words equal to 0.045.
The sixth feature indicates that the phrase templates commonly used in <MEDICATION> appear 4 times in the current sentence.
Then, we used the classifier, a stochastic gradient descent module [
Chinese ANs are special in some ways. For instance, a majority of Chinese names of WM drugs are actually the transliteration from foreign words, so an extra disambiguation of word segmentation is needed, which is much more complicated than the automatic word segmentation of general Chinese named entities. Existing common tools and methods for Chinese word segmentation thus cannot be directly applied to word segmentation of clinical ANs, which necessitates the customization for the medical field. Therefore, we did not conduct word segmentation and part-of-speech analysis commonly used in text processing. We used selected Chinese characters as the basic annotation units instead, because Chinese characters are the most basic sentence-composing units and also contain semantics.
The annotation of drug named entities can be transformed into a sequence annotation task:
The objective is to construct an annotator
The actual annotation task is finished by the CRF-based supervised ML sequence annotator, which was trained by an annotated corpus, while the training set is composed of a sequence pair (
In this module, we use a feature set containing 5 types of features (see Table
List of various features for the drug name recognizer.
Feature set | Features | Description |
---|---|---|
F1-1 | CWS = 1: |
The 1-gram, 2-gram, and 3-gram of the character text at CWS = 1 |
F1-2 | CWS = 2: |
The 1-gram, 2-gram, and 3-gram of the character text at CWS = 2 |
F1-3 | CWS = 3: |
The 1-gram, 2-gram, and 3-gram of the character text at CWS = 3 |
F1-4 | CWS = 1: |
The 1-gram, 2-gram, and 3-gram of the pinyin corresponding to the current character at CWS = 1 |
F1-5 | CWS = 2: |
The 1-gram, 2-gram, and 3-gram of the pinyin corresponding to the current character at CWS = 2 |
F1-6 | CWS = 3: |
The 1-gram, 2-gram, and 3-gram of the pinyin corresponding to the current character at CWS = 3 |
F2-1 | InDictTCM | Are the current character and the surrounding characters contained in the TCM dictionary? |
F2-2 | InDictTCMPinyin | Are the pinyins corresponding to the current character and the surrounding characters contained in the TCM dictionary? |
F2-3 | InDictWM | Are the current character and the surrounding characters contained in the WM dictionary? |
F2-4 | InDictWMPinyin | Are the pinyins corresponding to the current character and the surrounding characters contained in the WM dictionary? |
F3-1 | CurC |
Do the current character and subsequent characters contain the TCM dosage unit |
F3-2 | CurC |
Do the current character and subsequent characters contain the WM dosage unit |
F3-3 | PreC |
Do the characters before the current character contain the TCM dosage unit |
F3-4 | PreC |
Do the characters before the current character contain the WM dosage unit |
F3-5 | CurC |
Do the current character and subsequent characters contain the TCM usage term |
F3-6 | CurC |
Do the current character and subsequent characters contain the WM usage term |
F3-7 | PreC |
Do the characters before the current character contain the TCM usage term |
F3-6 | PreC |
Do the characters before the current character contain the WM usage term |
F3-9 | CurC |
Do the current character and subsequent characters contain the TCM drug form unit |
F3-10 | CurC |
Do the current character and subsequent characters contain the WM drug form unit |
F3-11 | PreC |
Do the characters before the current character contain the TCM drug form unit |
F3-12 | PreC |
Do the characters before the current character contain the WM drug form unit |
F3-13 | CurC |
Do the current character and subsequent characters contain the TCM frequency description |
F3-14 | CurC |
Do the current character and subsequent characters contain the WM frequency description |
F3-15 | PreC |
Do the characters before the current character contain the TCM frequency description |
F3-16 | PreC |
Do the characters before the current character contain the WM frequency description |
F4-1 | HasNum9 | Do the current character and the surrounding characters include the figure “9”? |
F4-2 | HasToken@ | Do the current character and the surrounding characters include the symbol “@”? |
F4-3 | HasEnglishAlphabets | Do the current character and the surrounding characters include English letters? |
F4-4 | HasTime | Do the current character and the surrounding characters contain time description such as hour, week, date, or year? |
F5 | InListSectionName | Do the name of AN section involving the current character and the surrounding characters appear in the predefined section list? |
F6 | Class |
These three types of features indicate the type labels of the 3 characters before the current character |
F1 is composed of 2 dimensions: texts and pronunciation. The ANs with TCM-WM combined treatment contain relatively simple short narrations (e.g., short subsentences) and abbreviations of technical terms. Usually, the average length of a Chinese phrase is 2 Chinese characters [
F2 has two dimensions including texts and pronunciation, namely, the Chinese characters in the current context window and the corresponding pinyin appearing in the drug name dictionary. This simple dictionary lookup approach uses the forward maximum match algorithm to search the drug name dictionary (defined in section “drug name dictionaries and lists of relevant terms”).
Does the Chinese characters or letters in the current context window match with the terms in the lists related to drug named entities? Also, it uses the forward maximum match algorithm to search term lists (defined in section “drug name dictionaries and lists of relevant terms”).
Other relevant mode features were contained in the
F5 indicates whether the name of the section that contains the current
F6 is the annotation category of the first 3 characters before the current character.
In the annotated narrative text, the annotation “BIO” is resolved as follows: “B” indicates that the character is at the beginning of the drug named entity, “I” shows the character to be in the middle or at the end of the drug named entity, and “O” indicates that the character does not belong to the drug named entity. To guarantee the consistency of character labels and the integrity of name recognition, we also used some simple heuristic rules (see Table
Rules used in the postprocessing module.
Number | Description of postprocessing rules |
---|---|
1 | If the label “O” is followed by the label “I,” then, “I” is forcefully resolved to the same-type label “B” |
2 | If “B” is followed by a different-type label “I,” then, “I” is forcefully resolved to “B,” such as B-WM I-TCM ➔ B-WM B-WM |
3 | In Chinese ANs, the end of a drug name is rarely followed by another completely different therapeutic drug. In this case, we established the following rules, such as B-WM I-WM B-WM I-WM ➔ B-WM I-WM I-WM I-WM |
4 | If a drug name entity only contains “),” but not “(,” the starting position of the current drug name entity is moved ahead, while the label “B” is repositioned at the position of “(” |
5 | If “)” is annotated as label “O” and it immediately follows the end of the Chinese characters of the recognized drug name, then, this field end is expanded by one character to involve “)”; otherwise, the starting position of the field of the drug name is adjusted to be discarded “(” |
First, we constructed baseline system 1, which used the maximum matching algorithm [
Regarding the characteristics and difficulty in MER recognition in TCM-WM combined Chinese ANs, we used 1 type of soft evaluation indicators [
Rules used in evaluation.
Score | Rule description |
---|---|
1 | Medication entity is accurately detected, and divisions of class and boundary are both correct |
0.8 | Only one error is detected at the start position of the ME boundary |
0.6 | Only one error is detected at the end position of the ME boundary |
0.4 | Two errors are detected at the start and end positions of ME boundaries, respectively |
0 | ME is not detected, or the detected phrase is not a drug name entity annotated in the gold standard |
It should be noted that in clinical practice, “静/B-WM 滴/I-WM 恩/I-WM 度/I-WM 复/O 查/O 有/O 进/O 展/O” is more reliable than “静/O 滴/O 恩/B-WM 度/I-WM 复/I-WM 查/I-WM 有/O 进/O 展/O,” so we assign more scores at the start position of ME.
We first tested the performance of baseline system 1. As shown in Table
Performance of the baseline system 1 based on professional drug dictionaries and the maximum matching algorithm between drug name characters and pinyin.
Precision | Recall | F-measure | |
---|---|---|---|
TCM drug name | 49.2% | 45.5% | 47.3% |
WM drug name | 54.9% | 49.1% | 51.8% |
All drug names | 53.2% | 48.0% | 50.5% |
Note: to the nearest 0.1%.
The confusion matrix in Table
Confusion matrix of outputs from the filtration module of potential hot-sentence classification.
Classification | Medication | No medication | Total |
---|---|---|---|
Medication |
|
21 | 716 |
No medication | 57 |
|
19,681 |
Total | 752 | 19,645 | 20,397 |
Figure
Precision, recall, and F-measure obtained by CRF with different features under different CWS settings.
With various feature sets, the performances of the CRF-based MER system with the candidate hot-sentence subset are shown in Figure
Precision, recall, and F-measure obtained by CRF with different features under different CWS = 3 setting.
The new approach obviously outperforms baseline system 1 based on professional drug dictionaries. As shown in Figure
Precision, recall, and F-measure obtained by the baseline systems versus the CCMER system.
Moreover, we also conducted experiments, built baseline system 3 (see Table
Performance of the baseline system 3 based on CCMER (not on the use of hot-sentence detection) (feature sets: F1 + F3 + F4 + F5 + F6).
Precision | Recall | F-measure | |
---|---|---|---|
TCM drug name | 73.4% | 71.3% | 72.3% |
WM drug name | 70.2% | 72.3% | 71.4% |
All drug names | 71.0% | 72.3% | 71.6% |
Note: to the nearest 0.1%.
Here, we manually built a dataset involving 972 annotated ANs containing TCM-WM combined treatment. Based on this, we tested a new approach, CCMER, and investigated its performance under different feature allocations. The performance of CCMER is significantly improved versus that of the baseline system 1, as the F-measures of TCM and WM drug named entity recognitions are increased by 45.8% and 38.2%, respectively. The deletion of abundant irrelevant sentences from the dataset results in largely improved operation efficiency.
The optimal performance occurs with the use of a feature set (F1 + F3 + F4 + F5 + F6), as the F-measure of overall drug name recognition is 41.9% higher than that using the baseline system. This indicates that the feature sets with different dimensions are modestly complementary and also proves that the results from Meystre et al. [
We then preliminarily studied the contributions of single features to the drug named entity annotation. First, the use of small-scale medical drug name dictionaries (F2) does not improve the system performance. This is not surprising because the same type of information was already captured by F1 and certain drug name entries in F2 lacked comprehensive and detailed information about the drugs. Unfortunately, our self-compiled drug name dictionaries are of small scales. The system performance can be further improved if foreign resources such as Chinese version RxNorm [
Moreover, we find hot-sentence detection in AN texts to be a key factor affecting the systemic performance. The hot-sentence detection technique was a way to determine the focus areas of the texts and thus filter out a large amount of noise. Removing the filtration module for classifications of potential hot sentence alone would largely reduce the systemic performance.
Meanwhile, we found during TCM drug name recognition that TCM is subdivided into Chinese medicinal herbs and Chinese patent drugs. The names of Chinese medicinal herbs are usually composed of 2 to 3 (mean 2.57) Chinese characters, while Chinese patent drugs are preparations made from TCM materials through modern pharmaceutical approach/process complying with quality standards. Their named entities combine the characteristics of both Chinese medicinal herbs and WM; thus, the recognition rates of these drug names are very low. For instance, for Heartleaf
Among the annotated results with scores of a soft evaluation indicator equal to zero for medical entities, the major error source is the recognition of general terms of drugs, such as anticoagulants, antibiotics, compound vitamins, and antihypertensive drugs. These general terms were included into the gold standard here, because while drug named entities are important drug use events that are captured by the annotator, the general terms of these drugs might also indicate such important event, yet owing to a lack of support from fine-grained information sources and medical knowledge, the current system cannot recognize them. This is also one research direction in the future.
Another common error occurs only in the sample recognition with the test set, but not with the training set. The supervised ML system has one advantage that it can accurately capture drug names in the test dataset that are not in the training set. This robustness is attributed to the systematic ability of capturing context information. As discussed above, though we annotated 648 ANs as a training set, the annotated dataset at this scale still cannot fully cover the test set. For instance, this system detects “Amoxy” from “Amoxy (Amoxicillin) 0.5 g PO TID” as a drug name, though this drug name is not in the training set. We think that the system learns from the training set through a context mode “<drug name><dosage><drug use approach><frequency>.” On the other hand, the system cannot detect “Amoxy tests negative,” because this context mode does not appear in the training set.
Owing to the timeliness and urgency of clinical work, doctors usually abbreviate and shorthand some drug names, such that “vitamin A and vitamin C” is often abbreviated as “vitamin A, C.” These two drug names share the common beginning characters “维生素 (vitamin),” and the combination of the two drug names is abbreviated to a new simple name combination. Such abbreviated drug descriptions omitting the same beginning or ending characters do not contain “and” or “or”; thus, unlike processing compound descriptions in general texts, in these abbreviated descriptions, recognizing the common beginning or ending characters of the compound-drug names can only result in the correct recognition of the first or last drug name in the combination, while all the other drug names are ignored.
Moreover, standard clinical guidance of treatment based on diagnosis has not been extensively followed in the medical institution where our samples were collected. Doctors in this institution prescribe medication according to previous experiences for most diseases. Thus, for the same disease with the same symptoms, doctors may prescribe different drugs, leading to low appearing frequency of single-drug named entities. Take transfusion medicines as an example, except for solvents like glucose infusion and normal saline solutions; about 50.3% of the medical solvents that are used for transfusion only appear once, which is one cause for the low recognition rates of WM drug names.
Our approach also has some limitations. First, we only tested the ANs from one data source with one pattern from one medical center. Though CEHRTG is an HL7 CDA R2-based [
Here, we targeted at analyzing texts written in Chinese, a typical logographic language; tried MER in nonstructural texts regarding TCM-WM combined treatment; and proposed a new cascade-type approach—CCMER. This approach avoids the side effects due to abundant negative samples and improves the recognition performance of drug named entities in the logographic (Chinese) descriptions. We think that this approach may provide some reference values for MLP of other logographic languages. We also conducted many fine experiments. We found that the
Medication entity recognition
Admission note
Traditional Chinese medicine
Western medicine
A cascade-type Chinese medication entity recognizer
Support vector machine
Conditional random field
Informatics for Integrating Biology and the Bedside
NII Testbeds and Community for Information access Research
Named entity recognition
Electronic health record
Chinese Electronic Medical Record Template Guide
Medical language processing
Machine learning
Information retrieval
Second Affiliated Hospital Zhejiang University School of Medicine
Natural language processing
Interannotator agreement
Context window size
Information extraction
Word feature
False positive
In symbol list
In Chinese drug name dictionaries
In pinyin dictionaries
Positive set of statistical word feature
Negative set of statistical word features
In event patterns
The local context feature
Feature of drug name dictionary
Feature of drug name entity-related terms
Feature of mode
Feature of global AN structure
Feature of category annotation.
The authors alone are responsible for the content and writing of the paper.
The authors report no conflicts of interest.
Jianbo Lei developed the conceptual framework and research protocol for the study. Jun Liang drafted the manuscript and Xiaojun He and Jianbo Lei made major revisions. Xuemei Xian and Meifang Xu annotated the corpus, Sheng Dai evaluated the annotations, and Jun Liang, Jianbo Lei, and Xiaojun He conducted the experiment and data analysis. Jun’yi Xin, Jie Xu, and Jian Yu provided comments and made revisions to the manuscript. All authors read and approved the final manuscript.
The authors are grateful to Professor Wei Zhu, Dr. Ting Chen, and Ms. TingXue Sun for their helpful discussion and suggestions. This work was supported by the Medical and Health Planning Project of Zhejiang Province (no. 2017KY386) and Medical and Health Planning Project of China (no. 2015109528) from the National Health and Family Planning Commission, the Key Project of the National Research Program of China Grant (no. 2015BAH07F01), and the National Natural Science Foundation of China (NSFC) (no. 81171426 and no. 81471756).