Depending mostly on voluntarily sent spontaneous reports, pharmacovigilance studies are hampered by low quantity and quality of patient data. Our objective is to improve postmarket safety studies by enabling safety analysts to seamlessly access a wide range of EHR sources for collecting deidentified medical data sets of selected patient populations and tracing the reported incidents back to original EHRs. We have developed an ontological framework where EHR sources and target clinical research systems can continue using their own local data models, interfaces, and terminology systems, while structural interoperability and Semantic Interoperability are handled through rule-based reasoning on formal representations of different models and terminology systems maintained in the SALUS Semantic Resource Set. SALUS Common Information Model at the core of this set acts as the common mediator. We demonstrate the capabilities of our framework through one of the SALUS safety analysis tools, namely, the Case Series Characterization Tool, which have been deployed on top of regional EHR Data Warehouse of the Lombardy Region containing about 1 billion records from 16 million patients and validated by several pharmacovigilance researchers with real-life cases. The results confirm significant improvements in signal detection and evaluation compared to traditional methods with the missing background information.
All medicinal products are subject to strict testing and assessment of their quality, efficacy, and safety before being authorized. While premarket safety analysis through clinical trials remains vital, there is considerable attention towards improving the reporting and collection of postmarket data to enhance patient safety. After authorization, all medicinal products continue to be observed through pharmacovigilance studies to monitor their safety profiles. Currently, pharmacovigilance activities are mainly based on signal detection studies run on voluntarily sent spontaneous reports. Although spontaneous reporting remains a cornerstone of pharmacovigilance in the regulator environment and is indispensable for signal detection, due to examples of drug withdrawals [
The current postmarket drug surveillance process has several bottlenecks, with the first one being underreporting [
For these reasons, there is a clear need for complementary pharmacovigilance activities. Relative to Individual Case Safety Reports (ICSRs), Electronic Health Records (EHRs) cover extended parts of the underlying medical histories, include more complete information on potential risk factors, and are not restricted to patients who have experienced a suspected ADE [
This paper presents the interoperability framework developed in the SALUS (Scalable, Standard based Interoperability Framework for Sustainable Proactive Post Market Safety Studies) project [
Postmarket safety studies cover a wide area where various analyses can be done by following different approaches. Therefore, as one of the first activities in the SALUS project, we have identified the concrete pilot application scenarios to be implemented. We have agreed on six pilot application scenarios, four of which are specific safety analysis methods for different purposes, while the remaining two are focused on semiautomatic ADE notification and reporting.
Our work in this paper first provides the underlying Semantic Interoperability Framework that is used commonly in all six pilot application scenarios. Furthermore, this paper focuses on the implementation and validation of one of the safety analysis methods, namely, the
Recently, a number of investigators have examined potential use cases for secondary use of EHR data in clinical research and patient safety contexts including eligibility determination, clinical trial data collection, adverse event reporting, and conduction of epidemiological studies [
Although reuse of EHRs for safety studies has a great potential, a major barrier is that information systems in patient care and clinical research domains are not interoperable with each other. This is due to the fact that different reference information models (as models of use) such as HL7 RIM [
There are several efforts for addressing this interoperability challenge. Some approaches like OMOP [
Integrating the Healthcare Enterprise (IHE) profiles [
Several other efforts like Artemis [
When it comes to addressing Semantic Interoperability mismatches due to the use of different terminology systems, in some efforts like epSOS [
We believe that addressing syntactic and Semantic Interoperability cannot be separated from each other, since the binding between models of use and models of meaning also has an impact on Semantic Interoperability [
TRANSFoRm project also proposes a unified framework for representing structural and semantic models to address the interoperability problem [
The aim of a case series characterization study is to evaluate the validity of a potential signal, that is, the effect of a specific drug on a specific event. In particular, the safety analysts working at UMC and national/regional pharmacovigilance bodies are trying to find answers to such problems. What differs between the patients having a What is the proportion of male patients in both foreground and background populations, where patients using A signal of
Safety analysts need to access medical data sets of selected foreground and background populations from disparate EHR systems to be able to check whether there are other explanations more likely to cause a specific event (e.g.,
SALUS Framework enables the execution of such use case scenarios through a series of integrated semantic and technical interoperability components, as displayed in Figure
Components of the SALUS architecture involved in case series characterization implementation.
In the upcoming sections, all these components displayed in Figure
SALUS Framework provides the Case Series Characterization Tool (CSCT) as a Web application, which enables the safety analyst to formally define the characteristics of foreground and background populations. It is possible to define eligibility criteria by expressing several different clinical statements, such as conditions, medications, lab results, and procedures, which are retrieved from a common model, SALUS Common Information Model (CIM). Such criteria are represented by selecting coded values from terminology systems; for example, the medical event of interest can be defined by selecting
Eligibility criteria definition interface of the CSCT. In this example, the analyst defines
The tool also enables the safety analyst to configure the statistics to be calculated for grouping and stratifying data sets of the eligible populations, such as age, gender, and common conditions/medications before/after medication/event of interest. The coded data can be configured to be grouped under a preferred terminology system and level in the results, for example, MedDRA High Level Group Terms (HLGT), no matter which specific terminology system is used in the EHR sources. Finally, it is possible to define a number of coded risk factors to be specifically checked on both populations. These represent the possible confounding factors of the selected conditions in the eligibility criteria that need to be checked in the medical summaries of the eligible patients, such as diabetes and obesity.
The eligibility criteria need to be passed to disparate EHR sources, and the deidentified medical data sets should be retrieved for the eligible patients. After aggregation, these medical data sets need to be analysed to calculate the statistical information asked by the safety analyst. However, there are several challenges: (i) divergent data models are used to represent EHRs and (ii) several different terminology systems are used to code structured patient data. In our architecture, we address these problems by formalizing the local models of EHR sites and semantically aggregating them using a common model, which we call SALUS Common Information Model (CIM). SALUS CIM is linked with ontological representations of terminology systems; hence, before the statistics are calculated on the aggregated data represented in CIM, terminology reasoning is handled to address not only structural but also semantic mismatches between data sources and the requestor.
There are two EHR sources in SALUS. A Regional Health Data Warehouse (DWH) is maintained in Lombardy Region in Italy, which collects and extracts all data necessary for administrative and statistical purposes from almost all the public healthcare providers. It is operational since 2002, covering medical records of around 16 million patients. This huge DWH includes around 1 billion records including hospitalizations, ambulatory events, chronic conditions, drug prescriptions, allergies, vaccinations, and pregnancies. Its main advantage is providing longitudinal data from all public healthcare providers at the primary, secondary, and tertiary levels. Also, all data in the DWH is structured and coded. In SALUS, we are using a copy of the DWH for both eliminating unnecessary data (e.g., financial) and not affecting the regular operation of the system. This DWH has a monthly update mechanism according to the data flow time process of the regional DWH. All the information present in the regional DWH is structured and coded. The second source is the AGFA ORBIS installation used as the EHR system at Technical University of Dresden (TUD) Hospital, which is the largest hospital structure with 21 clinics in Saxony, Germany. For use in SALUS, access to a live backup of the operational TST1 database is provided, which includes data of around 950 thousand patients with around 75 million records including 13 million diagnoses, 2 million medications, and 56 million lab results.
In SALUS, we follow a nondisruptive approach and collect EHR data in the local models used by the EHR systems. These can be based on interface standards as in the case of Lombardy DWH, which can provide medical data represented in CCD/Patient Care Coordination (PCC) templates [
Before SALUS, Lombardy Regional Health Infrastructure was already able to produce and exchange patient summary documents complying with CCD/PCC templates within the scope of epSOS project [
The EHR RDF Service gets the data of the eligible patients from TIDSQS in native XML representation of the CCD/PCC templates, after which data formalization takes place. In order to perform comprehensive transformations of XML Schemas (XSD) and XML data to RDF automatically, we have implemented a tool named Ontmalizer [
A simple HL7 CDA observation instance for
A slightly different approach is followed on the TUD side. Instead of data exchange through some content standards, a SPARQL [
SALUS Common Information Model (CIM) ontology forms the core of the SALUS Semantic Resource Set (see Figure
The Semantic Resource Set as the backbone enabling the SALUS Semantic Interoperability Framework.
During the requirements analysis phase, we have collected all the clinical data requirements of our pilot application scenarios; one among six is the case series characterization. Although the requirements of our pilot applications were our main driving point, we have analysed and taken into account content models from other standards and initiatives as well, to provide a common mediator that can interoperate with well-established state of the art. These include HL7/ASTM CCD and IHE PCC templates, HITSP C32/C83 components [
As a result, we have built a list of Common Data Elements (CDEs) that include elements to be present within a medical summary, such as patient demographics, encounter, condition (problem, diagnosis), allergy, family history, and healthcare provider, and their subelements [
Composed of 211 CDEs, SALUS CIM ontology acts as a mediator among different content models. SALUS CIM ontology not only represents entities that can be presented within a medical summary, but also establishes a link with the terminology system ontologies that are used to code patient data.
SALUS CIM also covers the query model to express eligibility criteria for defining a population of interest. For this purpose, we mainly benefited from the query model of HL7 HQMF and created its semantic representation within the SALUS CIM ontology.
None of the above-mentioned existing models is satisfactory enough in terms of scope to meet the requirements of observational studies on its own. Therefore, we had to develop the SALUS CIM as a harmonization of several well-accepted content models used in the clinical care and observational study domains.
In our architecture, Semantic Interoperability Layer-Data Services (SIL-DSs) for Lombardy and TUD are responsible for converting the medical summaries of the eligible population represented in local ontologies, that is, CDA/CCD Content Entity Model instances received from EHR RDF Service and ORBIS Content Entity Model instances received from TUD SPARQL Endpoint to instances represented in SALUS CIM Ontology. In order to perform this operation, a set of conversion rules in Notation 3 (N3) [
Content Entity Models and conversion rules are part of the SALUS Semantic Resource Set. Whenever a new content model is to be introduced in the SALUS architecture, it is necessary to define the conversion rules from the corresponding entity model (i.e., formalized) to the SALUS CIM Ontology as the common mediator. This is a one-time manual process. Although the CIM has become quite mature after several iterations, still it can be the case that it would not cover a new content model completely. In this case, the CIM is extended without disrupting the existing data elements so that it covers the new content model to be mapped while preserving the existing conversion rules.
This section depicts the complete transformation and mediation cycle of the query and the results, which is initiated by the CSCT by passing the query parameters to the Safety Analysis Query Manager (SAQM). SAQM is responsible for forwarding the eligibility criteria represented in SALUS CIM Ontology to the registered data sources and getting back the aggregated results again in SALUS CIM. The complete cycle is presented in detail in Figure
The complete transformation and mediation cycle of the eligibility query and the result sets via the SALUS Interoperability Framework.
Now, all the patient data in SAQM are represented in SALUS CIM; however, yet it is not possible to “understand” as they are coded with several codes from different terminology systems.
The first step to overcome the terminology reasoning challenge is the representation of the terminology systems as ontologies within the SALUS Semantic Resource Set. For this, we prefer the well-established Simple Knowledge Organization System (SKOS) [
The next step is formalizing the mapping between terminology systems. We utilize several reliable terminology mapping resources for this purpose, as presented in Table
Terminology mapping resources that are utilized in the SALUS Framework.
Source system | Target system | Number of mappings | Mapping resource |
---|---|---|---|
MedDRA | SNOMED-CT | 10,648 | OntoADR of the PROTECT project, manual improvement of UMLS mapping by PROTECT experts [ |
ICD-9-CM | SNOMED-CT | 16,819 | OMOP Vocabulary, created manually by experts |
ICD-10-CM | SNOMED-CT | 59,122 | OMOP Vocabulary, created manually by experts |
SNOMED-CT | ICD-10 | 27,166 | CrossMap, a collaborative project by IHTSDO and WHO |
ICD-10-GM | ICD-10 | 12,318 | Identical codes in both systems |
ICD-9-CM | SNOMED-CT | 43,086 | BioPortal, manual review by SALUS experts before inclusion |
ICD-10-CM | SNOMED-CT | 45,022 | BioPortal, manual review by SALUS experts before inclusion |
In order to realize terminology reasoning at run time in acceptable durations, it is absolutely necessary to do some in advance inferencing specific to the reasoning requirements, which is known as materialization in the semantic Web domain.
In our case series characterization scenario, the conditions of the patients are provided with several codes at different levels from ICD-9-CM in Lombardy and ICD-10-GM in TUD. However, the safety analyst wants the conditions to be grouped under a different terminology system, namely, MedDRA, and also at a specific level in the MedDRA hierarchy, in this case HLGT. Therefore, we should be able to find either exact or broad correspondences of various source codes from ICD-9-CM and ICD-10-GM to MedDRA HLGT terms. An example for
Some generic and specific codes for representing
In our Semantic Resource Set, we represent the original hierarchical relationships within a terminology system with “skos:broader” property. Regarding the mapping across terminology systems, we have used a number of resources providing the mapping across different terminology systems and formally represented them through RDF properties. IMI PROTECT project created an ontology called OntoADR, which also presented the correspondence between MedDRA and SNOMED-CT codes [ OMOP project [ US NLM provides mapping between SNOMED-CT and ICD-10 to support semiautomated generation of ICD-10 codes from clinical data encoded in SNOMED-CT for reimbursement and statistical purposes. This is a result of CrossMap Project by IHTDSO and WHO [
It should be noted that, in our first attempt, we tried to represent this mapping through the well-established SKOS ontology via its relationships like skos:exactMatch and skos:narrowMatch and used these relationships to infer mapping between ICD-10-GM and MedDRA. However, after manually analysing some of the inferred terminology mapping, we realized that there is clinically incorrect mapping. We discovered that most of the errors are due to the transitive and bidirectional nature of SKOS mapping relationships [
By using all these relationships, in this scenario we apply a series of terminology reasoning rules, again implemented on top of EYE, which calculate the full transitive closure of “salus:closeMatch” relationship for all the codes in our Semantic Resource Set. A part of the result for the haemorrhage example is provided in Figure
An excerpt from the result of “skos:closeMatch” relationship transitive closure calculation.
These materialized results are provided to the Terminology Reasoning Service. At run time, Terminology Reasoning Service is used to enrich the coded information in retrieved population data, such as problem and active ingredient, with the codes from the terminology systems preferred by the safety analyst. This materialized mapping information is also used while querying the EHRs, for query expansion. In Lombardy DWH, for example, the original query for
The final step is the calculation of the statistics that the analyst asked for. Queries implemented as EYE rules are executed on the patient data enriched as a result of terminology reasoning to extract the common and different characteristics of the foreground and background populations. The results are displayed by the CSCT, as seen in Figure
A part of overall common conditions results displayed by the CSCT.
When the
Upon the previous configuration of the analyst, all the conditions of the background and foreground populations are grouped under MedDRA HLGT terms and presented comparatively. Similarly, the medications are grouped by their active ingredients at the substance level. By analysing all these results, the safety analyst decides in an informed manner whether a specific drug (
The quantity and quality of the information provided by SALUS CSCT to the UMC safety analysts are a significant improvement compared to what they are able to access using traditional methods based on reported ADEs and without access to EHR sources.
CSCT and all related components have been implemented and deployed on top of the SALUS Semantic Interoperability Framework integrated with the central Data Warehouse (DWH) of the Lombardy Region. This regional DWH contains anonymized structured data of about 16 million patients with over 10-year longitudinal data on average. There are around 1 billion medical records grouped as follows: ~550 million ambulatory diagnoses; ~275 million drug prescriptions; ~80 million conditions; ~35 million vaccinations; ~30 million inpatient diagnoses; ~2 million allergies; ~800.000 pregnancy records.
We have followed a progressive deployment approach to effectively address challenges due to technical integration and testing with huge data and started with deploying incrementally on 3 reduced subsets of the original DWH including 40, 100 thousand, and 1 million patients. After ensuring stability and optimum parameters for parallel execution of subqueries to improve the performance, we have deployed on the DWH with 16 million patients.
All deployment activities have taken place within the care zone of the data owners, and remote validators (i.e., pharmacovigilance researchers) in the research zone accessed the SALUS safety analysis tools including CSCT, which are all implemented as Web applications, through secure VPN channels and access credentials. There is no transfer of identified patient data outside the care zone; only anonymized data are accessible. The deidentification process has been carefully built and put in place. All personal information has been anonymized; date of birth has been generalized; date of death and event dates have been randomly and coherently shifted; rare diseases and orphan drugs have been eliminated.
The validation activities for the Lombardy pilot application took place from August 2014 to January 2015 for all SALUS tools with the involvement of several experts from UMC and Lombardy. These activities and results are presented in the following subsections.
In order to assess whether CSCT fulfills the intended use from an end-user point of view, it has been tested and evaluated by real end-users from UMC and Lombardy Regional Pharmacovigilance Centre in the scope of the SALUS project. The SALUS Evaluation and Validation Framework has been developed based on the ISO/IEC 25040 Systems and software engineering
A few queries with their durations of execution on two different DWHs of Lombardy, that is, with 1 million patients and with 16 million patients, are provided in Table
A few CSCT queries and their execution times on two Lombardy DWHs with 1 million and 16 million patients.
Medication | Reaction | Execution time in 1 million patients | Execution time in 16 million patients |
---|---|---|---|
Dabigatran | Upper gastrointestinal haemorrhage | 40 minutes | 0.4 days |
Nifedipine | Acute myocardial infarction | 95 minutes | 1.6 days |
Simvastatin | Rhabdomyolysis | 543 minutes | 6 days |
Ramipril | Pancreatitis | 647 minutes | 7.2 days |
In line with our ISO/IEC SQuaRE compliant evaluation and validation framework, in order to collect and analyse end-user feedback, we have developed online questionnaires addressing different validation characteristics including
CSCT questionnaire based evaluation scores (italic: average; bold: above average; the interval is
Usability | Social acceptance and viability | Quality of work life | Perceived usefulness | Perceived ease of use | User control |
---|---|---|---|---|---|
|
|
|
|
|
|
In the questionnaires, the end-users have agreed on the following aspects. CSCT is an added value to the existing process of research in pharmacovigilance. CSCT makes it easier to define eligibility queries and retrieve eligible patients for foreground and background populations. CSCT is compliant with the existing local, regional, and national processes.
We have also carried out focus group meetings and interviews with the validator end-users. The most prominent positive comments of the CSCT regarded its general user friendliness and ease of use. An average time of 7–10 minutes was required in order to get acquainted with the tools before team members felt confident in how to use them. Other positive aspects that were mentioned included the possibility of selecting different credibility intervals in certain statistical measures and more generally that the tool indeed has the potential to provide useful information in signal detection and validation work.
The major criticism of the CSCT regarded the time it takes to execute the queries, especially when the eligible patient population retrieved as the result of a query is big. This is due to the huge amount of patient records being accessed remotely in real-time and heavy use of standards based transactions, semantic conversion, and terminology reasoning operations, which the end-users have accepted as well. This criticism came from UMC experts, who are used to working on top of locally stored data which is converted in advance to formats and terminology systems used in the clinical research domain and hence not subject to several conversions for interoperability, such as the studies done on central data repository of the OMOP initiative [
Further details on end-user validation of CSCT and all other SALUS ADE detection and safety study tools are presented in SALUS D7.2.2 Validation Report for SALUS Pilot Application [
Lombardy Region is planning for a drugs monitoring project for adverse reactions specifically for patients treated with new oral anticoagulants (NOACs). Before initiating this project, Lombardy Regional Pharmacovigilance Centre carried out a preanalysis study with the available data in the Lombardy DWH to investigate the relationships between NOACs and some medical conditions as suspected ADEs (e.g., dabigatran etexilate as the NOAC and upper gastrointestinal haemorrhage as the suspected ADE), by using traditional methods and tools supported with custom-built queries and manual interpretation of data. After deploying SALUS tools on top of the Lombardy DWH, experts from Lombardy Informatics (LISPA; the partner in the SALUS project from Lombardy) decided to repeat the same study by using the CSCT, which provided the opportunity to test CSCT and the underlying SALUS Semantic Interoperability Framework in the field.
This comparative analysis revealed that the results provided by CSCT were identical with those found by the Lombardy Regional Pharmacovigilance Centre through traditional methods, which confirmed the technical correctness of our implementation. The main difference was observed in terms of time and resources spent to complete the studies. Experts at the Lombardy Regional Pharmacovigilance Centre reported that they completed their NOAC study in 1 month using traditional methods, while it took only 2 full days to repeat the same study by using CSCT and the underlying interoperability platform. Experts from the Lombardy Regional Pharmacovigilance Centre were impressed with this significant improvement of time and resource utilization.
The adoption of EHR systems and data exchange among these systems are rapidly increasing due to a number of national and cross-border projects in Europe and Meaningful Use in the US [
Although the main priority of these systems is improving clinical care, we demonstrate that the same systems and interfaces can be exploited for postmarket safety studies as well, with minimum intrusion when necessary, as in the case of our QED extension for population based queries. Our implementation proves that it is possible to carry these observational studies without developing study specific databases and Data Warehouses, which is costly and hard to maintain.
In the TUD case, we also demonstrate a complementary approach by developing a semantic interface directly on top of the EHR database and formalizing patient data immediately. This approach is of course more capable in the sense that the whole content of the EHR database can be formalized and more complex querying can be done compared to the standard based interfaces for data exchange. However, it necessitates an in-depth knowledge of and interaction with the storage structure of the EHR system, in addition to expertise with semantic Web technologies. Our advantage in SALUS is that AGFA as the developer of the ORBIS system is a core beneficiary of the project, so that we are able to demonstrate both approaches in parallel in integrated scenarios.
One of the biggest challenges in developing semantic Web applications is utilizing a satisfactory reasoning engine that is able to perform in reasonable time and space. In our very early prototype [
The data that we need in SALUS scenarios such as conditions, procedures, allergies, and medications of the patients are always available in a structured manner in the Lombardy DWH. On the other hand, we have observed in TUD that some medical details of some patients are only available in free-text patient documents and are missing in a structured manner. This naturally limits the benefits of our advanced safety study tools. However, analysis of free-text data in EHRs was not within the scope of the SALUS as a focused research project.
Last but not least, it is very critical to have reliable and explicit mapping between terminology systems to accurately address the Semantic Interoperability challenge between clinical care and clinical research domains. In SALUS, we have analysed several mapping resources and represented the best options in RDF through SALUS specific properties mostly, and, through reasoning, we have inferred close matches that can be of use to SALUS end-users. It was not always possible to infer stronger and more valuable relationships such as exact match due to missing semantics. Therefore, in order to make the existing mapping reliable and reusable over the semantic Web, it is extremely important that the communities, who create the mapping, provide them in RDF using standard ontologies such as SKOS to indicate the exact semantics of the mapping relationships.
We have developed a scalable interoperability framework for observational studies and demonstrated in this paper how it is used for case series characterization by the pharmacovigilance researchers. Through our integration, validation, and comparative analysis studies, we have proven that the CSCT and the underlying SALUS Semantic Interoperability Framework have gone beyond simple proof-of-concept prototypes.
Semantically mediating all the patient data and terminology systems in formalized representations allows us to extend the capabilities of our tools via introduction of new rules easily. For example, we are able to insert a new rule to check the existence of diabetes through age, some specific medications (e.g., metformin), and laboratory test results (e.g., glycosylated hemoglobin) when diabetes is not explicitly recorded in the list of diagnoses of a patient.
Scalability is due to our semantic mediation approach; whenever a new source or target content model is to be added, the required mapping to the SALUS CIM is added in linear time, without affecting the existing resources. For example, although not used directly in our pilot sites, recently we have also added ISO/CEN 13606 archetypes as another source model. Furthermore, our decoupled RESTful services allow us to improve the overall performance by multiplying the services for concurrent processing and reasoning.
The SALUS architecture is designed for all kinds of observational studies, not just for case series characterization. In our other pilot application scenarios (e.g., temporal pattern characterization for signal detection), we have additional requirements such as subscribing to population data and mapping population data to OMOP CDM as the target model. We have implemented the necessary supplementary components for meeting these requirements and validated the involvement of several end-users as in the case of CSCT.
As one of the final outcomes of the SALUS project, we have developed a guidance document [
Beyond the project, SALUS partners are now concentrating on the exploitation and marketing of the SALUS Semantic Interoperability Framework and the supporting ADE detection and safety analysis tools. The most concrete efforts are taking place in the pharmacovigilance authorities in Lombardy, Italy, and in Turkey for large-scale deployment and operational use at the regional and national levels.
The authors declare that they have no competing interests.
The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007–2013) under Grant agreement no. ICT-287800, SALUS Project (Scalable, Standard based Interoperability Framework for Sustainable Proactive Post Market Safety Studies).