Big data analytics (BDA) is important to reduce healthcare costs. However, there are many challenges of data aggregation, maintenance, integration, translation, analysis, and security/privacy. The study objective to establish an interactive BDA platform with simulated patient data using open-source software technologies was achieved by construction of a platform framework with Hadoop Distributed File System (HDFS) using HBase (key-value NoSQL database). Distributed data structures were generated from benchmarked hospital-specific metadata of nine billion patient records. At optimized iteration, HDFS ingestion of HFiles to HBase store files revealed sustained availability over hundreds of iterations; however, to complete MapReduce to HBase required a week (for 10 TB) and a month for three billion (30 TB) indexed patient records, respectively. Found inconsistencies of MapReduce limited the capacity to generate and replicate data efficiently. Apache Spark and Drill showed high performance with high usability for technical support but poor usability for clinical services. Hospital system based on patient-centric data was challenging in using HBase, whereby not all data profiles were fully integrated with the complex patient-to-hospital relationships. However, we recommend using HBase to achieve secured patient data while querying entire hospital volumes in a simplified clinical event model across clinical services.
Large datasets have been in existence, continuously, for hundreds of years, beginning in the Renaissance Era when researchers began to archive measurements, pictures, and documents to discover fundamental truths in nature [
There are many recent studies of BDAs in healthcare defined according to many technologies used, like Hadoop/MapReduce [
Research has focused mainly on the size and complexity of healthcare-related datasets, which includes personal medical records, radiology images, clinical trial data submissions, population data, and human genomic sequences (Table
Big Data applications related to clinical services [
Clinical services | Healthcare Applications |
---|---|
R&D | (i) Targeted R&D pipeline in drugs and devices, clinical trial design, and patient recruitment to better match treatments to individual patients, thus reducing trial and failures and speeding new treatments to market, follow on indications, and discover adverse effects before products reach the market |
|
|
Public health | (i) Targeted vaccines, e.g., choosing the annual influenza strains |
|
|
Evidence-based medicine | (i) Combine and analyze a variety of structured and unstructured data-EMRs, financial and operational data, clinical data, and genomic data to match treatments with outcomes, predict patients at risk for disease or readmission, and provide more efficient care |
|
|
Genomic analytics | (i) Make genomic analysis a part of the regular medical care decision process and the growing patient medical record |
|
|
Device/remote monitors | (i) Capture and analyze in real-time large volumes of fast-moving data from in-hospital and in-home devices, for safety monitoring and adverse prediction |
|
|
Patient profile analytics | (i) Identify individuals who would benefit from proactive care or lifestyle changes, for example, those patients at risk of developing a specific disease (e.g., diabetes) who would benefit from preventive care |
Certain improvements in clinical care can be achieved only through the analysis of vast quantities of historical data, such as length-of-stay (LOS); choice of elective surgery; benefit or lack of benefit from surgery; frequencies of various complications of surgery; frequencies of other medical complications; degree of patient risk for sepsis, MRSA,
Healthcare and hospital systems need BDA platforms to manage and derive value. The conceptual framework for a BDA project in healthcare, in essence of its functionality, is not totally different from that of conventional systems. Healthcare analytics is defined as a set of computer-based methods, processes, and workflows for transforming raw health data into meaningful insights, new discoveries, and knowledge that can inform more effective decision-making [
The objective was to establish an interactive and dynamic framework with front-end and interfaced applications (i.e., Apache Phoenix, Apache Spark, and Apache Drill) linked to the Hadoop Distributed File System (HDFS) and backend NoSQL database of HBase to form a platform with Big Data technologies to analyze very large data volumes. By establishing a platform, challenges of implementing and applying it to healthcare scenarios for clinical services could be validated by users to visualize, query, and interpret the data. The overall purpose was a proof of concept of Big Data capabilities to stakeholders, including physicians, VIHA administrators, and other healthcare practitioners. One working hypothesis was that NoSQL database created using hospital and patient data in differentiated fields would accurately simulate the patient data. Another hypothesis was that high performance could be achieved by using a few nodes optimized at the core CPU capacity and, therefore, used for clinical services. Lastly, patient data could be secured from configurations and deployment of HBase/Hadoop architecture and heavily relied on WestGrid’s High Performance Computing (HPC). These hypotheses are related to five specific challenges: data aggregation, maintenance, integration, analysis, and pattern interpretation of value application for healthcare [
Legality and ethics are a major contender to deal with within the realm of utilization of large datasets of patient data in healthcare [
In a hospital system, such as for the Vancouver Island Health Authority (VIHA), the capacity to record patient data efficiently during the processes of ADT is crucial for timely patient care and the enhanced patient-care deliverables. The ADT system is referred to as the source of truth for reporting of hospital operations from inpatient to outpatient and discharged patients. Among these deliverables are reports of clinical events, diagnoses, and patient encounters linked to diagnoses and treatments. Additionally, in Canadian hospitals, discharge records are subject to data standards set by Canadian Institute of Health Information (CIHI) and administered into Canada’s national DAD repository. Moreover, ADT reporting is generally conducted through manual data entry to a patient’s chart and then it is combined with Electronic Health Record (EHR) (adding to further complications of possibly compromising autopopulated data) that might consist of other hospital data in reports to provincial and federal health departments [
Big Data technologies fall into four main categories: high performance computing, data processing, storage, and resource/workflow allocator, like Hadoop/MapReduce [
Big Data technologies using Hadoop with possible applications in healthcare [
Technologies | Clinical utilization |
---|---|
Hadoop Distributed File System (HDFS) | It has clinical use because of its high capacity, fault tolerant, and inexpensive storage of very large datasets clinical. |
|
|
MapReduce | The programming paradigm has been used for processing clinical Big Data. |
|
|
Hadoop | Infrastructure adapted for clinical data processing. |
|
|
Spark | Processing/storage of clinical data indirectly. |
|
|
Cassandra | Key-value store for clinical data indirectly. |
|
|
HBase | NoSQL database with random access was used for clinical data. |
|
|
Apache Solr | Document warehouse indirectly for clinical data. |
|
|
Lucene and Blur | Document warehouse not yet in healthcare, but upcoming for free text query on Hadoop platform, can be used for clinical data. |
|
|
MongoDB | JSON document-oriented database has been used for clinical data. |
|
|
Hive | Data interaction not yet configured for clinical data, but SQL layer to cross platform being possible. |
|
|
Spark SQL | SQL access to Hadoop data not yet configured for clinical data. |
|
|
JSON | Data description and transfer has been used for clinical data. |
|
|
ZooKeeper | Coordination of data flow has been used for clinical data. |
|
|
YARN | Resource allocator of data flow has been used for clinical data. |
|
|
Oozie | A workflow scheduler to manage complex multipart Hadoop jobs not currently used for clinical data. |
|
|
Pig | High-level data flow language for processing batches of data, but not used for clinical data. |
|
|
Storm | Streaming ingestions were used for clinical data. |
A distributed computing system can manage hundreds of thousands of computers or systems, each of which is limited in its processing resources (e.g., memory, CPU, and storage). By contrast, a grid computing system makes efficient use of heterogeneous systems with optimal workload management servers, networks, storage, and so forth. Therefore, a grid computing system supports computation across a variety of administrative domains, unlike a traditional distributed computing system. Furthermore, a distributed Hadoop cluster, with its distributed computing nodes and connecting Ethernets, runs jobs controlled by a master. “Hadoop was first developed to fix a scalability issue affecting
Considering the design and implementation of BDA systems for clinical use, the basic premise is to construct a platform capable of compiling diverse clinical data. However, the process of Ethics and Research Capacity at VIHA for approval for the entire patient data of the hospital system was not possible. Secondly, it was not possible to piece together summarized data specific to health outcomes because this data has already been summarized. Thirdly, real data in the data warehouse at VIHA will require several months to review and develop the solution to use Big Data technologies. Lastly, performance benchmarking of the platform needs to be determined with the current data query tools and workflow at VIHA, which means that simulation at extremely large volume can prove to be of high performance and usability. Therefore, the study focused on simulation conducted with VIHA’s real metadata and exchanged knowledge on how the ADT and DAD could be used in production.
Hadoop/MapReduce framework was proposed to implement HBDA and analyze emulated patient data over a distributed computing system that is not currently used in acute patient-care settings at VIHA and other health authorities in British Columbia, Canada. The teamed collaboration between UVic, Compute Canada/WestGrid, and VIHA established the framework of the HBDA platform. It comprised innovative technologies like the Hadoop HDFS with MapReduce programming and a NoSQL database. The HBase database construct was complex and had many iterations of development over the past three to four years. HBase is an open-source, distributed key-value (KV) store based on Google’s
The functional platform was tested for performance of data migrations or ingestions of HFiles via Hadoop (HDFS), bulkloads of HBase, and ingestions of HFiles to Apache Spark and Apache Drill. In this study performances were proof-of-concept testing using simulated data with the same replicated metadata and very large volume. Furthermore, this study involved six Hermes cores (each core has 12 Computer Processing Units (CPU) cores). These CPUs accounted for only a total of 72 cores out of the overall maximum of 4416 cores available at WestGrid-UVic. There were many configurations and package components to include in the build, such as Apache Phoenix, Apache Spark, and Apache Drill, as well as Zeppelin and Jupyter Notebook interfaces.
Metadata is information about the data that is established in a system as a structured standard to record and retrieve information accurately. It is the structure of metadata that allows for data profiles (i.e., characteristics, sources, and character lengths) to be established in a database. And in healthcare this means data is standardized effectively for patient records to be accurate when retrieved or viewed in an EHR. In the case of VIHA, the metadata of the ADT system allows for patient data to be recorded when a patient is admitted to the hospital, assigned to a bed, and provided other medical services. The structure itself is not proprietary to the system and does not contain any real patient data. In the meetings, with VIHA personnel, the core metadata of ADT/DAD were verified with questions scripted for the three main groups (Box
(i) Focus on the current demographics using standardized metadata of CIHI, hospitalization, and readmission
(ii) CIHI requests hospitals to submit data based on set data collection targets and standards. Much of the used
(iii) ADT is location, medical service and inpatient to discharge, so we can add those columns while diagnosis
(iv) Requested by and regulated by CIHI all metadata associations can be based on the encounter and MRN at
(v) It is the most important system that holds the patient’s non-clinical information. These are based at the
(vi) ADT is collected when patient is still in the hospital, but DAD data is recorded after patient leaves the
(vii) DAD contain the clinical information that is collected ADT is the location, date and time of the visit, and
(viii) Patients are identified using their PHN, MRN and encounter number. Encounter level queries are important
(i) Produce standard reports hourly, daily, weekly, monthly, and yearly with no errors for reporting, the
(ii) ADT is implemented from vendor and source of truth and automated, DAD is abstraction and utilizes
(iii) Significant relevance to reporting to CIHI can show similar queries in simulation. (iv) Standardized reporting is available to show similar queries in simulation. (v) Primary keys are important for data integrity and no errors while linking encounter to patient. Database
(vi) Encounter level data important to standard reporting and data integrity. Simulation patient encounters
(vii) Key stores important to index data because foundation of system is based on patient encounter. Need to
(viii) Important queries need to incorporate as proof of concept with certain fields from hospital systems:
(ix) Combining the columns, we need to be able to perform these basic calculations:
(i) Like key stores, we need dependencies in our database to be representative of existing system relevant to the
(ii) Certain data elements with standardized metadata are necessary for the data to be accurate. The process
(iii) Integration is not necessary for system to work but only to query the data ad hoc or correctly, and currently
(iv) Medical Services is not currently utilized in clinical reporting because it is not DAD abstracted, but could be
(v) Transfers are important to ADT and flow of patients in the system as their encounters progress and change.
(vi) Combining columns against encounter rows is already implemented at the hospital level; therefore, ADT and
(vii) Groupings allow building and construct of database to add columns progressively based on the encounter. (viii) Diagnosis is important because it is health outcome of hospital. Groupings important as performance
To accomplish these objectives, Island Health’s partial core metadata from ADT/DAD systems was obtained via knowledge transfer in interviews with specific teams working at Royal Jubilee Hospital (RJH) and Nanaimo Regional General Hospital (NRGH). Knowledge transfer with VIHA personnel and current reporting limitations were documented, recorded, and verified in summary after several meeting iterations.
Information from the informatics architecture team was composed of DAD dictionary and the selected data elements. Information on metadata and the frequencies of three core data elements (i.e.,
Metadata was set at over 90 columns and randomized based on data dictionary examples and from VIHA interviews. For example, metadata for the diagnostic column was set with standardized metadata of International Classification of Disease version 10 Canadian or ICD-10-CA codes, and personal health number (PHN) has ten numerical digits while the patient’s medical record number (MRN) for that encounter has nine numerical digits. All data elements and their required fields, as well as primary and dependent keys, were recorded for completed trials of the necessary columns to generate the emulation of aggregated hospital data. The generator included all important data profiles and dependencies were established through primary keys over selected columns (Table
Use cases and patient encounter scenarios related to metadata of patient visits and its database placement related to query output.
Case | Clinical Database |
---|---|
Uncontrolled type 2 diabetes & complex comorbidities | (i) DAD with diagnosis codes, HBase for IDs |
|
|
TB of the lung & uncontrolled DM 2 | (i) DAD and ADT columns with HBase for patient IDs |
|
|
A on C renal failure, fracture, heart failure to CCU, and stable DM 2 | (i) DAD and ADT columns with HBase for patient IDs |
|
|
Multilocation cancer patient on Palliative | (i) DAD and ADT columns with HBase integrating data together |
|
|
1 cardiac with complications | (i) DAD and ADT columns with HBase integrating data together |
|
|
1 ER to surgical, fracture, readmitted category for 7 days and some complication after | (i) DAD and ADT columns with HBase integrating data together |
|
|
1 simple day-surg. with complication, admitted to inpatient (allergy to medication) | (i) DAD and ADT columns with HBase for patient IDs |
|
|
1 cardiac with complications and death | (i) DAD and ADT columns with HBase integrating data together |
|
|
1 normal birth with postpartum hemorrhage complication | (i) DAD and ADT columns with HBase integrating data together |
|
|
1 HIV/AIDS patient treated for an infection | (i) DAD and ADT columns with HBase for patient IDs |
|
|
Strep A infection | (i) DAD and ADT columns with HBase integrating data together |
|
|
Cold but negative Strep A. Child | (i) DAD and ADT columns with HBase integrating data together |
|
|
Adult patient with Strep A. positive | (i) DAD and ADT columns with HBase for patient IDs |
|
|
Severe pharyngitis | (i) DAD and ADT columns with HBase integrating data together |
|
|
Child, moderate pharyngitis, throat culture negative, physical exam | (i) DAD and ADT columns with HBase for patient IDs |
|
|
Adult, history of heart disease, positive culture for Strep A. | (i) DAD and ADT columns with HBase integrating data together |
|
|
Adult, physical exam, moderate pharyngitis, positive for strep A. culture and positive second time, readmitted | (i) DAD and ADT columns with HBase for patient IDs |
At VIHA, health informatics architecture has direct relation to the DAD abstracting, as it is a manual process and dependent on
The data warehouse team working with health professionals for clinical reporting can rely on comma-separated value
It is important to note that this study is about performance testing of ADT/DAD queries of a distributed filing system (Apache-Hadoop) with a processing (MapReduce) configuration on an emulated NoSQL database (HBase) of patient data. The platforms tested the totally randomized generated data with replicated duplicates for every 50 million patients’ encounters with that of replicated groupings, frequencies, dependencies, and so on in the queries. The pipelined process included five stages or phases that coincided with the challenges outlined in Section
All necessary data fields were populated for one million records before replication to form one and three billion records. The recorded workflow provided a guideline to form the NoSQL database, as a large distributed flat file. The patient-specific rows across the columns according to the existing abstraction were further emulated; HBase established a wide range of indexes for each unique row, and each row contained a key value that was linked to the family of qualifiers and primary keys (columns). The HBase operations were specific to family qualifiers at each iteration; therefore, the data was patient-centric combined with certain DAD data (from different sources of metadata) in the rows and columns, such that summary of diagnosis or medical services could be queried.
The emulated data was stored and maintained in the HPC parallel file (~500 GB) and over the BDA platform under HDFS. The replication factor for HDFS was set to three for fault tolerance. The large volume of datasets was reutilized to test the performance of different use cases or queries conducted by the analytics platform. This required innovation, in an agile team setting, to develop stages in the methodology unique to BDA configurations related to healthcare databases.
This step was very important because the SQL-like Phoenix queries had to produce the same results as the current production system at VIHA. All results were tested under a specific data size and comparable time for analysis, whether the query was simple or complex. The data results also had to show the exact same columns after the SQL-like queries over the constraint of the family qualifiers (as primary keys). Over a series of tests, certain columns were included or excluded as qualifiers in the SQL code for constraints. Once the results were accurate and were the same as those benchmarked, those qualifiers remained for each of the iterations run via Hadoop, to generate the one billion totals.
In this step, the study conducted a proof-of-concept analysis of task-related use cases specific to clinical reporting. The queries were evaluated based on the performance and accuracy of the BDA framework over the one billion rows. For example, a task-based scenario for the analysis included the following.
The established framework of the platform used WestGrid’s existing security and the privacy of its supercomputing platform while reviewing and identifying regulations for eventually using real patient data over the platform (external to the hospital’s data warehouse). The following method was applied, which included four steps.
HBase creates indexes for each row of data that cannot be queried with direct access, and queries can only be generated when accessing the deployment manager (DM) on the platform. That is, the data cannot be viewed at all by anyone at any time or for any duration; only queries can show the data that is HBase-specific and nonrecognizable without Hadoop and HBase running, as well as the correct scripts to view it.
Executing data replication, as a generator over the platform, worked in conjunction with business/security analysts to identify the masking or encryption-required algorithms that represented optimal techniques to replace the original sensitive data.
Review was carried out with the related regulations regarding privacy protection regulations and principles, such as the HIPPAA, Freedom of Information and Protection of Privacy Act (FIPPA), Personal Information Protection Act (PIPA), and the use of the public KV stores established in semipermanent databases of HBase distributed by Hadoop.
Test of the replicated dataset was executed by an application process to test whether the resulting masked data could be modified to view. A real dataset (large annual inventory of pharmaceuticals) was tested and verified firstly, since studies have shown that the distribution of data using Hadoop has many inherent processes that restrict access to running ingestions [
In this section, the steps and experiences implementing the technical framework and application of a BDA platform are described. The established BDA platform will be used to benchmark the performance of end users’ querying of current and future reporting of VIHA’s clinical data warehouse (i.e., in production, spanning more than 50 years of circa 14 TB). To accomplish this, Hadoop environment (including the Hadoop HDFS) from a source was installed and configured on the WestGrid cluster, and a dynamic Hadoop job was launched.
The construction and build of the framework with HBase (NoSQL) and Hadoop (HDFS) established the BDA platform. This construct coincided with and is enforced by the existing architecture of the WestGrid clusters at UVic (secure login via LDAP directory service accounts to deployment database nodes and restricted accounts to dedicated nodes). It was initially running the architecture of the platform with five worker nodes and one master node (each with twelve cores) and planned to increase the (dedicated) nodes to eleven and possibly to 101, as well as incorporating a nondedicated set of virtual machines on WestGrid’s OpenStack cloud.
The queries via Apache Phoenix (version 4.3.0) resided as a thin SQL-like layer on HBase. The pathway to running ingestions and queries from the build of the BDA platform on the existing HPC was as follows: .
This pathway was tested in iteration up to three billion records (once generated) for comparison of the combination of HBase-Phoenix versus Phoenix-Spark or an Apache Spark Plugin (Apache Phoenix, 2016), under this sequence and after loading the necessary module environments for Hadoop, HBase, and Phoenix and testing initial results linked to the family qualifiers and HBase key-value entries [
Performance was measured with three main processes: HDFS ingestion(s), bulkloads to HBase, and query times via Phoenix. One measurement of ingestion time in total for iterations and overall was established to achieve the total desired number of records, that is, one billion and three billion from 50 million replicated [
Apache Spark (version 1.3.0) was also built from source and installed to use on HBase and the Hadoop cluster. The intent was to compare different query tools like Apache Spark and Drill, implemented over the BDA platform, against Apache Phoenix using similar SQL-like queries. The entire software stack used in the platform has at its center HDFS (Figure
Big Data Analytics (BDA) platform designed and constructed as patient encounter database of hospital system.
Data profiles, dependencies, and the importance of the metadata for reporting performance were also emulated and verified. Current reporting limitations were recorded if the combinations of the DAD and ADT were done in one distributed platform running parallel queries. A total of 90 columns were confirmed as important to construct necessary queries and to combine ADT data with DAD data in the Big Data platform. Additionally, the queries derived were compared with clinical cases and how that interacted with the performance of the platform was representative of the clinical reporting at VIHA.
HBase (NoSQL version 0.98.11) was composed of the main deployment master (DM) and failover master, the
The steps carried out to run Hadoop modules are shown in Box
(1) qsub -I -l walltime = 72:00:00, nodes = 6: ppn = 12, mem = 132 gb (2) ll/global/software/Hadoop-cluster/-ltr
(3) module load Hadoop/2.6.2 (4) setup_start-Hadoop.sh f (f for format; do this only once…). (5) module load HBase/… (6) module load phoenix/… (7) (actually check the ingest.sh script under ~/bel_DAD) (8) hdfs dfsadmin -report (9) djps (command displays the JVMs, Java services running with PIDs)
(1) module load Hadoop/2.6.2 (2) module load HBase/0.98.16.hdp262 (3) module load phoenix/4.6.0 (4) localFileName = “The CSV file containing your data” (5) hdfs dfs -mkdir/data (6) hdfs dfs -put “ (7) hdfs dfs -ls/data (8) sqlline.py hermes0090-ib0 DAD.sql (9) export HADOOP_CLASSPATH = /global/software/Hadoop-cluster/HBase-0.98.16.1/lib/HBase-
(10) time Hadoop jar/global/software/Hadoop-cluster/phoenix-4.6.0/phoenix-4.6.0-HBase-0.98-client.jar
(1) First decide which file to use, then check the correctness of its column names. DADV2.sql (for v2) and
(2) Create the database table using sqlline.py as illustrated above (sqlline.py hermes0090-ib0 DAD.sql) (3) Make sure all the modules loaded: module load Hadoop/2.6.2
(4) Generate the rest of data (we need 10 billion and monitor Big Data integer in the database). (5) Use the d_runAll.sh to ingest them all at once. (6) If a problems happen (persists) check the logs in different location (/global/scratch/dchrimes/and/or on
The platform worked as expected after modified configurations of Hadoop’s
The Map part of MapReduce on the platform showed high performance at 3–10 minutes, but the Reduce took 3–12 hours (Figure
Performance (seconds) of 60 ingestions (i.e., 20 replicated 3 times) from Hadoop HDFS to HBase files, MapReduce indexing, and query results. Dashed line is total ingestion time and the dotted line is time to complete the Reducer of MapReduce. The bottom dashed-dot lines are the times to complete Map of MapReduce and the duration (seconds) to run the queries.
To improve the ingestion of the one billion rows and 90 columns to attempt to generate 1–10 billion rows, local hard disks of 40 TB in total were physically installed on the worker nodes. After local disks were installed on five (worker) nodes, a set of shell scripts was used to automate the generation and ingestion of 50 million records at each of the iterations via MapReduce. The maximum achieved was 3 billion due to operational barriers, workflow limitations, and table space because key stores almost tripled the amount of space used for each of the ingestions (Table
Operational experiences, persistent issues, and overall limitations of tested Big Data technologies and components that impacted Big Data Analytics (BDA) platform.
Technology component | Clinical impact to platform |
---|---|
Hadoop Distributed Filing System (HDFS) | (i) Did not reconfigure more than 6 nodes because it is very difficult to maintain clinical data |
|
|
MapReduce | (i) Totally failed ingestion |
|
|
HBase | (i) |
|
|
ZooKeeper & YARN | (i) Extremely slow performance when ZooKeeper services are not running properly for both, but additional configuration minimized this limitation with few issues for YARN |
|
|
Phoenix | (i) To maintain a database schema with current names in a file on the nodes, such that if the files ingested do not match, it will show error, and to verify ingested data exists within the metadata of schema while running queries |
|
|
Spark | (i) Slow performance |
|
|
Zeppelin | (i) 30-minute delay before running queries which takes the same amount of time as with Jupyter |
|
|
Jupyter | (i) Once the Java is established, it has high usability and excellent performance |
|
|
Drill | (i) It is extremely fast but has poor usability |
Other findings of Big Data technology limitations installed on WestGrid’s architecture were ongoing manual intervention (over three-five months) which was required to constantly fine-tune the performance of bulkloads from MapReduce to HBase. Hadoop had ingestions exhibiting high performance, for circa three minutes to complete task for 258 MB or each 50 million rows. Sometimes HDFS was unbalanced and had to be rerun to rebalance the data to the nodes or when the local disk at 500 GB did not failover to 2 TB disks installed, the entire ingestions had to start all over again because HBase could not reindex them, and, therefore, its queries were invalid with no indexes, which drastically slowed performance when not operational. There were some findings on optimized performance of the platform. CPU usage needs to be maxed, which is during mid-May to October 2016; it pinged at 100% but did not stay due to running compaction after each of the ingestions took over 4 hours (Figure
A year of varied iteration and CPU usage (at 100%) on Hemes89 node reported from WestGrid showing variation in the duration of the ingestion of 50 million records over each of the iterations. The graph shows the following: user (in red), system (in green), IOWait time (in blue), and CPU Max (black line).
The deployment of the Hadoop environment on the nodes was carried out behind the backend database scenes via a sequence of setup shell scripts that the user can then adjust configurations to match the needs of the job and its performance. There were 22 SQL-like queries tests for querying reports, instances, and frequencies in the ADT/DAD data over the 50 million to 1–3 billion rows. Ten queries were categorized as simple while others were complex; these included more than three columns and three primary keys across the 90 possible columns. All queries, simple (linear) and complicated (exponential and recursive), were less than two seconds for one billion and almost the same for three billion when the nodes were, eventually, balanced by Hadoop; however, some queries were more than three seconds and less than 4 seconds for three billion with unbalanced nodes. There were no significant differences between simple and complex query types and possible two-second increase when nodes were unbalanced. Caching did not influence the query times. There was no significant difference in the performance of simple versus complex queries. The performance speed, even at one to three billion rows for complex queries, was extremely fast compared to the 50 million rows queried. It did require months of preparation to get to the task of testing the platform with the queries. Health data that was involved with hospital outcomes and clinical reporting was combined to form a database and distributed over nodes as one large file, up to 30 TB for HBase. All the pertinent data fields and much more were used.
The results showed that the ingestion time of one billion records took circa two hours via Apache Spark. Apache Drill outperformed Spark/Zeppelin and Spark/Jupyter [
Drill did perform well compared to Spark, but it did not offer any tools or libraries for taking the query results further. That is, Drill proved to have higher performance than Spark but its interface had fewer functionalities. Moreover, algorithms (as simple as correlations between different columns) were time-demanding if not impossible to express as SQL statements. Zeppelin, on the other hand, offered the ability to develop the code, generate the mark-down text, and produce excellent canned graphs to plot the patient data (Figure
Zeppelin interface with Apache Spark with multiple notebooks that can be selected by clinical users.
With Jupyter, more configurations with the data queries were tested. It exhibited similar code to ingest the file (Figure
Spark with Jupyter and SQL-like script to run all queries in sequence and simultaneously.
Drill interface customized using the distributed mode of Drill with local host and running queries over WestGrid and Hadoop.
The ultimate goal of the study was to test the performance of the Big Data computing framework and its technical specifications cross platform against all challenges specific to its application in healthcare. This goal was accomplished by combining ADT and DAD data through ingestions over the Hadoop HDFS and the MapReduce programming framework. High performance over the BDA platform was verified with query times of less than four seconds for 3 billion patient records (regardless of complexity), showing that challenges of aggregation, maintenance, integration, data analysis, and interpretative value can be overcome by BDA platforms.
There are analytical challenges in many Canadian healthcare systems because of separated silos of aggregations. There are complex and unique variables that include “(1) information used; (2) preference of data entry; (3) services on different objects; (4) change of health regulations; (5) different supporting plans or sources; and (6) varying definition of database field names in different database systems” [
The ADT data are very difficult to emulate because they are from Cerner System, which uses a kernel to create alias pools for ~1000 different tables in the database. Simply creating one flat file cannot emulate the complex metadata relationships and does not guarantee that the data are treated uniquely for each encounter row when the encounters can change over time or several are linked to the same patient. However, if the data is extracted from the automated hospital system and it is confirmed that the columns are correct with unique rows, it should be possible to combine it with DAD data with similar unique keys and qualifiers. The complex nature of HBase means that it is difficult to test the robustness of the data in emulations based on real data. Several steps were required to prepare the DAD database alone for statistical rendering before it was sent to CIHI. The actual columns used in this study are the ones used by VIHA to derive the information accurately in a relational database, which ensures the data is in alias pools and not duplicated for any of the encounters. Other research reviews (e.g., [
It was more complicated to validate the simulated data in Spark and Drill with real data. Scott [
Wang et al. [
There are many alternative solutions for Big Data platforms; choice of the best solution depends on the nature of the data and its intended use (e.g., [
HBase also has a dynamic schema that can be uploaded via other Apache applications; therefore, the schema can be changed and tested on the fly. If HBase had not been used, more complex data models would have been needed to map over the Hadoop/MapReduce framework. Another benefit of using HBase is that further configurations can be accomplished for multirow transactions using a comma-separated value
Our study showed that compaction on HBase improved the number of successful runs of ingestion; however, it did not prevent failure of the nodes, a finding that is supported by other studies, (e.g., [
In Canada, population health data policy relies on legislative acts for public disclosure of the data accessed externally outside health authority’s boundaries [
Advantage of using Apache Spark or Drill over Phoenix is less reliance on MapReduce, which speeds up performance; however, then there is major limitation of data not accurately representative of clinical events, and data is less encrypted. Therefore, there is a performance trade-off. A further limitation of this study was on linkage between the technologies and representations of the patient data for clinical use; HBase at large volumes did not achieve fully integrated complex hospital relationships. Without complete validation, the technologies cannot be certified by the health authority. More work on using key-value storage for BDA should be considered in simplified clinical event models across many clinical services.
There is a need to further explore the impact of Big Data technologies on the patient data models of hospital systems. Additionally, it was initially set out to test security and privacy of the interactive and functional BDA platform. However, due to the limitations of MapReduce, it was determined that its Java code would remain as is and it was determined not to add encrypted patient identifiers for personal health number, medical record number, and date of birth. Tang et al. [
Dr. Dillon Chrimes is lead technical specialist and wrote the research design and software implementation for publication with Mr. Hamid Zamani, as research assistant.
The authors declare no conflicts of interest.
This work is supported by the competitive research grant at Vancouver Island Health Authority. Dr. Belaid Moa is thanked for database administrator at WestGrid Systems, University of Victoria. Dr. Alex Kuo is thanked for research framework plan.