Challenges and Opportunities for Exploring Patient-Level Data

The proper exploration of patient-level data will pave the way towards personalised medicine. To better assess the state of the art in this field we identify the challenges and uncover the opportunities for the exploration of patient-level data through the review of well-known initiatives and projects focusing on the exploration of patient-level data. These cover a broad array of topics, from genomics to patient registries up to rare diseases research, among others. For each, we identified basic goals, involved partners, defined strategies and key technological and scientific outcomes, establishing the foundation for our analysis framework with four pillars: control, sustainability, technology, and science. Substantial research outcomes have been produced towards the exploration of patient-level data. The potential behind these data will be essential to realise the personalised medicine premise in upcoming years. Hence, relevant stakeholders continually push forward new developments in this domain, bringing novel opportunities that are ripe for exploration. Despite last decade's translational research advances, personalised medicine is still far from being a reality. Patients' data underlying potential goes beyond daily clinical practice. There are miscellaneous challenges and opportunities open for the exploration of these data by academia and business stakeholders.


Introduction
The widespread collection of patient-level data represents a critical step towards the realization of personalised medicine [1,2]. These data stem from primary care centres, hospital information systems, clinical trials' cohorts, and administrative platforms. Moreover, they withhold a huge potential that goes beyond daily clinical care [3,4].
Yet, along with the miscellaneous opportunities to explore patient-level data, this unparalleled growth of patients' digital metadata brings several challenges [5,6]. Data size, lack of open access, heterogeneity, or the uses of primitive technologies are some of the issues researchers face [7]. In contrast, exploring the potential behind these data will lead to the discovery of new knowledge, essential to improve the current clinical narrative [8,9].
Although patient-level data from public institutions, such as hospitals or regional/national administration centres, should be easier to access, it is generally locked under primitive technological implementations. This results in closed data silos that hinder scientific and technological evolution. Several large-scale projects already try to commoditize access to these data, whether through policies or through technical standards for data exchanges [10].
Pharmaceutical companies are also responsible for a big chunk of patient-level data [11]. Clinical trials' cohorts generate comprehensive patient datasets whose value for personalised medicine research is immeasurable [12,13]. Despite this, most of pharmaceutical data are private [14].
It is important to distinguish between private companies' data, which is the basis for internal research and development for new drugs and treatments, from public research datasets, fundamental to advance general scientific research. Although pharmaceutical companies are entitled to keep their results private, policies should be put in place to foster the sharing of clinically relevant results into the public domain.
Dealing with this heterogeneous mixture of private and public patient-level data, tools, standards, and projects is in itself a complex research and development challenge [15]. Ultimately, the entropy in this ecosystem is delaying what should be a swift evolution. Hence, we need to evaluate past and on-going initiatives to better assess and plan the personalised medicine research and development roadmap for the upcoming years [16].

BioMed Research International
For this matter we established an evaluation framework to analyse the outcomes of existing initiatives, identifying current challenges and uncovering new opportunities. This framework is based on four key pillars: control, sustainability, technology, and science. We assess several components in each of these areas, generating a rather comprehensive study: (i) the control section focuses on data ownership and access; (ii) the sustainability topics cover the long-term perspectives for each asset; (iii) on technology we assess the technical outcomes for each project, where existing; (iv) at the science level we identify the projects' research areas and their key scientific outcomes.
We present this comprehensive review targeting three key objectives. These were to (1) identify the best initiatives dealing with patient-level data, (2) [17,18] to the ongoing exploration of pharmaceutical trials data [19], among others [20].

2.1.
Design. This review covers past and on-going largescale projects. Selected projects' evaluation is based on an assessment framework with four key components: control, sustainability, technology, and science. This design allows us to better understand the projects' outcomes distribution as well as defining an initial categorization for each project. We chose topics for matching criteria in each area based on mappings with existing ontologies, namely, Simple Knowledge Organization System (SKOS) [21] and EMBRACE Data and Methods (EDAM) [22].
At the control level we assess several topics, detailed next.
(i) Data ownership: who owns the project data and who decides whether to make data available or not? Available options are community, partner, or project. (v) Security, privacy, and auditing: how are security, privacy, and auditing issues dealt with within the project? Available options are external, none, or project.
In this review we also assess the selected projects' sustainability, covering the following areas: (i) At last, we inspected the key scientific outcomes for each project, evaluating their areas of impact.
(i) Field of research: it is the fields of research with results that will have direct application to improve patientlevel data exploration. These include EHR, epigenomics, genomics, metabolomics, pharmacogenomics, phenomics, proteomics, transcriptomics, and other.
(ii) Area of interest: similarly to the field of research, we identified the technological areas of scientific interest that were studied in the project. Available options are analytics, annotation, data integration, data visualization, ontology, semantic analysis, textmining, and other.

Inclusion and Exclusion Criteria.
We searched for largescale international projects in literature and general listings. From there, the inclusion criteria for this review were as follows: (i) is on-going or finished after January 1st, 2011; (ii) is sponsored mainly by the NIH, IMI, or the European Commission; (iii) includes partners from both academia and the business sector; (iv) must focus on rare diseases, pharmacy or have direct patient involvement; (v) must have public published results. International   3 For all identified projects, we reviewed titles, funding information, references, and available publications to better assess if the projects appeared to meet all inclusion criteria. If insufficient information was available to make a confident decision, we contacted key project partners to disclose further details.

Results
This review provides an overview of the different attempts at improving the exploration of patient-level data. This section details the projects' evaluation according to our framework, including a tabular and visual comparison of their distinct features. From this evaluation we identify the main challenges and opportunities for future research endeavours.

Projects.
Our initial dataset was extracted from the online project databases of three major funding agencies: USA's National Institutes of Health (NIH), European Commission (EC), and the Innovative Medicines Initiative (IMI) [23][24][25]. After a comprehensive filtering and selection process, 16 projects met our inclusion criteria (Table 1).
On a first glance we can quickly assess that the selected projects' domains and goals are heterogeneous, with the access or use of patient-level data being one of the few common threads. There is also an obvious bias towards European projects, as the European Commission continues to be a strong proponent of research, namely, on the life sciences and medical areas.

Feature Comparison.
In this section we explore the projects' evaluation results according to the several pillars of our evaluation framework. Figure 1, highlighting the control pillar, we can conclude that there is real diversity in the projects being assessed regarding who controls the data. The notable exception concerns the patient involvement ( Figure 1(d)). Although patients play a fundamental role in the research workflow, patients and patient advocacy groups are seldom considered as partners. As the other charts in Figure 1 show, data are equally distributed, owned, and stored by partners, the project, and the public domain. However, if we make a more basic categorization between open (public or community) and private (project or partner), the division is steeper.

Sustainability.
Our sustainability review entails better prospects for future data exploration. As Figure 2(b) highlights, the majority of projects already do or plan on doing active data maintenance. This implies that data collected within the project's scope will be stored for future use. Even if the access is limited, keeping these data alive opens good prospects for future endeavours. About half the evaluated projects will continue to provide their results to academia and some will focus on creating a business to sustain their research work once the project finishes ( Figure 2(a)).

Technology.
At the technological level, all evaluated projects already produced public results. As expected from the heterogeneous project goals, there is an assorted amount of technical outcomes. Figure 3 highlights the current trend, where services and databases are the focus of produced work. Next, infrastructure development is also a key area in selected projects, although they were more relevant for projects started before 2011 (Figure 3(A)). These particular results are of particular relevance for our review. We can infer that there is already proper effort put towards creating infrastructures for research. Hence, we should move our focus to the better exploration of existing resources, namely, with the creation of additional frameworks, standards, and services. Figure 4, we find greatest variety of project features at the scientific level. Figure 4 In these, genomics is evidently important. Although the results are biased due to the selected projects' domain, there is a clear influence of genomics, pharmacogenomics, and biobanking at the patient-level domain (EHR). Nevertheless, as shown in Figure 4(a)(B), the miscellaneous omics research fields continue to be of interest and EHR interest is growing. Figure 4(b) also validates the fundamental role of data integration in the various research fields. Nowadays, data integration expertise must be a vulgar commodity for life sciences and medical related research projects. More importantly, Figure 4(b)(B), for projects started after 2011, the differences in the fields of analytics, ontologies, text-mining, and semantic analysis are staggering. This reveals the growing significance of semantic web related technologies, as they complement analytics, ontologies, and text-mining features.

Challenges and Opportunities.
With this evaluation we identified several challenges and opportunities. Challenges relate to data discovery, access, acquisition, and ownership. This brings several opportunities to deploy future solutions that fully explore the enormous amounts of patient-level data, using technological paradigms that projects are already supporting.

Challenges.
There is a clear dichotomy regarding data. Patient-level data is a very specific use case for exploration. While there are too many data scattered throughout multiple stakeholders, they are wildly difficult to obtain. The outcome of this is that, in the end, there is not enough data to generate statistically meaningful conclusions. Hence, we cannot discover or infer new knowledge because there is no access to a minimal amount of patient data. Along with distribution, data heterogeneity arises as a key challenge for exploring patient-level data. As shown in Figure 3, there are already several projects dealing with creating new and improving existing data standards for data sharing. However, these are far from being widely adopted throughout international stakeholders. Bioinformatics and pharmacogenomics projects also face these challenges [41]. Nevertheless, for these   http://rd-connect.eu/ RD-Connect will launch an integrated platform connecting databases, registries, biobanks, and clinical bioinformatics for rare diseases research [38].
Sentinel 2008 -http://www.fda.gov/Safety/FDAsSentinelInitiative/default.htm Sentinel is a USA-based electronic system that will transform FDA's ability to track the safety of drugs, biologics, and medical devices [39,40]. This initiative aims to develop and implement a proactive system that will complement existing systems that the FDA has in place to track reports of adverse events linked to the use of its regulated products.   there are already adequate standards for data storage and exchange [42][43][44][45].
In the same vein, data translation also arises as a complex challenge for researchers. In addition to the obvious sense (translating data between multiple languages [46]), there is the data translation from a low-level free text data to structured information [47,48]. Clinicians' reports traditionally include their notes in free text. These notes must be mapped to a shared domain, elevated from simple text to meaningful structured knowledge. Again, the growing relevance of textmining and semantic web technologies, as highlighted before, is visible.
Data discovery, access, and acquisition are typical problems that can be solved by improving existing technologies and by focusing on their widespread adoption. Unlike these, data ownership is a much more complex issue. Dealing with data ownership involves tackling issues related with government's policies, stakeholders' interests, and projects' internal guidelines. In an ideal scenario, all patient-level data should be available for research purposes. This should be particularly enforced in publicly funded projects. Yet, this does not happen. As seen in Figure 1, projects' data ownership, storage, and access resort to closed solutions. In most cases, data are privately held, or at most, shared to project partners. Moreover, where data are shared publicly to researchers, access restrictions are in place.

Opportunities.
Great challenges leverage great opportunities. From our review, we believe there is room for improving how we explore patient-level data and how we can use it to further improve research and development towards personalised medicine. As Figure 4 highlights, on-going projects are already solving important technological challenges.
There is huge potential behind the combination of data available worldwide. Yet, we need to develop and disseminate new technologies that improve how relevant entities collect, store, and share patient-level data.
As data integration is already commonplace, to obtain real advances in this domain we must see worldwide patientlevel data as a whole, and not as single detached data silos. Although we already have the technology to accomplish this, stakeholders must unite efforts to make this holistic view a reality.
At the technical level, opportunities arise that demand the creation of new software and new standards. Likewise, at a policy level, we must improve existing guidelines and policies to better cover data sharing and ownership and ethics issues.
New data management standards should promote better (and easier) ways to access and share data. This will promote knowledge discovery and enable the integration and interoperability among patient-level data silos throughout the world. Likewise, going from patient-level data to summarylevel data, and vice-versa, should be a simple straightforward process with the latest text-mining and semantic web tools.
Ideally, new software will empower collaboration and sharing among patients and clinicians. These should promote ease of access to patient information and enhance the communication process among clinicians. Furthermore, new tools are required to enhance data ownership controls, facilitating how patients, clinicians, or researchers express who has access to relevant personal data. More importantly, a combination of policies and guidelines should be put in place to foster the active involvement of patients in clinical care.
Despite the great opportunity for creating new standards and software, these assets alone are not enough to change the current scenario. New politics and guidelines, stemming directly from key worldwide stakeholders, must be disseminated to all interested parties. Moreover, with adequate support from governmental agencies (regional, national, and international), projects and their internal partners will proactively work towards implementing these new guidelines.

Discussion
As this review reveals, there is room for change in the exploration of patient-level data. However, we must take in account that these results are biased and strict. This is an everexpanding field with lots of partners, projects, and companies working in this subject.
While we tried to be comprehensive, this review has obvious limitations. Namely, identifying each project's features and technical/scientific outcomes was a complex task. Once the projects finish, little to no effort is put into maintaining an accurate dissemination summary and rarely the projects results are assessed a couple years after each project's conclusion. patient-level data stemming from electronic patient records. However, as shown in Figure 4(a), the quantity and quality of projects interacting with patient databases focused on genomics data are growing [49]. Furthermore, next generation sequencing technologies streamline the generation of huge patient datasets [50]. In a sense, patient sequencing data are patient-level data. Projects, such as 1000 Genomes [51] or Genome of the Netherlands [52], are trying to sequence large numbers of individuals to better understand existing genotypephenotype relationships and uncover new ones.
In the long term, these data will be included in clinical patient registries. They may even be part of the electronic patient record. At this stage, clinicians will require new tools to adequately exploit the true value behind these data. In summary, this is a whole new field of exploration for personalised medicine and patient-level data research that cannot be ignored [53].

Implications for Future Research.
As detailed in previous sections, the various opportunities highlight the room for improvement in this domain. Assessing the projects' timing evolution we identify that the focus on sharing, dissemination, and patient control is of growing relevance in the field.
The creation of new technical standards and data sharing policies will be fundamental for future research. Moreover, these topics are emerging in current project calls. Thus, they are becoming a stepping-stone for future research and infrastructure initiatives.
Despite the scale of on-going projects, they will not cover every possible topic. Technological developments in analytics tools, text-mining, ontologies, semantic web, data visualisation, integration, and interoperability, originating from distinct areas, must be brought to patient-level exploration.
The semantic web arises as a ground breaking paradigm to foster the intelligent integration of structured information. Sustained by state-of-the-art standards such as RDF, OWL, SPARQL, and LinkedData, semantic web promotes better strategies to express, infer, and make knowledge interoperable.
Latest advances in the area cover the research and development of new algorithms to further improve how we collect data, transform data into meaningful knowledge assertions, and publish connected knowledge. To further improve this, we must rely on the latest text-mining technologies. Elevating clinical text data to abstract knowledge or mapping the best matching ontologies to patient datasets require advanced text-mining solutions.
The combination of these strategies, semantic web, textmining, and ontologies will pave the way towards interoperable scientific knowledge. These technologies will foster data integration and interoperability, enabling an effortless connection between heterogeneous distributed knowledge, obtained from patient-level data. Hence, the foundation of translational research, where multiple technical research areas collide, will be even more meaningful in the future.

Impact.
Although this review had the main goal of covering the scientific results, we cannot ignore additional fundamental questions surrounding large-scale projects.
Hence, we must discuss the privacy policies applied to research-oriented datasets, the creation of businesses sustained by public funding, or the lack of publicly visible project evaluation outcomes.
The general community perceives that there is a huge amount of public funds being poured into research projects in all areas. Still, the outcomes of these projects are not as public as desired. There is an underlying sense of fulfilment in investing on research, especially in fields related with life sciences, such as rare diseases treatments, pharmaceutical research, or any other relevant omics field: IMI, EC, and NIH are funding science. Figure 1(b) highlights that only a quarter of studied projects expect to provide their data publicly to the general research audience. Data access restrictions are too common on research. Large investments, with public funds, are being applied to clinical drug trials, patient registries development, and next generation sequencing technologies. Yet, the majority of research outcomes will not be made available to the public. And, despite pharmaceutical companies financial involvement in IMI projects, the expected profit outcome from these projects will definitely surpass invested money. Patient-level data, obtained with public research funds, which have the potential of being fundamental to create new knowledge, are not available to the research community as they are closed behind complex privacy policies and neverending access restrictions.
Likewise, Figure 2 charts show that there are several projects whose future sustainability will rely on implementing a profit-oriented business model. Hence, we must ask, again, how can public funds, applied to research projects, be used to create self-sustainable companies? These companies will sell products, software or data, created with research funds stemming from public investment.
At last, there is a great difficulty in finding projects details and their respective evaluation results. It is as if the IMI, EC, and NIH projects lists are difficult to access and lack essential project details on purpose. The general audience cannot find out how projects are evaluated, their assessment results and, more importantly, their visible outcomes. Despite having concluded that most project results are private, the projects' evaluation should be public. Furthermore, it should be supported by a clear long-term plan that assessed the proper use of public funds to actually advance research. Finished projects should be evaluated in multiple timespans, not just when the deadline is reached. Evaluating projects 2, 5, or 10 years after their finish date would improve the understanding of how successful was the large sum of invested money.
The reality is that IMI, EC, and NIH are funding projects that have the liberty to create for-profit businesses and, more importantly, the liberty to apply public funds to the most diverse research tasks, whether they are directly related to the expected project results.

Conclusions
This review provides an overview of different initiatives that try to properly explore patient data. We limited our study to research and development projects in the recent past. We established base criteria to evaluate on-going initiatives. This resulted in the identification of several opportunities for future developments, namely, (1) bringing distributed data together by putting more advanced sharing and integration at clinicians' fingertips; (2) focus on text-mining and semantic web technologies to create real knowledge from distributed and heterogeneous data; and (3) pressuring stakeholders for stricter project evaluations that will foster a quicker evolution pace. The lack of well-established and widely adopted solutions covering these areas represents a major roadblock for the adequate exploration of patient-level data. However, if future projects consistently adopt these overarching goals, personalised medicine will be one step closer.
More importantly, in addition to the research-specific evaluation outcomes, we must highlight the strange patterns behind large-scale project funding. Although IMI, NIH, and EC provide intensive financial support for research, what we witness is that the money is being used to create forprofit businesses and closed research datasets. Furthermore, funding agencies lack clear evaluation frameworks that properly assess the success of public investment into large-scale research.