Developing Clinical Phenotype Data Collection Standards for Research in Africa

Modern biomedical research is characterised by its high-throughput and interdisciplinary nature. Multiproject and consortium-based collaborations requiring meaningful analysis of multiple heterogeneous phenotypic datasets have become the norm; however, such analysis remains a challenge in many regions across the world. An increasing number of data harmonisation efforts are being undertaken by multistudy collaborations through either prospective standardised phenotype data collection or retrospective phenotype harmonisation. In this regard, the Phenotype Harmonisation Working Group (PHWG) of the Human Heredity and Health in Africa (H3Africa) consortium aimed to facilitate phenotype standardisation by both promoting the use of existing data collection standards (hosted by PhenX), adapting existing data collection standards for appropriate use in low- and middle-income regions such as Africa, and developing novel data collection standards where relevant gaps were identified. Ultimately, the PHWG produced 11 data collection kits, consisting of 82 protocols, 38 of which were existing protocols, 17 were adapted, and 27 were novel protocols. The data collection kits will facilitate phenotype standardisation and harmonisation not only in Africa but also across the larger research community. In addition, the PHWG aims to feed back adapted and novel protocols to existing reference platforms such as PhenX.


Introduction
Biomedical data are increasingly used in high-throughput and interdisciplinary approaches, with data generation and knowledge discovery occurring at a faster pace than ever before, particularly in the areas of genomics and bioinformatics [1,2].As a result, there has been an increase in multiproject, consortium-based collaborations and meta-analyses.Such efforts have the potential to expedite knowledge discovery in resource-limited research communities, particularly low-and middle-income countries, where funding for large sample collections is limited [3].Advances in genomics research have raised the need to match large-scale phenotypic data which incorporate social, environmental, and clinical factors, with genetic information.However, researchers still struggle to collect complete and valuable data in a standard manner, which hinders data integration and collaborative eforts [4].Meaningful analysis of multiple heterogeneous phenotypic datasets is difcult, and at times, impossible, without standardisation of the data elements prior to collection or retrospective harmonisation of the collected phenotypic data.
Phenotype standardisation involves harmonisation of the way researchers defne and collect information about clinical phenotypes and environmental exposures.Tis is opposed to retrospective phenotype harmonisation, which involves the integration of phenotype data collected or defned in various ways and is often employed in meta-analyses and collaborative research eforts.Phenotype standardisation may be employed for various reasons.For example, to promote operability and the interoperability of data coming into a biomedical database, collaborative and multisite studies need to ensure that their collection methods and dataset formats are as similar as possible to reduce bias and assist with overall study management and quality control.One way to achieve this is by employing ontologies to standardise the defnitions for trait and case classifcation as previously shown [5].In doing so, phenotype standardisation can also improve the quality of research outputs, as has been shown in pharmacogenetics and kidney disease research [6][7][8].More recently, the NIH-funded PhenX (consensus measures for Phenotypes and eXposures) Toolkit (https://www.phenxtoolkit.org/) has emerged as a leading resource for well-established standard protocols for the collection of phenotype and exposure information.Tis publicly available online catalogue contains standard and recommended protocols to extract the maximum value from data collected for genomics research, although these protocols are also applicable to other felds [9][10][11].Measures hosted by PhenX have been successfully employed to identify opportunities for cross-study collaboration through established databases such as the database of Genotypes and Phenotypes (dbGaP) and the Cancer Data Standards Registry and Repository (caDSR) [12].Despite these advances, some protocols within PhenX may lack regional validity since they were largely developed from a Western perspective.For example, protocols related to the concepts of diet, nuclear family, and extended family may need to be adapted since these vary across the world.
Te Human Heredity and Health in Africa (H3Africa) consortium was established to drive novel and innovative genomics research in Africa and build capacity on the continent [13].As part of the consortium, the H3Africa Bioinformatics Network (H3ABioNet) was formed to support these eforts, with a particular focus on the production and sharing of FAIR (Findable, Accessible, Interoperable, and Reproducible) data [14].With the goal of facilitating the standardisation and interstudy sharing of phenotypic data within H3Africa and beyond, these initiatives established the H3Africa Phenotype Harmonisation Working Group (PHWG) in 2014.Te PHWG aimed to build a core set of phenotypes to be collected within the consortium and develop protocols to facilitate the collection of these phenotypes.Tis work was later expanded to cover specifc research or disease domains, which were represented in clusters across the consortium [15,16].Here we describe the development and dissemination of standard phenotype data collection kits for genomics research, which are adapted to ensure their suitability for use in African settings.

Core Phenotypes.
Te PHWG oversaw the establishment of a standard set of CORE PHENOTYPES, relevant to the study of the diseases or traits explored in H3Africa.Given the diversity of diseases and populations studied within H3Africa (covering the study of neonatal respiratory diseases to adult renal diseases to the transmission of zoonotic tuberculosis in rural pastoral populations), the initial set of phenotypes was extremely broad.Te PHWG catalogued a set of CORE PHENOTYPES based on the number of H3Africa studies that measured them (each phenotype had to be measured in more than 3 studies).As a result, a set of 26 core phenotypes and 10 discretionary phenotypes were identifed.Tese were selected based on their general interest and applicability in most study populations, relative ease of collection, and whether they were a primary phenotype in one of the H3Africa studies (to facilitate identifcation of common controls).To formulate these CORE PHENO-TYPES, the working group chose to use the PhenX Toolkit to format the selected phenotypes, which initiated the development of the H3Africa Standard Case Report Form (Standard CRF).H3Africa grantees were encouraged, where possible, to measure the CORE PHENOTYPES in each research study so that the community could assess the power of the opportunity to measure a common set of phenotypes across a maximum number of projects.Following the initial development of the Standard CRF, the Working Group (WG) revised it to ensure its application to both paediatric and adult participants.(3) automated export procedures for seamless data downloads to common statistical packages; and (4) procedures for data integration and interoperability with external sources [17].PhenX also provides protocols that generate REDCap.xmlfles, and the software itself lends itself to easy design and sharing of data collection instruments.Before building the kits, some functional elements had to be considered in terms of the variables and project structure within REDCap: (1) Each domain-specifc kit needed to be able to stand alone with the CORE PHENOTYPES or be incorporated with any other combination of kits.To this end, all kits were built in a single REDCap project to eliminate the possibility of duplicate variable names before splitting them into separate domain-specifc kits.In addition, because several kits have branching logic based on participants' biological sex and age (collected in the CORE PHE-NOTYPES), each domain-specifc kit was packaged alongside the CORE PHENOTYPES but also made available as separate data collection instruments for import.
(2) Consistent coding for missing data must be applied across data collection kits; therefore, each module has a preset list of missing data codes, which can be modifed by users.(3) Variable naming conventions needed to be useful to the end-user; therefore, variable names relevant to what was being collected were employed to avoid nonsensical indexed naming.(4) Consistent coding was an absolute necessity, so basic codes for common responses and formats were applied throughout.( 5) Cosmetically, all the forms were formatted with the same style, and the basic REDCap forms were maintained, without the use of external kits.Field embedding was limited to a few felds necessary to accommodate mobile data collectors.
Te REDCap instruments for every kit are intended to allow studies to include additional data elements and can be adapted to suit the needs of any study.Te only caution for developers is to consider the elements already included as recommended for collection and, when making changes, to consider the branching logic already in place (this is primarily with respect to the display of age or biological sex questions).

Ontology Mapping.
Once the data dictionaries were fnalised, each variable in the data dictionaries was mapped to a relevant ontology code, where possible, to facilitate interoperability and reproducibility.Ontology mapping was conducted using the Ontology Lookup Service (OLS) and Zooma developed by the European Bioinformatics Institute (EBI).Domain-dedicated and well-maintained ontologies were preferred during this mapping.Ontologies were checked and revised to ensure complete correspondence of mapped variables, and the ontology code was incorporated into the data dictionary fle for machine readability.

Review.
Te fnal set of data collection kits (CORE PHENOTYPES and Domain-specifc Kits) underwent multiple rounds of review, both internal and external, to ensure quality and usability.Te external review involved distributing a survey to African experts in a particular feld to review the selected phenotypes and the collection protocols.Te experts included in the survey comprised of researchers in leadership positions in a particular feld and professionals involved in data collection and management.Te survey allowed experts to review each element and associated protocol included in a toolkit, designate its applicability or lack thereof, and make additional recommendations to improve the toolkit.Te internal review involved the revision of ontology mappings by a dedicated team, two rounds of quality and consistency checks using the established guidelines, and a review of the structure and organisation of the data dictionaries to ensure technical validity and usability.

Data Collection Kits.
In total, the PHWG produced 11 data collection kits, including the CORE PHENOTYPES and 10 domain-specifc kits.All kits contain protocols adapted for use in both paediatric and adult participants.In total, they included 82 protocols, of which 38 were existing Global Health, Epidemiology and Genomics protocols (from PhenX or other existing initiatives, e.g., WHO, CIDRI-Africa, and the HIV Data Exchange Initiative), 17 were adapted, and 27 were novel protocols.Adapted protocols are protocols that are based on existing protocols but were edited (e.g., addition/removal of felds or rephrasing/restructuring of protocols) to improve their application in low-and middle-income regions and Africa, in particular.Examples of adapted protocols include those used to capture information on diet and household characteristics.Novel protocols are those for which it was not possible to identify appropriate existing protocols and were thus newly developed, such as asthma, pregnancy history, birth history, and family history protocols.An overview of the established data collection kits is illustrated in Figure 2 and summarised in Table 1.
As illustrated in Figure 3, each kit is composed of multiple components, including a REDCap implementation fle, a data dictionary to be implemented on the platform of choice, a case report form or data collection document, and a guideline, to facilitate the use of the kit.

External Review.
In total, the data collection kits were reviewed by 45 individuals (experts) through a survey, with the majority of respondents coming from South Africa (16), Tunisia (5), and Nigeria (4).In addition, the majority of respondents were classifed as non-H3Africa experts (17), while most of the H3Africa respondents were associated with the Phenotype Harmonisation (15), Cardiovascular Disease, and Rare Diseases H3Africa WGs.Te majority of respondents identifed themselves as experts in rare and developmental disorders (13), infectious diseases (13), and family history (12).Figure S1 (A and B) shows the complete breakdown of survey respondents.

External Use.
Te CORE PHENOTYPES have been disseminated internally for an extensive period of time (since 2018) and, to our knowledge, have been implemented by at least 9 H3Africa projects, mainly the projects initiated in the second round of H3Africa funding.Tese include studies involving exome sequencing and genome-wide association and epigenetic studies conducted on study populations from South Africa, Nigeria, Rwanda, Ethiopia, and Botswana and focused on the genetic underpinnings of infectious diseases and cardiovascular, mental, and developmental disorders [18][19][20][21][22][23].Te CORE PHENOTYPES served as a point of reference to facilitate retrospective phenotype harmonisation by the Cardiovascular H3Africa Innovation Resource (CHAIR) [24] and the Cardiometabolic Disorders in African-Ancestry Populations (CARDINAL) initiatives and to generate synthetic phenotype data by the Common Infrastructure for National Cohorts in Europe, Canada, and Africa (CINECA) initiative [25].Since the majority of the Domain-specifc Kits were only fnalised and disseminated in 2022, we do not yet have reports on validation for these kits, although, to our knowledge, the Genome Tunisia Collaborative Alliance has used a slightly modifed and translated version of the CORE PHENOTYPES and Family History kits.

Discussion
In this report, we have described the eforts of the H3Africa PHWG to establish standard data collection protocols for relevant biomedical and health research in Africa and in resource-limited regions globally.As far as we are aware, this is the frst such standardisation efort at the African level.It has been carried out within the framework of the H3Africa projects with a vision to enable African researchers to access and use them for newly developed projects, thus facilitating future data integration, phenotype harmonisation, and meta-analyses eforts, while also building a solid foundation for future collaborative eforts, both within the continent but also cross-continentally.We encourage use and feedback throughout the continent, to ensure continuous improvement on the existing recommendations.
As indicated, the established protocols not only encourage and promote standard data collection from the ground up but may also be useful in integrating phenotypic data from various sources retrospectively, i.e., harmonising phenotypes.Indeed, the kits are based on commonly collected phenotypes from specifc research domains.When  developing the protocols, the PHWG sought to fnd a balance between simplicity and completeness, and to cover various scenarios.Te frst step in many retrospective phenotype harmonisation eforts is typically identifying overlapping or common phenotypes across various projects and then to develop a data dictionary structured to allow transformation of data from diferent sources into a harmonised data structure [26].In so doing, the data collection kits, and specifcally standard data dictionaries, can facilitate two of the key steps in phenotype harmonisation.
Te data collection kits were developed and released in adherence to FAIR principles and also promote FAIR principles in research projects which employ them.Te kits were released on multiple platforms, including the H3ABioNet website, GitHub, and ZivaHub/Figshare, making them freely accessible to any interested user and allocating them unique identifers to facilitate their fndability.In addition, the kits provide a REDCap-based XML fle which can be implemented directly on REDCap, which is broadly available for academic use, along with an EXCEL fle which facilitates viewing for novice users and enables users to transform the fle for the data management platform of their choice.Te kits encourage interoperability and reproducibility in that they are standard recommendations based on existing standards and, additionally, employ ontology codes for machine-readable interoperability.Tey are also released with associated guidelines for reproducibility in practice.
Regarding the assessment of the validity of data collection kits, we had limited opportunity to measure the validity of the domain-specifc kits in practice within the scope of H3Africa, given the time of development and fnal release of these kits.To counter this, we sought to assess and ensure the validity of the kits in other ways; therefore, we encouraged collaboration and worked alongside feld experts to develop each of the kits, including, as previously illustrated, the Kidney Disease and Stroke kits [15,16].We also sought additional external feedback on the elements and data collection protocols included in the kits by surveying feld experts.Finally, where existing standard protocols were found and applicable, like protocols observed in PhenX, we encouraged the use of these standards, which have previously undergone broad validity assessments.Tis was to avoid the development of new but overlapping protocols.Despite these measures, we recognise that the kits may have practical limitations not considered during the development and thus encourage feedback on the kits from the user community.Feedback may be provided through multiple channels, including submitting issues on GitHub and the H3ABioNet Helpdesk [27].Tese also serve as an important channel Global Health, Epidemiology and Genomics 6 Global Health, Epidemiology and Genomics through which we can track implementation of the kits.
Although we encourage citation of kits when employed, many projects may not reference data collection forms in their methods section, representing a tracking barrier.
In contrast to the domain-specifc kits, the CORE PHENOTYPES were released to H3Africa for a much longer period.H3Africa grantees were encouraged, where possible, to use the Standard CRF, as a guide to collect a common set of phenotypes to be measured in each research participant, although it was recognised that this was not always realistic, particularly because (1) there are costs in terms of time and efort to collect additional phenotypes, (2) recipients are not funded to collect phenotypes other than those in the original grant application, (3) some phenotypes have little or no relevance to certain populations and collection settings, and (4) some grants had already fnalised their CRFs or were already recruiting subjects, which limits the possibilities for adding or revising measures.Despite the abovementioned considerations, several H3Africa projects did implement the CORE PHENOTYPES within their research projects and data collection processes, particularly the projects which were funded during H3Africa's second cycle (once the CORE PHENOTYPES were already established and released).Feedback from these projects was extremely positive regarding simplifying the CRF design process, the simplicity of use, and the comprehensive nature of design.As a by-product of collecting the CORE PHENO-TYPES, various groups found great collaborative potential with regard to simple retrospective phenotype harmonisation, including the Mental Health and Cardiovascular Disease working groups, the latter of which developed a database for phenotypes harmonised to many of the CORE PHENOTYPES [24].
Although the kits have been successfully implemented by numerous studies, we recognise that existing barriers may prevent the more extensive implementation of the data collection kits.One such barrier is a lack of technical capacity to implement the kits electronically.Tis is one of the main reasons why the kits also include a CRF which promotes paper-based collection, the other being that research studies in low-income settings may often still rely on paperbased collection.To address such capacity gaps, we also aim to provide training materials exhibiting implementation of the kits on technical platforms.Te other foreseeable barrier is that research initiatives based on existing cohorts are often hesitant to switch data collection methods once data collection has previously been conducted in a diferent manner, as this presents data integration issues downstream.To address this barrier, we need to also promote the kits as tools for retrospective phenotype harmonisation, as it has previously been used for by the Mental Health and Cardiovascular Disease working groups.
In the future, the primary eforts of the PHWG will be focused on promoting the use of the developed data collection kits on a broader scale, particularly in Africa.Tis may be achieved through endorsement by local research and funding bodies but also endorsement from previous users.In addition, the PHWG will be maintaining the data collection kits so that they remain relevant for future.One of the key goals for the project will be to feed back the adapted and novel protocols developed to PhenX for incorporation into the catalogue.In this manner, a broader set of users may be reached through an established body, which will, ultimately, facilitate the broad goals of the project.In addition, we will also investigate integration and interoperability with existing common data models such as the Observational Medical Outcomes Partnership (OMOP) model to increase broad usability [28].

Conclusion
Te PHWG has successfully developed a series of easyto-use data collection kits that cover a range of biomedical research felds.Tese kits should facilitate phenotype standardisation and harmonisation eforts on the African continent and the larger user community.Te standards can form the basis for data models for post hoc data  Global Health, Epidemiology and Genomics harmonisation.In addition to the abovementioned benefts, the data collection kits will also promote FAIR data principles, ultimately enabling data integration and interoperability.Finally, as mentioned previously, several novel data collection protocols were developed during these eforts, where relevant gaps were identifed.Tese, along with protocols which were adapted from already established protocols, will be submitted for (re)incorporation into the PhenX platform.

Figure 1 :
Figure 1: Development process of core phenotypes and domain-specifc kits.

Figure 2 :
Figure 2: Overview of data collection kits developed by H3Africa.Te number of protocols indicated per kit is given in brackets; CVD: cardiovascular disease.

Table 1 :
Overview of structure of data collection toolkits developed by H3Africa.