On the Heterogeneity of Existing Repositories of Movements Intended for the Evaluation of Fall Detection Systems

Due to the serious impact of falls on the autonomy and health of older people, the investigation of wearable alerting systems for the automatic detection of falls has gained considerable scientific interest in the field of body telemonitoring with wireless sensors. Because of the difficulties of systematically validating these systems in a real application scenario, Fall Detection Systems (FDSs) are typically evaluated by studying their response to datasets containing inertial sensor measurements captured during the execution of labelled nonfall and fall movements. In this context, during the last decade, numerous publicly accessible databases have been released aiming at offering a common benchmarking tool for the validation of the new proposals on FDSs. This work offers a comparative and updated analysis of these existing repositories. For this purpose, the samples contained in the datasets are characterized by different statistics that model diverse aspects of the mobility of the human body in the time interval where the greatest change in the acceleration module is identified. By using one-way analysis of variance (ANOVA) on the series of these features, the comparison shows the significant differences detected between the datasets, even when comparing activities that require a similar degree of physical effort. This heterogeneity, which may result from the great variability of the sensors, experimental users, and testbeds employed to generate the datasets, is relevant because it casts doubt on the validity of the conclusions of many studies on FDSs, since most of the proposals in the literature are only evaluated using a single database.


Introduction
Falls, in particular falls among elderly, are a major social concern in current societies. e World Health Organization has reported that 646,000 persons die from falls each year worldwide, so they represent the second cause of unintentional injury deaths after car accidents [1]. In this respect, it has been shown that a rapid response after a fall can lower the risk of hospitalization by 26% and the death rate by 80% [2]. As a consequence, during the past decade, great research efforts have been devoted to the development of efficient and low-cost technologies for automatic Fall Detection Systems (FDSs).
Falls are generically and ambiguously defined as a loss of balance or accident that causes an individual to rest involuntarily on the ground or other lower level [3]. Most unintentional falls can be easily distinguished from other movements by human visual inspection. However, this task is not so evident when it is carried out by an automatic system. Accordingly, the problem of fall detection has been addressed through different approaches, which can be clustered into two great generic strategies: context-aware and wearable systems. Under the first strategy, an FDS can be deployed by placing video cameras and other ambient sensors, such as pressure sensors and microphones, in the vicinity of the user to be monitored. However, in most practical cases, the mobility of the patients can be tracked in a more adaptive and cost-effective way by employing lightweight sensors that can be directly transported on the clothes or as another garment or a piece of jewelry (e.g., as a pendant). e decreasing costs and widespread popularity of electronic wearables and especially those intended for sporting activities have fostered the adoption of this type of transportable solutions to investigate and implement FDSs.
Under a wearable FDS, a detection algorithm is permanently in charge of analyzing the signals captured by the sensors worn by the user to identify any anomalous mobility pattern that can be linked to the occurrence of a fall. As soon as a fall is presumed, an alerting message (phone call and SMS) to a remote monitoring point (medical premises and patients' relative) will be forwarded by the FDS. In the vast majority of wearable architectures, the detection decision is based on the measurements provided by an accelerometer and, in some cases, a gyroscope (integrated in the same Inertial Measurement Unit, IMU), which are attached to a certain part of the user's body. e general goal of an FDS is to simultaneously minimize both the number of falls that remain unnoticed and the generation of false alarms, that is to say, conventional movements or Activities of Daily Living (ADLs) that are misinterpreted as falls. A crucial element in the investigation of a wearable FDS is the procedure by which the detection algorithm will be methodically evaluated to check its actual capacity to discriminate ADLs from falls.
In almost all works existing in the related literature, FDSs are tested against a set of labelled movements that include both ADLs and falls. In order to repeat the analysis by changing the detection techniques and the parameterization of the algorithms, the movements are previously prerecorded in files that contain the corresponding timestamp and measurements gathered by the inertial sensors. e quality and representativeness of the employed dataset of movements are a key aspect to assess the validity of the evaluation. In this regard, it has been estimated that it is necessary to record between 70,000 and 100,000 days to collect about 100 actual falls by continuously monitoring persons aged over 65 [4]. Owing to the obvious practical difficulties of monitoring actual falls experienced by elderly people, the general procedure followed by the literature to evaluate a fall detection algorithm is using datasets of activity traces thatare intentionally created by experimental users. For this purpose, the participants in the experiments normally execute a series of predetermined movements while they transport the corresponding wearable sensors in one or several positions of their bodies. ese movements typically incorporate different types of conventional ADLs (sitting, climbing stairs, picking up objects from the floor, etc.) and falls, which are mimicked taking into account different aspects, such as the direction (lateral and backwards) or the cause of the fall (slipping, stumbling, and tripping).
In almost all initial studies on FDSs, a group of volunteers were recruited to generate a specific dataset which was employed for the evaluation of the proposed architecture. ese datasets were rarely released by the authors to enable their use by other researchers to validate new algorithms. To tackle this lack of a benchmarking framework, a nonnegligible number of datasets have recently been produced and made publicly available on the Web to cross compare FDSs with a common reference. e use of normally young and healthy volunteers that emulate falling in a systematic way in a 'controlled' scenario, as surrogates for actual falls of older persons, is still a controversial issue in the field of FDSs. By tracking during six months two groups of persons totaling 16 older people, Kangas et al. conducted a study aiming at comparing the dynamics of real-life falls of older people with those simulated by middle-aged volunteers [5]. From the results, the authors concluded that the features of the acceleration data captured during accidentals falls follow a similar pattern to those measured from emulated falls, although some significant differences were detected (for example, in the timing of the different phases of the falls or in the acceleration magnitude measured during the impact against the floor). In a similar study [6], Klenk et al. compared the actual backward falls suffered by four elderly people to those mimicked by 18 young individuals. Results seem to indicate that the 'compensation' strategies to avoid the damages of the impact followed by the subject during the unintentional falls introduce relevant differences (e.g., jerkier movements with higher changes in the acceleration) with respect to the case of the emulated falls.
Besides, Bagalà et al. [7] have shown that the efficacy of certain algorithms successfully tested against datasets of emulated falls may notably decrease when they are evaluated with traces captured in a real scenario. In other works, such as that by Sucerquia et al., the ability of the proposed FDS to avoid false alarms is evaluated by monitoring elderly people that transport the wearable detection system during their daily routines. In these cases, the sensitivity of the detector cannot be computed unless a real fall occurs during the monitoring period. A similar strategy is described by Aziz et al. in [8]. ese authors report that the number of false alarms of an FDS, which is based on a Support Vector Machine classifier, deteriorates when it is employed by a community of 19 older adults. In this scenario, 2 out of 10 actual falls suffered by the participants were not identified by the system.
In any case, these studies are based on the analysis of a very small number of real falls. e fact is that, to the best of our knowledge, the repository provided by the FARSEEING European project [9] is the only dataset that provides inertial measurements of real-world falls of elderly patients although again the number of samples that are publicly available, only 22, is quite limited. us, this work mainly focuses on those datasets grounded on emulated falls and ADLs (although in some cases, ADLs were captured not by an execution of predetermined activities on a laboratory but by monitoring the participants during their daily routines).
On the other hand, although the use of public and wellknown datasets is gaining an increasing acceptance in the literature, most studies base their validation on the use of just one or, at most, two repositories. So, a question arises about the correctness of extrapolating the results obtained with a particular dataset when another repository is considered.
e goal of this study is to recap and compare the characteristics of the existing public repositories of inertial measurements intended for the assessment of FDSs. e paper is organized as follows. Section 2 revises the available datasets, synopsizing their basic properties and the testbeds (employed sensors, characteristics of the experimental users, and typology of the movements) which were deployed to generate the data. e section also describes the criteria to select the datasets to be compared. Section 3 presents the statistical features employed to characterize the mobility of the traces of the datasets, while Section 4 compares the datasets by showing the results of the analysis of variance (ANOVA) of these characteristics. e main conclusions are summarized in Section 5.

Revision and Selection of Public Datasets
As aforementioned, a key problem for the development of an automatic fall detection architecture is the need of trustworthy repositories that can be employed to thoroughly evaluate the accuracy of the detection decisions, i.e., the capacity of the system to correctly identify ADLs and falls by simultaneously avoiding false alarms and undetected falls. Table 1 presents a comprehensive list of the authors, references, institutions, and year of publication of the existing datasets intended for the study of wearable systems. All these datasets comprise the measurements collected by the inertial sensors worn by the selected volunteers during their daily life or while performing a preconfigured set of movements in a controlled testbed. In this revision we do not include those available databases of inertial measurements (such as those presented in [10] or [11]) that are envisioned for other types of HAR (Human Activity Recognition) systems but do not incorporate falls among the represented activities .
In the case of Context-Aware Systems (CAS), different research groups have also published datasets containing the measurements captured by fixed video camera, motion and depth sensors (such as Kinect), and/or other ambient sensors (vibration detectors, pressure, infrared, and Doppler sensors, and near-field imaging systems), while a set of volunteers emulate falls and ADLs in a predefined testbed. Among these databases, we can mention the following: CIRL Fall Recognition [12], Le2i FDD [13], SDUFall [14], EDF&OCCU [15], eHomeSeniors [16], Multiple Camera Fall [17] KUL High-Quality Fall Simulation [18], UTA [19], FUKinect-Fall [20], or MEBIOMEC [21] datasets, as well as the infrared video clips described by Mastoraky and Makris in [22] or those sequences provided by Adhikari et al. in [23]. ese datasets are out of the scope of this paper although we do consider those databases, such as UR Fall or UP Fall, which were conceived to test hybrid CAS-type and wearable FDSs, i.e., systems that make their detection decision from the joint analysis of video images (and/or magnitudes collected by environmental sensors) and measurements from inertial sensors transported by the users. e number of samples, the considered typologies of the emulated ADLs and falls, and the duration of the traces (i.e., the duration of the recorded movements), as well as the basic characteristics of the participants (number, gender, weight, and age range) of each dataset, are enumerated in Table 2. Table 2 illustrates the great heterogeneity of criteria used to define the experimental framework where the samples were captured, both with regard to the selection of the test subjects and the number and type of simulated movements. In some repositories, such as tFall, the ADLs were not emulated (scheduled and executed in a laboratory) but obtained by tracking the real-life movements of the subjects during a certain period of time. As expected, in most cases, the movements were exclusively carried out by volunteers under the age of 60. In the few testbeds in which older subjects participated, almost none of the older participants simulated any fall, so their samples are limited to examples of ADLs. Table 3 summarizes, in turn, the type and basic properties (sampling rate and range) of the sensors employed to generate the repositories. e table also indicates the corporal position on which the inertial sensors were located or attached during the experiments. As it can be observed from the table, although there are cases where up to seven sensing positions have been considered, most datasets include just a single measuring point. In all cases, the sensor embeds, at least, an accelerometer and, less often, a gyroscope, a magnetometer, and/or an orientation sensor. In any case, the table shows the variability of the characteristics of the sensors (e.g., with sampling rates ranging from 10 to 200 Hz) and the body location considered to collect the measurements in the different testbeds again.
In the recent literature about FDSs, the use of some of these public datasets as benchmarking tools is becoming more and more common. However, in most studies, just one or, at most, two repositories are utilized to evaluate the effectiveness of the proposed detection algorithm. Khojasteh et al. [24] employed four datasets, although two of them (DaLiac [25] and Epilepsy [26] databases) do not encompass falls, which only allows assessing the capability of the system to avoid misinterpreting ADLs as falls. As a consequence, the conclusions of most works are mainly based on the results obtained when the proposed system is tested against a very particular set of samples.
Given the huge diversity of the experimental setups in which the datasets were generated, it is legitimate to question whether the conclusions achieved with a certain repository can be extrapolated to scenarios with a different typology of subjects, movements (simulated or not), or to a different parameterization of the inertial sensors.
In this context, Medrano et al. utilized three repositories (tFall, DLR, and MobiFall) in [27] to show that the effectiveness of an FDS based on a supervised machine learning strategy remarkably diminishes when the discrimination algorithms are tested against a database different from that utilized for training. In a more recent work [28], we concluded that even when the algorithm is trained and tested with traces of the same datasets and users, the quality metrics of the classification process may differ notably. In particular, we analyzed the performance of a deep learning classifier (a convolutional neural network) when it is individually trained and evaluated as a fall detector with 14 of the repositories presented in Table 1. Results clearly indicated that the performance dramatically varies depending on the dataset to which the detector is applied.
In the following sections, we thoroughly analyze the statistical properties of a representative number of these datasets to get a deeper understanding of the existing divergences between these repositories.
Journal of Healthcare Engineering

Election of the Compared Datasets.
In order to compare the properties of the signals provided by different repositories on equal terms, we only select those datasets that contain inertial measurements captured on the same position. In particular, in a first analysis, we focus on those traces collected on the waist as several studies [53][54][55][56][57] have shown that this is one of the most adequate positions to place an inertial sensor aimed at characterizing the general dynamics of the body. is election benefits from the fact that the waist is near the center of mass of the human body in a standing posture. When compared to other placements such as a limb or the chest, the waist also provides better ergonomics as it may enable the user to transport the wearable sensor almost in a seamless way (e.g., attached to a belt).
To ensure that the analysis is performed with a minimum number of samples, we only take into account those datasets with, at least, 300 samples. Consequently, we discard UR, FARSEEING, LDPA, and TST datasets, although they include traces captured with the sensor located on the waist. For a similar reason, we exclude the SMotion dataset [45], which is actually aimed at assessing fall risk and not fall detection systems, as it only contains 5 falls.
Finally, the Graz UT OL dataset is also discarded because of the small range of the employed accelerometer (±2g), which can prevent a proper representation of the acceleration peaks caused by falls (typically exceeding 4-5g).

Selection of the Characteristics for the Analysis
As in most works in the literature, the study will be based on the signals collected by the triaxial accelerometers (A X [i], 2 accelerometers and a gyroscope in a single mode Chest, head, left ankle, left thigh, right ankle, right thigh, and waist 7 external IMUs 128 ±400°/s (G) Ankle, neck, and thigh (pocket) 5 external IMUs 14

±8 g (A)
Waist and wrist ±2000°/s (G) Note. A : accelerometer, G : gyroscope, O : orientation measurements, M : magnetometer, SP : smartphone. 1. TST, UR, CMDFALL, and UP datasets also include the measurements (RGB, depth, and skeleton information) of Kinect sensors or video cameras, not considered in this Table 2. n.i.: not indicated by the authors.
for the i-th measurement), which are provided by the datasets. Future studies should contemplate the analysis of the signals collected by the gyroscope and, secondarily, the magnetometer. Nevertheless, it is still under discussion that the information provided by the gyroscope may significantly improve the success rate of methods merely based on the accelerometry signals (see [58] for a revision of this issue). During the free-fall period before the impact, a collapse typically prompts a sudden drop of the acceleration components, which is interrupted by a sharp peak of the acceleration magnitude (sometimes followed by several secondary peaks) produced by the collision against the floor [59]. erefore, to define a common basis to compare the traces, which present a wide variety of lengths, we focus on the interval of every measurement sequence where the highest difference between the "valleys" (decays) and peaks of the acceleration components is detected. Once this analysis interval is extracted, the rest of the trace is ignored. For this purpose, we set up a sliding observation window of duration t W � 0.5 s, consisting of N W samples: where fs indicates the sampling rate of the sensors.
To find the analysis interval within each trace, we follow the procedure presented in [60].
us, for each possible observation window within the sequences, we calculate the magnitude of the maximum variation of the acceleration where designate the maximum values of the components measured by the accelerometer in the x-, y-and z-axis, respectively, in the m-th sliding observation interval. us, for the x-axis, we have e analysis or observation interval will correspond to the subset of consecutive samples where k o is the index of the first sample of the analysis interval while N denotes cardinality (number of samples for each axis) of the trace.
In order to compare the different datasets, we extract the acceleration components of the signals during the analysis interval to compute the following twelve statistical features for all the traces.
(1) e mean Signal Magnitude Vector (µ SMV ), which gives an idea of the average mobility experienced by the body during the analysis interval. is mean can be calculated as where SMV[i] represents the Signal Magnitude Vector (SMV) of the acceleration for the i-th sample: (2) e standard deviation (σ SMV ) of SMV[i], which describes the variability of the acceleration during the observation window: (3) e mean absolute difference (μ SMV diff ) between two consecutive samples of the acceleration module, which is estimated as is parameter is useful as it informs about the brusque fluctuations of the acceleration during a fall [75]. (4) e mean rotation angle (µ θ ) may help to detect the changes of the body orientation of the body caused by a fall [75]. is angle is computable as Journal of Healthcare Engineering (5) e acceleration component in the direction perpendicular to the floor plane is strongly determined by the gravity. us, the tilt of the body provoked by the falls usually triggers a noteworthy alteration of the acceleration components that are parallel to the floor plane when the individual remains static in an upright posture. To characterize the alteration of the body position with respect to the standing position, we also compute the mean magnitude (µ Ap ) of the vector formed by these two acceleration components: depending on the placement and orientation of the accelerometer in each dataset. (6) e aforementioned value of A w diff(max) , which gives an insight of the range of the variability of the three acceleration components. (7) e peak or maximum (SMV max )of the SMV, as a key element to describe the violence of the impact against the floor: (8) e "valley" or minimum (SMV min ) of the SMV to characterize the phase of free-fall: (9) e skewness of SMV[i] (c SMV ), which describes the symmetry of the distribution of the acceleration: (10) e Signal Magnitude Area (SMA) [43]. is parameter, which is an extended feature used to evaluate the physical activity, can be estimated as (11) Energy (E). Since falls are associated to rapid and energetic movements, we also consider the sum of the energy (E) estimated in the three axes during the observation interval [72]: where (12) Mean of the autocorrelation function (μ R ) of the acceleration magnitude captured during the observation interval: where R[m] represents the m-th lag value in the series of the normalized autocorrelation coefficients of SMV[i]: is feature μ R is taken into account as long as the acceleration during a conventional activity normally exhibits a certain degree of self-correlation that could be impacted by the unexpected movements caused by a fall.

Comparison and Discussion of the Datasets
For an initial comparison of the statistical features of the different datasets, we utilize boxplots (or box-and-whisker plots), an extended and intuitive visual tool, to display the data distribution in a standardized manner. Figures 1-12 show the boxplots of the twelve statistics when they are separately calculated for the ADLs and the fall movements of the seven datasets under study. In the graphs, for each dataset and type of activity (ADL/fall), the median of the corresponding statistic is denoted by the central line in each box while the 25th and 75th percentiles are indicated by the lower and upper limits of the box. e dotted lines or "whiskers" represent an interval over and under the box of 1.5 IQR (the height of the box or Interquartile range between the 25th and 75th percentiles). All the data outside these margins (box and whiskers) are considered to be outliers and marked as red crosses in the figures.
e graphs show the high inter-and intravariability of the statistics of the traces. As it refers to the intravariability, within each repository, the analysis identifies a wide IQR interval and a high number of outliers for almost all the characteristics, in particular for the ADLs. Similarly, when the boxplots of the different databases are compared, a huge heterogeneity is also present.
is intravariability among datasets is also noticeable (both for ADLs and falls) even in the case of a basic feature, such as the mean acceleration magnitude during the observation window (which is assumed to be linked to the period of greatest alteration in the body acceleration). For all the considered statistics and for both ADLs and falls, we can observe several pairs of datasets where the IQR intervals (which concentrate 50% of the samples) do not even overlap, i.e., the 25% quartile of the corresponding feature of a certain dataset exhibits a higher value than that of the 75% quartile for the same feature of a different dataset. In addition, the magnitude of the IQR interval strongly differs from one repository to another. In some cases, the estimated mean of certain statistics in one dataset is several times higher when compared to others. is is more visible for those characteristics associated with the loss of verticality: the mean rotation angle (µ θ ) and the mean magnitude of the acceleration components (µ Ap ) perpendicular to the vertical plane while standing. e statistical significance of these divergences among the repositories can be systematically confirmed by an ANOVA (Analysis of variance) test. Figures 13 and 14 depict the post hoc multiple comparison of the estimated means of the twelve features based on the results achieved by a one-way (or single-factor) ANOVA. In the bars of the figure, the circular marks indicate the mean whereas the corresponding comparison interval for a 95% confidence level is represented by the line extending out from the symbol. e group means are considered to be significantly different if the intervals determined by the lines are disjoint.
Each subgraph in these two figures shows, in red, those datasets that have a characteristic with a significantly different mean than that of the fall or ADL movements of another dataset (marked in blue), which is taken as a reference by way of an example. As can be seen in the figure, there are very few cross comparisons, indicated in grey, in which the null hypothesis is not rejected as the differences between the means of the characteristics are not significantly relevant.
is inconsistency in the characterization of the different datasets is also appreciated if we consider other duration of the time observation window in which the maximum variation of the acceleration components is detected. Figures 15  and 16 present the analysis of variance when it is applied to the features computed for two different observation intervals (0.5 s and 1 s, respectively). For the sake of simplicity, the graphs only show the six first characteristics although a similar disparity can be found if the other six features were shown.

Comparison of the Different Types of ADLs.
e differences analyzed in the previous section could be partly justified by the fact that the terms 'ADL' and 'falls' may hide a huge variety of different movements.
is is particularly true for the groups labelled as ADLs, as they can encompass activities ranging from those that require almost no effort, such as standing, to those that are much more physically demanding (such as running). In spite of this evident heterogeneity, the authors of the datasets normally select the typology of the ADLs to be emulated by the volunteers without previously discussing the degree of mobility that the selected activities actually require.
In order to minimize the effects of this heterogeneity in the ADLs, we propose to individualize the previous ANOVA study taking into account the nature (physical effort) of the ADLs. For this purpose, as we also suggested in [76], we split the ADLs of each repository into three generic subcategories: basic ordinary movements (such as getting up, sitting, standing, and lying down), standard routines that entail some physical effort or a higher degree of mobility or leaning of the body (walking, climbing up and down stairs, picking an object from the floor, and tying shoe laces), and finally, sporting activities (running, jogging, jumping, and hopping).
By taking into account this taxonomy, Table 4 displays and catalogues the different types of ADLs and falls contained in the seven datasets under analysis. e table shows that each subcategory in each dataset is basically represented by the same three or four types of common movements.
us, a certain homogeneity could be presumed. In two of the datasets (DOFDA and IMUFD), there are no sporting activities. As an extra type of 'nonfall' movements, the table also indicates which repository includes the emulation of near falls, that is to say, missteps, stumbles, trips, or any other type of accidental movements that involve a loss of balance but do not result in a fall. e individualized ANOVA analyses of the series of the six statistical features of the datasets are depicted in Figures 17 and 18 (for basic movements), Figures 19 and 20 (for standard movements), and Figures 21 and 22 (for sporting movements).
Despite the categorization and clustering of the traces, the graphs again reveal the great variability of the datasets when they are compared to each other. For all three movement types and for all metrics, the mean of the six statistical features of each dataset is significantly different from that calculated for, at least, two other datasets. Figures evince that in a nonnegligible combination of cases (some of which are highlighted in blue in the graphs), the null hypothesis can be rejected for the comparison of a certain mean of a particular dataset with the mean of the same metric of the rest of datasets. For example, five out of the six contemplated features in the basic movements of the UMAFall repository present a mean value significantly different to those of all the other datasets. A similar behavior is detected in other repositories and types of movements (e.g., the sporting activities in the UP dataset).
A similar conclusion can be reached by analyzing the near-fall movements existing in two datasets (IMUFD and Erciyes). Figures 23 and 24 confirm that the six statistics with which these movements have been characterized present mean values that significantly differ for the two repositories.

Comparison for the Same Type of Movement: Walking.
e disparity in the statistical characterization of the traces is confirmed even when the same type of movement is considered as the basis for comparing the datasets. Figures 25  and 26 depict the results obtained when the ANOVA is exclusively applied to those movement samples (measured on the waist) labelled as "walking". We select this ADL due to its importance in real-life scenarios of FDSs as it is the movement that normally precedes falls and because it is present in the seven datasets (DLR, DOFDA, Erciyes, IMUFD, SisFall, UMAFall, and UP-Fall) that employ a                     characteristics (for example, note the absence of overlapping intervals in the graphs corresponding to μ θ or μ R ) the post hoc tests show that all or almost all datasets are significantly different.

Results for the Measurements on the Wrist.
To corroborate the previous results, we apply the previous analysis to the datasets containing measurements captured on a completely different body position: the wrist. In spite of the particular (and independent) mobility of the wrist, this position has been selected in a significant number of studies on FDSs as the position to locate the detection sensor. e wrist offers to the user better ergonomics than other typical placements as humans are already habituated to wear watches. Moreover, commercial smartwatches (which are natively provided with inertial measurement units) can be employed to deploy the FDS without obliging the user to transport any supplementary device. In some articles that consider systems with more than one sensing mote, the wrist-sensor can be used as a backup node to confirm the detection decision taken from the measurements obtained on another body area.
To extend the study to the wrist-based measurements, we repeat the selection process described in Section 3 and select only those datasets that employed a sensor on that position in the datasets (see Table 3). us, six datasets were selected: Erciyes, UP-Fall, and UMAFall (already utilized in the previous analysis of the traces obtained from the waist), as well as CMDFall, SmartFall, and Smartwatch datasets. e results of the ANOVA analysis of the series of the twelve statistical features of these six datasets (when an observation window is contemplated) are represented in As expected, the graphs show even a higher disparity between the datasets than those obtained on the waist. e way in which the volunteers are instructed to execute the ADLs and falls may particularly determine the position and movements of the hands during the activities. us, the measured dynamics may be extremely dependent on the testbed, which reduces the suitability of the traces for being extrapolated to other scenarios.

Discussion.
is heterogeneity of the repositories can be motivated by very different factors, which we could group as follows: (i) Technological factors: inertial sensor problems and limitations (biases, calibration issues, and range) can affect the measurements (ii) Ergonomic factors: although we have compared datasets where the measurements were taken in a similar body area (the waist), measurements could be altered by the exact position of the sensor, the discomfort that the sensing device can cause in the user (which could influence the naturalness of the movements), or the firmness with which the device is adjusted to the body (iii) Factors determined by the design of the testbed: the variability of the datasets could be clearly justified not only by the intrinsic variability (in number and types) of the performed movements but also by the particularities of the physical setting in which the movements take place: the route of the subjects during the execution of each activity, the external elements (stairs, chairs, and beds) used in the routines, or the mechanisms used to cushion the impact of the falls (mattresses, elbow pads, and helmets) (iv) Human factors: finally, the data could be affected not only by the criteria for choosing the subjects (especially the age) but also by the particular training (or orders) that the volunteers receive to carry out the activities (in particular the falls)

Conclusions
is paper has presented a thorough study of the existing public repositories employed in the validation of Fall Detection Systems (FDS) based on wearables. e paper compares and summarizes the main basic characteristics of up to 25 available datasets used as benchmarking tools in the evaluation of FDSs.
Due to the difficulties of obtaining inertial measurements of actual falls, all these databases (except one) were created by groups of volunteers that executed a predetermined set of ADLs (Activities of Daily Living) and mimicked falls in a controlled lab-type environment. In this regard, most works in the literature evaluate their proposals by analyzing their behavior when they are applied to just one (or at most two) of these datasets. In order to indirectly assess the validity of testing a certain FDS with a single dataset, we have systematically compared the statistical characteristics of the series contained in seven of these repositories. e selection criterion of the analyzed datasets was founded on the election of a common position (waist) in which the sensor was located and on the cardinality of the measurement sets. In any case, by also analyzing the movements captured on the wrist, we also showed that conclusions could be extrapolated if other body locations with a higher degree of movement autonomy are considered. e study, which was restricted to the accelerometry signals (as they are massively employed by the related literature on FDSs), defined and computed twelve statistical features to characterize different properties of the human mobility for each activity during the observation window (of fixed duration) in which the maximum variation of the acceleration magnitude is detected. e analysis was repeated with up to three different observation intervals without identifying a strong coherence in the characteristics obtained from the analysis of the different traces.
In particular, by means of an ANOVA analysis, we compared the means of the different statistics taking into account the nature (falls or ADLs) of the activity. is comparison was repeated after clustering the ADLs into three subcategories (basic, standard, and sporting activities) depending on the physical effort that they demand. In all cases, a significant difference of the means was found for almost all the datasets and features. Same conclusions were drawn even when a unique and simple type of standard movement (walking) was selected to compare the databases. e divergence of the datasets could be justified by the complex interaction of a wide set of factors: the typology and number of activities (even for those in the same subcategory), the method to execute the programmed movements, the characteristics of the experimental subjects, the range, quality, and ergonomics of the sensors, the way in which the sensing device is fastened, and the elements employed to cushion the falls. In this sense, the study reveals an evident lack of consensus on the procedure followed to define the experimental testbeds in which the datasets are generated. For example, just one of the studied datasets includes (as nonlabelled ADLs) samples captured while monitoring the actual daily routines of the volunteers.
In any case, the heterogeneity of the datasets highlighted by this investigation calls into question the results of all those studies that test the FDS against a single repository. anks to the sophisticated methods currently used by the literature, normally based on machine learning or deep learning techniques, some studies have achieved quality metrics (sensitivity and specificity) in the recognition of ADLs and falls very close to 100%. However, these works do not normally evaluate the capability of these methods to extrapolate these positive results when using other datasets than those considered during the training and initial validation of the FDS.
With this in mind, we should not ignore either that the credibility of the research on FDS systems is still undermined by the lack of datasets with a representative number of real falls of older people (the target population of these emergency systems), which could be utilized to benchmark the detection methods in a more realistic scenario.

Data Availability
Datasets employed in this paper are publicly available in the Internet. URLs to access the data are provided by their authors in the corresponding references (see References).

Conflicts of Interest
e authors declare no conflicts of interest.