Three Ways to Improve the Performance of Real-Life Camera-Based Fall Detection Systems

,


Introduction
The group of persons aged 65 years and over are affected the most by falls and their negative health consequences.Thirty to forty-five percent of this group that live at home fall at least once a year [1], making fall incidents a major cause of health related problems for older persons.There can be both physical complications such as dehydration, pressure ulcers, and even death as well as psychological consequences such as fear of falling, loss of self-confidence, and loss of independence [1,2].One determining factor that influences the severity of the consequences of the fall is the amount of time spent on the floor [2].Fall detection systems that can help ensure timely aid are therefore needed.
The Personal Emergency Response System (PERS) [3] is a commercially available solution for this.But the fallen person has to press the button manually.Even when he or she is not unconscious after a fall and can reach the button, 80% does not use it to call for help [2].An automatic fall detection system which does not need user intervention can overcome this problem.Because of the importance of this issue, a lot of research has been done to solve the fall detection challenge, as can be seen in the numerous available review articles [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19].There are different ways to detect a fall; Mubashir et al. categorizes them, for example, in three categories: wearable sensors, vision, and ambient/fusion [14].However, each system has its drawbacks.For example, a wearable system has to be worn at all times.This can be Figure 1: Some examples of falls that end occluded and are therefore excluded from   (see Section 3.1).The white box represents the detected bounding box (see Section 3.3).The gray ellipse represents the bounding ellipse (see Section 3.3).(a) The person falls backwards behind a table and also pushes a chair away.(b) The person falls backwards behind a table and also the chair, which she was holding, falls over.
obtrusive for the monitored older person.Also, during the night or while taking a bath or shower, the system is often not worn [20].Contactless methods, such as the camera-based system used in this research, can overcome this limitation.
A lot of research has been done, but most of the proposed techniques do not cope well when used in real-life situations.For example, many camera-based fall detection algorithms use background subtraction to find the person in the image.Most researchers [21][22][23][24][25][26] assume that the silhouette that results from these background subtraction techniques is an accurate representation of the person.However, this is not always the case.Other objects and persons, occlusions, and changing illumination conditions often interfere with the segmentation [27].These conditions are underrepresented in most simulation data sets (DS).As almost all published research is only validated using simulated data [16,28], often these challenges remain untested.In our previous work [28], we showed that our fall detection algorithm that was based on a simple background subtraction method performed similarly as the state-of-the-art on a publicly available simulation data set.However, it generated a large amount of false alarms when tested on a real-life data set.Thus, it is important to increase the robustness of the used algorithms.For example, Belshaw et al. [22] use optical flow to increase their robustness.Our approach uses a particle filter in combination with a person detector for this purpose.
Earlier in [29], we tested our new approach on a newly created publicly available simulation data set that is based on our real-life data set [30] and got promising results.As first contribution, we show the results of this new approach on our real-life data set.
In previous work [28], we saw that in real-life often the person is occluded, e.g., behind a table, door, and so forth, while falling.Some examples are shown in Figure 1.Normally we try to train our systems using all available data, also including occluded falls in the training set.In previous work, always a high number of false alarms were generated.One of the causes could be that the fall detector is also trying to detect these kinds of falls, often not successfully.The extracted feature values can differ in a high degree with the ones that represent a visible fall but look similar as a normal activity.As a second contribution, we show that training a fall detector using only nonoccluded falls can increase its performance.
Most fall detectors are based on a generic model that is trained using fall data and activities of daily life (ADL) of one or more participants.However, every person has his own way of living.The executed activities differ from person to person and also their walking pattern is unique.We propose incorporating online training to take these differences into account.At installation time the system is based on a generic model.During the next days, the system constantly monitors the person and adapts the model to take this person's specific life pattern into account.In wearable approaches, this is more commonly used.Medrano et al. showed that personalization increases the performance of smartphone-based fall detectors [31].They use different recorded falls from other persons to build a model.This is then personalized using the ADL of the monitored person.This type of personalization is not yet common in computer vision based fall detection algorithms.One example is from Yu et al. [26].They use a single camera and an online one-class support vector machine to create a person-specific fall detection system.They only use ADL from different persons to create a generic model that is able to detect outliers.While using the system, the model is constantly retrained using the ADL of the monitored person to get a more person-specific model.To decrease false alarms, they include two rules to determine a fall: a fall is only reported when a large movement is detected by using the motion energy image and the person should remain on the floor longer than a certain time interval.We on the other hand use a different discriminative approach.As a final contribution, we show that using a generic model that is trained using both falls and ADL from different persons and retrain it afterwards by adding only negative data containing ADL of the monitored person can improve the performance of a fall detector.Below, we show how our approach is built and how it performs.
In this paper we first show some related work before elaborating our approach more in depth.We discuss the used real-life data set to show the performance of our fall detector in Section 3.1.Also the more restricted data set containing only nonoccluded falls and the additional videos for online training are discussed in this section.In Section 3.2, we explain how we combined our background subtraction method with a particle filter and a person detector.The extracted features that are used as input for our classifier are described in Section 3.3.After this, we explain how our approach for a personalized detector is implemented in Section 3.5.Next, we show the results of our three different contributions in Section 4. We discuss these results and propose some future paths to solve the fall detection challenge in Section 5. We conclude the paper in Section 6.

Related Work
As stated above, camera-based fall detection systems have the advantage that they are contactless.Because of this, several research groups have focused on their development and explored a whole range of different approaches.Often a single camera is used to try to detect a fall in a room [21][22][23][24][25][26].Regular RGB cameras have the advantage that the recorded images can be used to learn the root cause from this fall afterwards.For example, during the monitoring period, we discovered that one of the participants fell twice because she wanted to take something from the bottom shelf of her cupboard.Rearranging the cupboard prevented further fall incidents.Cameras can also be combined in a multicamera network using early or late fusion [32,33].With early fusion it is possible to first extract a 3D figure of the person to try to make the fall detection more robust [32].The commercialization of the Microsoft Kinect also provides the possibility of directly extracting 3D images in an affordable way.This has led to the usage of this sensor to detect falls [34,35].Another path that is sometimes used is thermal imagers [36].These outperform normal cameras in dusky environments.Also an RGB camera with an appropriate filter can be used together with a near-infrared light source as a cheap alternative in these circumstances.
At the algorithmic level, there are two different approaches that are often used to detect falls: those that try to detect unusual events [23,32,37,38] and those that try to detect the action of falling directly [21,[24][25][26]39].The former use indirect evidence, such as prolonged inactivity at unusual locations.The latter extract features of the movement of the person or changes in their posture to try to detect the fall.For this, background subtraction is often used to find the moving foreground objects [21][22][23][24][25][26], the biggest of which is mostly assumed to be the person.Domain knowledge is then often used to implement simple yet robust fall features, such as the aspect ratio of the bounding box around the person [24][25][26]39], or the angle of the surrounding ellipse [24][25][26]39].Another commonly used technique is motion histograms [24,25,40].

Methods
To detect falls, the region where the person is present in the image has to be detected.From this region, some features can be calculated to classify the fall.Section 3.2 explains our approach to detect the elliptical region of interest (ROI) in the image.The features that are extracted from this and how they are used to classify a fall are elaborated on in Section 3.3.Next, our approach to test the effects of online training is discussed in Section 3.5.But first we start with Section 3.1, in which an overview of the different real-life data sets used for both training and validation of our experiments is given.

Data Set (DS).
During the last years, seven older persons with a high risk of falling were monitored at their place of residence continuously during a period of three months up to two years.This represents an extensive real-life data set consisting of 29 falls and a huge amount of other activities of daily life.All videos are processed using greyscale values.More information about the data set itself and an overview of the different falls can be found in [28].
For our experiments, we used three different combinations of videos from this real-life data set.Table 1 gives an overview per person of the amount of falls and hours of data included in each data set.Because we only have a small number of falls, we do not divide the data in fixed training and test sets.We use tenfold cross-validation in which each video fragment is kept as a whole.

DS complete.
Only part of the available falls and other videos was used in data set   for our experiments.Two inclusion criteria were formed for the fall to be included in these tests.These are the same as used in previous research [28].The first criterion was that only the person falling should be visible.This is because the extraction of the fall features can For the first experiment, we used the data set  .
The training was executed using the complete data set using tenfold cross-validation.The video segments are added as a whole to either the test or the training set.The second experiment was to show the effect of using only good data for the training of the fall detector; we again used tenfold crossvalidation to classify the videos of  .To compare the results with the ones obtained using  , we trained a fall detector using all videos from   and using this trained model we classified all videos from   that were not included in  .The third experiment is to show the effect of personalization or online training.Since several participants were included in our data set, we can simulate an installed system by using a base training set that is created by removing one of these persons from  .To generalize the results, a different training set is created for each available combination of days of this person from   combined with the base training set.As mentioned above, these additional videos only contained normal ADL, no falls.Using these different training sets, we can show the effect of adding one to five days of personalized training data.

Foreground Detection.
Previously, our foreground detection was based on background subtraction (BGS) using an approximate median (APM) filter [41].Shadows were removed using cross correlation.The foreground was further cleaned up using an erosion/dilation step on all foreground pixels.This is a relatively standard workflow for BGS.A more detailed explanation can be found in [28].One of the conclusions of this work was that a more robust foreground segmentation could improve the performance of our fall detector by reducing the amount of generated false alarms.However, exploratory tests [28] using two background subtraction algorithms available in OpenCV [42], an improved and adaptive mixture of a Gaussian model [43] and a probabilistic method that uses Bayesian inference [44], produced only minor improvements mainly because we use gray-level images instead of color images.Here, we explore the use of a tracker to increase the robustness of our foreground detection.The general idea is to follow a person in the foreground image using an ellipse that surrounds the foreground blob that corresponds to this person.We chose a particle filter (PF) [45] because it is able to cope with nonlinear motions and we combined it with a people detector [46].A PF considers multiple state hypotheses simultaneously so it can deal with short-lived occlusions and can recover when it loses track for a short time.
Our experiments showed that using a PF using only foreground segmentation or a normal histogram did not work well with our challenging real-life data.One reason for this is the lack of color information.Therefore, our implementation uses a combination of foreground segmentation, a weighted structural intensity histogram, and an upperbody detector to determine the weights of the different particles, the combination of which provides the possibility of following the person in the image.Each part has its own function.The upper-body detector helps to prevent and solve tracking losses.The foreground segmentation performs best for tracking the person while he is moving.The weighted structural intensity histogram and the upper-body detector help the tracker during periods with low motion when the foreground might be integrated in the background.It also helps with reducing the effects of ghost formation when the person has been integrated in the background.In this case, the intensity histogram is more robust than the upper-body detector given the low recall of the detector.
To make the background update more robust, the prediction of the PF was used as feedback.The BG was updated more slowly inside of the prediction of the PF and faster outside of this region.This reduced the appearance of a ghost figure while other changes in the image (e.g., changes in lighting, other moving objects) were integrated faster.An overview of the implementation of the foreground detection is shown in Figure 2. A more in-depth explanation of the implementation of the PF can be found in the Appendix.The different configuration values were optimized using visual inspection on a set of validation videos containing only nonfall data and were kept fixed unless specified otherwise.
There are different ways to determine the region of interest (ROI) from which the fall features can be extracted.One possibility is using the predicted ellipse of the particle filter.The performance of this predicted ellipse was benchmarked in a multicamera setting in [47].It outperformed multiple dedicated multicamera trackers.This showed that the prediction of the tracker is good for following persons performing normal activities of daily life (ADL).However, a previous study using a challenging simulation data set [29] showed that the fall detector performed best when using the biggest foreground object in the image as ROI and not when using the prediction of the particle filter.This could be explained by the fact that a tracker smooths out sudden abrupt movements of which a fall is a good example.Increasing the reaction speed of the tracker would cause it to be able to better follow the falls, but it would also cause the tracker to be less robust and lose track more often.This could be counteracted by increasing the number of particles, but then also the processing time would increase a lot since the coefficients described above have to be calculated for each particle.The fall detection features are extracted from the biggest detected blob available in the foreground image (see Figure 2(d)).This relies on the assumption that only one person is visible in the image.In real situations, sometimes this is violated.A solution for this is to detect if multiple persons are present and if so, deactivate the system.In most cases, this can be done because the other person could call for help.But, in case of a married couple where the person that is taking care of his/her spouse with cognitive impairment falls, this can cause severe problems.In this case, it is better to make the tracker itself work for multiple persons as done by Young-Sook and HoonJae [48].For the moment, this is not implemented yet.
When looking to a fall in more detail, four phases can be distinguished as stated by Noury et al. [50]: the prefall, critical, postfall, and recovery phase.All four of these phases contain valuable information to detect a fall.Our approach to include this temporal information was to use a feature vector with a stride length of one second that contains the mean and the maximum values of these features calculated over different time slots from before, during, and after the fall.These different time slots are shown in detail in Figure 4. We have one such feature vector for each time slot of one second covering a certain time frame.

3.4.
Classifier.These feature vectors can then be used by a support vector machine (SVM) to determine if this feature vector corresponds to a fall or a nonfall.For this, first the SVM has to learn a model based on a training set containing both fall and nonfall examples.Since only a small number of fall and a huge number of nonfall feature vectors were available, a linear SVM was used to reduce the problem of overfitting and increase the processing speed.To prevent all vectors from being classified as nonfall, different weights were used for negative (  ) and positive data (1 −   ).The  measure (see below) was used as cost-function for finding the best combination of the weight   and the regularization parameter of the SVM.This way, the importance of the sensitivity (SENS) and positive predictive value (PPV) could be weighted appropriately.
with TP being the amount of true positives, FN the amount of false negatives, and FP the amount of false positives.Tenfold cross-validation over the complete data set was used for the evaluation of the fall detection algorithm.To reduce the false alarm rate, after each time slot was classified by the SVM, a median filter with length of three was executed over all time slots to remove single detection.Additionally, bursts of detection were grouped using nonmaximum suppression.

Personalized Detector.
Most fall detection algorithms use a generic model based on training data recorded using one or more participants.However, it is very difficult to take into account the huge range of activities of daily life that a person can execute.Another challenge is that every person is unique, and their movement pattern can differ from other persons.Also the activities that are executed differ from person to person.In an ideal world, a fall detection system should use fall and nonfall information from the person that it is monitoring.Since falls are rare, capturing fall data from the person that is being monitored is difficult and could take a long time.Waiting for this to start to be able to detect future falls is not feasible.A better way is to install a fall detection system that is based on generic information and then use the negative, nonfall instances that are being captured to increase the robustness of the system.This way, the living pattern of this person could be learned and taken into account to decrease the number of false alarms.Our approach for online training is to install the system and then retrain the model every 24 hours.This is easier and faster to implement then adapting the model every time an activity is captured.To show which performance gain can result from this, we use our existing real-life data set.As discussed in Section 3.1, we created a different training set for each available combination of days from the person from   combined with the base training set.As mentioned above, these additional videos only contained normal ADL, no falls.A model was trained using each of these new training sets.The regularization parameter of the SVM is kept fixed, but the weight of the nonfall data  OT has to be decreased with the quantitative growth of the training set.If  OT would not be decreased, the weight of the nonfall data would become too high, and the sensitivity of the system would decrease. OT is calculated as follows: is a factor that can be changed online to regulate the effect of the online training. OT is the number of feature vectors in the complete training set for online training, while  orig is the number of feature vectors in the original base training set.To test the performance of these personalized detectors, we used all fall videos from the selected person contained in  .

Robust Foreground Detection Validated Using 𝐷𝑆 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒.
The results of the fall detector based on our robust background subtraction method is shown in Table 2 for  = 10 for the   -measure.The results of a previous study using our less robust foreground detection method from [28] are shown for reference.These earlier experiments were executed with a data set that is very similar to data set  .One fall of participant A was not included before and one of her falls used to have 24 hours of video instead of the now available 140 minutes.In that study we also showed that our original method performed similarly as the state of the art on a publicly available simulation data set.
If we compare the results of our robust BGS algorithm with these of our original BGS [28], there is a large decrease in false alarms for a similar sensitivity.The original system generated 1360 false alarms for a sensitivity of 38.1%, while our more robust system almost halves the amount of false alarms while detecting more falls.The results of the fall detector that is trained using   show that still a high number of false alarms occurred.Only 10 out of 22 falls were detected, while 688 false alarms were generated.This represented a sensitivity of 45.5% and a PPV of 1.4%.Figure 5 shows a precision-recall (PR) curve of these results.

Training Using Visible Falls.
As mentioned above, we also wanted to show the effect of using a more restricted data set for training the detector.In this case the person should be visible during the fall and he should remain on the floor for more than thirty seconds.The prior restriction made certain that the extracted features were more likely to be from the person and not from something else.Table 3 shows the results for each person using   to validate the system.We also added the results of the same falls of   but having been trained using   for comparison.This shows the effect of using the restricted training set in more depth.
The results using the more restricted training set show a large decrease of false alarms.Now nine out of eleven falls were detected while generating 352 false alarms, giving a sensitivity of 81.2% and a PPV of 2.49%.Looking to the amount of false alarms generated per day, a decrease of 33.5% from 57.6 to 38.3 can be observed.
However, two falls remained undetected of  .The first was a fall in which person B lies down after sitting on the knees for 30 seconds while trying to get up again.The speed towards the ground was rather low.Additionally, a high level of overillumination was present in the image, interfering with a robust detection.In the second fall, the table is pushed away and person C is partially occluded during part of the fall.Figure 6 shows some screenshots of both missed falls.
From the 352 false detection instances, 21% was generated because a person was walking below the camera and was not completely visible anymore because of this.Another 26% was generated due to the presence of two persons or other moving objects in the room.In 13% of the cases, the person's movement was misclassified; this often happened together with errors caused by the background update.A partial occlusion of the body (e.g., the legs that were occluded by a table) also caused 10% of the errors; 9% of the false alarms were generated due to errors in the background update.In this case, the person was starting to be integrated in the background, or sometimes he was moving after being integrated in the background, which caused a ghost figure.Shadows (certainly the ones generated by the sun) also are still a challenge, since these accounted for 8% of the alarms.The other false alarms were caused by illumination changes, leaving or entering the view of the camera, and so forth.
Table 2 shows the results when training a model using   and validating it on  .As mentioned in Section 3.3, tenfold cross-validation was used for the validation of the falls contained in  .The other falls of   were classified using a model that was trained using  . Table 4 shows these results more in detail over the different persons.This table also shows that the number of false alarms per day decreased from an average of 31.4 to 26 false alarms per day.These results are also contained in the PR curve shown in Figure 5.The doubling of the area under the curve (AUC) of the PR curve also shows the definite improvement in performance.

Personalized Fall Detector.
To show the possibility that using personalization can give a further improvement, we have to look at the distribution of the false alarms per day over the different persons in Table 3.This shows that person D had a very high false alarm rate.In this case, 201 out of the total 352 false alarms were generated in a single video fragment of 24 hours.  only contained one video of this person, so no data for person D was present in the training set.Person C also had a high false alarm rate of 42 alarms per day.But in this case two videos were present in the data set, one of them was available in the training set of the other one.This could be the reason that person C has a lower amount of false alarms than person D.
To reduce the number of false alarms further, two additional experiments, one for person C and one for person D, to show the effects of personalization using online learning were executed.For this, all fall videos from   excluding the ones from the current person were combined with additional videos from this person from   containing only normal activities.Several combinations and number of additional videos were tested.A fall detector was trained for every data set.Different values for  to alter the weight parameter of the SVM were tested.Persons C and D were used independently for cross-validation to find an optimal value for .A value of four gave the best results in both cases.The results for  = 4 are shown in Table 5.
The results show a different behavior for both persons.The results for person D showed a large decrease of the false positives from 201 to 29 while keeping the sensitivity.For person C, a small increase of the false alarms was noticed, but more importantly also the sensitivity increased.Remember that in the previous experiments there always was a video containing a fall of person C in the training set of the other ones; this explains the higher number of false alarms as found in Table 3.An overview of the causes of false alarms per person when using five additional days is shown in Table 6.

Robust Foreground Detection Validated Using 𝐷𝑆 𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒.
The tests using our complete data set   of twenty-two real-life videos showed an improvement of   the results by using a particle filter in combination with a person detector to create more robust foreground detection.
Comparing the results with previous experiments using a less robust BGS shows almost a decrease of the amount of false alarms by 50% while retaining or even increasing the sensitivity.This shows that using a robust foreground segmentation is important for camera-based fall detection algorithms.This reduction is certainly substantial, but still a high amount of false alarms were generated, while failing to detect twelve out of twenty-two falls.

Training Using Visible Falls.
Looking into more details to the different available falls showed that a high number of the falls were partially or even completely occluded.Only in 50% of the falls was the person visible after the fall.This caused problems since our foreground detection was not written to be occlusion resistant.When the person was not visible during or after the fall, the features that were extracted from the foreground region were not representative for the person.Hence, training a fall detector using these erroneous features values caused it to be less robust.Removing these falls from the data set in   and training our fall detector using this subset showed an improvement.Most importantly, the AUC of the PR curve of the fall detector (see Figure 5) trained using only visible falls doubled opposed to using all falls.The false alarms rate decreased from 31.4 to 26 falls/day.The sensitivity also decreased from 45.5% to 40.9%.One fall less was detected.However this fall was one in which the person was not visible, so it could be lucky detection.The other falls could not be detected by our current nonocclusion-resistant algorithm.This shows that it is important to be able to cope with occlusions.This could partially be done by using only head tracking and detecting occlusions using this, or by placing more cameras in the room.It also shows that it is important to only use features that were extracted correctly and not to include noisy falls in the training data.As discussed in Section 4, the two falls from   that were not detected were very challenging.The missed fall from person C was also almost completely occluded during the fall, but the person was visible shortly after the fall.Maybe it should have been removed from the restricted data set.Still, even when using only nonoccluded falls to train the model of the fall detector quite a high number of false alarms were generated.

Personalized Fall Detector.
Each person has his own way of living and walking.Training a fall detector that is able to detect all falls and generate as few false alarms as possible is still very challenging.Analyzing the false alarms of person D showed that her walking speed was much higher than the other participants.Also the speed while bending over to take something out of the cupboard seemed a lot higher.Additional to this, the setup of the room and the camera view differed a lot from the others.belief is that it is better to use a generic detector as a base and let it learn the specific patterns of the person during the usage of the system.Our results supported this with showing an improvement in the robustness of the system.A large decrease of the false alarm rate was noticed for person D, while an increase of the sensitivity with only a minor increase of the false alarms was produced for person C. It would be better to test this on more persons, but unfortunately no additional persons with visible falls were available in our real-life data set.Adding more days would probably further decrease the number of generated false alarms, but only up to a certain point.To decrease the number of false alarms further, it is important to solve the challenges that cause errors in the foreground detection.
A personalized detector provides the possibility of integrating more environmental information in the future.One of the main error causes, for example, is that the person was walking below the camera.Border object detection could prevent these errors.A further option for personalization is to train the person detector to learn the appearance of the monitored person.This would improve the performance of the foreground detection and also could provide a means to detect occlusions.

Future Work.
As mentioned before, the performance of this kind of fall detection algorithm depends heavily on the placement of the camera in the room.To cover the room more cameras are needed, but even then still dead spots remain possible.To increase the robustness further, our belief is that it is best to integrate two different ways of detecting a fall.The system proposed in this paper based on transient information could be combined with a detector that is based on contextual information.These kinds of systems first need to learn the normal pattern of the person and then can detect deviations from this.Combining this kind of information could improve the robustness of the system a lot.In addition to combining different algorithms using the same sensor, also improvement in performance can be found in combining several types of sensors.This has currently received a lot of attention, but it has its own challenges and issues as described in [51].

Conclusion
In this paper, we showed three contributions to enhance the performance of our fall detector.As a first contribution, this paper showed the results of our new approach using our reallife data set.The use of a more robust foreground detection by integrating person detection and tracking to segment the person from the background increases the performance of our fall detection algorithm.It reduced the number of false alarms with 50% compared to our previous system while maintaining or even improving the sensitivity.However, the numerous occluded falls that were included in our real-life fall data set still caused the detector to generate a high number of false alarms.As a second contribution, we showed that only using nonoccluded falls in the training set reduced the false alarms from 31.4 to 26 per day.The AUC of this detector using selected falls was twice as large as the detector using a model trained using all available falls.But we saw that some persons have a higher number of false alarms than others.This could partially be explained by having less training data available for these persons.As a third contribution, we showed that personalization of the model used for the classification of falls can further improve the performance.Our approach of online training using only ADL from the person itself in addition to a generic data set further increased the robustness of our camera-based fall detection algorithm by reducing the false alarm rate by a factor 7 in one case, while increasing the sensitivity of the system with 17% for an increase of the false alarms of 11%.These optimizations provide a step forward in solving the fall detection challenge, but even then adding other cameras or sensors may be needed for a practical reallife system.

A. Foreground Segmentation
Our foreground segmentation is based on an approximate median (APM) filter (see Figures 2(a)-2(c)) combined with a particle filter (PF) to increase its robustness.Our implementation of this PF uses a combination of foreground segmentation, a weighted structural intensity histogram, and a person detector to follow the person in the image.To make the background update more robust, the prediction of the PF was used as feedback.The BG was updated more slowly inside of the prediction of the PF and faster outside of this region.This reduced the formation of a ghost figure while other changes in the image (e.g., changes in lighting, other moving objects) were integrated faster.The different parts of our foreground segmentation are explained in more detail below.
A.1.Particle Filter.Particle filters estimate the probability distribution (  | [1 ⋅ ⋅ ⋅ ]) of the state vector   of the tracked object given   representing all the observations.This probability density function can be approximated using a set of  weighted samples or particles.Increasing  makes the particle filter more robust, but also the time to process a single frame increases accordingly.Initial tests showed us that using 80 particles gave a good trade-off between processing time and accuracy.
Each particle corresponds with one state vector representing an ellipse that is defined by its center coordinates, major and minor axis length, and angle of the major axis and the ground plane.Also the speed at which each of these values changes is recorded as part of the state vector, resulting in a ten-dimensional state vector.The weight of each particle is updated every frame and depends on the previous state and the current measurement function.Our measurements are based on foreground, weighted structural histogram, and person detection coefficients.These are explained in more detail below.The used particle filter is a Bootstrap filter implemented using the Bayesian Filtering Library [52].To start tracking, an initialization step is needed.In our case, the tracker was initialized when a foreground object of over 5000 pixels was detected in our image of 640 by 480 pixels.
A.2. Foreground Coefficient   .The foreground coefficient (see Figure 2(f)) measures how well the ellipse fits the foreground object.The foreground was detected by thresholding the difference between the current frame and the background.If the difference in intensity level was more than eight, the pixel was detected as foreground.Unfortunately, also shadows could be included in the foreground this way.These were removed using cross correlation.This was followed by an erosion/dilation step to remove small noisy patches.
high value for the foreground coefficient represented an ellipse that surrounds the foreground as well as possible without including too many background pixels.As shown on Figure 2(f), two layers were defined surrounding this bounding ellipse (BE).A penalty was given if foreground pixels were included in either of these layers.The exact value of   was calculated with the following formula: where FG BE is the amount of foreground pixels contained in the bounding ellipse (BE),  BE is the surface of BE, OL is a layer surrounding BE 1.5 times the size of BE, and OOL is an additional layer twice the size of BE.

A.3. Weighted Structural Histogram
Coefficient   .Histogram matching of the image in the bounding ellipse around the person is the base for our second measurement function (see Figure 2(e)).In the literature, mostly a color histogram is used because of it being more robust.But since also video was recorded during the night using near-infrared, only gray-scale values are available.A weighted structural histogram was used to make the histogram more distinctive [53].The bounding ellipse was divided into four overlapping circles as seen in Figure 2(e).Each circle represented a different part of the body, more precisely the head and shoulders, the chest, the abdomen and hips, and the lower legs.Since it is more probable that some background pixels are included in this circle at the edges, the center pixels were given more weight than the ones on the outside of the circle using a Gaussian distribution.One exception to this rule was the circle containing the legs.When a person is walking, his legs move all over the circle causing the center to contain background information at some points.Therefore the weights were evenly distributed over this whole circle.In the literature the pixels are mostly included in only one bin of the histogram.Even small intensity changes can cause a pixel to shift to another bin, sometimes causing dramatic changes.
To reduce this effect, linear interpolation was used to divide the weight of the pixels over the bin and its neighbors.
Calculating the correspondence of two histograms can be done in different ways, like Bhattacharyya or chi-squared distance, but our tests showed that correlation with the histogram model (  ) gave the best results.To calculate   , the correlation of the histograms for each part of the ellipse was calculated and combined as given in   = 0.3 head + 0.35 chest + 0.25 abdomen + 0.1 legs .

(A.2)
During the initialization the histogram was calculated from the biggest object and used as the starting model.The appearance of the person changes while moving, which could cause the tracker to lose track.To counteract this,   was updated during each frame with 0.5% of the current prediction.This was done by multiplying all bins of   with 0.995 and adding the histogram of the current prediction multiplied with 0.005 to this.
A.4. Person Detection Coefficient   .Not only persons move through the room.Also other objects, such as walking aids, can be detected as foreground.To reduce the effect of these other objects, the Calvin upper-body detector [46] was used (see Figure 2(g)).This detector is based on the successful part-based object detection framework [54].Unfortunately, it takes a huge amount of labeled training data to train a new model, so a standard model was used.We chose an upper-body detector because, in contrast to most pedestrian detectors, it can also detect sitting persons.Another major advantage of the Calvin detector is that the model is available online and can easily be used in the OpenCV framework.The detector returns a confidence level for each detection instance that it makes.Our tests showed that a value of over −0.45 represents a high confidence.But, as explained before, the used model is not trained specifically for our data.The higher point of view that was used differs from the data on which the model was trained.Also the posture of older persons, or the presence of a lump on their back, differs from younger people.It can only detect upper bodies in an upright position, so persons that are lying down can not be detected anymore.This caused quite some false and missed detection instances, but, even with these drawbacks, the detector proved useful.
The detection was used in three ways: if a person was detected,   was updated in the same way as described above, but with a multiplication coefficient depending on the confidence of the detection.When the detection is very good, the histogram model is completely replaced with the histogram of this detected person.If the confidence level is lower, the value with which   was multiplied was defined using a Gaussian centered around −0.45 with  2 = 0.15.If one or more person detection instances were available, also five low weight particles were replaced by a combination of these detection instances.The replaced particles were unlikely to correspond to the person but are now directly placed on this detected person.This is normally not done in the particle filter paradigm, but it increased the robustness of the tracker.To limit the processing time the detection was only executed every five frames on a region that is centered around the detected foreground object but is 80% bigger.Additionally, also the calculation of the final coefficient is changed.How is explained next.
A.5. Final Coefficient.Finally, the final coefficient was calculated as a combination of the foreground   and histogram coefficient   .The detection coefficient   was used to shift the weight between these both measures.The formula for the final coefficient was given by  total =   + (1 − )   . (A.3) During the initialization step,  was set to 0.65.When a detection instance was available,  was decreased according to the confidence of the detection as used for the update of the histogram model.For a reliable detection,  is decreased to 0.25.This increases the importance of   and decreases this of   .Values for  lower than 0.25 were clipped to make certain that the tracker does not stick to a nonmoving object.When no detection was available,  was gradually increased to 0.65 again.This reduces the chance that another object is tracked.If no detection was available for 20 frames, the initial value for  was used again.
A.6.Predicted State of Particle Filter.As mentioned above, a particle filter represents a probability density function using particles.From this function, a prediction can be calculated (see Figure 2(h)).In our case, the mean of the five best predictions was used.This prediction was found to be more stable than the weighted mean of all particles or the particle with the highest weight.This predicted ellipse was also used as feedback for the update of the background.To make the background update more robust, the background was updated more slowly inside of the prediction of the PF and faster outside of this region.
A.7. Clean-Up.To reduce the impact of erroneous foreground detection due to, for example, the continuous update of the background or spots from the sun, small blobs were omitted.The same size as for the initialization of the PF was used.
A.8. Processing Time.Unfortunately the usage of this kind of robust tracker has an effect on the processing time.We used a PC with 16 GB of RAM and an Intel Core i5-4670 CPU running on 3.40 GHz from 2013.The current processing time of one frame depends on the size of the person in the image, but it is situated between 250 and 600 ms when no person detection is executed.As stated above, every five frames, the upper-body detector is executed using part of the image.This adds between 100 and 300 ms to the processing time of that frame.Almost 90% of the processing time is needed to calculate the coefficient of all particles.This takes between 1 and 8 ms per particle multiplied by the number of particles, 80 in our case.The processing speed can be increased by optimizing and reimplementing some parts of the code.But the highest performance gain can be expected by parallelizing the calculation of the coefficients of the particles and executing this using the GPU of the PC.

Figure 2 :
Figure 2: Overview of implementation of foreground (FG) segmentation based on robust background (BG) subtraction using a particle filter (PF) with three different measurement coefficients: FG, histogram and person detection coefficient.(a) BG model.(b) Current input frame.(c) Detected FG.(d) Determined region of interest (ROI) as biggest foreground object (crosses indicate center and top of bounding ellipse (BE)).(e) Histogram coefficient used by PF.(f) FG coefficient used by PF (OL: first outer layer; OOL: second outer layer).(g) Person detection coefficient that controls weight-factor  (small rectangle represents detected upper body; large rectangle represents extrapolated person).(h) Prediction of PF used to update BG model selectively slow inside prediction, fast outside.

Figure 3 :
Figure 3: Extraction of salient subject features used for fall detection.Purple rectangle: bounding box of subject; white fine segmented line: best-fitting ellipse within subject bounding box; double green diamond: center of mass; blue octagon: head position (small, filled black rectangle is not part of feature extraction set; it is inserted in the figure presentation here to maintain subject anonymity) [28].

Figure 4 :
Figure 4: Overview of the contents of the feature vector (FV) used by the support vector machine (SVM; Section 3.3).The complete video is split into discrete, one-second time slots.One FV is created for each time slot.A FV contains information about the current time slot and (combinations of) other time slots.Each FV part consists of 10 features as shown (AR = aspect ratio, CAR = change of AR, FA = fall angle, CS = center speed, and HS = head speed) [28].

Figure 5 :
Figure 5: Precision-recall curve for comparison of our robust background subtraction (BGS) algorithm trained on   containing all falls and trained only using   containing only visible falls but validated on  .Both use  = 10 for the   -measure.The larger markings indicate the optimal trained model (AUC = Area Under Curve).

Figure 6 :
Figure 6: Screenshots of both falls of   that were not detected.(a) Person B lies down after sitting on the knees.(b) Person C pushes the table away while falling; only the legs are visible during and directly after the fall.

Table 1 :
Overview of number of falls and number of hours of video per person included in each data set.The data set   contains all falls that meet our inclusion criteria.Data set   is a subset of   with the additional restrictions that the fall should not be occluded and the person should stay on the floor for over 30 seconds after falling.The data set   contains additional videos with only nonfall data.
The data set   is a combination of five additional videos of 24 hours for both persons.3.1.4.Experiments.We ran a couple of different experiments.

Table 2 :
[28]lts of robust background subtraction (BGS) trained on   and trained using visible falls of   but validated using  .Results from[28]using slightly different data set added as reference.

Table 3 :
Results of fall detection algorithm trained and validated using  . Results of the same falls trained and validated on   added for comparison.

Table 4 :
Results per person for all videos from   using a model trained on   and one trained using  .

Table 5 :
Results for personalization using online training for persons C and D for  = 4. Different combinations of adding zero to five days are tested and averaged.

Table 6 :
Causes of false positives when using online training with five additional days.