Learning from Large-Scale Wearable Device Data for Predicting Epidemics Trend of COVID-19

,


Introduction
Since the outbreak of the coronavirus disease 2019  pandemic, more than 300,000 people have been infected in at least 127 countries as of March 23, 2020, according to the World Health Organization's (WHO's) report [1]. COVID-19 spreads easily from person to person and has killed thousands of people [2][3][4][5]. Since the beginning of the COVID-19 outbreak, several studies have been carried out to forecast the epidemic trend of COVID-19 in China [6][7][8]. For example, Wu et al. built a Susceptible-Exposed-Infectious-Recovered (SEIR) model to simulate the epidemics across the major cities in China [7]. Yang et al. applied the Long Short Term Memory (LSTM) model to predict the number of newly infected COVID-19 cases by utilizing data from the outbreak of Severe Acute Respiratory Syndrome (SARS) in 2003 [6]. Although the models used in those studies could simulate the outbreak trend of the disease, they relied heavily on officially reported statistics; therefore, the timeliness of the models could be affected. On the contrary, big data analysis, such as analysis of Internet data, may provide real-time surveillance and improve the timeliness of the forecasting [9][10][11][12][13][14][15][16]. For instance, Google invented the influenza epidemic prediction tool Google Flu Trend (GFT) to estimate the level of in-fluenza activity based on the individual web search queries from different regions [9][10][11]. ey assumed that more individuals in a certain region might search online for the information about specific diseases if the influenza disease risk was higher in that certain region. erefore, Google built a database containing 50 million of the most common web search queries on all influenza-related topics and constructed the risk prediction model GFT using this search query data as the input [9]. Google showed that GFT could help predict the influenzalike illness outbreak 7-10 days before the Centers for Disease Control and Prevention (CDC) report [10]. In fact, the surveillance report from CDC usually has a lag time of around 1-2 weeks. erefore, the result from Google indicated that big data analysis could improve timeliness for public health surveillance. However, search queries can be greatly influenced by social hotspots, which weakens the correlation between the search queries and the occurrence of influenza-like diseases [17].
With the rise in popularity of fitness band and smartwatch devices, physiological signs, such as heart rate, activity, sleep, etc., can be conveniently acquired from these wearable biosensors [18][19][20]. As of 2019, more than 100 million consumers owned Huami wearable devices, and the number continues to grow. In contrast with the big data from web search engines, data from wearable devices can provide more objective information on the health status of the users. For example, once users are infected with an influenza-like illness, their physiological signs would be altered. Radin et al. explored the relationship between the physiological anomaly rate from wearable device users and the influenza-like illness rate reported by the US CDC [21] to build the regression models for predicting the influenza-like illness cases within different states of America. ey utilized the heart rate and sleep data from the wearable devices to improve upon the standard models. e prediction results have strong correlation with the official data. Li et al. also investigated the role of physiological changes measured with wearable devices on the diagnosis and analysis of disease [22]. e researchers established a personalized disease detection framework, which identifies abnormal physical signs, e.g., from Lyme disease and other inflammatory responses, from the longitudinal data of the individuals. All the studies mentioned above can inform the way wearable device data is used for public health surveillance.
According to clinical studies [23][24][25], the most common symptoms at the onset of COVID-19 are fever, cough, and fatigue, which are closely related to the physiological signs measured by the wearable devices. erefore, a good method to predict the epidemic trend of COVID-19 may involve building a prediction model based on the wearable device data.
e main purpose of this study is to provide a novel framework for predicting the trend of COVID-19 outbreak within different countries and cities, using big data collected from wearable devices. ere are two major contributions from this study: (1) a physiological anomaly detection method is developed and can identify the anomalous signs reflected by the physiological data from wearable sensors; (2) an online learning framework is proposed for public health emergency surveillance.

Physiological Anomaly Detection.
According to a study on fever and cardiac rhythm [26], heart rate increases by 8.5 beats per minute, on average, for every 1°C increase in body temperature, so an elevated resting heart rate (RHR) might be related to fever caused by COVID-19 or influenza-like illness. e basic anomaly detection method is based on the elevated RHR. Because shortened sleep length also causes an increase in RHR [27], we weaken the contribution of this factor in the physiological anomaly detection method.
RHR and sleep length are directly acquired with the corresponding sensors of Huami wearable devices. Both kinds of synchronized data from the accelerometer (ACC) sensor and the photoplethysmography (PPG) sensor are used to analyze sleep status (including sleep recognition and stage) for measuring sleep length. During sleep, the PPG data is used to compute the RHR. For each user, overall mean and standard deviation (SD) of RHR and sleep length throughout the entire period are calculated. A daily RHR is defined as an anomaly if it is larger than the average RHR plus 1.5 SD, and if in addition, the daily sleep is longer than the average sleep minus 0.5 SD. Considering that COVID-19 or influenza-like illness persist for several days, we define the detection standard of physiological anomaly as continuous anomaly measured for at least five consecutive days.

Online Prediction of COVID-19 Infection Rate.
e physiological anomaly detected by our method is an indication of fever, which in fact can be caused by COVID-19 or other influenza-like illness. us, the key point for COVID-19 infection rate prediction is to distinguish an anomaly arising from COVID-19 from the wider category of physiological anomalies. To this end, as shown in Figure 1, a heterogeneous neural network [28] regression model combining sparse categorical features and dense numerical features (CDNet) is proposed.
CDNet concatenates 2 subnetworks: CatNN and DenNN. e inputs of the CatNN are sparse categorical features, i.e., holiday activity, season, and weather. e inputs of the DenNN are historical physiological anomaly rate, active user density, and historical officially reported COVID-19 rate, where the historically detected physiological anomaly rate is calculated with dividing the number of users detected with a physiological anomaly by the number of total active users. e output layer of CDNet normalized by a Sigmoid function outputs the predicted physiological anomaly rate. e detailed inputs and outputs are summarized in where, for country or city k, the output R t+1,k ′ is the predicted physiological anomaly rate in the next period, R t−j,k is the physiological anomaly rate the j-th period earlier, r t−j,k is the physiological anomaly rate in the same period of R t−j,k last year, C t−j,k and c t−j,k are the corresponding categorical information with the same temporal definition as R t−j,k and r t−j,k , respectively, RC t,k is the officially reported COVID-19 rate (ratio of confirmed COVID-19 patient number to the number of residents in the country or city) in the current period, D t,k is the current active user density (ratio of active user number to the number of residents in the country or city). To distinguish regional disparity, four different CDNet models are trained separately for North China, Central China, South China, and South-Central Europe.
In order to get the predicted anomaly rate caused by COVID-19 for the next period, the predicted physiological anomaly rate with (R t+1,k ′ ) and without (R t+1,k ′ |RC t,k � 0) the supervision of officially reported data is calculated separately. As shown in Figure 2, the supervision is removed by 2 Discrete Dynamics in Nature and Society setting RC t,k as 0. en, the predicted anomaly rate caused by COVID-19 for the next period P t+1,k ′ can be calculated as the difference between R t+1,k ′ and R ' t+1,k | RC t,k � 0: To consecutively predict the epidemic trend of COVID-19, the CDNet model is trained in an online learning way. As shown in Figure 3, the initial CDNet model M 0 is trained with the input of R t−j,k , r t−j,k , C t−j,k , c t−j,k , j � 1, 2, . . . , 7 , and with the target as R t,k . e weights of CDNet are updated step by step with the transmission of COVID-19, using the arriving data of newly officially reported COVID-19 rate and detected physiological anomaly rate. e step size of the sliding window for online learning is set as 1 week.

Experiments
3.1. Dataset. Anonymised sensor data of approximately 1.3 million users who wore Huami devices from July 1, 2017, to April 8, 2020 were obtained according to appropriate security control processes. All users are notified that their anonymised data could potentially be used for academic research under the Huami Privacy Policy.
All the users wore their Huami devices for at least 100 days throughout the entire period. Daily measures include RHR, activity, and sleep length, which are the bases of physiological anomaly detection. Data with missing RHR or sleep length were excluded. e daily COVID-19 infection rate data come from CDC of the corresponding countries.
We build separate models for different countries and cities listed in Table 1, according to the geographical segmentation considering the regional and lifestyle differences. Taking North China as an example, we utilized data from five representative cities (Beijing, Shijiazhuang, Jinan, Taiyuan, and Tianjin) for analysis and model building. e detailed summaries of the active user numbers are also listed in Table 1. e users enrolled in the study were chosen from 19 cities of Central, Southern, and Northern China and seven South-Central European countries to sufficiently reveal the regional disparity.

Analysis Result in China.
e consecutive 3-year physiological anomaly rate curves in Wuhan together with the predicted physiological anomaly rate curves with and without the supervision of the officially reported COVID-19 infection rate in 2020 are illustrated in Figure 4. ey are aligned by the time of the Chinese Spring Festival in the temporal axis. In the figure, all five curves peak around the time of Chinese Spring Festival. In addition, the predicted physiological anomaly rate with the supervision of official data in 2020 fits well with the rate calculated by the anomaly detection algorithm, which validates the prediction performance of the CDNet. Additionally, the physiological anomaly rate curve excluding COVID-19 in 2020 overlaps with both the predicted and the detected physiological anomaly rate curves including COVID-19 in 2020 before the outbreak of COVID-19, which verifies the basic reliability of the model. After that, all these three curves rise rapidly, which indicates that the outbreak of influenza-like illness is occurring alongside COVID-19. e predicted outbreak period aligns with the real-life situation. In addition, we also predicted the physiological anomaly rate curve from 2018 to 2019 with the prediction model and found that the predicted curve fits well with the total anomaly rate curve during the 2 years. is may indicate that the obvious separation happening around the Chinese Spring Festival between the predicted anomaly rate curves with and without the supervision of the officially reported COVID-19 infection rate in 2020 results from the outbreak of COVID-19. Figure 5 illustrates the predicted COVID-19 infection rate across five Chinese cities and the officially reported accumulating COVID-19 infection cases in Wuhan. In the figure, there is a clear outbreak period in the predicted infection rate curve for each city, which may correspond to that of the newly confirmed cases. Taking Wuhan as an example, the predicted infection rate peaks around January 28, while the officially reported newly confirmed infection rate in Wuhan reached its highest on February 8 (the data after February 12 in Wuhan is omitted since the COVID-19 diagnostic criteria changed on that day, which causes a sudden sharp increase of 13,436 newly confirmed cases). e predicted disease peak is ahead of the officially reported peak by 11 days. e predicted earlier peak may indicate that health surveillance involving wearable sensors can play an  McGoogan also found there was a lag between the start of the illness and the diagnosis of COVID-19 by viral nucleic acid testing [2]. e newly infected cases actually peaked around January 28 if determined by the onset of the symptoms, which happens to be consistent with our findings. In addition, Figure 5 also shows that the predicted infection rate in Wuhan gradually decreases following January 28 and reaches a local minimum on February 1, which may    correspond to the plateau in the officially reported accumulating infection curve that occurs after February 19. is result may indicate that the model can also predict the disease control outcome in advance. Moreover, Figure 5 shows Wuhan has the highest prediction disease peak among the five cities. is is also consistent with the fact that Wuhan is the most affected city in China.

Analysis Result in Italy and Spain. Figures 6(a) and 6(b)
illustrate the predicted COVID-19 infection rate and the officially reported accumulating COVID-19 infection rate in Italy and Spain, respectively. e predicted infection rate in Italy rises rapidly from February 23, 2020, which coincides with the outbreak of COVID-19 in this country. As for Spain, the predicted infection rate starts to increase from February 29, which is 6 days later than Italy, and the predicted rate increases quickly following that. is is consistent with the real-life situation where the outbreak of COVID-19 was later in Spain. As shown in Figure 6, the principal peak in the predicted COVID-19 infection curve of either Italy or Spain arrives as of April 8. In correspondence to the largest number of newly confirmed infection cases, which are reported officially by Italy on March 21 and Spain on March 25, the predicted principal peaks for the two countries occur around the time of March 13 and March 18, respectively. Both predicted principal peaks are ahead of the officially reported data by at least 1 week.

Correlation Analysis.
To evaluate the appropriateness of predicting COVID-19 infection rate from physiological anomaly rate, we chose 19 Chinese cities to calculate the correlation between the officially reported COVID-19 infection rate and the detected physiological anomaly rate using Pearson's correlation coefficient shown in equation (3). In the equation, t 0 represents the start of the COVID-19 outbreak, t 1 stands for the end of the study period, and X, Y represent the officially reported COVID-19 infection rate and the physiological anomaly rate, respectively. e correlation analysis is performed in two steps. In the first step, we find the point, corresponding to the outbreak peak point of the officially reported COVID-19 infection curve, on the physiological anomaly rate curve. In the second step, we align the curves by the two points, and calculate the correlation coefficient. (3) Pearson's correlation coefficients, ρ, for different cities in China are listed in Table 2. e average ρ value reaches around 0.68, which is strong correlation that further supports the opinion that physiological signs are useful for public health emergency alert. However, some cities do not show strong correlation, which may be due to the following reasons. Firstly, the officially reported cases of infection in some cities, e.g., Wuhan, were adjusted on certain days resulting in sudden changes. Secondly, the number of active users in some cities, e.g., Nanning, are relatively small which influences the performance of the model; therefore, the ρ value can be further improved when the number of active users increases. Finally, some cities, e.g., Beijing, have unstable user population and data noise due to the population shift.

Retention Effect.
In the above correlation analysis, it is noticeable that there might be some retention effect in the detected physiological anomaly rate. To be specific, some people with anomalous measurements may continue to wear their devices so that they are calculated as anomalies on multiple days.
is results in statistical error during the correlation analysis.
In order to analyze the impact, we calculate the retention rates of people detected as anomalies for several consecutive days. As shown in Figure 7, if a person is detected as anomaly on a certain day, the possibility of wearing the device is decreasing gradually from 3.5% down to 0.2% in the following 4 days. is indicates that the retention effect may have very limited influence on the correlation analysis.

Discussion
In this study, a prediction model for COVID-19 epidemic trends has been realized using physiological data collected by wearable devices.
e results show that prediction with dynamic physiological data may have an advantage in alerting to the infection outbreak in advance. However, the detection method for calculating the physiological anomaly rate has some limitations.
Firstly, on holidays, e.g., Chinese Spring Festival, Christmas, etc., transportation and population shift, social activities, and alcohol drinking might greatly influence the physiological signs of the users. For example, the elevated RHR due to heavy drinking on holidays might persist for several days and greatly influences the physiological anomaly rate to be detected. Especially for China, the outbreak of COVID-19 and influenza-like illness overlap with the Chinese Spring Festival. us, it is necessary to distinguish the elevated RHR cases induced by holiday activities from infection.
Secondly, the anomaly rate is the statistical description of wearable device users' physiological signs measured in the anomalous range. e validity of the statistical description depends on both the user scale and diversity. For example,    [2,3], the statistical performance of the model will be influenced if there is not enough coverage of such people.
irdly, although the current study provides a population evolution model for public health surveillance, it may be more meaningful for medical workers as well as individuals to take early precautions, if individualized health status prediction model is available. In the future, such prediction models based on wearable device data will be explored by incorporating more individual features, such as age, gender, body mass index (BMI), etc.

Conclusions
Public health emergencies can cause severe damage to the health and prosperity of our society. e popularity of wearable devices provides the opportunity for researchers to utilize big health data for public health emergency surveillance. In this study, a COVID-19 prediction framework using the health data from wearable devices was put forward. e proposed model could predict the epidemic trend of COVID-19 outbreak in various countries and cities. e results from the study may shed light on a nationwide solution for the infectious disease surveillance system.

Data Availability
e concerned sensor data cannot be shared due to user privacy. For academic purposes, anonymised region-level statistics can be shared under agreement.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.