Estimation of Underreported Cases of Infections and Deaths from COVID-19 for Countries with Limited and Scarce Data: Examples from Nepal

COVID-19 pandemic has overburdened the public healthcare system around the world. Further, lockdown imposed to curb the spread of pandemic has shown to have an adverse effect on economic and health status of an individual. It has also compelled us to switch from the physical world to virtual world, thus depriving us of benefits of person-to-person direct contact. People from developing countries are specially affected. An average person here lacks basic skills needed to survive in the digital world. Due to limited COVID-19 testing capacities in such countries, there is also less testing. Less testing means less contact tracing, underreported cases, and rapid spread of disease. In this paper, the underreported cases of daily infections and daily deaths are predicted using mathematical models. This is based on daily data published by the Government of Nepal. Here, Kathmandu valley is taken as a model area for estimation of underreporting. The behavior of probability of infection, probability of recovery, and probability of deaths is also mathematically analyzed. A time-dependent susceptible infected and recovered model is also proposed. Here, the second wave of COVID-19 is analyzed in detail from 1 Feb 2021 to 1 June 2021. The effect of lockdown on the psychology of people is also modeled with principal components analysis. The inherent and latent factors affecting the people in lockdown are identified. This is based on detailed primary data collected from a survey of 277 households.


Introduction
Countries in the developing world have poor health infrastructure. COVID-19 pandemic has put a lot of pressure on public healthcare system of such countries. is healthcare system is already understaffed and underresourced. PCR test is used for detecting COVID-19. It is also called swab test. Nepal has the capacity to conduct PCR tests on 23,000 samples on average daily. But tests are conducted only on 15000 samples on average daily [1]. e government of Nepal has announced free COVID-19 tests and treatment for its citizens. But the services are very fast and efficient in the private sector. PCR test costs around 20 USD in the private sector. is is beyond the reach of an average citizen of such countries. ese tests are sometimes unreliable, as they give negative results during the first time and positive results during the second time. us, less people turn up for testing. is has resulted in less contact tracing and rapid increase in infections. Hence, there are underreported cases of infections and deaths from COVID- 19. In Nepal, the first case of COVID-19 was reported on 23 January 2020. It was a 32-year-old man who had recently returned from Huwei, China. e patient recovered and contacts were also asymptomatic [2]. e Government of Nepal enforced a strict lockdown from 24 March 2020. is lockdown was completely relaxed on 19 September 2020. e second wave of COVID-19 epidemic in Nepal was triggered in April 2021. Second wave of COVID-19 epidemic in India was the cause. It was due to Nepalese migrant workers coming back to Nepal.
is was after the announcement of lockdowns and curfews in different parts of India [3]. Nepal shares an open border with its southern neighbor India. People from either side of the border can freely cross and work without work permits [4]. Initially the positive cases of COVID-19 were mostly either Indian nationals working in Nepal or Nepali workers who had recently returned from India. ousands of workers had returned to Nepal without proper screening. is was also the scenario in the first wave. Many workers were stranded in different parts of India due to lockdowns in both countries. Many workers were stranded in the border [5][6][7][8][9].
With the onset of second wave of COVID-19 infections, the government of Nepal enforced a strict lockdown on 29 April 2021. is lockdown was partially relaxed from 22 June 2021 [10]. From 22 June, vehicles could move on the road.
is was according to the odd and even numbers of the number plate. e shops and departmental stores were allowed to be open till 11:00 a.m. only. e capital city Kathmandu is still the hotspot of COVID-19 pandemic [11][12][13]. e Government of Nepal had started the COVID-19 vaccination drive in January 2021. Around 10% of the total population had received the first dose of COVID-19 vaccine by May 2021 [14,15].
In this paper, the underreported cases of daily infections and deaths are estimated for Kathmandu valley. is is for the period 1 Feb to 1 June 2021. Daily COVID-19 updates of the Government of Nepal provide detailed information for Nepal. ese data are used here. But in these updates, detailed information on individual areas including Kathmandu valley is missing. Kathmandu valley is the area with the highest population density in Nepal. Data on the daily testing, infections, deaths, and recovery for Nepal are used for drawing strength for this estimation and inference. As the data collection process of COVID-19 is being monitored globally by the World Health Organization [16], it is assumed that the pattern exhibited by these data for Nepal is correct but underreported. e behavior of these data for Nepal is also minutely analyzed using the SIR model. en, estimation is done for Kathmandu valley by regressing on the information gained from Nepal. e impact of government lockdown on the economic and mental health status is also estimated. It is on the basis of primary data. ese data were collected from the online survey of 277 households. e arrangement of this paper is in the following order. is section is followed by Materials and Methods. is is followed by Result and Discussion and Conclusion.

Data.
is study is based on two sets of data. e first set of data is the daily COVID-19 updates published by the Ministry of Health and Population. us, it is a secondary data. It is published in the official website of Ministry of Health and Population, Government of Nepal, developed for COVID-19 updates [14]. It gives daily data of tested, infected, dead, recovered, active, and total infected, total recovered, and total dead cases. e data from 1 Feb 2021 to 1 June 2021 are taken for this study. e second dataset is a primary data. An online survey of COVID-19 pandemic was conducted. A detailed structured questionnaire comprising 55 questions was designed. ese questions were framed with the objective of knowing the impact of COVID-19 pandemic and the lockdown imposed on the economic, emotional, and psychological wellbeing of an individual. e sampling technique was snowball sampling. Here, the questionnaire was initially distributed to a set of university students. ey were asked to provide the information about their families. ey were further asked to collect information from five additional families. e importance of each question was explained to this initial set of respondents.
ey were requested to continue with this process with additional set of five people. is way the data were collected from 277 families comprising 1381 individuals. is large sample size ensured the statistical viability of the results obtained. is is by the application of Central Limit theorem. e data collection was done in two phases. One set of data was collected in November and December 2020. Another set was collected in March and April 2021.
is survey collected information on the impact of lockdown imposed during the first wave of COVID-19 infection. e lockdown was imposed from 24 March 2020 to 19 September 2020. e susceptible infected and recovered (SIR) time-dependent model for COVID-19 was as follows. Let S(t) represent the number of people susceptible to the disease on the day t. Let X(t) represent the number of infected from the disease on the day t. Let R(t) represent the number of people recovered from the disease on the day t. Here, R(t) means leaving the system either through recovery or through death on day t. e following is assumed: were n(t) represents the number of people in the COVID-19 infection system on day t. It is assumed to be dependent on time. In the beginning of a pandemic, n(t) is a small number. As the pandemic spreads, this n(t) will be a larger number. At the end of the pandemic, this n(t) will shrink down to a smaller number. Let β(t) and c(t) be the transmission rate and recovery rate of COVID-19 on day t. e three variables S(t), X(t), and (t) are governed by the following differential equations [17,18]: e three variables S(t), X(t), and R(t) will satisfy (1). Equations (2)-(4) are explained briefly in an intuitive way in the following manner. Here, (2) explains the difference in the number of susceptible on day t. Let us assume that the total people in the system of COVID-19 infection on day t are n(t). en, the probability that a randomly chosen person is in the susceptible state is (S(t)/n(t)). Hence, an individual in the infected state will contact (β(t)S(t)/n(t)) people in the susceptible state per unit time. is implies that the number of newly infected is (β(t)S(t)X(t)/n(t)), as there are X(t) people in the infected state on day t. us, the number of people in the susceptible state on day t will decrease by (β(t)S(t)X(t)/n(t)). Further, every individual in the infected state will recover with rate c(t); there are c(t)X(t) recovered on day (t). is is shown in (4). is amount is subtracted in (3), to show the change in the number people of infected on day t.

Principal Component Analysis (PCA).
It is a statistical approach that can be used to analyze the interrelationship among a large number of variables [19]. Here, the information contained in a number of original variables is condensed into a smaller set of variates (factors). e loss of information is minimized in this process. is data summarization helps identify the underlying dimension or factor. It also estimates of factors and contribution of each variable to the factors (termed loadings). Unrotated factor matrix comprising of factor loadings is used when the main objective of research is in best linear combination of variables. Here, the particular combination of original variables account for more of variance in the data as a whole than any other linear combination. Suppose we have a set of N variables, a * 1j to a * Nj. ese represent N variables related to the economic and mental health status with respect to COVID-19 lockdown, for each household j. Let us denote it as impact index. Further, let us standardize each variable by its mean and standard deviation; for example, a ij � (a * 1j − a * 1 /s * 1 ), where a * 1 is the mean of a * 1j across households and s * 1 is its standard deviation.
ese selected variables are expressed as linear combination of a set of underlying components for each household j: where j � 1, . . . , J. A ′ s are the components, and v ′ s are the coefficient on each component for each variable. e "scoring factors" from the model are recovered by inverting the system implied by (13) and yield a set of estimates for each of the N principal components: where j � 1, . . . , J.
e impact index expressed in terms of the original (unnormalized) variables is therefore an index for each household based on the expression where j � 1, . . . , J.
With respect to the study of 277 households, there are 14 significant variables and 277 households. So, N � 14 and J � 277.

Result and Discussion
e results of this paper are given below in three parts. In the first, results from the detailed statistical analysis of data on COVID-19 pandemic survey are discussed. In the second part, the results from the time-dependent susceptible, infected, and recovered (SIR) model are given. In the third part, underreported infections and deaths in Kathmandu valley are estimated. It is for the period 1 Feb 2021 to 1 June 2021. In order to ensure continuity in the discussion of results, the research outcomes are ordered in the following manner.
A survey of 277 households was conducted. e responses were classified into categories for the categorical data analysis [20]. e aim was to assess the impact of lockdown on food security, economic status, and mental health of individuals. Information on 1371 individuals living in 277 households was gathered. e income group of these households classified into quintiles is shown in Figure 1. Here, the highest income group is represented by Quintile 5, and the middle income group is represented by Quintile 3. It is seen from Figure 1 that 213 households belong to middle income group, 5 belong to the highest income group, and 4 belong to the lowest income group. Figure 2 shows the classification of these families according to their personal savings. We see that in Quintile 1 and Quintile 2, 50% or more of the families responded that they had no savings. But from Quintile 3 to Quintile 5, 50% or more responded that they had savings. e distribution of the families according to the educational qualifications and type of employment of the head of the family is given in Figures 3 and 4. We see that in 98 households, the head has studied till Bachelors level.
is is followed by 75 households with educational background of Class 9 to Class 12. Only 60 households out of 277 households have highly educated head of the family. ey have an education of Masters or above. Similarly, out of 277 households, in 149 households, the head of the family is employed in a white collared job. Information on the economic, educational, and professional status of these households is provided in Figure 1 to Figure 4. e interdependence of income group with 31 variables related to food security, economy, and mental health was tested. Chi square test for independence of attributes was applied. e significant results indicating existence of dependence with the income group at 10% level of significance are shown in Table 1. So out of 31 variables, 14 variables were identified. ese were dependent on the income status of the household. As seen from Table 1, these are related to food security, economy, and mental health status of the family during COVID-19 lockdown. e descriptive statistics of these variables is given in Table 2. As seen from Table 2, these are categorical data that can be classified on ordinal scale. Out of 15 variables mentioned here, 13 range from 1 to 5. e response to personal savings is either yes or no. e response to fear of losing a job ranges from 0 to 5. e higher the value of these variables, the more severe is the negative impact of COVID-19 lockdown. e mean and SD values give the center of gravity and spread of the data. e PCA of these 14 variables was conducted. ese variables measure the impact of COVID-19 lockdown and are highly dependent on the income status. ese 14 impact variables are correlated to one another. rough PCA, these 14 variables are reduced to 4 orthogonal uncorrelated variables explaining 58.998% of the total variance. e significance of this PCA is given in Table 3. e KMO is higher than 0.6. Bartlett test of sphericity rejects the null hypothesis. e null hypothesis here is that correlation matrix is an identity matrix. ese results of Table 3 justify PCA for these data. e first principal component has high factor loadings on increased financial insecurity, decreased income, food shortage, depression, and anxiety. is component can be related to the overall negative impact of the lockdown. It accounts for 28.53% of the total variability. e second component has high loadings to limited variety and fewer foods available during this period. is component explains 14% of the total variability. e third component has high factor loading to financial insecurity and decreased income. It explains 8.709% of the total variability. e fourth component explains 7.676% of the total variability. It has high factor loadings on personal savings and fears of losing job.    Table 4.
is linear regression is highly significant with a p value ≤ 0.01. As seen from Table 4, the regressions coefficients are also highly significant with 5% level of significance. Although R 2 is 0.1, regression and regression coefficients are highly significant. is helps us in making valid conclusions and generalizations. We see from     Journal of Environmental and Public Health impact index for 277 households is shown in histogram in Figure 5. e behavior of impact index, with respect to different income quintile over 277 households, is shown in the boxplot in Figure 5.
Here, the time-dependent SIR model was used. e dynamics of change in the daily number of infected was analyzed. e daily data on the number of people tested for COVID-19 were taken as data of daily number of susceptible cases. It was assumed that the number of tests performed per day was directly proportional to the number of susceptible per day. Fewer tests performed due to poor public health system meant underreported cases of susceptible and daily infections. Out of the total number of tests performed in a day, the number of tests with positive results indicated the number of those infected. Test with negative results indicated healthy people. Let S(t) represent the number of people tested for COVID-19 on day t. Here, S(t) is the number of susceptible on day t. Let X(t) denote the number of infected on day t. Let D(t) represent the number of dead on day t. Let n(t) represent the number of people in the system of COVID-19 infection on day t. Here, Here a COVID-19 patient left a system when he/she recovered or died from the disease. A system is defined as the system of COVID-19 disease. e probability of an infected person is given by (16). e probability of a recovered person is given by (17). It is the probability of leaving the system. e probability of a dead person is given by (18). e probability of susceptible person in the COVID-19 system is given by (19). e behavior of p X (t), p R (t), and p S (t) is shown in Figure 6. is is observed over a period of 120 days. p X (t) is the probability of entering the system. p R (t) is the probability of leaving the system. We see from Figure 6 that the curve of proportion of those who were infected and recovered intersects at two points. e area between these two points of intersection indicates that a susceptible person is infected from COVID-19. But this person has not recovered from it. It is the probability that a susceptible person is in the system of COVID-19 infection. It is 0.692 from the data. So, if the entire population x is considered as susceptible in the peak of epidemic, then the number of those infected but not recovered is 0.692 x. Let us take Kathmandu valley as model area, and the population of Kathmandu valley is 1472000. So, in this second wave of COVID-19, 1018624 are infected. So next time, Nepal should prepare for 1018624 infected only from Kathmandu valley, in the entire wave of COVID-19 infection. Exiting from this system is either by recovery or death. is result is also validated on the primary data collected from the survey of 277 households comprising 1371 individuals. Here, 1371 individuals are taken as susceptible. ey provided information of 885 COVID-19 infected. So, the proportion of those infected during the entire first wave was 0.641. is value is very close to the predicted proportion of COVID-19 during the second wave.
We also see from Figure 6 that (S(t)/n(t)) narrows down as the COVID-19 infected per day reaches its peak. is implies that the number of those susceptible tested positive for infection increases as the number of those infected increases. A comparison between observed and predicted number of infected people by fitting time-dependent SIR model is shown in Figure 4. Here, two models are compared. In Model 1, mean of (S(t)/n(t)) over all the values of t is considered. So, (S(t)/n(t)) � mean(S(t)/n(t)) � 0.8403. Model 2 assumes that the number of susceptible at time t is equal to the total population at time t; that is, is is the general assumption in the case of the SIR model. e accuracy of these SIR models is given in Table 5. e accuracy of a model can be explained by R 2 , the coefficient of determination. As seen from Table 5, in the original time-dependent SIR model with (S(t)/n(t)), the R 2 is 0.9571. is implies that 95.71% variability of the daily infected data for Nepal can be explained. R 2 value of Model 1 is 0.930. us, this model can explain 93% variability of daily COVID-19 infected data.
is is also seen by the close correspondence between observed and predicted value in Figure 7. In Model 2, R 2 is 0.90. is model explains 90% variability of data. e higher R 2 , the closer fit between the observed and the predicted values. ese models have shown very good results for the Kathmandu valley as well. In the original time-dependent SIR model with (S(t)/n(t)) , R 2 is 0.968. is implies that 96.8% variability of daily infected data can be explained. As seen from Table 5, R 2 for Model 1 is 0.956. R 2 for Model 2 is 0.946. e results given in Table 5 validate and justify the assumptions made in the development of (16)- (19).
Kathmandu is the capital city of Nepal. Most of COVID-19 patients are found in Kathmandu valley. According to 2011 census, the population density of Nepal is 180. Kathmandu valley has the population density of 4416 [21]. Here, population density is defined as average number of people per square kilometer. Kathmandu valley comprises Kathmandu, Lalitpur, and Bhaktapur. It is an area with the highest population density in Nepal. e details of COVID-19 tests and testing facilities in Nepal are provided in Table 6 We made following assumption here: (i) e pattern exhibited by COVID-19 data for Nepal was correct but underreported. (ii) Kathmandu valley had the highest contribution to the daily numbers of COVID-19 susceptible, infected, recovered, and dead cases of Nepal. It was due to its highest population density. (iii) e data of tests on a day were taken as the data of number of susceptible on that day.  Nepal (S(t)/n(t)) 0.9571 2 Nepal (S(t)/n(t)) � mean(S(t)/n(t)) � 0.8403 0.930 3 Nepal Kathmandu (S(t)/n(t)) � mean(S(t)/n(t)) � 0.8577 0.956 7 Kathmandu (S(t)/n(t)) � 1 0.946 (iv) e pattern exhibited by the daily numbers of COVID-19 susceptible, infected, recovered, and dead cases for Nepal was highly influenced by the data of these variables from Kathmandu valley. (v) e highest weights were assigned to Kathmandu valley in the COVID-19 pandemic landscape of Nepal.
is estimation is done in the following manner: (i) e number of susceptible can be explained by the curve given in Figure 8. is can be explained by the parabolic function given in the following equation with R 2 � 0.913: (ii) e daily number of susceptible cases for Kathmandu valley can be explained by the parabolic function [21]. ese data are not available in the official data of COVID-19 published daily. It is assumed that the behavior of the number of susceptible cases daily for Kathmandu valley is the same as that of Nepal. So, From COVID-19 updates, in 1 Feb 2021, 3316 tests were from Kathmandu valley. is can also be seen in Table 6   S(t) � 3.840t 2 − 256t + 3568.16.
is is from 1 Feb 2021 to 1 June 2021. It is for a period of 120 days. (iii) e probability of infection, death, and recovery in Kathmandu valley is assumed to be the same as that of Nepal. is is due to the highest concentration of COVID-19 patients here in comparison to any other part in Nepal. is can also be seen in Table 6. Using (16) and (18)  ese are reasonable estimates. e reasons behind these underreported cases are poor public health system and hence less testing, less contact tracing, and less infections reported. Some deaths took place at home and the causes could not be accounted for.
Various studies on underreporting of COVID-19 have been conducted. For example, Garcia et al. [22] used open online survey for indirect reporting of COVID-19. ey compared the estimates obtained from the survey with serology study data of Spain. Krantz and Rao [23] used wavelet approach in harmonic analysis to develop full epidemic data from partial data. ey used governmental data on COVID-19 for their analyses. Zhu et al. [24] developed wastewater SARS-CoV-2 RNA model based on fecal shedding profile of infected individuals. Rao and Krantz [25] commented on retrospective adjustment in the calculation of basic reproduction number, so that the reporting errors within the epidemic spread network could be correctly reported. Mackey et al. [26] used unsupervised learning approach. It was to detect and characterize user generated conversations in Twitter.
ese conversations were associated with COVID-19 related symptoms, experience with access to testing and mentions of disease recovery.
is paper is novel in comparison to the previously mentioned papers. Here, underreporting is studied from the perspective of a country with limited and scarce official records. Such countries have inefficient registration of vital events such as birth, death, and migration. Remote geographical locations, lack of awareness, and lack of incentives are some of the reasons behind this regrettable state. Keeping this backdrop in mind, the aim here was to develop a parsimonious model. is model should explain the scenario of COVID-19 accurately.
is model should also have physical interpretation. e following steps were taken to fulfill these objectives: (1) Primary data of 277 households were collected from online survey.

Conclusion
In countries like Nepal, less testing, less contact tracing, and less reporting of COVID-19 have taken place. is is due to limited COVID-19 testing facilities. ere are long queues in governmental testing facilities. Testing in private sector is expensive and beyond the reach of a common people.
Here, underreported cases of daily infections and daily deaths in Kathmandu valley are estimated.
is is for a period from 1 Feb 2021 to 1 June 2021. Kathmandu valley is an area with the highest population density in Nepal. It comprises the capital city Kathmandu, Lalitpur, and Bhaktapur. Governmental records state that in Kathmandu valley, total COVID-19 infections and COVID-19 deaths during this time period are 113522 and 1576, respectively. But we predict 169512 and 4326, respectively. So, there are 55990 and 2750 cases of underreported infections and deaths, respectively. is was done under the assumption that patterns exhibited by governmental records were correct but underreported. e time-dependent SIR model was also fitted to the COVID-19 data. Very high values of R 2 for Nepal and Kathmandu were obtained. is validated and justified the assumptions made while using this SIR model. Under these assumptions, the probability that a person is in a second wave of COVID-19 infection is predicted to be 0.692. So, Kathmandu valley with a population of 1472000 should prepare for 1018624 infected in the next wave of COVID-19 infection.
COVID-19 lockdown was imposed from 24 March to 19 Sept 2020. Out of 31 variables related to economic and mental health status in this COVID-19 lockdown, 14 variables significantly related to the income status were identified. is was based on a primary data collected from an online survey of 277 households. Here, detailed structured questionnaire comprising 55 questions was designed. PCA was conducted on these 14 variables, and 4 principal components were identified. ese components explained 58.998% variability of the data. e first principal component is related to the overall negative impact of the lockdown and explains 28.53% of the total variability. A COVID-19 lockdown impact index was calculated on the basis of the first principal component. Its behavior over 277 households is analyzed and displayed here with the help of histogram and boxplot. It has a mean value of 0 and standard deviation of 1 and ranges from −2 to 3.6. e higher the value of this COVID-19 impact index, the more severe the impact of the lockdown. e regression of COVID-19 impact index on income status and educational background of the head of the family gave the following results. e inherent value of COVID-19 lockdown impact index is 1.811. is is a 96.4 percentile and top 3.6% value. As the income status increases by one quintile, the negative impact of COVID-19 lockdown decreases by 0.477. Similarly, with 1-unit increase in education, the negative impact of lockdown decreases by 0.142. ese results are based on data collected from 277 households.
Data Availability e data used are described in the research article in detail.

Conflicts of Interest
e author declares that there are no conflicts of interest.