Water Pollution Detection Based on Hypothesis Testing in Sensor Networks

Water pollution detection is of great importance in water conservation. In this paper, the water pollution detection problems of the network and of the node in sensor networks are discussed. The detection problems in both cases of the distribution of the monitoring noise being normal and nonnormal are considered.The pollution detection problems are analyzed based on hypothesis testing theory firstly; then, the specific detection algorithms are given. Finally, two implementation examples are given to illustrate how the proposed detection methods are used in the water pollution detection in sensor networks and prove the effectiveness of the proposed detection methods.


Introduction
Water is the most important material to human's survival and valuable resource to industrial and agricultural production.With the development of economy and industry, more kinds of pollution materials are discharged into the water environment such as rivers and lakes, and more water pollution disasters have happened.Detecting the pollution timely is important for water conservation and is the precondition to locate and find the pollution source.
In most pollution monitoring and pollution source localization applications by using sensor networks, the criteria of the pollution detecting are that the nodes have pollution concentration values and the concentration values are larger than a given threshold, such as the works about the pollution monitoring [1][2][3][4][5][6][7][8] and the works about the pollution source localization [9][10][11][12].
Since there is an initial pollution concentration of normal production and life in water, when the sensor nodes have monitored relevant information, it cannot be deduced that there exists pollution generated by a pollution source.At the same time, in the water environment there are plankton, garbage, aquatic animals, plants, and so forth, which intervene in water pollution monitoring and bring disturbances to the monitoring data.The decision threshold to determine whether there is pollution is difficult to be given properly in the simple source detection method.
In this paper, hypothesis testing is adopted to solve the water pollution detection problems.Firstly, a brief description of the monitoring sensor network and what problems there are in the pollution detection are given.Secondly, theoretical approaches to solve the detection problems are analyzed based on hypothesis testing.Thirdly, the specific detection algorithms are given.Finally, implementation examples are given to illustrate the proposed pollution detection methods.synchronously with the same time interval.The background information such as the diffusion coefficient, the water depth, and the interval of sampling time is known previously.The monitoring information is routed to the sink node and processed by the data processing center.The network deployment is as shown in Figure 1.

The Detection Problems.
The pollution detection problem of the network is to detect whether the sensor network finds the pollution.More specifically, that is,  static nodes sample and store the concentrations uniformly with a time interval .At the sampling time   , based on the samples C(  ,   ,   ),  = 1, 2, 3, . . .,  of the  nodes determine whether there is pollution at a given significance level  in hypothesis testing.The purpose of the pollution detection of the network is to detect the pollution timely.
The pollution detection problem of the node is that each node in the network determines whether it has accessed to the concentration information about the pollution source.More specifically, that is, all static nodes in the network sample and store the concentrations synchronously with a time interval .At a given significance level  in hypothesis testing, it is determined whether the node (  ,   ) has found the water pollution at   based on the known samples C(  ,   ,   ),  = 1, 2, 3, . . ., .
The change of the diffusion in the concentration field is slow.When the network finds the pollution, it is not that each sensor node has detected the pollution.With time passing, the sensor nodes having detected the pollution will be more and more.In the pollution source localization, a node can be used in the localization only when the node has detected the pollution.

Pollution Detection Based on Hypothesis Testing
In [13], a simple discussion about the water pollution detection is given by the present authors under the assumption that the distribution of the monitoring noise is normal and known previously.In this paper, the pollution detection problems are discussed in more general cases.Assume that the initial pollutant concentration (the pollution concentration of normal production and living sewage) in water is  0 .If there is no diffusion source, C(  ,   ,   ) =  0 + ,  = 1, 2, 3, . . ., .If there is some node (  ,   ) having detected the pollution, C(  ,   ,   ) = (  ,   ,   )+ 0 +, where  is the measurement noise of sensor nodes and (  ,   ,   ) is the theoretical concentration value related to the pollution source.
Remark 1.The concentration (  ,   ,   ) changes over time and at different locations.The specific forms of water pollution diffusion can be seen in literature [14].

Distribution Test.
Under different statistical distributions of samples, the specific hypothesis testing problems are different.In the water pollution detection, the first is to determine whether the distribution of the observation noise is normal.
In the initial state, there are only a few nodes perceiving the pollution or there is no node perceiving the pollution.When there is no pollution, C(  ,   ,  1 ) =  0 + , and if the distribution of C(  ,   ,  1 ) is normal,  is a normal variable.
In order to save cost, the number of sampling nodes is often limited in practical applications.The Shapiro-Wilk  test [15] method in the case of small samples can be used as the distribution test method here.
Step 3. At a given significance level of hypothesis testing , if  ⩽   , the distribution is nonnormal; otherwise, the distribution is normal.
In the above steps, the values of   and   can be obtained by the method of table lookup [16].
The specific detection problems when the distribution of the sensing values is normal are different from the detection problems when the distribution is nonnormal.The detection methods in the two cases are discussed in the following.
Case 2 (hypothesis testing under nonnormal distribution).While the sample distribution is nonnormal, it is difficult to verify what the specifying distribution of the values is.In this case, the Wilcoxon rank sum test is used directly [17].
Our test problem is to determine whether there is significant difference between the two groups of independent samples in Table 1.The hypotheses are  (1)  0 :  1 =  2 ,  (1)  1 : List the data in ascending order and allocate the ranks   according to the order.The significance level of hypothesis testing is , and  1 is the sum of ranks of Sample 1. When or reject  0 , there is a pollution source in the monitoring area.  (/2) and   (/2) are the upper tail value and lower tail value of the two-tailed rank sum test [17,18].),  = 2, 3, . . ., .}.The hypotheses are given by  (3)  0 :   = 0

The Pollution
The test statistic is where that is, when (| 2 | ≥  /2 ( − 2)) = , reject  (3)  0 , and it is deduced that the node has detected the pollution.Here,  /2 is the /2 quantile of -distribution, and  is the significance level of hypothesis testing [17,18].
Case 2 (hypothesis testing under nonnormal distribution).The Wilcoxon rank sum test is used.Our test problem is to determine whether there is significant difference between the two groups of independent samples in Table 2.
The hypotheses are The same solving method as hypothesis testing problem ( 6) can be used in (11).

Sample Size Requirements in Detection
3.4.1.Basic Requirements.According to the basic sample number requirements of the hypothesis testing methods [17,18], the basic sample size requirements in our detection methods are given as follows.In the distribution test, the number of samples  should be 3 ≤  ≤ 50.When the distribution of the sample noise is normal, there should be at least 4 samples in the pollution detection of the network and the pollution detection of the node, so  ≥ 4 and  ≥ 5.
When the distribution is not normal, there should be at least 6 samples in the pollution detection of the network and in the pollution detection of the node, so the sample numbers should satisfy  ≥ 6 and  ≥ 6.
In the pollution detection of the network, to reduce the cost, the number of nodes is always given previously.So there is a precondition; that is, the number of the sensor nodes is  ≤  1 , where  1 is a given number.For the purpose of participating in the pollution source localization timely, there also should not be many sampling times in the pollution detection of the node, and the maximum sampling number is also often given.
To reduce the probability of false alarm, under given thresholds ,  > 0, in the hypothesis testing problems (3) and ( 8), the significance level  should satisfy when  ∈  1 and |/| ≥ .
(B) The Nonparametric Test.There are no explicit expressions of the test power in nonparametric tests.When the maximum number of samples is given, if we want to reduce the possibility of the I type error in the test, the possibility of the II type error often increases [19], so an appropriate significance level is necessary.In nonparametric tests,  = 0.05 is often adopted.

The Detection Algorithms
Based on the theoretical research above, there are pollution source detection algorithms as follows.
Algorithm 1 (the pollution detection of the network).
Preconditions.The number of providing samples  is large enough and known.The samples at the first sampling time  1 and the detection time   are known.The parameters  and  which are related to the test power are given.
Step 1. Use Shapiro-Wilk  test to test the distribution of the sample noise according to (2) at the first sampling time  1 .
Step 2. If the distribution is normal, get the value range of the significance level according to (13), choose an any significance level  in the range, calculate the test statistic as (4), and go to Step 3. If the distribution is nonnormal, go to Step 4.
Step 3. When the test statistic satisfies the test criterion (5), there is pollution; otherwise, there is no pollution.
Step 4. List the data in ascending order and allocate the ranks according to the order.Calculate  1 which is the sum of ranks of Samples 1 in Table 1.Look up the table to get the tail values in the rank sum test.When the sum  1 satisfies the test criterion (7), there is pollution; otherwise, there is no pollution.
At time   , if the network does not detect the pollution, the pollution detection of the network will be made at  +1 .
Preconditions.The number of samples  is large enough and known.The samples at the first sampling time  1 are known.The samples of the detection node (  ,   ) are known.The parameters  and  which are related to the test power are given.
Step 1. Use Shapiro-Wilk  test to test the distribution of the sample noise according to (2) at the first sampling time  1 .
Step 2. If the distribution is normal, get the value range of the significance level according to (13), choose an any significance level  in the range, calculate the test statistic as (9), and go to Step 3. If the distribution is nonnormal, go to Step 4.
Step 3. When the test statistic satisfies the test criterion (10), the node detects the pollution; otherwise, the node fails to detect the pollution.
Step 4. List the data in ascending order and allocate the ranks according to the order.Calculate  3 which is the sum of ranks of Samples 3 in Table 2. Look up the table to get the tail values in the rank sum test.When the sum  3 satisfies the test criterion (7), the node detects the pollution; otherwise, the node fails to detect the pollution.

Implementation Examples
Experiment 1.A simulation is carried out to test the proposed detection algorithms.The distribution of monitoring noises is normal.
(A) The Pollution Detection of the Network.According to constraint (13) and the sample size table of  test in [18], it is can be deduced that  > 0.025 under the given parameters , , and  in Table 3.
For different  values and significance levels, detect the pollution at the initial observation time 0.01 h, and the results are shown in Table 3.The hypothesis testing detection method under the normal distribution is used, and in the The results show that the pollution can be detected by the network soon.
(B) The Pollution Detection of the Node.Detect whether the node has detected the pollution source based on the observed data of the node (1.05, 7.05).The monitoring data is as shown in Table 4. Compare with the simple detection method in which the criterion of whether the pollution source has been detected is that the monitoring value is larger than a given threshold, and the results are shown in Table 5.
In Table 5, "-" represents no result.Comparing the results in the table, it can be seen that the detection method using hypothesis testing is more stable if an appropriate significance level is chosen, and in the simple detection, to detect the pollution source timely the threshold should be as small as possible.But apparently, if the noise in the practical applications is considered, small thresholds may bring about large false alarm rates.Experiment 2. A practical experiment is carried out to test the proposed detection algorithms.
Background.In water, of which the size is 200 cm × 200 cm, the average depth  = 100 cm.There is a continuous source at the boundary.Starting from  0 = 0 s, the solution of MgSO4 is discharged to the water continually.The nodes deployment is depicted by Figure 2. The monitoring values of different sensor nodes in the experiment are shown in Table 6.find the pollution is shown in Table 7.The detection method is as (11).
The results show that the pollution can be detected by the nodes only when there are some increasing concentration samples.
From the results of the experiments above, it can been seen that, in the simple detection method, an appropriate decision threshold is hard to be given, so the pollution source detection by using hypothesis testing is more preferable.Whether the distribution of the sample noise is normal or not, the corresponding detection algorithms are available.

Conclusions
Water pollution detection is important in the water environment monitoring.The pollution source detection problems of the network and of the node are discussed based on hypothesis testing.The sample size requirements in different detection problems are also analyzed.In implementation examples, the proposed pollution detection algorithms are tested.The effectiveness of the detection algorithms is proved.This work mainly focuses on theoretical detection approaches based on hypothesis testing.In the future work, more problems in the practical applications will be studied when the proposed detection algorithms are adopted, such as the optimized detection methods of the node related to large or small concentration variations, and the influences of the concentration variations on the statistical distribution in the distribution test step.
Distribution Verification.Both at time 5 s and 10 s, for different significance levels  = 0.01,  = 0.05, and  = 0.1, the detection results all show that the distribution of monitoring data is not normal.So, the detection methods based on the Wilcoxon rank sum tests are used.(A) The Pollution Detection of the Network.The given significance level is  = 0.05, and the network detects the pollution at 30 s. (B) The Pollution Detection of the Node.The significance level is  = 0.05, and the time when nodes 0, 1, 2, 3, 4, 5, 6, and 7

Table 3 :
The detection results of the network.

Table 5 :
The detection results comparing with the simple detection method.

Table 6 :
The monitoring values in Experiment 2.

Table 7 :
The detection time of different nodes in Experiment 2.