Truth Discovery Technology for Mobile Crowd Sensing in Water Quality Monitoring

The water quality of urban inland rivers is an important index of urban environmental health, which can re ﬂ ect a city ’ s development level and its social and economic development. The water quality of these rivers strongly impacts the health and quality of life of the residents of urban and surrounding areas. Therefore, it is necessary to accurately assess the quality of water in urban inland rivers, which can also aid environmental protection departments in providing river governance. Generally, the water quality status of a city ’ s inland rivers is assessed and released by environmental monitoring stations in various regions that deploy the corresponding water quality detection equipment at certain major locations of the river. However, these detection devices can only detect water quality at ﬁ xed locations, and often, the water quality of an urban inland river changes owing to the impact of the surrounding environment and residents it serves. Therefore, the water quality around a detection point does not always re ﬂ ect the water quality of the entire river section. To better express the water quality status of a city ’ s inland river, we propose a method based on a mobile crowd-sensing system that obtains the water quality data of the river during an entire period of time and then fuzes these sensing data to obtain the best truth-value estimate of the water quality of the river. We can use this water quality truth value to conduct an objective evaluation of the water quality of a city ’ s inland rivers. The water quality parameters obtained by the method can better represent the water quality status of the river, and the data are more accurate compared to the data collected and released by an environmental monitoring station. Through simulation and comparative analysis, we found that the water quality data obtained by the proposed method were more accurate, indicating that our method has more practical value than the detection device method.


Research Background.
With the rapid development of the economy and the improvement in people's living standards, the pursuit of a high quality of life has become more important. However, the discharge of industrial and agricultural wastewater, domestic sewage, and wastewater is increasing year over year, resulting in increasingly serious environmental pollution of urban watercourses. Urban river water quality is mainly affected by physical impacts (e.g., geomorphic impacts), chemical impacts (e.g., domestic sewage and wastewater and industrial wastewater), urban ecological impacts (e.g., urban landscape water), and natural rainfall [1]. The deterioration of the urban river water environment not only affects the normal development of urban construction but also seriously threatens the physical health of urban residents and the ecological safety of the city. Therefore, it is necessary to conduct regular monitoring of river water quality and closely track the water quality situation of the river. Regular monitoring of river water quality not only provides a basis for environmental protection departments to provide river treatment, but also provides an opportunity for urban residents to assess the river water quality. Currently, the water quality data of urban inland rivers are mainly derived from detection devices, which are deployed at fixed locations by environmental protection departments to obtain water quality data. However, this detection method has a high cost and an incomplete monitoring range, resulting in low-data accuracy.
To monitor water quality more accurately, many scholars have carried out extensive related research. The main new technologies that have been developed are based on optical or electrochemical sensors. However, owing to the recent development of these technologies, more robust analyses and evaluations in real conditions are essential to guarantee the precision and repeatability of their use [2]. Some scholars have exploited the advantages of the Internet of Things (IoT) and long-range technology (LoRa) to build a water quality monitoring platform that can measure and collect key information about water quality. The LoRa gateway transmits these data to a server through a network. The method can then realize the online monitoring of water quality, and it has achieved good monitoring results [3,4].
With the growing popularity of smart mobile terminals such as smart phones, it has become possible to use the numerous sensors of these smart terminals or to connect portable detectors through these terminals. This offers great convenience for ordinary users to obtain sensing data and also provides the possibility of large-scale urban river water quality monitoring. After collecting data using the smart devices of many common users and then submitting the sensing data to a sensing platform for processing, we can obtain the final data from the sensing platform. This method in which data by common participants using their own smart terminal devices is called a mobile crowd-sensing system (MCS) [5][6][7][8]. In applying an MCS to urban inland river water quality monitoring, the cost is low, the coverage of detection is wider, and the water quality of the entire river can be more accurately obtained in comparison to the detection device method. Further, the MCS enables residents and environmental protection departments to more promptly and accurately assess the water quality of the river. Thus, it has great significance for environmental protection departments to perform timely and effective prevention and control of river water pollution.
1.2. Aim of the Paper. Since urban inland river water quality is generally assessed and released by water quality monitoring stations, the data released by the monitoring stations only represent the water quality around the monitoring station equipment and cannot be used to objectively evaluate the water quality of the entire inland river. To better evaluate the water quality of the whole river, researchers need to deploy numerous sensors and other monitoring equipment, which can increase costs. To better monitor urban inland river water quality, we have two main objectives in this paper: (1) to establish an MCS to sense water quality data and develop an approach that does not require deploying any monitoring equipment and is low cost; and (2) to design a data fusion algorithm to process and fuze the sensed data to obtain the best true-value estimate of river water quality so that the result better reflects the true water quality condition.
1.3. Structure of the Paper. The rest of this paper is structured as follows. In Section 2, the related research work of other scholars is introduced. In Section 3, the research methods and data matrix are presented. In Section 4, the data fusion processing method used in this paper is described. In Section 5, we present the verification of the feasibility of the proposed method through experiments. Finally, we provide our conclusions in Section 6.

Related Work
Many scholars have conducted studies on how to better monitor the water quality of urban inland rivers. Sendra et al. [3] proposed the design and implementation of a LoRa-based wireless sensor network for monitoring the quality of water in coastal areas, rivers, and ditches. The network is composed of several wireless sensor nodes endowed with several sensors to physically measure water quality parameters such as turbidity and temperature, along with weather conditions such as temperature and relative humidity. The wireless sensor network is used to achieve water quality monitoring in the target area.
Sanya et al. [4] proposed a real-time platform based on the IoT and LoRa technology for monitoring water quality. The platform measures and collects critical information about water quality, including parameters such as the pH value, turbidity, and temperature of the surrounding atmosphere. However, the platform requires the deployment numerous sensor nodes.
Singh et al. [9], a real-time water quality monitoring system was installed to monitor the water quality of the Ganges river. The details of the real-time water quality monitoring system installed in the Ganges river and the results of various parameters obtained through the system were also presented.
Blanco-Gómez et al. [10] developed a low-cost monitoring system that can be integrated in a small buoy and attached to fishing and recreational boats, allowing citizens to gather water quality information (i.e., electrical conductivity (EC) and temperature) with their smartphones.
Vasudevan and Baskaran [11] presented a stand-alone photovoltaic (PV)/battery energy storage (BES)-powered water quality monitoring system based on the narrowband Internet of Things (NB-IoT) for aquaculture. The system could operate continuously and stably without losing power supply.
Gunia et al. [12] presented an IoT system to realize realtime water quality monitoring. The data collected by this system can be transferred to the cloud in real time to track the water body's quality and access the real-time quality data of different chemical and biological indicators, like pH, dissolved oxygen (DO), total dissolved solvents, and turbidity. However, the system has a high cost.
Chen et al. [13] presented the unmanned aerial vehiclesupported intelligent truth discovery (UAV-ITD) scheme to obtain truth data for low-cost MCS communications. The truth data discovery scheme needs to collect a portion of data samples to make inferences about the data of the entire network with high accuracy.
In the studies described above, the authors' main contribution to water quality monitoring was the construction of a monitoring system through which water quality data can be obtained for water quality monitoring. These monitoring systems have achieved relatively good monitoring results. However, they are costly due to the deployment of numerous water quality monitoring sensors and other related equipment, and some monitoring systems are only suitable for specific application scenarios.
Several other methods have also been proposed by scholars to monitor water quality. Jamroen et al. [14] used optical and electrochemical techniques to monitor water quality, and they obtained good results. However, the testers require expertise to use this method.
Bai et al. [15] developed data-based models to reduce the number of water quality parameters of monitoring programs using data mining. This approach was a promising alternative that can reduce the frequency of analyses in the laboratory and increase the spatial and temporal coverage of water quality monitoring networks. This method mainly models water quality data from water quality monitoring stations and mines the data to improve the monitoring effect.
Okpara et al. [16] presented an operational system for multisensor data fusion implemented at the Finnish Environment Institute. The system uses the Ensemble Kalman filter and smoother algorithms, which are often used for probabilistic analysis of multisensor data. The system considers the uncertainty and spatial and temporal correlations present in the available observation data to obtain accurate and realistic results. In this approach, numerous sensors need to be deployed.
da Silveira Barcellos and de Souza [17] proposed a multisource remote sensing water quality inversion method based on a small number of samples to solve the problem of scale inconsistency among multisource remote sensing data to achieve large scale and efficient inversion of the urban river water quality. Because complex nonlinear relationships must be solved between simple ground point data and remote sensing data in water quality inversion, a novel self-optimizing machine learning monitoring method was proposed that can automatically find the optimal parameters of the model from a small number of samples and then reduce the training time. In this method, the complex relationship between ground monitoring data and remote sensing data is solved primarily by machine learning.
Other scholars have focused on data privacy protection in the truth discovery process. Xiong et al. [18] proposed DPriTD, a decentralized privacy-preserving framework for truth discovery in crowd sensing. The proposed approach leverages the additively homomorphic property of Shamir's Secret Sharing scheme to protect user privacy.
Liu and Pan [19] presented a data masking-based privacy-preserving truth discovery framework that incorporates spatial and temporal correlations to solve the sparsity problem. The framework makes it possible to monitor air quality at a fine granularity using vehicular crowd sensing systems.
This paper focuses on monitoring water quality by means of mobile crowd sensing, which requires the recruitment of participants and the use of participants' devices to obtain water quality data and transmit the data to the sensing platform. This approach helps to obtain the best true estimate of the monitored river by data fusion of the data submitted by the participants. Moreover, the method does not require the deployment of monitoring equipment to complete water quality monitoring, and it has the advantages of a low cost and accurate monitoring.

Research Methods
Taking the water quality of urban inland rivers as the research object, first, we determine the range of water quality monitoring. We set n monitoring points on selected urban inland rivers. We set a radius of r around these monitoring points as the effective sensing area N i 1 ≤ i ≤ n ð Þ . The sensing platform releases the sensing task, and participants decide whether to participate in the sensing task based on their own time constraints and daily routes. Once participants have decided to participate in the task, they can participate in the task of sensing any sensing area or participate in the task of sensing several sensing areas and then submit the sensing data to the sensing platform within a specified time range. The sensing diagram is shown in the Figure 1.
Suppose that n sensing points are set on a river; with these sensing points at the center and r as the radius, define n data sensing areas. k participants participate in the data sensing task, and the number of participants in each data sensing area is k 1 ; k 2 ; k 3 ; …; k n and k 1 þ k 2 þ k 3 þ ⋯ þ k n ≥ k. Suppose that the vast majority of participants are not malicious participants and that the differences between the data submitted by participants mainly caused by device and participant operational factors. Suppose that there are α water quality parameters; the ith parameter is represented by X i 1 ≤ i ≤ α ð Þ , and the sensing data vector that participants need to submit is Data ¼ where N id is the collection area identifier, and T is the time of data sensing. According to the characteristics of the participants' activities, the data sensing is mainly concentrated in the daytime, and the data sensing time period can be set according to actual needs. Suppose that a time interval is 30 min; for example, 7:00-7:30 is 1, and 7:30-8:00 is 2. By analogy, the value range of T is 1 ≤ T ≤ 24. The data submitted by the participant are in a data matrix composed of the vector Data ¼ N id ; ½ T; X 1 ; X 2 ; Owing to the differences in the capabilities of the data sensing devices held by different participants, some participants cannot sense data for some indicators. To accurately express the correctness of the data submitted by the participants, it is necessary to build a competency matrix for every According to the dataset A k ; ½ N id ; T; X 1 ; X 2 ; X 3 ; …; X α submitted by the participants, when T ¼ j and N id ¼ id, the data sensing matrix for the time period j of the idth sensing area is obtained as Data id . Assuming that the data matrix has k id rows, k id participants participate in data sensing in the sensing area, and then To the dataset A k ; N id ; T; X 1 ; X 2 ; X 3 ; …; X α , when T ¼ j (j is a constant) and N id ¼ id (id is a constant), from the dataset, we can obtain k id identifiers of A k , where A k is the identifier of the participant. According to the capability matrix G, we can obtain the capability submatrix of n data sensing areas, and the capability submatrices are G 1 ; G 2 ; G 3 ;…; G id ; …; G n . By calculating the Hadamard product G id ⋅ Data id , we can conclude that the true data matrix of sensing area N id is Data N id ¼ y 11 ; y 12 ; y 13 ; …; y 1i ; …; y 1α y 21 ; y 22 ; y 23 ; …; y 2i ; …; y 2α ⋮ y k id 1 ; y k id 2 ; y k id 3 ; …; y k id i ; …; y k id α : Our goal is to perform data fusion on matrix Þto obtain the best truth estimate for the sensing area N id .

Data Truth Discovery Process
In the section, the data preprocessing process is first introduced in Section 4.1 to derive the preprocessed data matrix. In Section 4.2, the details of removing the anomalous data in the preprocessed data matrix to obtain the true data matrix are presented, and the corresponding algorithm for removing the anomalous data is given. In Section 4.3, the matrix after removing the anomalous data is weighted and fuzed to obtain the best estimate of the true value of water quality (i.e., the final result of water quality). In Section 4.4, the algorithm of Section 4.2 and the weighted fusion process of Section 4.3 are analyzed algorithmically to describe the feasibility of the algorithm. ; there are k id elements. For the capability submatrix i . To obtain the best truth estimate for the data, we need to fuze each data vector Y 0 i .

Removing Data with Excessive Differences.
To obtain the best truth estimate for the data, we need to fuze each data vector Y 0 i . First, we designed an algorithm to remove data with excessive differences.
Assuming that most participants are not false data submitters, the differences between their data are caused by the sensing environment, sensing location, sensing time, device performance, and incorrect operation.
For any two data points, y i and y j , in vector Y 0 i , if dist i y i ; ð y j Þ ≤ β and β are the similarity thresholds set for the ith parameter indicator, and if the result of dist i y i ; ð y j Þ ≤ β is true, y i and y j are considered similar.
Suppose there are k 0 i valid data points in Y 0 i ; that is, Y 0 i ¼ y 1 ; ½ y 2 ; …; y i ; …; y k 0 i . The algorithm for finding all similar elements in the vector Y is shown in Algorithm 1.

Weighted Average of Similar Data.
According to the method in Section 3.2, we calculate the effective similar data for each sensing area, perform weighted fusion on these similar data, and calculate the best truth estimate for these similar data [20]. The specific calculation method is as follows: after the method in Section 3.2, assume that the set of similar data for an index is A a 1 ; ð a 2 ; a 3 ; …; a δ Þ. Based on the accuracy of similar data, we divide the data in dataset A a 1 ; ð a 2 ; a 3 ; …; a δ Þ into sections according to the range of section values. The specific diagram is shown in the Figure 2.
After calculating the average for each section, the average for the data in Section 1 is Wireless Communications and Mobile Computing the average for the data in Section 2 is and the average for the data in Section m is Then, the fuzed data are For each vector sequence Y 1 ;Y 2 ;Y 3 ;…;Y i ;…;Y α in data matrix Data N id ¼ y 11 ; y 12 ; y 13 ; …; y 1i ; …; y 1α y 21 ; y 22 ; y 23 ; …; y 2i ; …; y 2α ⋮ y k id 1 ; y k id 2 ; y k id 3 ; …; y k id i ; …; y k id α ; ; …; Data truth id ; …; Data truth n for the same time period in each sensing area is fuzed, and the best truth value of the water quality parameter data for each time period is calculated using the weighted average method in Section 4.3.

Algorithmic Analysis.
Generally, the water quality status of a city's inland rivers is assessed and released by environmental monitoring stations in various regions deploying corresponding water quality detection equipment at some major locations of the river. However, these detection devices can only detect water quality at fixed locations; often, the water quality of an urban inland river changes owing to the impact of the surrounding environment and residents it serves. Therefore, the method that water quality monitoring stations use to collect water quality data at fixed points to monitor water quality does not always represent the true water quality status of urban rivers.
The method proposed in this paper is based on a MCS, which allows participants along rivers to engage in tasks using their own portable sensing devices. Because of the mobility of participants, the sensing points are widely distributed, and with a large number of participants participating in data sensing, a large amount of water quality data from different sensing points in the inland river can be obtained. Then, the data are fuzed to obtain the truth value of the best water quality data of the inland river, which can more accurately represent the true water quality of the inland river.

Experimental
Environment and Data Sources. Currently, more than 100 water quality parameters exist, including water temperature, pH, DO, EC, turbidity, permanganate index, ammonia nitrogen, total phosphorus, and total nitrogen. The most commonly used parameters are the pH, EC, DO, turbidity, and water temperature. Thus, our monitoring parameters were five conventional parameters of surface water quality: pH, EC, DO, turbidity, and water temperature. The experiment was conducted on the Pi River in the city of Lu'an, Anhui Province, China, and the test site was the Xin'an Ferry. Twenty people participated in the experiment, (1) For each element in Y 0 i , first, the pointer, P, points to the first element. (2) For (P points to each element y i one by one, q i ¼ 0, i ++ ) (3) Traverse the vector Y 0 i ; (4) If the distance between y i and y j satisfies the condition dist i y i ; ð y j Þ ≤ β; (5) Then q i ¼ q i þ 1, i.e., the value of the variable q i increases by 1; (6) Endif (7) Endfor (8) Get the array q 1 ; ½ q 2 ; …; q i ; …; q k 0 i ; (9) Find the set Q with equal q i in q 1 ; ½ q 2 ; …; q i ; …; q k 0 i , and q i is the largest in q 1 ; ½ q 2 ; …; q i ; …; q k 0 i ; (10) In array Y 0 i , elements with the same subscript as elements in Q are similar elements. We set up several data sensing areas along the river, each data sensing areas with a radius of 200 m. The participants were mainly students who used portable water quality analyzers to collect data in the data collection area during specified sensing time periods, and they submitted the collected data using smartphones. Task participants decided whether to participate in data sensing tasks based on their time and daily action routes. If the participant decided to participate in the sensing task, the participant could participate in the task of any sensing area released for sensing. Participants had to submit their sensing data to the sensing platform within a specified time period. The format of the sensing data submitted by participants was N i ; T; X 1 ; X 2 ; X 3 ; X 4 ; X 5 , where N i is the ID of sensing area, T is the data sensing period, and X 1 ; X 2 ; X 3 ; X 4 ; X 5 contains the five water quality parameter data points submitted by the participants. For each record in the data matrix of the same sensing area, we determined whether to use the record based on the sensing time. If the record was within the time range, the record was used; otherwise, the record was not used. We used the distance formula to find all similar data within the same time range for the records used. Then, by weighting and fuzing all similar records, we could obtain the best truth value of the water quality data. The true state of water quality could be assessed through the obtained truth value.
Our simulation environment was Windows 10, the simulation platform was Matlab 2016a, and the data result comparison object was sourced from https://szzdjc.cnemc.cn: 8070/GJZ/Business/Publish/Main.html.

Simulation Results and Analysis.
Through simulation and comparative experiments, the results below were obtained.
Our experimental data collection areas were defined as follows: collection Area 1 covered a 200 m radius from the center of the water quality monitoring station, followed by collection Area 2, which was immediately adjacent to collection Area 1, and so forth for collection Areas 3, 4, and 5. The experiment employed 20 participants, and data were collected 10 times. Figures 3-7 show the data comparison graphs for the five core indicators of water quality: water temperature, pH, DO, EC, and turbidity. The blue line represents the monitoring data released by the monitoring station, the green line represents the results of the fusion of the data collected by the participants in collection Zone 1, and the pink line represents the results of the fusion of the data     collected by the participants in all collection zones (i.e., the best true-value estimate of the river). Comparing the blue and green lines in the five comparison plots, we can see that they almost overlap. Moreover, by using the data in Table 1, we can see that the errors of the five indicators of water temperature (water temperature, pH, DO, EC, and turbidity) in collection Area 1 are 0.085, 0.013, 0.035, 0.13, and 0.08, respectively, which are within acceptable limits. The errors occurred for two main reasons: (1) differences in the accuracy of data collection were caused by differences in equipment performance and (2) the water quality monitoring station's fixed-point monitoring equipment was not the same as mobile crowd sensing monitoring. Through the curve data and error data, we can illustrate the reliability and correctness of using mobile crowd sensing to obtain water quality data. In contrast, the comparison of the pink line and blue line in the five graphs shows a large gap. Furthermore, the errors of the water temperature, pH, DO, EC, and turbidity indicators in the monitoring station data in Table 1 are 0.11, 0.039, 0.106, 0.53, and 0.98, respectively. These errors are 1.3, 3, 3, 4, and 12 times larger than the errors in the first collection area. Therefore, the water quality data released by the monitoring station cannot represent the entire river's water quality condition, but instead represents the water quality within a limited range. In contrast, mobile crowd sensing acquires data from multiple collection areas, and the data fusion result represents the water quality condition of the entire river.

Experimental
Discussion. Numerous scholars have researched and proposed various methods to effectively monitor water quality conditions. These include deploying water quality monitoring equipment at fixed points for continuous monitoring, using mobile equipment such as fishing boats or buoys to deploy monitoring equipment and transmitting data through LoRa technology to a background server for monitoring purposes, utilizing optical and electrochemical technology to monitor water quality conditions, and deploying multisensor data fusion systems for realtime water quality monitoring. While these methods can achieve the goal of water quality monitoring, they require deploying a significant amount of costly equipment.
The experimental data in this thesis were obtained through mobile crowd sensing, which eliminates the need for additional monitoring equipment. Participants were recruited to use their own portable mobile terminals according to the requirements issued by the sensing platform within a specified time period and area for data collection. The collected data were submitted by all participants to the sensing platform, which then employed data fusion to calculate the final water quality data. This approach is highly costeffective, simple to operate, and highly accurate, owing to the lack of additional water quality monitoring equipment. However, certain limitations remain: (1) different participants used varying portable data collection equipment, resulting in slight inconsistencies in the accuracy of the collected data; (2) differences in the operation of the equipment   Wireless Communications and Mobile Computing resulted in data inconsistencies; (3) participants' perceived data within the specified time period were valid, although there was little variation in water quality data within that time period; and (4) this sensing method can only obtain water quality data within a certain period of time, and it cannot support real-time monitoring.

Conclusion
The water quality of urban inland rivers is usually detected by the deployment of water quality testing equipment by environmental monitoring stations at certain major locations of the river. Although these devices can detect the water quality of urban inland rivers, this detection method has two disadvantages: (1) the cost is relatively high and (2) these detection devices can only detect water quality at fixed locations, so the quality of water is usually affected by the surrounding environment and the residents it serves, and the water quality near a testing point can hardly reflect the water quality of the whole river. To better represent the water quality of an urban inland river, we proposed a method for water quality detection based on a MCS. By fuzing the data of multiple sensing areas, the best true estimate of the water quality of the river can be obtained so as to make a more objective evaluation of the water quality of an urban inland river. The water quality parameters obtained by this method are more representative of the water quality of the river and are more accurate than the data from environmental monitoring stations. This method of evaluating the water quality of urban inland rivers produces more accurate data, has a lower cost, and has better application value than the detection device method.
In this paper, we used mobile crowd sensing to monitor water quality and obtain the best true-value estimate of water quality data by fusion processing of monitoring data. Although this method does not require the deployment of water quality monitoring equipment, it requires recruiting participants to obtain data. The number and ability of participants directly affect the quality of the collected data. Our two main aims for future work are: (1) to design appropriate incentive models and methods to obtain better monitoring effects with fewer participants and (2) to combine water quality monitoring systems with artificial intelligence technology so that the water quality monitoring system can monitor water quality and analyze and predict water quality. The results of our analysis and prediction can help guide water quality management departments.

Data Availability
The data in the simulation were derived from the National Surface Water Quality Automatic Monitoring Real-time Data Release System (https://szzdjc.cnemc.cn:8070/GJZ/Business/ Publish/Main.html).

Conflicts of Interest
The authors declare that they have no conflicts of interest.