Temporal Aspects of Surface Water Quality Variation Using Robust Statistical Tools

Robust statistical tools were applied on the water quality datasets with the aim of determining the most significance parameters and their contribution towards temporal water quality variation. Surface water samples were collected from four different sampling points during dry and wet seasons and analyzed for their physicochemical constituents. Discriminant analysis (DA) provided better results with great discriminatory ability by using five parameters with (P < 0.05) for dry season affording more than 96% correct assignation and used five and six parameters for forward and backward stepwise in wet season data with P-value (P < 0.05) affording 68.20% and 82%, respectively. Partial correlation results revealed that there are strong (r p = 0.829) and moderate (r p = 0.614) relationships between five-day biochemical oxygen demand (BOD5) and chemical oxygen demand (COD), total solids (TS) and dissolved solids (DS) controlling for the linear effect of nitrogen in the form of ammonia (NH3) and conductivity for dry and wet seasons, respectively. Multiple linear regression identified the contribution of each variable with significant values r = 0.988, R 2 = 0.976 and r = 0.970, R 2 = 0.942 (P < 0.05) for dry and wet seasons, respectively. Repeated measure t-test confirmed that the surface water quality varies significantly between the seasons with significant value P < 0.05.


Introduction
River water represents a readily available source of water for human activities and historically many civilizations have relied on the ample supplies of fresh water found in major river catchment. Currently, rivers worldwide serve as the recipient of great quantities of waste discharge by agricultural, industrial, and domestic activities [1]. The availability of fresh water in rivers is one of the major issues facing the human population especially in developing countries [2]. The constant discharge of domestic and industrial wastewater and seasonal surface run-off due to the climate change all have a strong effect on the river discharge and water quality [3]. Information on water quality and pollution sources is important for the implementation of sustainable water resource management strategies [4]. Physical and chemical characterization of aquatic environment has become an important aspect due to the seasonality of river water [5].
High concentrations of all kinds of pollutants have an influence on the river water quality and determine the use of water and also can lead to diverse problems such as algal blooms, loss of oxygen, and loss of biodiversity [6]. It is, therefore, necessary to monitor river water quality, understand the chemistry of the water, and provide a reliable assessment of water quality for effective water resource management.
In modern research, different statistical techniques such as multivariate statistical analysis through principal component analysis, cluster analysis, discriminant analysis, and multiple linear regressions have been used to evaluate and interpret complex datasets to better understand the river water quality [3]. Statistical tools have often been used in exploratory data analyses for classification of sampling stations [7,8], identification of possible pollution sources [9][10][11][12][13][14], and identifying common patterns in data distribution that allow identification of the most significant 2 The Scientific World Journal variable responsible for river water variation [9,[15][16][17][18]. Very recently, statistical approaches have been applied to the data observed in several complex systems where the problem of environmental data reduction and interpretation can be easily handled through the application of robust statistical techniques. These statistical analyses are capable of allowing the detection of long-range correlations that are artificial nonstationeries compared to traditional conventional methods. Conventionally, the usual methods of interpretation of surface water quality are only descriptive and lack statistical significance. Furthermore, it is only relied on univariate procedure, which is inadequate to characterize simultaneous similarities and differences between samples and variables in a complex environment, hence the need to apply robust statistical tools to the surface water quality datasets. Several researchers apply robust statistical tools to evaluate surface water quality variation. For example, a study conducted by Koklu et al. [15] revealed that DA gave indicator parameters responsible for large variation in water quality and multiple regressions analysis identified the important and effective parameters that contributed to water quality variation in Melen River system, Turkey. A recent study conducted by Osman et al. [19] found that DA is an important multivariate statistical tool that reduces dimensionality of the data and brings out the most statistically significant parameters that result in variation of the datasets. Zhang et al. [6] used DA to evaluate water quality variation in southwest new territories and Kowloon, Hong Kong. They concluded that DA provided an important data reduction by revealing only four and eight parameters with 84.2% and 96.1% correct assignment for temporal and spatial water quality variation, respectively. Jakara Basin is located in the northwestern Nigeria and lies in the center of Kano city, the most populous city in the whole of Nigeria with over six million people. The region has rapid population growth and industrial development, which increase the mass of sewage discharge. With an increase in population, surface water quality needs to be monitored continuously in order to take measures, when necessary to sustain the portability of the surface water resources [20]. Jakara Basin is located on longitude 8 • 31 E to 8 • 45 and latitude 12 • 10 N and 12 • 13 N. The basin is about 30 km 2 with northwest, southwest orientation sprawling about 0.33 • . The climate of the area is strongly influenced by the tropical maritime air masses during wet season and tropical continental air masses during dry season. The seasonal migration of the intertropical discontinuity (ITD) gives rise to two seasons, one dry and the other wet. The wet season lasts from June to September although May is sometimes humid. The dry season extends properly from mid-October of one calendar year to mid-May of the next. The annual mean rainfall in the region is between 800 mm and 900 mm.Variation of the mean value is up to +30 or −30 percent. More than 300 mm of the rainfall is received in August alone, while the truly wet season lasts from June to September. In addition, the mean monthly temperature of the study area is 21 • C and 23 • C with diurnal range of 12-14 • C [17].
The present study aims at evaluating the temporal variation of river water quality and determining the most meaningful parameters and their contribution towards water quality variation between dry and wet seasons in Jakara Basin.

Sample Collection and Analytical
Technique. Samplings were carried out every day from 1st April to 31st May, 2011 and 31st July to 30th September, 2011 for dry and wet seasons, respectively, at four different sampling locations along Jakara River. Samples were taken from 10 cm to 15 cm below the surface water using acid washed plastic container to avoid unpredicted changes. Samples were stored in a chilled cold box during transportation to the laboratory. Fifteen physicochemical water quality parameters were selected for analyses, these being dissolved oxygen (DO), five-day biochemical oxygen demand (BOD 5 ), chemical oxygen demand (COD), suspended solids (SS), pH, conductivity, salinity, temperature, nitrogen in the form of ammonia (NH 3 ), turbidity, dissolved solids (DS), total solids (TS), nitrates (NO 3 ), chloride (Cl), and phosphates (PO 4 ). Samples were analyzed in the Soil and Water Laboratory of Ministry of Environment, Kano, Nigeria.
The samples were filtered using filter paper with a pore size of 5 μm [21]. Water temperature, DO, pH, conductivity and turbidity of the water samples were determined and detected using multiparameters monitoring instrument (YSI incorporated, Yellow Spring OH, USA). The instruments were calibrated using specific calibrating solutions. A mean value was calculated for each parameter, with standard deviation (SD) being used as an indication of the precision of each parameter [18]. NH 3 was measured using ultraviolet absorbance spectrophotometer. The UV light absorbance NH 3 analyzer was calibrated to measure the wavelength of UV light (within the range of 200-450 nm). NaOH reagent was added to the sample to act as a buffer by adjusting the pH of the sample to a value greater than 12. Second reagent hypochlorite was added to react with free NH 3 in the samples to form monochloramine. The difference in the UV light is proportional to the amount of free NH 3 in the sample. TS was measured by drying the sample at temperature of 105 • C in preweighted porcelain and then cooled in a dry atmosphere in desiccators and then weighted on an analytical balance by subtracting the porcelain dish and dividing by the original amount of sample. DS was measured by filtering the water sample through a tarred fiber filter, which was then dried and the weight of the materials captured on the filter was used to figure the total suspended solids (TSS). The DS can be estimated from the difference between the TS and TSS. BOD determination of the water samples was carried out using the standard method [22]. The dissolved oxygen content was determined before and after the incubation. Sample incubation was for 5 days at 20 • C in BOD bottle and BOD 5 was calculated after the incubation period. COD was determined after oxidation of organic matter in strong tetraoxosulphate VI acid medium by K 2 Cr 2 O 7 at 148 • C with back titrations. Cl was determined using 100 mL of the water sample, which was measured into 250 mL conical flask, and pH was adjusted with 1 M NaOH. 1 mL K 2 Cr 2 O 4 indicator The Scientific World Journal 3 was then added and titrated with AgNO 3 solution. A blank titration was carried out using distilled water and Cl in mg/L was then calculated. NO 3 and PO 4 were determined using calorimetric method [22].

Data Management and Treatment.
The normality distribution test of the data for each variable under study was checked by analyzing statistical value of kurtosis and skewness. The original data showed that the value of kurtosis ranged from −0.37 to 54.68 and from −0.29 to 14212, and the skewness value ranged from −0.29 to 6.93 and from −0.02 to 3.73 for both dry and wet seasons data, respectively, indicating that the data were not normally distributed.
The raw data of all the parameters under study were log transformed x = log 10 (x). Log transformation removes outliers and renders geochemical data normalized. Although log transformation is generally used to obtain normal distribution, it can also be applied to standardize the datasets and reduce the influence of extreme cases and outliers [23][24][25]. After the transformation the kurtosis ranged from −1.42 to 7.08 and from −0.75 to 6.52, and the value for skewness ranged from −2.50 to 2.06 and from −2.39 to 1.20 for both dry and wet seasons data, respectively. These ranges showed that both data were now within the normal distribution population.

Discriminant Analysis (DA)
. Discriminant analysis is a statistical method which determines the variables that discriminate between two or more naturally occurring groups [16]. It constructs a discriminate function (DF) for each group as in equation where i is the number of groups (G), ki is the constant inherit to each group, n is the number of parameters used to classify set of data into a given group, and w is the weight coefficient assigned by DA to a given selected parameter q.
In this study, temporal (dry and wet seasons) data were evaluated. DA was applied to the log transformed data using the standard, forward stepwise, and backward stepwise modes and construct DFs to evaluate temporal variation in river water quality.

Partial
Correlation. Partial correlation allows looking at the relationship between bivariates when the effect of the third variable is held constant. Partial correlation is similar to Pearson's product moment correlation except that it also allows control for an additional variable. This is usually the variable that you suspect might be influencing the two variables of interest [24,26].

Multiple Linear Regression.
Multiple linear regression is a statistical tool for understanding the relationship between an outcome variable and several predictors (independent variables) that best represent the relationship in a population [15]. The technique is used for both predictive and explanatory purposes within experimental and nonexperimental designs. Multiple linear regressions can be expressed using the equation: where Y represents the dependent variable, X1 · · · Xm represent the several independent variables, βo · · · βm represent the regression coefficients, and ε represents the random error.

Repeated Measure Sample t-Test.
This statistical tool performs a paired two-sample t-test to deduce whether the difference between the sample means is statistically distinct from a hypothesized difference. Repeated measure test does not assume that the variances of both populations are equal, it is used when only one group of experiment and data is collected from two different occasions or under two different conditions [26]. The t-value result from the analyses ranges between − infinity and + infinity, in which positive value indicates an increase while negative value indicates a decrease. The repeated measure test is calculated using equation where t is the test statistic (Student's t-distribution) − X2 is the mean of the paired difference for the sample − X2 is the mean of the paired difference for the population s 2 P is the standard error of the mean of the paired difference for the sample n 1 is the number of paired difference values, and n 2 is the number of paired difference values.

Results and Discussion
3.1. Descriptive Statistics. The descriptive statistics of physiochemical parameters under study are given in Table 1. It provides a summary of the mean, standard deviation, variance, sekwness, and kurtosis values of fifteen measured parameters for both dry and wet seasons data. The pH value of the water samples is acidic to slightly above neutral ranging from 6.67 to 7.14 for dry and wet seasons, respectively.
The mean for temperature and conductivity ranged from 28 • C and 1.96 μS/cm to 29.5 • C and 4.10 μS/cm for dry and wet seasons, respectively. The values of DS, TS, SS, turbidity, and salinity are generally more enhanced in wet season. These parameters are reactive compounds and qualitatively reflect the status of inorganic pollution, dissolved solids increases salinity as well as conductivity measures. The reason for the high values of these parameters during wet season could have been the result of the geology of the area and soil erosion effects.
The mean values of DO, BOD 5 , and COD are more pronounced in dry season than wet season. This represents organic and nutrients pollution and may be from natural organic matter decomposition. This suggests that during dry 4 The Scientific World Journal  season, the volume of the water in the river significantly reduced and there is substantial addition of organic materials from residential areas of Kano Metropolitan to the Jakara River [27].

Discriminant Functions.
The objective of DA was to test the significance of discriminant functions and to determine the most significance variables that result in water quality variation in both dry and wet seasons. Tables 2 and 3 show that the values of Wilks' lambda for both dry and wet seasons for each discriminant function were quite small (0.37, 0.42, 0.42: dry season and 0.25, 0.84, 0.48: wet season) for standard, forward stepwise and backward stepwise mode-respectively. In the forward stepwise mode, variables/parameters were included step by step, beginning with the most significant variable until no significant changes were obtained. In the backward stepwise mode, variables/parameters were removed step by step beginning with the least significant variable until no significant changes were obtained [4,15,16].
The standard DA mode constructed discriminant functions including all the fifteen parameters under study.  Forward stepwise and backward stepwise modes showed that DO, COD, pH, NH 3 , and Cl are the most significant parameters responsible for water quality variation in the dry season assigning more than 96% (P < 0.05) of cases correctly. In the wet season, the stepwise forward discriminant functions discriminate five variables with 68.20% (P < 0.05) of cases correctly. Forward stepwise mode showed that DO, BOD 5 , COD, SS, and Cl are the most significant parameters responsible for water quality variation in the wet season. However, backward stepwise DA mode produced a classification matrix of more than 82% (P < 0.05) correct assignations using six variables: DO, COD, salinity, turbidity, SS, and TS. The box and whisker plots of discriminating parameters identified by DA (forward stepwise and backward stepwise) for both seasons were given in Figures 1, 2 and 3.

Temporal Control Relationship between Variables.
Partial correlation was applied to the log transformed data to estimate the correlation between BOD 5 and COD controlling for the linear effect of NH 3 in the dry season data and TS and DS controlling for conductivity in the wet season data. There was a strong positive correlation (r p = 0.829, P = 0.0001) with high content of NH 3 being associated with high level of BOD 5 and COD and a moderate positive correlation (r p = 0.614, P = 0.0001) with high content of conductivity associated with high level of TS and DS for dry and wet season water quality variation (Table 4). An inspection of zero-order correlation of dry season (r = 0.866) and wet season (r = 0.993) suggests that controlling for NH 3 and conductivity for dry and wet seasons, respectively has strong influence.

Temporal Water Quality Predictors.
To find out the best predictor of water quality variation in the Jakara Basin, a stepwise multiple linear regression model was used. Before interpreting the result, classical assumptions of linear regressions were checked: an inspection of normal p-p plot of regression standardized residuals revealed that all the observed values fall roughly along the straight line indicating that the residuals are from normally distributed population. Moreover, the scatter plot (standardized predicted values against observed values) indicated that the relationship between the dependent variable and the predictors is linear and the residuals variances are equal or constant. 6 The Scientific World Journal   Based on the collinearity diagnostic table obtained, none of the models dimensions has conditional index about the threshold limit 30.0, none of the tolerance values is smaller than 0.10, and none of the VIF statistics is less than 10.0. This indicated that there is no multicollinearity problem among the predictors variables of the models. Since there is no multicollinearity problem between the predictors included in the dry and wet seasons samples in the final models and the classical assumptions of normality, linearity and equality of variance are all met. It is reasonably to conclude that estimated multiple linear regression models to explain water quality variation in the Jakara Basin are stable, good, and quite respectable.

Dry Season Water Quality Predictors.
Based on the stepwise method of linear regressions, seven predictor variables were found to be of significance in explaining water quality variation in dry season ( Table 5). The water quality variation was explained by seven predictors, namely, DO, COD, SS, NH 3 , temperature, pH, and conductivity, other variables were excluded because they did not contribute in explaining dry season water quality variation. The obtained R-square of 0.976 implies that the seven predictor variables explained about 97.6% of the water quality variation in the dry season.
The ANOVA table revealed that the F-statistics (F = 381.22) was very large and the corresponding P value was highly significant (P = 0.0001) or lower than the alpha value The Scientific World Journal 7 (0.05). As depicted in Table 5, the largest beta coefficient was DO with 0.539, this means that DO makes the strongest unique contributions in explaining the variation of water quality in dry season, when the variance explained by other predictors in the model is controlled. This showed that, one standard deviation increase in the concentration of DO is followed by 0.539 standard deviation increase in the variation of water quality in the dry season. The Beta value for COD was the second highest (−0.423), followed by NH 3 (−0.184), SS (−0.97), temperature (0.079), and conductivity (−0.052). The beta value for pH was the smallest (−0.050) and indicating that it made at least contribution in the water quality variation in the dry season.

Wet Season Water Quality Predictors.
The water quality variation in the wet season was explained by five predictor variables, namely, DO, BOD 5 , SS, TS, and Cl. The R-square of 0.94.2 revealed that 94.2% of the variation of water quality during wet season was explained by the mentioned five predictors. The wet season estimate of coefficient of the model is presented in Table 6. The largest beta coefficient among the parameters calibrated by stepwise regression analysis, TS, makes the strongest unique contribution in the wet season water quality variation. The beta value for DO (0.547) was the second highest, followed by Cl (0.545) and BOD 5 (−0.292), and the least contributor was SS with −0.292.
The ANOVA table showed that the F-statistics (F= 112.697) was very large and the corresponding P value is highly significant (P = 0.0001) or lower than the alpha value (0.05). This indicated that the slope of the estimated linear regression model is not equal to zero for both seasons, confirming that there is linear relationship between the predictors of the models.

Temporal Water Quality
Variation. Temporal variation of water quality was examined using repeated measure sample t-test, this determines whether the mean of samples obtained in the dry season differ from that of wet season samples. A quick check of the box plot shown in Figure 4 indicates that the mean of the wet season is much higher than the mean of the dry season.
Repeated measure sample t-test was conducted to compare means of dry and wet season samples. The null hypothesis states that there are no differences in the mean samples of river water quality in the dry and wet seasons. A preliminary assumption testing was checked for normality with no violation noted (KS = 0.113, P = 0.200), and the Q-Q plot indicated that the distribution for the dry season is normal. Although the test for normality for wet season samples did not showed a perfect normal distribution (KS = 0.103, P = 0.100), an inspection of the Q-Q plot for wet season samples show that the distribution is approaching normal. The detrended Q-Q plot showed that the data fall within −0.25 to 0.75 and −1.5 to 1.0 for dry and wet seasons, respectively, showing that there are no data that deviate from normal distribution.
The result obtained from paired sample t-test revealed that there is a significant difference in the mean of dry and The decision is that the null hypothesis was rejected and research hypothesis was supported, this is because the mean differences obtained were rather-large and the t-statistics obtained was very large (t = −27.372) and the corresponding P value (0.0001) was very much smaller than the alpha of 0.05. Comparing the eta-square obtained (η 2 = 0.86) to Cohen [28] criteria (0.01 = small effect, 0.06 = moderate effect, and 0.14 large effect), the magnitude of the mean differences was large (η 2 = 0.86) showing that river water quality varies largely between the two seasons.

Conclusion
In this study, different statistical techniques were used to assess temporal variation in surface water quality of the Jakara River Basin. DA rendered an important data reduction as it uses only five and six parameters (DO, COD, pH, NH 3 , and Cl and DO, BOD 5 , SS, salinity, turbidity, and Cl) affording more than 96% and 68% correct assignation for dry and wet seasons, respectively. Thus, DA allowed reduction in the dimensionality of the large data sets and revealing few indicator parameters responsible for large variation in water quality. Further, partial correlation analysis revealed strong and moderate partial correlation between BOD 5 and COD, TS and DS controlling for the linear effect of HN 3 and conductivity for the dry and wet seasons, respectively. Multiple linear regressions supported DA and identified the contribution of each variable with significant value r = 0.988, R 2 = 0.976 and r = 0.970, R 2 = 0.942 (P < 0.05) for dry and wet seasons, respectively. Repeated measure t-test confirmed that the surface water quality varies significantly between dry and wet season samples (P < 0.05). These statistical tools provided more objective interpretation of water quality variables, and, from the analyses, it is clear that DO, COD, BOD 5 , NH 3 , Cl, SS, turbidity, pH, and salinity were found to be the most abundance parameters responsible for water quality variation in the Jakara River Basin. Consequently, this study suggests that further studies 8 The Scientific World Journal in this area should be conducted to identify the sources of these parameters revealed by statistical techniques, so as to control the menace.