Water Quality Assessment of Gufu River in Three Gorges Reservoir ( China ) Using Multivariable Statistical Methods

To provide the reasonable basis for scientific management of water resources and certain directive significance for sustaining health of Gufu River and even maintaining the stability of water ecosystem of the ThreeGorge Reservoir of Yangtze River, central China, multiple statistical methods including Cluster Analysis (CA), Discriminant Analysis (DA) and Principal Component Analysis (PCA) were performed to assess the spatialtemporal variations and interpret water quality data. The data were obtained during one year (2010~2011) of monitoring of 13 parameters at 21 different sites (3003 observations), Hierarchical CA classified 11 months into 2 periods (the first and second periods) and 21 sampling sites into 2 clusters, namely, respectively upper reaches with little anthropogenic interference (UR) and lower reaches running through the farming areas and towns that are subjected to some human interference (LR) of the sites, based on similarities in the water quality characteristics. Eight significant parameters (total phosphorus, total nitrogen, temperature, nitrate nitrogen, total organic carbon, total hardness, total alkalinity and silicon dioxide) were identified by DA, affording 100% correct assignations for temporal variation analysis, and five significant parameters (total phosphorus, total nitrogen, ammonia nitrogen, electrical conductivity and total organic carbon) were confirmed with 88% correct assignations for spatial variation analysis. PCA (varimax functionality) was applied to identify potential pollution sources based on the two clustered regions. Four Principal Components (PCs) with 91.19 and 80.57% total variances were obtained for the Upper Reaches (UR) and Lower Reaches (LR) regions, respectively. For the UR region, the rainfall runoff, soil erosion, scouring weathering of crustal materials and forest areas are the main sources of pollution. The pollution sources for the LR region are anthropogenic sources (domestic and agricultural runoff, hydropower exploitation and municipal waste). The study demonstrates the utility of multivariate statistical techniques for river water quality assessment, identification of pollution sources, and exploring spatial and temporal variations of water quality.


INTRODUCTION
Water quality is an important indicator of river ecological system, which directly affects the water use and development in river basin (Lopes et al., 2004).As the influences of both anthropogenic factor (land use, industrial and agricultural activities, urban and exploitation of water resources) and natural processes (forest areas, soil erosion, precipitation, geological composition and weathering) (Carpenter et al., 1998), the river water quality is the result of the combination and interaction of multifactor and multilayer and water quality variation will exert a series of influence on aquatic ecosystem, which also largely reflects basic characteristics of drainage area (Bonacci and Roje-Bonacci, 2003).In consideration of the spatial-temporal variations in the hydrochemistry of surface waters, regular monitoring measurements are necessary for representative and credible estimation of the water quality (Dixon and Chiswell, 1996;Singh et al., 2004).These generated produces large data sets with high complexity, which are often difficult to interpret and to obtain the meaningful information.Different multivariate statistical techniques, such as Cluster Analysis (CA), Discriminant Analysis (DA) and Principal Component Analysis (PCA), were broadly applied to interpretate a large and complex data matrix consisted of a good deal of physico-chemical parameters to better understand the water quality of studied systems (Vega et al., 1998;Helena et al., 2000;Alberto et al., 2001;Brodnjak-Vončina et al., 2002;Reghunath et al., 2002;Simeonov et al., 2003;Bengraine and Marhaba, 2003;Liu et al., 2003).Multivariate statistical techniques have been employed to characterize and assess surface water quality, and it is conducive to demonstrating spatial and temporal variations caused by natural and anthropogenic factors linked to seasonality (Vega et al., 1998;Reisenhofer et al., 199 2005;Shre The T control pro However, been the f (Wu et (Jiang et al., 2002).Soil presents evolution pattern with increasing altitude.The types of land use are various, largely consisting of forest land, arable land, industrial and mining and residential land.
In the present research, twenty one sampling points (Fig. 1), viz.Site-1~Site-21, from the source of Gufu River to the junction of it and Xiangxi River, were chosen on the river as the water quality monitoring network.The difference between altitudes for the neighboring sampling sites is 50~100 m.All sampling sites selected can cover a wide range of whole determinants at key sites, which reasonably represent hydrological characteristics of the river system.

Sampling and analytical methods:
The data sets for 21 water quality sampling sites, consisting of 13 water quality parameters monitored monthly over one year (August 2010 to July 2011, no sampling at January 2011).Because of the spatial-temporal variations in hydrochemistry of rivers, it is necessary to sample regularly for reliable estimates of the water quality.The water quality parameters selected included total phosphorus (TP, mg/L), Total Nitrogen (TN, mg/L), Nitrate Nitrogen (NO 3 --N, mg/L), ammonium nitrogen (NH 4 + -N, mg/L), chemical oxygen demand (COD, mg/L), dissolved oxygen (DO, mg/L), water temperature (Temp,C), Total Alkalinity (T-Alk, mg/L), Total Hardness (T-Hard, mmol/L), Silica (SiO 2 , mg/L), Electrical Conductivity (EC, μS/cm), Total Organic Carbon (TOC, mg/L) and Chlorophyll a (Chl a, μg/mL).The sampling, preservation, transportation as well as analysis of these water samplings followed standard methods (APHA, 1998;ASTM, 2001).Temp, DO and EC were measured with a portable multimetre in the field.All other parameters were determined in the laboratory according to standard protocols (ISO, 1986;APHA, 1998).The one year data set consisted of 3003 observations of Gufu River water quality in the Three Gorges Reservoir.
Data treatment: Analysis of Variance (ANOVA) was used to study the significant differences both spatial and temporal (p<0.05).Spatial and temporal correlation analysis of water quality parameters was tested using Pearson's coefficient with statistical significance set at p<0.05.Spatial and temporal and variations of the river water quality parameters were evaluated using Spearman non-parametric correlation coefficient (Spearman's R) via period and site-parameter correlation matrix (Alberto et al., 2001;Singh et al., 2004;Shrestha and Kazama, 2007).
In terms of CA and PCA, all log-transformed datasets were z-scale standardized (the mean and variance were configured to 0 and 1, separately) to eliminate the influences of difference measurement units and variance of variables and to turn into the data dimensionless (Lattin et al., 2003;Liu et al., 2003;Singh et al., 2004;Zhou et al., 2007).In addition, before performing PCA, the suitability of the data for PCA was examined by Kaiser-Meyer-Olkin (KMO) and Bartlett's sphericity tests (Shrestha and Kazama, 2007;Varol and Şen, 2009).

Multivariate statistical methods:
In the present study, CA, DA, and PCA were comprehensively coupled to perform multivariate analysis for the water quality data sets (Vega et al., 1998;Alberto et al., 2001;Simeonov et al., 2003;Panda et al., 2006;Shrestha and Kazama, 2007;Varol and Şen, 2009).The CA and DA were carried out using STATISTICA 6.0 and PCA used SPSS 19.0.A summary of theories of CA, DA, and PCA is described as follows.
Cluster Analysis (CA): CA is an unsupervised pattern recognition technique, divides a large group of cases into smaller groups or clusters of relatively similar cases that are dissimilar to other groups.Hierarchical Clustering Analysis (HCA) is the most common approach where clusters are formed sequentially, by starting with each case in a separate cluster and joins the clusters together step by step until only one cluster remains (Vega et al., 1998；Singh et al., 2004).The Euclidean distance usually gives the similarity between two samples, and a distance can be represented by the difference between transformed values of the samples (Otto, 1998).In this study, HCA was performed on the standardized experimental data set using Ward's method with Euclidean distances as a measure of similarity.Both temporal and spatial variations in water quality were determined from hierarchical CA on standardized data using Ward's method with squared Euclidean distances (Otto, 1998;Vega et al., 1998;Helena et al., 2000).
Discriminant Analysis (DA): Discriminant analysis automatically computes the classification functions.These are not to be confused with the discriminant functions.The classification functions can be used to determine to which group each case most likely belongs.There are as many classification functions as there are groups.Each function allows us to compute classification scores for each case for each group, by applying the equation: where, i = The respective group 1, 2... m = The m variables c i = A constant for the i'th group w ij = The weight for the j'th variable in the computation of the classification score for the i'th group x j = The observed value for the respective case for the j'th variable.Si = The resultant classification score.
DA is used to determine the variables, which discriminate between two or more naturally occurring groups.It operates on original data and the technique constructs a discriminant function for each group (Johnson and Wichern, 1992;Alberto et al., 2001;Singh et al., 2004), as in the equation below Eq. (2): where, i = The number of groups (G) k i = The constant inherent to each group n = The number of parameters used to classify a set of data into a given group w j = The weight coefficient, assigned by DA to a given selected parameter (p j ) In this study, DA was performed on original data set using the standard, forward stepwise and backward stepwise modes to evaluate both the temporal and spatial variations in river water quality.The best discriminant functions for each mode were constructed considering the quality of the classification matrix and the number of parameters.The sites (spatial) and periods (temporal) were the grouping (dependent) variables as well as all the measured parameters built the independent variables.

Principal Component Analysis (PCA):
The PCA is one of the most powerful and common techniques used for reducing the dimensionality of the dimensions of multivariate problems.As a non-parametric method of classification, it makes no assumptions about the underlying statistical data distribution (Huang et al., 2011).The PCA technique extracts the eigenvalues and eigenvectors from the covariance matrix of original variables.An eigenvector is a list of coefficients (loadings or weightings) by which we multiply the original correlated variables to obtain new uncorrelated (orthogonal) variables, called Principal Components (PCs), which are weighted linear combinations of the original variables.The PCA provides information on the most significant parameters due to spatial and temporal variations that describes the whole data set by excluding the less significant parameters with minimum loss of original information (Singh et al., 2004;Kannel et al., 2007).It is a powerful technique for pattern recognition that attempts to explain the variance of a large set of inter-correlated variables and transforming into a smaller set of independent (uncorrelated) variables (principal components).FA follows PCA, Factor analysis further reduces the contribution of less significant variables obtained from PCA.The new groups of variables, also known as Varifactors (VFs), were constructed by rotating the axis defined by PCA. a PC is a linear combination of observable water quality variables, whereas a VF can include unobservable, hypothetical, ''latent'' variables (Vega et al., 1998;Helena et al., 2000).PCA of the normalised variables (water-quality data set) was performed to extract significant PCs and to further reduce the contribution of variables with minor significance; these PCs were then subjected to varimax rotation (raw) to generate VFs.

RESULTS AND DISCUSSION
The essential statistics for all of the water quality variables measured during the sampling period of one year at twenty different sites on the river are summarized in Table 1.
The spatial and temporal changes of the river water-quality parameters (Table 2) were estimated via period-parameter and site-parameter correlations matrix.Apart from T-Alk and Chl a, all the analyzed parameters were found significantly correlated with period (p<0.05).Among these, Temp and COD displayed the highest correlation coefficient (Spearman's R = 0.87).Other parameters exhibiting correlation with period were T-Hard (R = -0.81),DO (R = -0.78),NO 3 --N (R = 0.73), NH 4 + -N (R = 0.73), SiO 2 (R = 0.71), TN (R = 0.69), TOC (R = -0.67),TP (R = 0.36) and EC (R = -0.39).The site-parameter correlation matrix indicated that TP, TN, NO 3 --N, NH 4 + -N, Temp and DO showed correlation with site.Among these, Temp showed the highest correlation coefficient (R = 0.828), followed by DO (R = -0.71),TP (R = 0.61), TN (R = 0.65), NO 3 --N (R = 0.60) and NH 4 + -N (R = 0.51).The period and site-correlated parameters can be regarded as representing the major source of temporal and spatial variations in water quality of the river.In view of the source types in the river watershed, these correlations can be interpreted on the basis of temporal and spatial features in the studying region.
Temporal similarity and period grouping: Temporal CA generated a dendrogram (Fig. 2), grouping one year    backward stepwise mode, TN and Temp was also found to be the significant variables.Thus, the temporal DA results suggest that TP, TN, Temp, NO 3 --N, TOC, T-Hard, T-Alk and SiO 2 were the most significant indicators for discriminating between the two periods, which means that these eight parameters explain most of the expected temporal variations in the water quality.
Box and whisker plots of the discriminant parameters recognized by temporal DA (forward stepwise mode) were applied to assess different patterns related to temporal trend in water quality given in Fig. 4. The average values of Temp, TN, NO 3 --N, TP and SiO 2 were higher in the first period than in the first period, while T-Alk, T-Hard and TOC show the opposite trend.The first period belongs to the wet season in Three Gorges Reservoir, when rainy weather can lead to soil loss (Withers and Lord, 2002), storm runoff, agricultural runoff (Changnon and Demissie, 1996;Mander et al., 1998), river bed degradation (Goolsby et al., 2000;Zhou et al., 2007) and so on occurs on many occasions, which makes the value of nutrients (nitrogen and phosphorous) relatively higher in the first period.Obviously, temperatures in wet season are higher, which benefit weathering leading to the increase in SiO 2 .Comparatively, there is less precipitation and a drier climate in the second period (dry season), which resulted in water of higher mineralization was the cause of the increase in T-Alk and T-Hard (Zhou et al., 2001).In the second period, lots of dead wood and leaves decay, this leaded to increase the organic content as TOC.

Spatial variations in water quality:
To further evaluate spatial variations in water quality between the different stream segments, spatial DA was performed with the initial data set comprising 13 parameters after dividing into two classes of UR and LR by CA.Classes were viewed as the dependent variables, while all the measured water quality parameters were viewed as the independent variables.The values of Wilks'lambda and the Chi-square (Table 6) for each discriminant function varied from 0.041 to 0.268 and from 23.677 to 42.234   Seasons assigned by DA ----------------------------------------------------- respectively, and p-level was below 0.05, indicating that the spatial DA was credible and effective.Discriminant Functions (DFs) and classification matrices were achieved via the standard, forward stepwise and backward stepwise modes of DA, are shown in Table 7 and 8, respectively.Both the standard and forward stepwise mode DFs produced the corresponding CMs with 100% correct assignations using 13 and 8 discriminant parameters, respectively (Table 8).The backward stepwise mode DA obtained CMs with close to 88% correct assignations using only 5 discriminant parameters (Table 7 and 8).Thus, the spatial-DA results suggest that TN, TP, NH 4 + -N, EC and TOC were the most significant water quality parameters for discriminating between the two stream segments (UR and LR), which means that these five parameters explain most of the expected spatial variation in water quality.
Box and whisker plots of the discriminant parameters recognized by spatial DA (backward stepwise mode) were employed to assess different patterns with regard to spatial trend in water quality given in Fig. 5.The average values of TP, NH 4 + -N, EC and TOC were higher in the LR than in the UR, while TN shown a reverse trend.In the UR, which relatively far from anthropogenic influences, TN and TP were influenced by natural factors, however, in the LR, with increased human disturbance, nitrogen and phosphorous were affected not only by natural factors but also by human activities.Presumably, too, nitrogen was more often influenced by natural factors, and yet, phosphorous more often impacted by various human activities.Along with an increasing the human activities, various ions (viz.EC) in the water increased.With extending downward from the UR to LR, forest litter layer of the soil surface gradually accumulate and increase, which led to an increase in the TOC content.
Identification of potential pollution sources in sampling sites: PCA was employed on the data set to compare the compositional patterns between the examined water parameters and to identify the latent factors in different spatial variability.
PCA was employed on the data set (13 parameters) to examine differences between UR and LR and identify the latent factors in different spatial variability.PCA of the two data sets derived five PCs for the URS and LRS sites with Eigenvalues>1, explaining 91.19 and 80.57% of the total variance in water quality data sets, respectively.Appropriate VFs, variables loading and variance explained are displayed in Table 9.
As shown in Table 5, for the dataset with regard to UR, among the four VFs, VF1, explaining 36.19% of the total variance, was correlated (loading> 0.7) with TN, NO 3 --N, Temp, SiO 2 and TOC, especially TN and NO 3 --N.Thus, it represented for nitrogenous nutrient pollution, organic pollution and salt.VF2, explaining 26.11% of the total variance, was correlated with EC, -N, and thus represented nitrogenous nutrient pollution.VF3, explaining 16.29% of the total variance, was correlated with TP and COD.Thus, VF3 represented phosphorus and organic pollution.VF4, explaining 13.79% of the total variance, was correlated with EC and Temp.Thus, it represented ion content of water.
According to the results by PCA, we can show that most of the change in water quality was explained by nutrient group of pollutants (nitrogen and  ) and anthropogenic influences (agricultural activities, resident sewage emission, etc.).The nitrogen which is a leading factor of the water quality change for the UR with little anthropogenic interference could attribute to 'geological' nitrogen (Holloway et al., 1998).The T-Hard and T-Alk might arise from dissolution of limestone and gypsum soils (Vega et al., 1998), which can be thus explained as a mineral component of the surface river water, and the characteristics of water quality distribution also accord with mountain river water quality characteristics (Day et al., 1998).EC in the UR was related mainly to T-Hard and T-Alk, whereas it would also due to anthropogenic influences, such as land use (Walker and Pan, 2006), hydropower exploitation (Zhang et al., 2010), and so on.SiO 2 would relevant to the natural weathering (Xie et al., 1999).Physical parameters such as Temp and DO just are related to the river.DO was negatively correlated with Temp and ranged from 8.6 to 10.9 mg/L.The result showed that the river was in saturation and there was strong self-purification capacity.TOC and COD represented organic pollution, which probably related to plenty of dead wood and leaves which stemed from higher vegetation overcast in the river basin discharging into water (Ye et al., 2006).

CONCLUSION
In this study, different multivariable statistical methods were successfully employed to assess spatial-temporal variations in surface river water quality of Gufu River in the Three Gorge Reservoir.Hierarchical CA classified 11 months into 2 periods (the first and second periods) and 21 sampling sites into 2 clusters (UR and LR), based on similarities in the water quality characteristics.DA obtained better results both spatial and temporal with good discriminatory ability via significance tests.For the temporal variation analysis, the DA determined eight significant parameters (TP, TN, Temp, NO 3 + -N, TOC, T-Hard, T-Alk and SiO 2 ) to discriminate between the periods with 100% correct assignations.The DA also only used five significant parameters (TP, TN, NH 4 --N, EC and TOC) to discriminate between the regions with 88% correct assignations for spatial variation analysis.Whereas, PCA did not generate appreciable data reduction as it points to 11 parameters (85% of raw 13) required to explain the 91% of the data variability of UR region sites and 11 parameters (85% of raw 13) required to explain 80% of the data variability of LR region sites.For UR region, four VFs obtained from PCs indicate that the eleven parameters responsible for water-quality variations are mainly relevant for nutrient group of pollutants, soluble salts and organic pollution load, which mainly derived from natural process as the rainfall runoff, soil erosion, scouring weathering of crustal materials and forest areas.For LR region, four VFs obtained from PCs indicate that the eleven parameters responsible for water-quality variations are mainly relevant for nutrient group of pollutants and soluble salts, which largely resulted from anthropogenic impact as domestic and agricultural runoff, hydropower exploitation and municipal waste.For a better Gufu River management, examine of surface water quality variations due to anthropogenic interference of LR region was compared to that of the UR region.

Fig. 5 :
Fig. 5: Spatial variation: (a) TP; (b) TN; (c) NH4-N; (d) EC ; (e) TOC Temp, T-Hard and T-Alk, which can be explained as a mineral component of the surface water of the river.VF3, explaining 17.30% of the total variance, was correlated with COD and Chl a. Thus, VF3 represented organic pollution and eutrophication.VF4, explaining 11.58% of the total variance, was correlated with TP.Thus, it represented phosphorus nutrient pollution.For the dataset with regard to LR, among the four VFs, VF1, explaining 27.00% of the total variance, was correlated (loading> 0.7) with T-Hard, T-Alk, SiO 2 and Chl a. Thus, it represented for mineral composition and

Table 3 :
Wilks' Lambda and Chi-Square test of DA of temporal variation of water quality

Table 4 :
Classification functions Eq. (2) for discriminant analysis of temporal variations in water quality of Gufu River Stand mode

Table 5 :
Discriminant function coefficient for dry season and wet season corresponds to w ij as defined in Eq. (1) Classification matrix for discriminant analysis of temporal variations in water quality of Gufu River

Table 7 :
Classification functions (Eq.(2)) for discriminant analysis of spatial variations in water quality of Gufu River Stand mode

Table 8 :
: Upper reaches includes site 1-9; b : Lower reaches includes site 10-21; *: Discriminant function coefficient for different catchments corresponds to w ij as defined in Eq. (1) Classification matrix for discriminant analysis of spatial variations in water quality of the Gufu River a

Table 9 :
Loadings of experimental variables (13) on significant principal components (with Varimax rotation) for the data set Bold value indicates strong and moderate loadings UR