Measurement of Available Phosphorus and Potassium Contents in Soil using Visible-near-infrared Spectroscopy in Conjunction with SPA-LS-SVM Methods

Applying Near Infrared Reflectance Spectroscopy (NIRS) on farmlands can effectively estimate the available phosphorus and potassium contents of soil online. Spectral preprocessing, including Savitzky Golay (SG), Standard Normal Variate (SNV), Multiplicative Scatter Correction (MSC) and SG 1st derivative, aimed to eliminate system noise and external interference. A correction model was created using respectively Radial Basis Function (RBF) and Least Squares Support Vector Machine (LS-SVM) methods with input from the characteristic wavelengths obtained using Successive Projections Algorithm (SPA). The results of predicting available phosphorus and potassium contents in soil using these two modeling methods were evaluated and the better model was selected. The results showed that the LS-SVM method with input from the characteristic wavelengths obtained using SPA had an advantage over the RBF modeling method. In SPA-LS-SVM models, the correlation coefficient and mean square error of prediction for available phosphorus were 0.8625 and 8.67 and those for available potassium were 0.7843 and 13.42, respectively. This indicates that SPA-based visible-near-infrared spectroscopy using LS-SVM for modeling can be used as a method to accurately measure available phosphorus and potassium contents in soil.


INTRODUCTION
Soil nutrient is one of the major limiting factors that affect crop growth.Crops need water and various nutrients from agricultural soils to grow and a large amount of soil phosphorus and potassium is needed by plants.These elements are one of the major factors that decide plant growth and soil productivity, as well as an important basis to guide scientific and balanced fertilization.It is therefore important to gain information about distribution of phosphorus and potassium in agricultural soil so as to provide guidance for agricultural production (Rossel et al., 2006;Gomez et al., 2008;Reeves III and Smith, 2009;Pan et al., 2009;Reeves III, 2010;Yu, 2011;Wang et al., 2013).Traditionally, available phosphorus and potassium contents in soil are measured using a chemical analytical method, which is time-consuming, cumbersome and can hardly meet the need for rapid monitoring of these contents.In recent years, near-infrared spectroscopy technology has been gradually becoming the new powerful tool for measuring available phosphorus and potassium contents in soil due to its fast speed, simplicity and reliability (Bogrekci andLee, 2005a, 2005b;2006;Bao et al., 2007;Barthes et al., 2008;Brichlemyer and Brown, 2010;Reeves III, 2010;Minasny et al., 2011).
Recent study using VIS/NIRS technology to build regression model for estimating organic matters, total nitrogen, total phosphorus and total potassium in soil showed that it is possible to rapidly estimate these figures.Confalonieri et al. (2001) utilized VIS/NIRS technology to estimate available phosphorus and potassium contents in different soils, but the estimation result was unsatisfactory.Li et al. (2007) used Partial Least Squares (PLS) and Artificial Neural Network (ANN) methods to create models for estimating alkalihydrolyzable nitrogen, available phosphorus and potassium contents in soil.The result showed that it is feasible to use NIRS technology to estimate alkalihydrolyzable nitrogen content in soil, but the feasibility of estimating available phosphorus and potassium contents still requires further study.
Near-infrared spectroscopy mainly reflects multiple and frequency absorption of organic matters.Different matters have complicated overlapping bands, causing the entire band spectrum to contain a lot of redundant information and noise, thus affecting the predicting performance of the model.Wavelength extraction method includes wavelet algorithm, genetic algorithm, Uninformative Variable Elimination (UVE), SPA, etc.The correlation coefficient method and effective wavelength method often select thresholds based on subjective experience without any systematic and effective threshold selection criteria, while UVE methods such as Monte Carlo method and genetic algorithm require time-consuming and intensive computation during the search process but yet generate unstable result.SPA can significantly reduce the number of variables used in modeling to an extent less than that using algorithm such as Monte Carlo UVE, genetic algorithm and wavelet algorithm.It also improves the speed and efficiency of modeling.SPA has therefore been widely applied in selecting the characteristic band of a nearinfrared spectrum.This study examined the feasibility of applying SPA in conjunction with the LS-SVM modeling method to select and optimize modeling variables for soil near-infrared spectrum.With 160 experimental soil samples collected, SPA was employed to select and optimize the modeling variables.Then the optimized samples were used to create a model for predicting available phosphorus and potassium contents in soil.

EXPERIMENTS Sample collection and treatment:
The samples were collected from the agricultural soil located in the city of Zhenjiang along the Yangtze River, at north latitude 32º 6 6 and east longitude 119º 2 4 , within the monsoon climate zone in the southern part of the northern subtropical.Due to its location in the Middle and Lower Yangtze Valley Plain, most of the soil is yellow-brown soil formed by the impact of the Yangtze River.In order for the samples to contain a wide range of nutrient contents, sampling points were selected using cross method at a depth of 0-20 cm.When sampling soil, the gravels and plants on the soil surface were removed and a 20 cm deep rectangular pit was dug.Then a 5 cm thick layer of soil was shoveled off from the side wall of the pit and collected in bags, resulting in 160 soil samples.In order for the samples to reflect the actual conditions of the agricultural land as far as possible, the soil was not ground but dried in natural conditions and then passed through a 2 mm sieve for future use.Then the soil was placed in and fill 9 cm diameter and 1 cm deep petri dishes, with the surface leveled using a ruler.The samples were divided into two parts, one for chemical analysis and the other for near infrared analysis.Available phosphorus content in soil was measured using a spectro-photometer with the method of colorimetry.Phosphorus in acid soil is present mainly in the forms of Fe-P and A1-P.The ability of fluorine ion to complex Fe3+ and Al3+ in acidic solution can be utilized to successively activate and release the iron aluminum phosphate that is relatively active in such soil.Meanwhile, due to the effect of H+, some Ca-P that is relatively active can also be dissolved out.Then the content can be measured using the Mo-Sb colorimetric method.Available potassium content in soil was measured using the flame photometry method.When  1.

Spectral acquisition instrument and spectral preprocessing:
Fourier transform near-infrared spectrometer of Swiss Arcspectro FT Rocket type was used for spectroscopic measurement with spectral wavelength ranging between 350-2500 nm.The instrument is equipped with a 35 W halogen light source and a fiber optical probe, both mounted on the built-in bracket of the spectrometer at an included angle of 8°.The distance between the fiber optic probe and soil sample is adjustable and the probe is connected through an optical fiber to the spectrometer which parses the reflectance spectrum and sends it to a computer via USB, as shown in Fig. 1.
Each sample was taken at an interval of 1nm for 5 times.The spectral curve after averaging is shown in Fig. 2.
Averaging was done for every five consecutive wavelengths to reduce the dimensions of the spectrum.Then SNV, MSC and 1st derivative preprocessing were respectively performed on the spectrum in order to compare and analyze the effect of different spectral preprocessing methods on model prediction result.The results are shown in Table 2.By comparing the characteristic wavelengths, extraction was performed   on the basis of the original spectrum using the SG 1 st derivative preprocessing method to produce the optimal prediction model result.

PREDICTION MODELING
Effect of sampling height on spectrum: Reflectance spectrum in soil measurement is affected by factors such as measured component structure, stability and probe mounting height, so it is necessary to examine the effect of probe mounting height on measuring accuracy in order to increase the accuracy of online measurement.In order to facilitate field measurement, the spectral characteristics of soil sample when the height of the spectrometer's fiber optic probe were respectively 5, 7, 10, 12 and 15 cm, respectively to the measured object were selected and the spectral data collected were used in conjunction with available phosphorus and potassium contents in soil to create a PLS prediction model, the results is as shown in Table 3.The table shows that the probe height of 10 cm to the soil sample surface gave the best modeling result.
The determination coefficient of modeling calibration set was 0.7985, 0.6965 and the root mean square error of modeling calibration set was 9.29, 14.30, so a sampling height of 10 cm was determined.

Selection of characteristic wavelength using SPA:
Abnormal samples will seriously affect the prediction accuracy of the model created.This study applies the 3 σ edit rule and Principal Component Analysis (PCA) score plot to the values of matter content and spectral data to determine if any values of available phosphorus and potassium contents in soil or sample spectral data are abnormal.Since a large number of samples are required for quantitative analysis and overlapping spectral peak is present in the spectral data of each sample, resulting in redundant spectral information and unobvious characteristic absorption peak, it is necessary to find effective wavelengths which play a key role to the model (Wang et al., 2011).Successive Projections Algorithm (SPA) selects a small number of wavelengths from the original data by projecting and mapping spectral data so as to summarize as much sample spectral information as possible and avoid information overlapping to the greatest extent (Chen et al., 2013).Competitive Adaptive Reweighted Sampling (CARS) method imitates the "survival of the fittest" principle in Darwinian evolution theory to perceive each wavelength variable as an individual and eliminate in adaptive individuals step by step.CARS allows "survival of the fittest and selection of the best" and eliminates useless variables at an exponentially decreasing speed to select the optimal wavelength (Sun et al., 2012).Stability Competitive Adaptive Reweighted Sampling (sCARS) method takes stability into account on the basis of CARS and eliminates redundant information by using the stability of a variable as an indicator of its modeling capability.Random frog algorithm calculates the probability of each variable being selected by simulating a Markov chain that has a stationary distribution in the model space, thereby performing variable selection.In conjunction with the PLS regression algorithm, Random frog algorithm uses the absolute value of each variable regression coefficient in the model as the basis for deciding whether to eliminate the variable in each iteration process.It is a very effective method for selecting high-dimensional data variables (Mouazen et al., 2005(Mouazen et al., , 2006)).
In order to find out the optimal wavelength selection method, a test for characteristic wavelength selection was conducted for the entire band spectral data using SPA, CARS, sCARS and Random frog respectively and a model was created for the selected  -----------------------------------------------K/(mg/kg) - -------------------------------------------------  wavelength points using PLS.The results are shown in Table 4. Table 4 shows that SPA can compress raw spectral data, thus greatly reducing the computational complexity of the model and improving its stability as well as prediction accuracy, provided that the number of characteristic wavelengths is equal.The algorithm attempts to find the wavelength data containing the minimum redundant information in the raw spectral data by sorting and filtering the selected wavelengths according to the contribution value of the test samples, thus preventing the selected wavelength data to contain overlapping information and removing redundant information.For these reasons, the SPA method was used for all subsequent prediction experiments.After the soil spectral data was compressed and filtered using SPA, nine contributing wavelength points were left, which were respectively 447, 655, 722, 1055, 1255, 1467, 1678, 1890 and 2246 nm, respectively.Modeling methods: Common methods for quantitative analysis in near-infrared spectroscopy include linear analysis methods and nonlinear methods such as neural network.It was found through preliminary experiments that linear analysis methods produce poor prediction result with respect to rapidly available phosphorus content in soil, so this study compares and analyzes the modeling results of nonlinear methods such as Radial Basis Function (RBF) and Least Squares Support Vector Machine (LS-SVM) in an attempt to find out the optimal modeling method.
RBF network model: An RBF network consists of an input layer, a hidden layer and an output layer.The input layer has 120 neurons; the hidden layer contains a Gaussian radial basis function with 5 neurons as determined through comparative experiments; the output layer contains a linear transfer function with one neuron corresponding to the predicted value of available phosphorus or potassium content in soil.
During training, 120 samples were selected as input using the SPXY (Sample set partitioning based on joint X-Y distance) method from the sample set in which all abnormal samples had been removed (for each sample, the characteristic wavelengths were selected using the SPA method).Considering modeling and prediction accuracy on a comprehensive basis, an error target of 55 was determined through repeated tests and other parameters were set to default values.After the network had been trained, the remaining 30 soil samples were inputted into the trained RBF network as a prediction set for the purpose of prediction.

LS-SVM:
LS-SVM is an improved support vector machine which maps the training set data from nonlinear space to high-dimensional characteristic space and substitutes equality constraints for inequality constraints to solve minimized loss function in a highdimensional space so as to obtain a linear fitting function.It can significantly increase computational speed.During training, input X for LS-SVM was the 120 samples selected using the SPXY method; the number of dimensions of the input matrix was the number of characteristic wavelengths selected using SPA; output Y was the value of available phosphorus or potassium content in soil obtained from the 132 samples.During prediction, input X t was the spectral information contained in the other 20 test samples; output Y t was the predicted value of available phosphorus or potassium content in soil.It was determined through preliminary experiments that a regularization parameter (γ) of 0.72 and RBF kernel function parameter (σ 2 ) of 0.58 produce the best prediction result for phosphorus or potassium content in soil.

EXPERIMENTAL RESULTS AND ANALYSIS
Experimental results: Table 5 shows the results of correlation coefficient and mean square error of prediction for available phosphorus and potassium contents in soil samples using SPA-RBF model and SPA-LS-SVM model.The relation between the predicted value and measured value of the prediction set samples when SPA was used to select the optimal characteristic wavelength variables and available phosphorus and potassium contents in soil to create a RBF network prediction model is shown in Fig. 3.The determination coefficients of prediction for the prediction set of this model were respectively 0.8330 and 0.7505 and the root mean square errors of prediction were respectively 9.65 and 14.12.The relation between the predicted value and measured value of the 30 prediction set samples when SPA was used to select the optimal characteristic wavelength variables and LS-SVM was used for modeling is shown in Fig. 4. The determination coefficients of prediction were respectively 0.8625 and 0.7843 and the root mean square errors of prediction were respectively 8.67 and 13.42.
Experimental analysis: Due to the massive size of near-infrared spectral data for soil, severe collinearity is present in the data points and network computation can easily fall into local minimum and difficulty may be experienced in parsing.It can be seen from the model prediction result for available phosphorus and potassium contents in soil that the prediction result was satisfactory as no under-fitting or over-fitting phenomenon occurs in the two nonlinear models, indicating that spectroscopy is advisable for the  To analyze the reliability and adaptability of different models, RBF and LS-SVM prediction models were compared and analyzed on a comprehensive basis.The result showed that the difference between their determination coefficients and root mean square errors was not obvious and both can meet the needs of practical application.However, LS-SVM using least squares linear system as loss function greatly reduces model complexity, increases computational speed and saves time.It is also more responsive to application of real-time field measurement.In summary, LS-SVM features higher learning and prediction capabilities, so it can better solve the generalization problem of model prediction for available phosphorus and potassium contents in soil.
The prediction result for the available phosphorus and potassium contents in the soil samples was satisfactory in this test, probably because they were indirectly captured by near-infrared spectrum through some active ingredients and minerals in the soil.The successful prediction of available phosphorus and potassium contents was achieved using spectral modeling algorithms based on some kind of generative mechanism which was captured by near-infrared spectrum signal.According to existing references, the characteristics of micronutrients such as available phosphorus, potassium, magnesium and calcium in soil, as well as the mechanisms in relation to prediction using near-infrared spectrum are still unknown and need further discussion and study.At present, the overall model accuracy is not high, because the study was based on a variety of soil types and the physical and chemical properties of different soils vary significantly with soil-forming factors such as climate, parent materials, topography and biology, as well as human activities.The diversity of soil constituents and unique spectral characteristics of each constituent give each soil spectrum its own features.In future studies, the accuracy of model can be improved by, for example, designing a more stable light source, the stability of which is important for ensuring the analysis accuracy and repeatability of measurement system, taking additional factors such as ambient temperature and soil moisture into account and increasing the number of samples for modeling to improve the stability and reliability of model.

CONCLUSION
After preprocessing of the original spectrum using SG 1 st derivative, SPA was used to determine characteristic wavelengths as spectral data for the modeling set and prediction set, i.e., effective wavelengths.The following conclusions were made according to the study on measurement of available phosphorus and potassium contents in soil using SPA in conjunction with LS-SVM:  Difference in sampling height significantly affected spectral data; the sampling height of 10 cm gave the highest accuracy.The division between abnormal samples and modeling set/prediction set samples also had a substantial impact on model accuracy. Of the several methods for selecting characteristic wavelengths, the SPA method extracted a few columns of data from a massive amount of raw spectral data to give a high-level overview of most sample spectral information, thus avoiding information overlapping, greatly reducing the computational complexity in calibration modeling and effectively compressing the time required for modeling. Comparison and analysis of the prediction results using two nonlinear modeling methods showed that SPA-LS-SVM gave the best prediction result, in which case the determination coefficients of model prediction were respectively 0.8625 and 0.7843 and the root mean square errors of prediction were respectively 8.67 and 13.42.The model was accurate enough to basically meet the requirements for actual measurement.
 It is feasible to employ spectral analysis to rapidly measure available phosphorus and potassium contents in soil.However, soil spectrum is complex and diffuse reflectance spectrum of soil is affected by soil quality, topography and the environment.
Further study is still required to reduce and eliminate these interferences and create prediction models with better universality and robustness.

Fig. 4 :
Fig. 4: Scatter plot by using SPA-LS-SVM models; (a): available P, (b): available K prediction of available phosphorus and potassium contents.To analyze the reliability and adaptability of different models, RBF and LS-SVM prediction models were compared and analyzed on a comprehensive basis.The result showed that the difference between their determination coefficients and root mean square errors was not obvious and both can meet the needs of practical application.However, LS-SVM using least squares linear system as loss function greatly reduces model complexity, increases computational speed and saves time.It is also more responsive to application of real-time field measurement.In summary, LS-SVM

Table 2 :
Different spectral preprocessing methods of model prediction results

Table 3 :
Comparison of results in calibration model with different height to soil surface

Table 4 :
Comparison of parameters in calibration model with different variable selection methods