The PCA, a multivariate statistical tool that allows us to summarize the information content in large dataset by means of smaller set of summary indices called Principal Components. It has been widely used for pattern recognition, signal processing and factor analysis. In practical, the PCA has been used in this study to find multicollinearity, outliers in the dataset and finally, to development models with this principal component.
From the PCA it could be found that the first principal component 69% and second one 23%, and both together expressed 92% of the total variation in the data (Fig. 2).
Score plot of the PCA, also known as map of samples, showed that samples were scattered without following any patter which was the indicator that there was no multicollinearity in the data (Fig. 2). Therefore, the spectroscopic data were suitable for model development.
Influence plot is popularly used to diagnose whether there is any outlier in the dataset which might have adverse impact on the developed models. From Fig. 3, it could befound that there was no outlier in the FT-NIR spectroscopic data of agricultural residues as there was no sample plotted beyond the read lines.
All parts of spectral data of a sample are not equally informative. Therefore, in order to find the most important section of the spectral data, full spectra were divided into three parts and calibration models were developed with data of full spectral data, and the segmented ranges. Efficiency of the two calibration techniques PCR and PLSR were assessed with the full range data (8000-4000 cm-1) and partial datasets 8000-7000, 7000-5500 and 5200-4060 cm-1. Calibrations were also done separately with calibration dataset and 6-fold cross validation (CV) datasets. Coefficients of multiple determination (R2) is one of the parameters to test the model efficiency. Here, comparisons were shown in terms of the values of R2.
Among full range and all three segmented ranges of spectral data most information range is 7000-5500 cm-1 (Table 1) for prediction models of lignin, holocellulose, α-cellulose, pentosan, extractive and ash in different agricultural wastes, potential raw materials for pulp industries. In another study, NIR range 7502-4246 cm-1 was found the most informative region for predicting holocellulose in plantation timber (Kothiyal et al., 2015).
Parameter Dataset Full range (8000–4000 cm-1) 8000–7000 cm-1 7000–5500 cm-1 5200–4060 cm-1 PCR PLSR PCR PLSR PCR PLSR PCR PLSR Lignin Calibration 0.242 0.262 0.228 0.237 0.472 0.806 0.663 0.778 CV 0.141 0.092 0.150 0.096 0.301 0.373 0.178 0.404 Holocellulose Calibration 0.746 0.882 0.006 0.016 0.706 0.905 0.909 0.909 CV 0.215 0.473 0.000 0.000 0.503 0.661 0.657 0.691 α-cellulose Calibration 0.781 0.745 0.193 0.569 0.741 0.921 0.878 0.879 CV 0.271 0.208 0.000 0.057 0.499 0.764 0.378 0.426 Pentosan Calibration 0.014 0.016 0.001 0.019 0.678 0.628 0.561 0.527 CV 0.000 0.000 0.000 0.000 0.361 0.251 0.175 0.078 Extractive Calibration 0.006 0.032 0.005 0.022 0.263 0.279 0.016 0.035 CV 0.000 0.000 0.000 0.000 0.062 0.094 0.000 0.000 Ash Calibration 0.580 0.549 0.392 0.399 0.856 0.856 0.720 0.834 CV 0.329 0.190 0.141 0.203 0.493 0.502 0.425 0.476 Notes: PCR, Principal Component Regression; PLSR, Partial Least Square Regression; CV, cross validation.
Table 1. Range selection of FT-NIR data.
Between two most popular calibration techniques PCR and PLSR, the later one showed better prediction efficiency in all cases with calibration and cross-validation datasets except pentosan. In case of the pentosan, the values of R2 were very close, but a slight better result could be found for the PCR (R2 = 68%) over the PLSR (R2 = 63%). Similar result could be noticed for Ash prediction model.
Spectral data contain some noise or unwanted background information which should be reduced or eliminated in order to improve the robustness of calibration techniques used in this study. Therefore, the FT-NIR spectral data had gone through some pretreatment processes, namely, mean normalization, de-trending, de-trending and smoothing with Savitzky-Golay (S-G) technique combined and finally, de-trending and 2nd derivative together. Like earlier, efficiency of the PCR and PLSR are assessed both for calibration and cross-validation datasets.
As shown in Table 2, calibration model for prediction of α-cellulose in agricultural residues is the best when the FT-NIR spectral data are preprocessed with mean normalization with calibration dataset (R2 = 94%) and calibration dataset (R2 = 79%). For lignin prediction model, detrending and smoothing with Savitzky-Golay (S-G) techniques showed the best results (R2 = 83%) when they were used simultaneously. However, a severe effect could be observed when the data were pretreated with detrending and 2nd derivative combined. Here R2 was tremendously high (R2 ≈ 99%) for calibration dataset but for cross-validation data, and the figure was horrifically low, which made the model very unstable for using to predict these parameters from FT-NIR spectral data. Calibration performance was found very poor in terms of R2 for prediction of α-cellulose (65%-82%) and lignin (52%-95%) in solid wood without any pre-treatment (Yeh et al., 2005).
Parameter Dataset Mean normalization De-trending De-trending+ smoothing (S-G) De-trending+ 2nd derivative PCR PLSR PCR PLSR PCR PLSR PCR PLSR Lignin Calibration 0.408 0.842 0.136 0.831 0.136 0.829 0.340 0.412 CV 0.148 0.392 0.061 0.243 0.052 0.338 0.141 0.120 Holocellulose Calibration 0.758 0.772 0.633 0.915 0.633 0.912 0.789 0.998 CV 0.275 0.488 0.210 0.525 0.235 0.501 0.111 0.123 α-cellulose Calibration 0.921 0.937 0.841 0.903 0.841 0.902 0.863 0.998 CV 0.789 0.785 0.734 0.724 0.772 0.786 0.199 0.292 Pentosan Calibration 0.639 0.658 0.511 0.466 0.510 0.466 0.566 0.976 CV 0.413 0.360 0.253 0.201 0.272 0.254 0.085 0.096 Extractive Calibration 0.250 0.259 0.124 0.157 0.124 0.157 0.072 0.346 CV 0.123 0.164 0.009 0.00 0.000 0.000 0.000 0.000 Ash Calibration 0.604 0.583 0.792 0.803 0.793 0.472 0.443 0.999 CV 0.363 0.283 0.412 0.422 0.419 0.352 0.343 0.340 Notes: PCR, Principal Component Regression; PLSR, Partial Least Square Regression; CV, cross validation.
Table 2. Pre-treatment of FT-NIR spectral data.
For prediction of lignin, the PLSR calibration model showed better performance than the PCR with FT-NIR spectral data detrended and smoothed with Savitzky-Golay (S-G) filtering simultaneously (R2 = 83%) (Fig. 4). It requiredseven factors to reach the destination of getting the best performing model.
Holocellulose could be predicted with raw spectral data of range 7000-5500 cm-1 with the PLSR (R2 = 91%) (Fig. 5). Here, model neededseven factors for getting the best performing model.
To predict α-cellulose non-wood samples especially agricultural residues, the PLSR performed better than the PCR with FT-NIR spectral data of range 7000-5500 cm-1when they were preprocessed with mean normalization (R2 = 94%) (Fig. 6). The model requiredseven factors to perform best. The result obtained under this study for predicting α-cellulose was very close to another study conducted by Uddin et al. (2017) with Dhaincha sample (R2 = 95%) and Jute sample (R2 = 99%) where FT-NIR spectral data were pre-processed with standard normal variate (SNV) and calibrated with PLSR and arterial neural network (ANN) respectively.
The PCR performed better than the PLSR with raw spectral data of range 7000-5500 cm-1 (Fig. 7). The model consisted of first six principal components (PCs) but the model was not recommendable one to use in practice as its predictive performance was not up to the mark (R2 = 68%).
Lastly, the performance of the PCR and PLSR were same for prediction of ash in the sample, even though slightly better results could be obtained from the later one with raw spectral data of range 7000-5500 cm-1 (R2 = 86%) (Fig. 8). Here the PLSR model was developed with first six factors. The model was more stable than models with other conditions.
The results obtained in the study were much better than those of Huang et al. (2010) who used the NIR for rice straw samples. In that study, R2 > 0.80 for cellulose, 0.60 < R2 < 0.70 for lignin, and R2 < 0.70 for hemicelluloses by the PLSR. Another study involving Eucalyptus globules predicted by multivariate linear regression (MLR) showed R2 ranged from 0.67 to 0.87 (Poke and Raymond, 2006).
A study conducted by Li et al. (2015) on moso bamboo found at best R2 of 0.91 for hemicelluloses, 0.98 for cellulose, and 0.94 for lignin, where the NIR was used for spectral data processing and the ANN for calibration.