Denoising and Principal Component Analysis of Amplified Raman Spectra from Red Blood Cells with Added Silver Nanoparticles

Incubated erythrocytes with and without silver nanoparticles (AgNP) were analyzed by Raman spectroscopy, resulting in two Raman spectra datasets. AgNP were added to red blood cells (RBC) in order to enhance the Raman signals. This technique is known as surface-enhanced Raman scattering (SERS). A comparison was made between the Raman spectra with and without AgNP, to test if the SERS had taken place. Since Raman and SERS spectra are considered to be cumbersome due to the noises presented, we applied denoising criteria for detection and removal of noises like cosmic rays, shot, and fluorescence contribution. After this, the principal component analysis (PCA) was performed, in order to reduce the dimensions of the spectra being studied. Only the main key components necessary for a better interpretation of these spectra were considered. All of those noises had to be removed prior to the statistical analysis, to make sure the analysis was really based on the Raman measurements and not on other effects. As a result, RBC Raman spectra with and without AgNP got denoised, obtaining an improvement in its resolution for a better signal reading and data interpretation. Also, the first principal components (PC) were selected from each dataset under scrutiny, based on the weight of their information and their spectrum readability. In conclusion, we were able to represent the given reference system with a more affordable and smaller dimension in which information loss was minimal.


Introduction
The use of Raman spectroscopy for biological purposes has increased the need to interpret large datasets resulting from applying this spectroscopic technique especially due to its high sensitivity to subtle molecular changes.Although Raman spectroscopy is a powerful technique, its signals are inherently weak in biological systems.In order to record these weak signals, SERS [1][2][3] can be used.This is a technique that provides a signal increment of some orders of magnitude, where the enhancement factor can be as high as 10 10 or 10 11 .
As the Raman and SERS spectra are cumbersome by the noises presented, criteria application for detection and removal of these noises for better data interpretation is necessary.Those noises that corrupt data are very common; we can see cosmic rays or "spikes" [4] that are produced by external sources, the "Shot" [4] is a result of random nature of light, and there is another one generated by the sample, which is known as fluorescence [4].All these noises affect the real data from RBC Raman spectra with and without AgNP [5], masking bands that contain valuable information and must be removed.
After denoising the data, it is necessary to correct the variability and reject redundant variables.To do that, we need statistical/chemometrical methods [6].The use of multivariate analysis techniques allows the process development and interpretation of multiple data spectra amplified by AgNP (SERS reporters) [7] in RBC, associated with a point or cell.For a better data analysis, spectra should not show any signs of noise.
In this work, we apply denoising techniques and PCA [6,8] as an alternative mathematical-statistical technique to improve relevant Raman signal reading (e.g., SERS), for a better interpretation of biological systems.Here, we propose an original theoretical/experimental method to enhance Raman spectra resolution in biological systems making it possible to obtain inferences and predictions about the resulting data that help with disease diagnostic and treatment.

Materials and Methods
2.1.Experiment.Certain noises were removed, and a multivariate analysis was applied to Raman spectra from an experiment that consisted in the incubation of erythrocytes with and without AgNP.Fresh cells obtained by blood extraction from a healthy volunteer were used.Two Raman spectra datasets were analyzed, the first one without AgNP, used as cell control, and a second one with the nanoparticles added to enhance the Raman signal.
Cell control: a 10 μL aliquot of blood was diluted 1 : 100 (V : V) in buffer phosphate saline (BPS).In Figure 1, it can be seen as a spectrum of RBC without AgNP.
Cells with AgNP: a small aliquot of blood was diluted 1 : 100 (V : V) in BPS solution.The AgNP solution was added in a final dilution 1 : 10. Figure 2 shows a RBC spectrum with AgNP added.
The samples were incubated for 2 h at 25 ± 4 °C.Cell samples were then prepared for Raman spectrum measurement.A drop of the cell suspension was added to a silicon substrate previously coated with poly-L-Lysine.The cells in the substrate were allowed to settle for 15 minutes to promote the immobilization of the RBC for measurement.

Instrumentation and Data Acquisition.
Measurements were performed with a Raman microspectrometer (LabRAM HR, Horiba), equipped with a refrigerated CCD camera.Multiple scans were conducted, and each cell was irradiated five times at different points.The excitation wavelength was 488 nm, with a 100x objective, in "nonconfocal" conditions (1000 μm aperture).Each measurement was performed with a single exposure of 20 seconds and spectra were collected in the 2500-550 cm −1 range.
The Raman spectra were acquired with LabSpec software; Figure 3 shows 94 spectra of RBC without AgNP, which were taken as control in the experiment, and Figure 4 shows 89 RBC spectra with AgNP, to be analyzed and compared with the control spectra.The spectra denoising was performed using a self-created software, named My Raman Data Processing (MyRDP).The multivariate analysis was implemented using the Wolfram Mathematica 9 software.Journal of Nanomaterials developed in NetBeans IDE 7.0.1, which is a free integrated development environment, made mainly for the Java programming language.The programming of this data treatment was made considering the needs for this experiment, always trying to yield the greatest possible robustness and allowing a user-friendly interaction.From the spectra, only a specific region is needed, the program allows the user to select the region of interest, which ranges from 1800 cm −1 to 599 cm −1 (it may be different depending on what is being looked for).After the range of interest was selected, we moved on to the detection and elimination of spikes (cosmic ray beams).

Cosmic Rays (Spikes
).An external nonoptical source of noise is generated by high-energy particles such as cosmic rays, which reach the equipment detector.Cosmic rays release many electrons that are indistinguishable from photoelectrons.Electrons generated by cosmic rays are concentrated in one or two of the elements of the detector.The result is a very narrow peak of high intensity in the Raman scattering spectrum.These peaks occur infrequently, in random time and at random positions of the spectrum.It is difficult to confuse a cosmic peak with a Raman band, because of the disparity in its characteristics, unless several cosmic rays fall into adjacent pixels of the detector, which is unlikely [4].
Cosmic rays or spikes are characterized by being intense and narrow peaks, generally three or four points with intensities n times greater than the adjoining points.It is known that hemoglobin (Hb) spectrum shows intense peaks in the 1380-1330 cm −1 and 1630-1530 cm −1 regions.As the sample was measured on a silicon substrate, the peak was found around 960 cm −1 in the 990-930 cm −1 region.Knowing where the characteristic bands are is very useful, because if a spike is detected on one of these bands, the whole spectrum is discarded.
Once this is known, a spike can be detected by scanning the spectrum point by point.As each point in the spectrum is scanned, it is checked that the absolute value of the difference between the current point and the next point does not exceed the value of 500; if so, we are in the presence of a spike.The elimination process is described in the following way: The current point, which is not more than the previous point where the spike begins, is subtracted from the following points when the difference is greater than 500 in order to determine the spectral width of the spike called "n" (it can be between 1 and 4 points).After determining the width of the spike, the current point becomes the point where the spike begins, then the average of the point that is n positions before the current point and n positions after the current point is calculated and replaced by the value that is in the current point.This procedure of replacing spike values is repeated n times.

Shot (Smooth: FFT Filter
). Shot noise is the result of the random nature of light.Its intensity is equal to the square root of the number of photons detected.It is an inevitable source of noise in the measurement of Raman spectra [4].
Shot noise can be eliminated from free-spike data.So, we smoothed the spectra implementing the filter method of the fast Fourier transform (FFT) [9].The algorithm places some limitations on the signal and on the resulting spectrum, for example, the signal from which samples were taken and to be transformed must consist of a number of samples equal to a power of two [10].
The idea that allows this optimization is the transformed decomposition to get to transformations of 2 elements where "k" can take the values 0 and 1.Once the simplest transformations have been resolved, they must be grouped into higher-level transformations, which must be solved again and so on until the highest level is reached.At the end of this process, the results obtained must be rearranged.The algorithm of the fast Fourier transform was popularized by Cooley and Tukey in 1965 [11].
This smoothness consists of six steps: (i) Calculation of the mean value of the first 1% of the data and the last 1% of the spectrum data where x i is a sequence of length N. IFFT: Low-pass parabolic filter: Let f c 1 be the pass frequency and f c 2 be the stop frequency.The window function can be expressed by w f : 2.3.3.Fluorescence (Spectra Normalization).The noise generated by the sample includes unwanted optical and samplegenerated emissions, as is the case of fluorescence, a phenomenon that occurs if a photon hits the molecule, it is absorbed and the molecule goes to an electronic excited state where it remains tenths of nanoseconds, to jump to another excited state but of lower energy, releasing a photon of frequency lower than the incident one.In Raman spectra, fluorescence usually presents as a gentle curvature of the baseline and it can reach an intensity that completely masks the intensity of the Raman bands.The noise generated by the sample also includes Raman intensity changes due to nonconcentrationrelated changes in the sample.Both the intensity of the bands and the position may vary depending on the temperature of the sample, although these changes tend to be small.The heterogeneity of the sample can also create noise, since the analysis performed at one point in the sample need not be representative of the whole sample [4].
Because the signal from different spectra does not share the same intensity value, we need to reduce funds.To homogenize the spectra, it is first checked that the background is not linear.This is possible to determine by comparing two specific regions that we defined as a lower flat region and upper flat region.Where the lower flat region was 640-599 cm −1 and the upper flat region ranges from 1800-1750 cm −1 .
(1) The mean of the lower flat region (640-599 cm −1 ) is compared with the mean of the upper flat region (1800-1750 cm −1 ).
(i) If the average of the upper flat region (1800-1750 cm −1 ) is greater than the subtraction of the averages of the upper and lower flat region plus the average of the lower flat region, then the background is nonlinear.
(a) The correction consists of normalizing each spectrum by the line equation given by two points x 1 , y 1 , x 2 , y 2 .
2.4.Multivariate Analysis.After denoising the spectra, we are ready to do the PCA statistical analysis.This multivariate analysis can be done in a program that was also self-made in the software Wolfram Mathematica 9, following the algorithm of the multivariate analysis, PCA [6].Multivariate analysis is one of the main techniques of data analysis when there are many variables to be considered at the same time for its interpretation.
The first step in multivariate data analysis is to have the data matrix X, from which the vector of mean value x [8] can be computed.From there, we could proceed to obtain the matrix of covariance (of dispersion) and the correlation matrix and from there we can begin to obtain the main components in order to typify the data.This typification process affects the variance of all variables to be the same and equal to 1.

Principal Component Analysis (PCA).
PCA is a descriptive technique of dimension reduction, which seeks to study the relationships of interdependence between groups of variables and individual variables.This analysis has its origins in the works of Pearson published in 1901, in the Philosophical Magazine with the title "On Lines and Planes of Closest Fit to Systems of Points in Space" [12].
It allows the decomposition of the original data in a model consisting of a part of the signal and a part of noise.The analysis is oriented to model a variance-covariance structure of a data matrix from which the eigenvalues corresponding to the principal components are extracted.Each main component is a linear combination of the original variables.The first major component represents the largest variance and thus corresponds to the higher eigenvalue.The second main component is orthogonal to the first one, with each successive main component being either orthogonal to all of the above and constituting a decreasing proportion of the variance [12].
In summary, in the main component analysis, a sample of size n on p variables X 1 , X 2 , … , X p , initially correlated, is used to obtain a number k ≤ p of uncorrelated variables Z 1 ,

4
Journal of Nanomaterials Z 2 , … , Z p which are a linear combination of the original variables and which explain most of their variability.The main components will be as follows: , where a p are the eigenvectors associated with the covariance matrix.Only the main p components that explain a high percentage of the variability of the original variables are being preserved [12].Raman spectra are n coordinate vectors defined in what is known as the signal space of normalized wave numbers: where e ij are the Raman intensities of the spectrum E i for a wave number (Raman shift).
The routine for performing the Raman spectra is performed with Wolfram Mathematica 9 software and is as follows: (1) Creation of the data matrix E with e ij coordinates: where is the mean of each variable.
Then, it is necessary to obtain the standard deviation from each variable: since typification consists of subtracting each variable from its mean value and dividing the result by the standard deviation, thus obtaining new values for the data matrix.
x ij = e ij − e j σ i , 10 where the typified data matrix will be We calculated the mean and the variance of the data to make sure that the typing was successful.These must be 0 and 1, respectively.
(3) After typing the data and forming a new matrix of data X with these, we proceed to obtain the covariance matrix S, which will be given by where the components of S are going to be proportional to the correlation coefficients, because the typing process does not affect the correlation between the variables.
So, we can say that the covariance matrix S is proportional to the correlation matrix R of the original or typified variables.We are now ready for the main component analysis.
(4) We proceed to obtain the values and eigenvectors.
The eigenvalues are the roots of the equation: The eigenvectors associated with the eigenvalues λ i are the result of solving the equation: (5) Once the eigenvectors have been obtained in matrix a, we proceed to extract the main components: For the preservation or retention of the main components, the most logical criterion is to retain the least number of components that explain a certain percentage of variability, which is generally between 90% and 80% of the total.Another criterion would be to preserve the components whose eigenvalues are greater than or equal to 1 [6].Finally, it is up to the researcher to decide what the best choice is.
The variability percentages were obtained in the following way:

Results and Discussion
An intensity difference is visible in Figures 3 and 4 spectra.The Raman signal increment due to the presence of the AgNP favors a better resolution of all peaks, especially those with low intensity in RBC spectra of control cells.This Journal of Nanomaterials increments the signal/noise relationship favor the use of this technique in the obtainment of bioimages.
3.1.Denoising Raman Spectra.Then, we analyze the RBC Raman spectra incubated with and without AgNP in order to remove the specified noises, as it is described in Section 2.3.In Figures 6-9, we can observe the denoising process gradually.For this process, we take into consideration the order to remove noises, because if we remove shot noise before removing spikes, these get masked as Raman bands [13].Figure 6 shows the scientists desired region between 1800-599 cm −1 , which is full of noises; we can clearly see the huge spikes, the shot noise, and the fluorescence contribution.
Figure 7 shows the Raman spectra after removing spikes.Those spikes were removed by implementing a criteria sequence that allowed us to identify and remove them, best described in Section 2.3.1.In Figure 8, we can observe spectra after smoothing.The smooth is done applying the FFT filter, which is an efficient algorithm that allows to calculate the discrete Fourier transform (DFT) and its inverse.It is of great importance in a wide variety of applications, from digital signal processing and digital filtering, among others.
In Figure 9, we see the spectra in the denoising process final state, where the fluorescence has been removed by creating a grade one polynomial function, a straight line, from two specific regions.We selected the lowest region from the lower section and the lowest region from the upper section for each spectrum.At the end, a spectral correction was made using the line equation given by two points x 1 , y 1 , x 2 , y 2 .With these two points, we construct a line, which we subtract from each spectrum for its normalization (4).Then, the spectra can be processed using multivariate analysis.

Statistical Analysis (PCA).
After denoising all RBC Raman spectra with and without AgNP, we moved on to perform the PCA.From the equipment (LabRAM HR, Horiba), we obtained a data matrix consisting of n spectra with 1201

6
Journal of Nanomaterials values each.These factors express the relationships between the wave number variables (1800-599 cm −1 ).So, the spectra had n = 615 values.All the data was typified, and a complete PCA cross validation was performed using one or two principal components.
After applying PCA over the denoised Raman spectra, we obtained statistical results shown in Table 1, where the PC variability percent shows that the first two PCs explain 98.86% of the information from RBC Raman spectra without AgNP and 98.94% of the information from RBC Raman spectra with AgNP.PC selection was made by applying component retention criteria; as we explained in Section 2.4.1, we took the fewest components that explain a percentage of variability between 80-90% of the total to preserve components whose eigenvalues are greater than or equal to 1.
In this case, we obtain two strong principal components, but we can see in Figures 10 and 11, where we show the two PC spectra from each dataset, that only the first PC in both cases gives readable information.
By making a correlation between the original variables and the retained components, we can also discard the second component, as shown in Table 2.
In Figure 12, we have the PC retained from both datasets, from which we can say that the characteristic peaks are distinguished and the main components in it match with the ones reported in [14], but they need more treatment to, for example, remove the silicon band and see the intensities' difference better between the spectra.

Conclusions
In this work, RBC Raman spectra noises from two representative independent tests were eliminated, one incubated without AgNP and the second one with AgNP.During the Raman  signal process, cosmic rays (spikes) were eliminated by detection and elimination criteria (mathematical method); the data smoothing was performed using the FFT filter method and fluorescence elimination by normalizing the data by the line equation given two points.More visible and manageable results were obtained for a better handling in the statistical analysis, the PCA, being able to represent the given referenced system, by a more affordable and smaller dimension, in such a way that information loss was minimal.This reduced the data analysis time, which was done by hand and was very cumbersome.In fact, the spectra denoising lasted for a month, before the statistical analysis was carried out.

3
Journal of Nanomaterials (ii) Construction of a straight line through these two points and subtraction of the data from this line (iii) Development of the FFT in the set of data acquired in the previous step (iv) Application of the filtrate with the low-pass parabolic filter (v) Development of the IFFT in the filtered spectrum (vi) Addition of a baseline to the set of data purchased in the previous step FFT:

6 ( 2 )
Typification of the variables:It is necessary to obtain the mean vector from E: e = e 1 , … , e j , … , e p , 7

Figure 10 :
Figure 10: Two principal component spectra with the highest variability percentage from RBC Raman spectra without AgNP.

Figure 11 :
Figure 11: Two principal component spectra with the highest variability percentage from RBC Raman spectra with AgNP.

Table 1 :
Principal component analysis results from RBC Raman spectra with and without AgNP.

Table 2 :
Correlation between original variables and components retained from RBC Raman spectra with and without AgNP.Figure 12: PC spectra from dataset analyzed.