- First, you need to be familar how to work into R. Here I give some recommendations to start mastering spectroscopy analysis tasks.
- You need to need to now that there are R basic data structures used for analysis. Have a first read here in the Advanced R book of Hadley Wickham.
You will learn how to use different R base data structures and basic operations such as subsetting to explore and transform spectral data.
This hands-on tutorial teaches you the technical skills how to work with data structures in R by the example of manipulating spectral data.
- Use the
Clone or Download
button and clone/download this tutorial repository from github to your computer. - Unzip the folder.
- Double-click the
file and the tutorial is loaded as R project (no need to usesetwd()
andpath = "yourverylonganduniquedirectoryonyourpc"
or other nasty hard-coded approaches to set the working directory, which makes your projects not reproducible for others; kindly follow these instructions to learn how to work with RStudio self-contained projects. - For working with this R markdown notebooks for reproducible and interactive analysis, see here.
- Have fun reproducing.
Spectroscopy modeling requires that we first organize our spectra well. In particular, a proper and reproducible data management of spectral data, metadata, and data from reference chemical analyses is key for all the subsequent data processing and modeling workflow.
The Sustainable Agroecosystems group at ETH relies on Diffuse Reflectance Fourier Transform (DRIFT) infrared spectrometers manufactured by the company Bruker (see Figure ). The manufacturer relies on a proprietary binary format called OPUS to store an extensive amount of data that includes different types of intermediary spectra. For each sample that was measured a single OPUS file is produced.
First, we load the set of packages of tidyverse
(see here for details) and the simplerspec
package (see here). Simplerspec contains a universal file reader that allows to read selected parameters (e.g. instrument, optic and acquisition parameters) and all types of spectra from a single OPUS binary file or a list of files.
I recommended that you set up a self-contained directory where all R scripts, data (spectra and chemical reference data), models, outputs (figures and text files of data and model summaries), and predictions live. Further, you can use the folder structure depicted in figure to organize your spectroscopy-related research projects.
When you have spectra that cover separate experiments and/or different locations and times, you might prefer to organize your spectra as sub-folders within data/spectra
. This hands-on is based on spectral data that were used to build and evaluate the YAMSYS spectroscopy reference models. Besides these reference spectra measured with a Bruker ALPHA mid-IR spectrometer at the Sustainable Agroecosystems group at ETH ZĂĽrich, there are other spectra that have been acquired to test different questions such as spectrometer cross-comparisons. Therefore, other comparison spectra are in separate paths, e.g. data/spectra/soilspec_eth_bin
In the Figure below you can see a file explorer screenshot showing OPUS files of three replicate scans for each of the first three reference soil samples. OPUS have the extension .n
where n
represents an integer of repeated sample measurements starting from 0.
We aim to read all the reference spectra contained within this folder. First, we get the full path names of the file names, which we subsequently assign to the object files
# Extract data from OPUS binary files; list of file paths
files <- list.files("data/spectra", full.names = TRUE)
Note that you need to set the full.names
argument to TRUE
(default is FALSE
to get the path of all OPUS spectra files contained within the target directory, otherwise R will not be able to find the files when using the universal simplerspec
OPUS reader.
You can compactly display the internal structure of the files
## chr [1:284] "data/spectra/BF_lo_01_soil_cal.0" ...
The object files
has the data structure atomic vector. Atomic vectors have six possible basic (atomic) vector types. These are logical, integer, real, complex, string (or character) and raw. Vector types can be returned by the R base function typeof(x)
, which returns the type or internal storage mode an object x
. For the files
object it is
# Check type of files object
## [1] "character"
We get the length of the vector or the number of elements by
# How many files are listed to read? length of vector
## [1] 284
Base R has subsetting operations that allow you to extract pieces of data structures you are interested in. One of the three base subsetting operators is [
We subset the character vector files
as follows:
# Use character subsetting to return the first element
# Subsetting can be seen as complement to str()
# (1) Subsetting with positive integers (position)
## [1] "data/spectra/BF_lo_01_soil_cal.0" "data/spectra/BF_lo_01_soil_cal.1"
## [3] "data/spectra/BF_lo_01_soil_cal.2"
# (2) Subsetting with negative integers (remove values)
head(files[-c(1:3)], n = 5L) # show only first 5 values
## [1] "data/spectra/BF_lo_02_soil_cal.0" "data/spectra/BF_lo_02_soil_cal.1"
## [3] "data/spectra/BF_lo_02_soil_cal.2" "data/spectra/BF_lo_03_soil_cal.0"
## [5] "data/spectra/BF_lo_03_soil_cal.1"
# The first three elements of the character vector are removed
Bruker FTIR spectrometers produce binary files in the OPUS format that can contain different types of spectra and many parameters such as instrument type and settings that were used at the time of data acquisition and internal processing (e.g. Fourier transform operations). Basically, the entire set of Setup Measurement Parameters, selected spectra, supplementary metadata such as the time of measurement are written into OPUS binary files. In contrast to simple text files that contain only plain text with a defined character encoding, binary files can contain any type of data represented as sequences of bytes (a single byte is sequence of 8 bits and 1 bit either represents 0 or 1).
Figure shows graphical representation from the OPUS viewer software to get familiarize with types of parameters OPUS files may contain.
You can download the OPUS viewer software from this Bruker webpage for free. However, Bruker only provides a Windows version and the free version is limited to visualize only final spectra. The remaining spectral blocks can be checked choosing the menu Window > New Report Window and opening OPUS by the menu File > Load File.
The types of spectra and associated data parameters that are saved after a single measurement depend on the options that are selected in the OPUS software. For data acquisition, the values under the tab Advanced of the Setup Measurement Parameters menu window in the OPUS software.
Depending on the standard of a binary file, different regions in a file can be interpreted differently by a program. For example, some information at some block positions need to be interpreted as a certain type of number representation whereas others are text. Hence, the interpretation of different bit positions in the file requires either a priori knowledge provided by some file specifications or extensive reverse-engineering.
Instead of sharing the full binary file specification, Bruker ships the OPUS macro programming language or Microsoft Visual Basic scripts for automated data acquisition and processing. However, this approaches are very inflexible and not transparent, and therefore not reproducible. Hence, the idea of implementing a file reader that is integrated in the R statistical programming environment was targeted first in the soil.spec
R package created by Andrew Sila (ICRAF, Nairobi), Tomislav Hengl (ISRIC -- World Soil Information) and Thomas Terhoeven-Urselmans (former member of ICRAF, Nairobi). soil.spec
was created based on the African Soil Information Services (AfSIS) project (see here for more information). Because this reader worked only when applying a restricted set of settings and procedures in OPUS, the idea came up to modify and extend the previously mentioned soil.spec::read.opus()
function. This restriction is mainly due to the fact that positions where spectra occur are not fixed and there is no evident accessible information about the sequence of spectra and data parameters and the type of present spectra. Therefore, I have been working extensively on a universal Bruker OPUS format file reader that can correctly assign and read out different spectra types from any type of Bruker FTIR spectrometer with different blocks saved and with and without atmospheric compensation.
Simplerspec comes with reader function written in R, that is intended to be a universal Bruker OPUS file reader that extract spectra and key metadata from files. Usually, one is mostly interested to extract the final absorbance spectra (shown as AB
in the OPUS viewer software).
# Read spectra and metadata of all binary OPUS files into a list
spc_list <- read_opus_univ(fnames = files, extract = c("spc"), parallel = TRUE)
The extracted spectra and metadata data are within a list. A list is a very flexible R data structure that can contain any other type of R objects. You can think of lists as containers that help to save and transform objects. Lists can contain hierarchically nested elements, e.g. a list can contain lists. In this case, the list contains the following elements:
# Return the names of a list; names() returns a character vector
# of the elemelement names and `[` extracts the first 10 names
## [1] "BF_lo_01_soil_cal.0" "BF_lo_01_soil_cal.1" "BF_lo_01_soil_cal.2"
## [4] "BF_lo_02_soil_cal.0" "BF_lo_02_soil_cal.1" "BF_lo_02_soil_cal.2"
## [7] "BF_lo_03_soil_cal.0" "BF_lo_03_soil_cal.1" "BF_lo_03_soil_cal.2"
## [10] "BF_lo_04_soil_cal.0"
# The names from the spectral data list are identical to the
# files that were read (file names without path of folder where data
# are contained)
## [1] "data/spectra/BF_lo_01_soil_cal.0" "data/spectra/BF_lo_01_soil_cal.1"
## [3] "data/spectra/BF_lo_01_soil_cal.2" "data/spectra/BF_lo_02_soil_cal.0"
## [5] "data/spectra/BF_lo_02_soil_cal.1" "data/spectra/BF_lo_02_soil_cal.2"
## [7] "data/spectra/BF_lo_03_soil_cal.0" "data/spectra/BF_lo_03_soil_cal.1"
## [9] "data/spectra/BF_lo_03_soil_cal.2" "data/spectra/BF_lo_04_soil_cal.0"
List subsetting/extraction of components: Lists can be subsetted similar to atomic vectors by the [
and [[
operators. The most important difference is that [
returns a list (sub-list of list) and [[
returns the content of a single component of the list (note that a single components can still contain sub-lists). Based on the [
operator we can extract spectra and metadata from a single replicate measurement of a sample:
# Display structure
, whereas using [[
shows the content of the subsetted list and the name is not shown anymore. The content contains again 9 elements, as str()
