Contents

One of the main characteristics of the Stammler_2016_16S_spikein dataset is the presence of spike-in bacteria with a known fixed amount of bacterial cells. These known loads of bacteria can be used to recalibrate the raw counts of the matrix and obtain recalibrated absolute counts. In this vignette, we provide an example of how to recalibrate the counts of the count matrix based on the read counts of Salinibacter ruber. This procedure is referred to as Spike-in-based calibration to total microbial load (SCML) in Sammler et al., 2016.

library(MicrobiomeBenchmarkData)
library(dplyr)
library(ggplot2)
library(tidyr)

0.1 Import data

tse <- getBenchmarkData('Stammler_2016_16S_spikein', dryrun = FALSE)[[1]]
#> Warning: No taxonomy_tree available for Stammler_2016_16S_spikein.
#> Finished Stammler_2016_16S_spikein.
counts <- assay(tse)

0.2 Ids of the spike-in bacteria

Identifiers of the spiked-in bacteria have the suffix ‘XXXX’.

Bacteria ID Load
Salinibacter ruber AF323500XXXX 3.0 x 108
Rhizobium radiobacter AB247615XXXX 5.0 x 108
Alicyclobacillus acidiphilus AB076660XXXX 1.0 x 108

0.3 Recalibrate based on Salinibacter ruber abundance.

This recalibration is based on the original article. The only difference is that the numbers have been rounded up to obtain counts.

## AF323500XXXX is the unique OTU corresponding to S. ruber
s_ruber <- counts['AF323500XXXX', ]
size_factor <- s_ruber/mean(s_ruber)

SCML_data <- counts 
for(i in seq(ncol(SCML_data))){
    SCML_data[,i] <- round(SCML_data[,i] / size_factor[i])
}

Brief comparison of counts


no_cal <- counts |> 
    colSums() |> 
    as.data.frame() |> 
    tibble::rownames_to_column(var = 'sample_id') |> 
    magrittr::set_colnames(c('sample_id', 'colSum')) |> 
    mutate(calibrated = 'no') |> 
    as_tibble()

cal <-  SCML_data |> 
    colSums() |> 
    as.data.frame() |> 
    tibble::rownames_to_column(var = 'sample_id') |> 
    magrittr::set_colnames(c('sample_id', 'colSum')) |> 
    mutate(calibrated = 'yes') |> 
    as_tibble()

data <- bind_rows(no_cal, cal)

data |> 
    ggplot(aes(sample_id, colSum)) + 
    geom_col(aes(fill = calibrated), position = 'dodge') +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

The counts matrix can be replaced in the original tse in order to preserve the same metadata.

assay(tse) <- SCML_data
tse
#> class: TreeSummarizedExperiment 
#> dim: 4036 17 
#> metadata(0):
#> assays(1): counts
#> rownames(4036): GQ448052 EU458484 ... DQ795992 GQ492848
#> rowData names(1): taxonomy
#> colnames(17): MID26 MID27 ... MID42 MID43
#> colData names(12): dataset subject_id ... country description
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULL

0.4 A more convenient way using the scml function included in the package:

tse <- getBenchmarkData('Stammler_2016_16S_spikein', dryrun = FALSE)[[1]]
#> Warning: No taxonomy_tree available for Stammler_2016_16S_spikein.
#> Finished Stammler_2016_16S_spikein.
tse <- scml(tse,bac = "s")
#> Re-calibrating counts with Salinibacter ruber (AF323500)

1 Session information

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] tidyr_1.3.1                     ggplot2_3.5.1                  
#>  [3] dplyr_1.1.4                     purrr_1.0.2                    
#>  [5] MicrobiomeBenchmarkData_1.8.0   TreeSummarizedExperiment_2.14.0
#>  [7] Biostrings_2.74.0               XVector_0.46.0                 
#>  [9] SingleCellExperiment_1.28.0     SummarizedExperiment_1.36.0    
#> [11] Biobase_2.66.0                  GenomicRanges_1.58.0           
#> [13] GenomeInfoDb_1.42.0             IRanges_2.40.0                 
#> [15] S4Vectors_0.44.0                BiocGenerics_0.52.0            
#> [17] MatrixGenerics_1.18.0           matrixStats_1.4.1              
#> [19] BiocStyle_2.34.0               
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1        farver_2.1.2            blob_1.2.4             
#>  [4] filelock_1.0.3          fastmap_1.2.0           lazyeval_0.2.2         
#>  [7] BiocFileCache_2.14.0    digest_0.6.37           lifecycle_1.0.4        
#> [10] tidytree_0.4.6          RSQLite_2.3.7           magrittr_2.0.3         
#> [13] compiler_4.4.1          rlang_1.1.4             sass_0.4.9             
#> [16] tools_4.4.1             utf8_1.2.4              yaml_2.3.10            
#> [19] knitr_1.48              labeling_0.4.3          S4Arrays_1.6.0         
#> [22] bit_4.5.0               curl_5.2.3              DelayedArray_0.32.0    
#> [25] abind_1.4-8             BiocParallel_1.40.0     withr_3.0.2            
#> [28] grid_4.4.1              fansi_1.0.6             colorspace_2.1-1       
#> [31] scales_1.3.0            tinytex_0.53            cli_3.6.3              
#> [34] rmarkdown_2.28          crayon_1.5.3            treeio_1.30.0          
#> [37] generics_0.1.3          httr_1.4.7              DBI_1.2.3              
#> [40] ape_5.8                 cachem_1.1.0            zlibbioc_1.52.0        
#> [43] parallel_4.4.1          BiocManager_1.30.25     vctrs_0.6.5            
#> [46] yulab.utils_0.1.7       Matrix_1.7-1            jsonlite_1.8.9         
#> [49] bookdown_0.41           bit64_4.5.2             magick_2.8.5           
#> [52] jquerylib_0.1.4         glue_1.8.0              codetools_0.2-20       
#> [55] gtable_0.3.6            UCSC.utils_1.2.0        munsell_0.5.1          
#> [58] tibble_3.2.1            pillar_1.9.0            htmltools_0.5.8.1      
#> [61] GenomeInfoDbData_1.2.13 R6_2.5.1                dbplyr_2.5.0           
#> [64] evaluate_1.0.1          lattice_0.22-6          highr_0.11             
#> [67] memoise_2.0.1           bslib_0.8.0             Rcpp_1.0.13            
#> [70] SparseArray_1.6.0       nlme_3.1-166            xfun_0.48              
#> [73] fs_1.6.5                pkgconfig_2.0.3