igblastr 0.99.7
The Immunoglobulin Basic Local Alignment Search Tool (IgBLAST) is a specialized bioinformatics tool developed by the National Center for Biotechnology Information (NCBI) for the analysis of B-cell receptor (BCR) and T-cell receptor (TCR) sequences (Ye et al., 2013). IgBLAST performs sequence alignment and annotation, with key outputs including germline V, D, and J gene assignments; detection of somatic hypermutations introduced during affinity maturation; identification and annotation of complementarity-determining regions (CDR1–CDR3) and framework regions (FR1–FR4); and both nucleotide and protein-level alignments.
igblastr is an R/Bioconductor package that provides functions to conveniently install and use a local IgBLAST installation from within R. The package is designed to make it as easy as possible to use IgBLAST in R by streamlining the installation of both IgBLAST and its associated germline databases. In particular, these installations can be performed with a single function call, do not require root access, and persist across R sessions.
The main function in the package is igblastn()
, a wrapper to the igblastn
standalone executable included in IgBLAST. In addition to igblastn()
,
the package provides:
A function (install_igblast()
) for conveniently downloading and
installing a pre-compiled IgBLAST from NCBI.
Functions to download and install germline databases from the
IMGT/V-QUEST download site, and to configure them for use with igblastn()
.
A set of built-in germline databases from OGRDB, the AIRR Community’s Open Germline Reference Database.
A set of built-in constant region (C-region) databases (a.k.a. constant region databases) from IMGT/V-QUEST.
Utility functions to parse the results returned by igblastn()
.
Utility functions to download data from OAS, the Observed Antibody Space database, and prepare it for use with IgBLAST.
Etc.
IgBLAST is described at https://pubmed.ncbi.nlm.nih.gov/23671333/
Online IgBLAST: https://www.ncbi.nlm.nih.gov/igblast/
Please use https://github.com/HyrienLab/igblastr/issues to report bugs, provide feedback, request features (etc) about igblastr.
Like any Bioconductor package, igblastr should be installed
with BiocManager::install()
:
if (!require("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("igblastr")
BiocManager::install()
will take care of installing the package dependencies
that are missing.
Load the package:
library(igblastr)
If IgBLAST is already installed on your system, you can tell
igblastr to use it by setting the environment variable
IGBLAST_ROOT
to the path of your IgBLAST installation.
See ?IGBLAST_ROOT
for more information.
Otherwise, simply call install_igblast()
to install the latest
version of IgBLAST. As of March 2025, NCBI provides pre-compiled
versions of IgBLAST for Linux, Windows, Intel Mac and Mac Silicon.
install_igblast()
will automatically download the appropriate pre-compiled
version of IgBLAST for your platform from the NCBI FTP site, and install it
in a location that will be remembered across R sessions.
if (!has_igblast())
install_igblast()
See ?install_igblast
for more information.
Note that we use has_igblast()
to avoid reinstalling IgBLAST if
igblastr already has access to a working IgBLAST installation.
Display basic information about the IgBLAST installation used by igblastr:
igblast_info()
## igblast_root: /home/biocbuild/.cache/R/igblastr/igblast_roots/1.22.0
## igblast_build: igblast 1.22.0, build Oct 11 2023 16:06:20
## igblastn_version: 1.22.0
## makeblastdb_version: 2.15.0+
## OS/arch: Linux/x86_64
## organisms: human, mouse, rabbit, rat, rhesus_monkey
The igblastr package includes a FASTA file containing 8,437 paired heavy and light chain human antibody sequences (16,874 individual sequences) retrieved from OAS. These sequences will serve as our query sequences, that is, the immunoglobulin (Ig) sequences that we will analyze in this vignette.
Before we can do so, the igblastn
standalone executable in IgBLAST (and,
by extension, our igblastn()
function) needs access to the germline V, D,
and J gene sequences for humans. Before igblastn
can use them, these
germline sequences must first be stored in three separate BLAST databases.
We will refer to these as the V-region, D-region, and J-region databases,
respectively. When considered together, we will refer to them as the
germline database.
The igblastr package includes a set of built-in germline databases for human and mouse that were obtained from the OGRDB database (AIRR community). These can be listed with:
list_germline_dbs(builtin.only=TRUE)
## db_name V D J
## _AIRR.human.IGH+IGK+IGL.202501 354 33 24
## _AIRR.mouse.CAST_EiJ.IGH+IGK+IGL.202501 184 9 22
## _AIRR.mouse.LEWES_EiJ.IGH+IGK+IGL.202501 169 11 22
## _AIRR.mouse.MSM_MsJ.IGH+IGK+IGL.202501 172 9 22
## _AIRR.mouse.NOD_ShiLtJ.IGH+IGK+IGL.202501 149 9 22
## _AIRR.mouse.PWD_PhJ.IGH+IGK+IGL.202501 184 10 22
The last part of the database name indicates the date of the download in YYYYMM format.
If our query sequences are from humans or from one of the mouse strains listed
above, we can already select the appropriate database with use_germline_db()
and skip the subsection below.
AIRR/OGRDB and IMGT/V-QUEST are two providers of germline databases that
can be used with IgBLAST. If, for any reason, none of the built-in
AIRR/OGRDB germline databases is suitable (e.g., if your query sequences
are not from human or mouse), you can use install_IMGT_germline_db()
to install additional germline databases. Below, we show how to install the
latest human germline database from IMGT/V-QUEST.
First, we list the most recent IMGT/V-QUEST releases:
head(list_IMGT_releases())
## [1] "202518-3" "202506-1" "202449-1" "202430-2" "202416-4" "202405-2"
The organisms included in release 202506-1 are:
list_IMGT_organisms("202506-1")
## [1] "Aotus_nancymaae" "Bos_taurus"
## [3] "Camelus_dromedarius" "Canis_lupus_familiaris"
## [5] "Capra_hircus" "Chondrichthyes"
## [7] "Danio_rerio" "Equus_caballus"
## [9] "Felis_catus" "Gadus_morhua"
## [11] "Gallus_gallus" "Gorilla_gorilla_gorilla"
## [13] "Heterocephalus_glaber" "Homo_sapiens"
## [15] "Ictalurus_punctatus" "Lemur_catta"
## [17] "Macaca_fascicularis" "Macaca_mulatta"
## [19] "Mus_musculus" "Mus_musculus_C57BL6J"
## [21] "Mustela_putorius_furo" "Neogale_vison"
## [23] "Nonhuman_primates" "Oncorhynchus_mykiss"
## [25] "Ornithorhynchus_anatinus" "Oryctolagus_cuniculus"
## [27] "Ovis_aries" "Pongo_abelii"
## [29] "Pongo_pygmaeus" "Rattus_norvegicus"
## [31] "Salmo_salar" "Sus_scrofa"
## [33] "Teleostei" "Tursiops_truncatus"
## [35] "Vicugna_pacos"
Next, we install the human germline database from the latest IMGT/V-QUEST release:
install_IMGT_germline_db("202506-1", organism="Homo sapiens", force=TRUE)
See ?install_IMGT_germline_db
for more information.
Finally, we select the newly installed germline database for use
with igblastn()
:
use_germline_db("IMGT-202506-1.Homo_sapiens.IGH+IGK+IGL")
See ?use_germline_db
for more information.
To see the full list of local germline dbs:
list_germline_dbs()
## db_name V D J
## _AIRR.human.IGH+IGK+IGL.202501 354 33 24
## _AIRR.mouse.CAST_EiJ.IGH+IGK+IGL.202501 184 9 22
## _AIRR.mouse.LEWES_EiJ.IGH+IGK+IGL.202501 169 11 22
## _AIRR.mouse.MSM_MsJ.IGH+IGK+IGL.202501 172 9 22
## _AIRR.mouse.NOD_ShiLtJ.IGH+IGK+IGL.202501 149 9 22
## _AIRR.mouse.PWD_PhJ.IGH+IGK+IGL.202501 184 10 22
## IMGT-202506-1.Homo_sapiens.IGH+IGK+IGL 700 47 34 *
## IMGT-202506-1.Mus_musculus.IGH+IGK+IGL 862 61 27
Note that the asterisk (*
) displayed at the far right of the output
from list_germline_dbs()
indicates the currently selected germline
database (you may need to scroll horizontally to see the asterisk).
See ?list_germline_dbs
for more information.
The igblastn
standalone executable in IgBLAST can also use constant
region (C region) sequences to improve the analysis. As with the
germline V, D, and J gene sequences, the C-region sequences should
typically come from the same organism as the query sequences, and they
must also be formatted as a BLAST database. We will refer to this database
as the C-region database.
The igblastr package includes a set of built-in C-region databases for human, mouse, and rabbit, obtained from IMGT/V-QUEST. These can be listed using:
list_c_region_dbs()
## db_name C
## _IMGT.human.IGH+IGK+IGL.202412 76
## _IMGT.mouse.IGH.202412 31
## _IMGT.rabbit.IGH.202412 28
The last part of the database name indicates the date of the download in YYYYMM format.
If your query sequences are from human, mouse, or rabbit, you can
select the appropriate database using use_c_region_db()
:
use_c_region_db("_IMGT.human.IGH+IGK+IGL.202412")
Calling list_c_region_dbs()
again should display an asterisk (*
)
at the far right of the output, indicating the currently selected
C-region database.
See ?list_c_region_dbs
for more information.
Now that we have selected a germline and C-region database for use
with igblastn()
, we are almost ready to call igblastn()
to perform
the alignment.
As mentioned earlier, the igblastr package includes a
FASTA file containing 8,437 paired heavy and light chain human antibody
sequences retrieved from OAS. These serve as our query sequences,
that is, the set of BCR sequences that we will analyse using igblastn()
.
To get the path to the query sequences, use:
query <- system.file(package="igblastr", "extdata", "1279067_1_Paired_sequences.fasta.gz")
The igblastr package also includes a JSON file containing metadata associated with the query BCR sequences:
json <- system.file(package="igblastr", "extdata", "1279067_1_Paired_All.json")
query_metadata <- jsonlite::fromJSON(json)
query_metadata
## $Run
## [1] 1279067
##
## $Link
## [1] "https://doi.org/10.1038/s41586-022-05371-z"
##
## $Author
## [1] "Jaffe et al., 2022"
##
## $Species
## [1] "human"
##
## $Age
## [1] 38
##
## $BSource
## [1] "PBMC"
##
## $BType
## [1] "Memory-B-Cells"
##
## $Vaccine
## [1] "None"
##
## $Disease
## [1] "None"
##
## $Subject
## [1] "Donor-3"
##
## $Longitudinal
## [1] "no"
##
## $`Unique sequences`
## [1] 8437
##
## $Isotype
## [1] "All"
##
## $Chain
## [1] "Paired"
The 8,347 paired sequences come from memory B cells isolated from peripheral blood mononuclear cell (PBMC) samples of a single human donor (age 38) with no known disease or vaccination history. The source for these sequences is the Jaffe et al. (2022) study; the DOI link to the publication is provided above.
Before calling igblastn()
, we first check the selected databases by
calling use_germline_db()
and use_c_region_db()
with no arguments:
use_germline_db()
## [1] "IMGT-202506-1.Homo_sapiens.IGH+IGK+IGL"
use_c_region_db()
## [1] "_IMGT.human.IGH+IGK+IGL.202412"
Now, let’s call igblastn()
. Since we are only interested in the best
alignment for each query sequence, we set num_alignments_V
to 1. This
may take up to around 3 min on a standard laptop:
AIRR_df <- igblastn(query, num_alignments_V=1)
By default, the result is a tibble in AIRR format:
AIRR_df
## # A tibble: 16,874 × 111
## sequence_id sequence sequence_aa locus stop_codon vj_in_frame v_frameshift
## <chr> <chr> <chr> <chr> <lgl> <lgl> <lgl>
## 1 heavy_chain_A… ACAACCA… QVQLVQSGSE… IGH FALSE TRUE FALSE
## 2 light_chain_A… TGAGCGC… QSVLTQPPSV… IGL FALSE TRUE FALSE
## 3 heavy_chain_A… ACTTTCT… QVQLQESGPG… IGH FALSE TRUE FALSE
## 4 light_chain_A… GAGGAGT… DIQMTQSPSS… IGK FALSE TRUE FALSE
## 5 heavy_chain_A… GGGATCA… QVQLVQSGAE… IGH FALSE TRUE FALSE
## 6 light_chain_A… GCTGGGG… QSALTQPPSA… IGL FALSE TRUE FALSE
## 7 heavy_chain_A… GGAGCCC… EVQLVESGGG… IGH FALSE TRUE FALSE
## 8 light_chain_A… GGGGGCT… QSALTQPASV… IGL FALSE TRUE FALSE
## 9 heavy_chain_A… ATACTTT… QVQLQESGPG… IGH FALSE TRUE FALSE
## 10 light_chain_A… GAGCTAC… DIVMTQSPDS… IGK FALSE TRUE FALSE
## # ℹ 16,864 more rows
## # ℹ 104 more variables: productive <lgl>, rev_comp <lgl>, complete_vdj <lgl>,
## # d_frame <lgl>, v_call <chr>, d_call <chr>, j_call <chr>, c_call <chr>,
## # sequence_alignment <chr>, germline_alignment <chr>,
## # sequence_alignment_aa <chr>, germline_alignment_aa <chr>,
## # v_alignment_start <int>, v_alignment_end <int>, d_alignment_start <int>,
## # d_alignment_end <int>, j_alignment_start <int>, j_alignment_end <int>, …
See ?igblastn
for more information.
library(ggplot2)
library(dplyr)
library(scales)
library(ggseqlogo)
One common analysis of AIRR format data is to examine the distribution of percent mutation across BCR sequences. Here we analyze the percent mutation in the V segments of each chain type (heavy, kappa, and lambda). Note that V percent mutation is 100 - v_identity.
AIRR_df |>
ggplot(aes(locus, 100 - v_identity)) +
theme_bw(base_size=14) +
geom_point(position = position_jitter(width = 0.3), alpha = 0.1) +
geom_boxplot(color = "blue", fill = NA, outliers = FALSE, alpha = 0.3) +
ggtitle("Distribution of V percent mutation by locus") +
xlab(NULL)
Another common analysis is to investigate the distribution of germline genes (e.g., V genes). In this case, we typically stratify the analysis by locus or chain type.
plot_gene_dist <- function(AIRR_df, loc) {
df_v_gene <- AIRR_df |>
filter(locus == loc) |>
mutate(v_gene = sub("\\*[0-9]*", "", v_call)) |> # drop allele info
group_by(v_gene) |>
summarize(n = n(), .groups = "drop") |>
mutate(frac = n / sum(n))
df_v_gene |>
ggplot(aes(frac, v_gene)) +
theme_bw(base_size=13) +
geom_col() +
scale_x_continuous('Percent of sequences', labels = scales::percent) +
ylab("Germline gene") +
ggtitle(paste0(loc, "V gene prevalence"))
}
plot_gene_dist(AIRR_df, "IGH")
plot_gene_dist(AIRR_df, "IGK")
plot_gene_dist(AIRR_df, "IGL")
A third category of analysis focuses on CDR3 sequences, including their lengths and motifs, which are often visualized using sequence logo plots.
AIRR_df$cdr3_aa_length <- nchar(AIRR_df$cdr3_aa)
AIRR_df |>
group_by(locus, cdr3_aa_length) |>
summarize(n = n(), .groups = "drop") |>
ggplot(aes(cdr3_aa_length, n)) +
theme_bw(base_size=14) +
facet_wrap(~locus) +
geom_col() +
ggtitle("Histograms of CDR3 length by locus")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_col()`).
AIRR_df |>
filter(locus == "IGK", cdr3_aa_length == 9) |>
pull(cdr3_aa) |>
ggseqlogo(method = "probability") +
theme_bw(base_size=14) +
ggtitle("Logo plot of kappa chain CDR3 sequences that are 9 AA long")
At the moment, the igblastr package does not provide
access to the full functionality of the IgBLAST software. In particular,
the igblastn()
function only supports the analysis of B-cell receptor (BCR)
sequences. Also, the igblastp
standalone executable included in IgBLAST
has no counterpart in igblastr.
Some future developments include:
Adding support for T-cell receptor (TCR) sequence analysis.
This will require adding facilities to download and install TCR
oriented germline databases, and to configure them for use
with igblastn(..., ig_seqtype="TCR")
.
Implement igblastp()
, a wrapper to the igblastp
standalone executable
included in IgBLAST for protein-level alignments.
Facilities to retrieve germline databases from OGRDB, the AIRR Community’s Open Germline Reference Database.
Here is the output of sessionInfo()
on the system on which this document
was compiled:
sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggseqlogo_0.2 scales_1.4.0 dplyr_1.1.4
## [4] ggplot2_3.5.2 igblastr_0.99.7 Biostrings_2.77.1
## [7] GenomeInfoDb_1.45.4 XVector_0.49.0 IRanges_2.43.0
## [10] S4Vectors_0.47.0 BiocGenerics_0.55.0 generics_0.1.4
## [13] tibble_3.3.0 BiocStyle_2.37.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.52 bslib_0.9.0
## [4] websocket_1.4.4 processx_3.8.6 vctrs_0.6.5
## [7] tools_4.5.0 ps_1.9.1 curl_6.3.0
## [10] pkgconfig_2.0.3 R.oo_1.27.1 RColorBrewer_1.1-3
## [13] lifecycle_1.0.4 compiler_4.5.0 farver_2.1.2
## [16] stringr_1.5.1 tinytex_0.57 chromote_0.5.1
## [19] htmltools_0.5.8.1 sass_0.4.10 yaml_2.3.10
## [22] later_1.4.2 pillar_1.10.2 crayon_1.5.3
## [25] jquerylib_0.1.4 R.utils_2.13.0 cachem_1.1.0
## [28] magick_2.8.7 tidyselect_1.2.1 rvest_1.0.4
## [31] digest_0.6.37 stringi_1.8.7 bookdown_0.43
## [34] labeling_0.4.3 fastmap_1.2.0 grid_4.5.0
## [37] cli_3.6.5 magrittr_2.0.3 dichromat_2.0-0.1
## [40] utf8_1.2.6 withr_3.0.2 UCSC.utils_1.5.0
## [43] promises_1.3.3 rmarkdown_2.29 httr_1.4.7
## [46] R.methodsS3_1.8.2 evaluate_1.0.3 knitr_1.50
## [49] rlang_1.1.6 Rcpp_1.0.14 xtable_1.8-4
## [52] glue_1.8.0 selectr_0.4-2 BiocManager_1.30.26
## [55] xml2_1.3.8 jsonlite_2.0.0 R6_2.6.1