This function builds a RepSeqExperiment object from alignment files generated by MiXCR and immunoseq, as well as files formatted following the AIRR standards guidelines.
A set of filters can be applied on the loaded files during the building process including:
- filtering out ambiguous sequences
- filtering out unproductive sequences including the ones with stop codons, namely out-of-frame sequences
- filtering out singletons, i.e clones with an occurrence of 1
- filtering out short or extensively long amino acid CDR3 sequences
Arguments
- fileList
a list of paths to the alignment files. File format can be one of the following: .tsv, txt.gz, .zip, or .tar
- fileFormat
a character vector specifying the format of the input files, i.e. the tool that was used to generate/align the files. Should be one of "MiXCR", "immunoseq", or "MiAIRR".
- chain
a character vector indicating a single TCR or Ig chain to analyze. The vector can be one of the following:
- "TRA", "TRB","TRG" or "TRD" for the TCR repertoires
- "IGH","IGK" or "IGL" for the BCR repertoires
- sampleinfo
a data frame containing:
- a column with the sample_ids. Ids should match the base names of the corresponding files as well as their order in the list. This column should be assigned as row.names when the metadata file is loaded (see the example below).
- any additional columns with relevant information for the analyses. Group columns must be transformed into factors after loading (see the example below).
No specific column names nor order are required
- keep.ambiguous
a boolean indicating whether or not ambiguous sequences containing an "N" in their nucleotide CDR3 should be left in. Default is
FALSE
.- keep.unproductive
a boolean indicating whether or not unproductive sequences should be left in. Default is
FALSE
. Unproductive sequences include:- out-of-frame sequences: sequences with frame shifts based on the number of nucleotides in their CDR3s
- sequences containing stop codons: aaCDR3 sequences with a "*" or "~" should be kept. Default is
FALSE
- filter.singletons
a boolean indicating whether or not ntClones (V+ntCDR3+J) with an occurrence of 1 should be filtered out. Default is
FALSE
- aa.th
an integer determining the CDR3 amino acid sequence length limits, i.e. the maximum number of amino acids deviating from the mean length that is accepted. The default value is 8, which keeps amino acid CDR3s with a length that falls inside the following range: mean length-8 =< aa.th <= mean length+ 8.
- outFiltered
a boolean indicating whether or not to write in the oData slot of the RepSeqexperiment object a data frame containing the filtered out reads.
- cores
an integer indicating the number of CPU cores to be used by the function. The process is paralleled and can thus be used on multiple cores. Default is 1.
Value
an object of class RepSeqExperiment
that is used in all the analytical metrics proposed by the AnalyzAIRR package. See RepSeqExperiment-class
for more details.
Examples
l <- list.files(system.file(file.path('extdata/MiAIRR'),
package = 'AnalyzAIRR'),
full.names = TRUE)
metaData <- read.table(system.file(file.path('extdata/sampledata.txt'),
package='AnalyzAIRR'),
sep = "\t",
row.names = 1, header = TRUE)
metaData$cell_subset <- factor(metaData$cell_subset)
metaData$sex <- factor(metaData$sex)
dataset <- readAIRRSet(fileList = l[7:8],
fileFormat = "MiAIRR",
chain = "TRA",
sampleinfo = metaData[7:8,],
filter.singletons = FALSE,
aa.th=8,
outFiltered = FALSE,
cores=3)
#> Running on 3 cores...
#> Loading and filtering sequencesAssembling sequences...
#> Creating a RepSeqExperiment object...
#> Done.