An alternative method for the building of a RepSeqExperiment object

This function allows the loading of alignment files in other formats than the ones supported by the readAIRRSet function. These files should contain the following column names: - sample_id: Sample names

- ntCDR3: nucleotide CDR3 sequence

- aaCDR3: amino acid CDR3 sequence

- V: Variable gene name (IMGT nomenclature)

- J: Joining gene name (IMGT nomenclature)

- count: the occurrence of each sequence

The order of the columns must be respected. Input files can contain additional columns that won't however be taken into account in the generated RepSeqExperiment object.

Similarly to the readAIRRSet function, readFormatSet applies a series of filters during the building process. The possible filters that can be applied include:

- filtering out ambiguous sequences

- filtering out unproductive sequences including the ones with stop codons, namely out-of-frame sequences

- filtering out singletons, i.e clonotypes with an occurrence of 1

- filtering out short or extensively long amino acid CDR3 sequences

Usage

readFormatSet(
  fileList,
  chain = c("TRA", "TRB", "TRG", "TRD", "IGH", "IGK", "IGL"),
  sampleinfo = NULL,
  keep.ambiguous = FALSE,
  keep.unproductive = FALSE,
  filter.singletons = FALSE,
  aa.th = 8,
  outFiltered = TRUE,
  cores = 1L
)

Arguments

fileList

a list of paths to the alignment files. File format can be one of the following: .tsv, txt.gz, .zip, or .tar

chain

a character vector indicating a single TCR or Ig chain to analyze. The vector can be one of the following:

- "TRA", "TRB","TRG" or "TRD" for the TCR repertoires

- "IGH","IGK" or "IGL" for the BCR repertoires

sampleinfo

a data frame containing:

- a column with the sample_ids. Ids should match the base names of the corresponding files as well as their order in the list. This column should be assigned as row.names when the metadata file is loaded (see the example below).

- any additional columns with relevant information for the analyses. Group columns must be transformed into factors after loading (see the example below).

No specific column names nor order are required

keep.ambiguous

a boolean indicating whether or not ambiguous sequences containing an "N" in their nucleotide CDR3 should be left in. Default is FALSE.

keep.unproductive

a boolean indicating whether or not unproductive sequences should be left in. Default is FALSE. Unproductive sequences include:

- out-of-frame sequences: sequences with frame shifts based on the number of nucleotides in their CDR3s

- sequences containing stop codons: aaCDR3 sequences with a "*" or "~" should be kept. Default is FALSE

filter.singletons

a boolean indicating whether or not ntClones (V+ntCDR3+J) with an occurrence of 1 should be filtered out. Default is FALSE

aa.th

an integer determining the CDR3 amino acid sequence length limits, i.e. the maximum number of amino acids deviating from the mean length that is accepted. The default value is 8, which keeps amino acid CDR3s with a length that falls inside the following range: mean length-8 =< aa.th <= mean length+ 8.

outFiltered

a boolean indicating whether or not to write in the oData slot of the RepSeqexperiment object a data frame containing the filtered out reads.

cores

an integer indicating the number of cores to be uses by the function. The process is paralleled and can thus be used on multiple cores. Default is 1.

Value

an object of class RepSeqExperiment that is used in all the analytical metrics proposed by the AnalyzAIRR package. See RepSeqExperiment-class for more details.

Examples

l <- list.files(system.file(file.path('extdata/informat'),
                     package = 'AnalyzAIRR'),
                     full.names = TRUE)