Generating reference data
In order to work, autoDCR
requires a few different reference files (which differ slightly depending on the mode being being run). These files are automatically generated using the autoDCR refs
command, which in turn uses the stitchr-style output of IMGTgeneDL, a tool used to automatically download and parse germline TCR allele data from IMGT/GENE-DB.
autoDCR
data directory
autoDCR
generates its own data directory during installation, and will download and manage its content there itself. If everything is working as it should, end users shouldn’t need to manually interact with their data directory much, but if it is required the location of this database can be printed by running autoDCR dd
.
autoDCR refs
refs
is the subcommand that grabs and manages the reference data.
Basic referencing
autoDCR
can theoretically be used to analyse any of the four common TCR loci (alpha, beta, gamma, delta) for which IMGT has sufficient information (see this page of the stitchr documentation for more details). The majority of TCRseq research (including my own) deals with human alpha/beta data, which is reflected in most of the autoDCR
default parameters.
autoDCR
identifies TCR genes used in a rearrangement by scanning reads with an Aho-Corasick trie built from short ‘tag’ sequences, themselves produced by taking overlapping subsequences from all allele gene sequences in the provided reference. In order to use autoDCR
in its simplest form - i.e. annotating V, J, and CDR3 regions in TCRseq reads - one need only run:
# for human data
autoDCR refs
# for other species
autoDCR refs -s [species common name]
autoDCR refs -s mouse
When you run that, autoDCR
grabs and processes the necessary data for its style of annotation.
By default this is done for all V and J genes of TRA and TRB genes together, so that the same process can be applied to all A/B human reads regardless of locus.
Running
autoDCR refs
under its default conditions will therefore generate a ‘HUMAN’ folder in the data directory.- That will contain several kinds of files, including:
- General files relating to the establishment of the species-specific data directory:
imgt-data.fasta
, containing all of the unfiltered reads from the IMGT reference for the requested species (humans, by default).Locus-specific filtered versions of that (
TRA.fasta, TRB.fasta
etc).Some automatically-produced motif files produced by
IMGTgeneDL
(C-
andJ-regionmotifs.tsv
)data-production-date.tsv
, which contains information about the date and release of IMGT/GENE-DB used (which should be retained for reporting when publising any results generated with a given reference).
- Several files prefixed
AB_JV
(representing ‘alpha/beta V and J genes’): A
*.fasta
file, containing the subset of genes used for a specific trie.- A
*.tags
file, containing the specific tags used to populate the trie, and the specific alleles they are found in. These contain several fields: tag sequences (by default 20-mers), used in the actual tag searching
tag jump values (comma-separated where applicable), which indicates the position in the gene of that tag
comma-separated lists of the alleles covered by each tag
- A
- A
*.translate
file, detailing automatically-inferred conserved V and J gene motifs used to help translate TCRs and identify CDR3 junctions, specifically with the fields: TCR allele
relative position
amino acid found at the conserved residue position
Note that in both file types, positions are counted forwards from the 5’ of the V or backwards from the 3’ end of the J.
Not only is it conceptually easier to count backwards for the Js (due to the start of the J being obscured by V(D)J recombination), but the conserved FGXG motifs tend to fall at specific positions relative to the end of the Js, at least in functional genes, presumably due to evolutionary constraints.
- A
A
*.log
file, which is not used in subsequent TCR annotation efforts but which contains metadata produced during the automatic processing of a given species’ genes (see Notes section below).
- Several files prefixed
You can specify exactly which loci
autoDCR
files are generated for using the--loci / -L
flag, e.g.:
# Gamma/delta TCR references
autoDCR refs -L GD
# Single chain A/B references (to process individually)
autoDCR refs -L A
autoDCR refs -L B -sd
- Note the use of the
skip_download / --sd
flag If not used, each use of
autoDCR refs
will over-write the previous data.As some more advanced features of
autoDCR
require multiple iterations ofrefs
to be applied to the same dataset, users must take care to ensure the correct order of commands using-sd
is observed.
- Note the use of the
Novel allele supplementation
As part of my interest in the impact of human TCR gene polymorphism, I maintain a resource tracking putative novel human TCR alleles (not featured in IMGT/GENE-DB) identified in the literature. If users wish to include these alleles in their database, they can run the following command:
autoDCR refs -nv
Note that this uses a function adapted from stitchr
(version 0.3.0), which is described in slightly more detail in the stitchr docs.
Additional region referencing
In addition to typical V/J/CDR3 annotation, autoDCR
also has some modes which permit additional annotation of the leader sequence and constant regions of a rearranged TCR.
In order to extract the necessary information for these modes, users must first generate the corresponding reference information, achieved by running autoDCR
refs with the appropriate --regions / -r
flag with the skip_download / -sd
flag, after running regular reference production (+/- novel alleles) for that species. E.g.:
# For human AB TCRs, first you must have run
autoDCR refs
# Or, if including novel alleles
autoDCR refs -nv
# Then to add the constant and leader files
autoDCR refs -sd -r CL
TCR protein sequence referencing
autoDCR
is also capable of applying its TCR annotation functions to translated polypeptide sequences, as might be found in say repurposed structural data. In order to generate the corresponding files, autoDCR refs
needs to be run with the boolean --protein / -aa
flag. Users also need to supply a decreased sliding window length (which dictates the length of the tags) via the --sliding_window / -sw
flag as below (using a length of 10, which seems to work reasonably well):
# For human AB TCRs, first you must have acquired standard nt V/J sequences
# Then generate amino acid versions from that
autodcr refs -aa -sw 10
Use of the --protein / -aa
flag automatically applies the skip_download
flag, as it requires the corresponding nucleotide data to have been downloaded already.
Also note that this is one of the (even more) experimental features. As such it currently can only be used for standard V/J/CDR3 annotation, and may behave oddly with some less-common parameters.
Quick full human referencing
The following commands will establish all of the necessary files for the current breadth of autoDCR
functionality, at least as relates to human alpha/beta TCR analysis:
autoDCR refs -nv
autoDCR refs -sd -r CL
autodcr refs -aa -sw 10
Things to note
If you use
pip
to uninstallautoDCR
will likely result in the deletion of this folder, so if you are likely to need the contents of the directory (e.g. if you have used a particular configuration to analyse data for publication) it is suggested that you make a prior of this directory before uninstalling.As all TRDV genes can be found recombined with TRAJ (even those genes not necessarily labelled as TRAVx/DVx, at least in humans),
autoDCR
automatically considers all TRDV genes when looking for alpha chain recombinations.- When generating tags for a particular species for the first time, be sure to check the
.log
file produced in the relevant data directory, Conserved C/FGXG motifs are detected using regexes manually produced by generating positional weight matrices of all four human TCR loci, which allows it to detect even non-canonical residues at the conserved positions, in allele sequences that may be incomplete.
However these motifs may be less conserved between species, and so if the log file shows that there are many predicted-functional genes not being found with high-confidence motifs users may wish to inspect and correct the regexes dictionary, stored in the
regexes.json
file located in the data directory.
- When generating tags for a particular species for the first time, be sure to check the
- If for some reason multiple data directories for a given species are required to be maintained simultaneously, you can navigate to the data directory and rename the species folder (e.g. you could have ‘HUMAN’ and ‘HUMAN2’), which could then be referred to specifically using the
--species / -s
of the differentautoDCR
commands. Each directory needs to be generated with a recognised common name however, as this is what
IMGTgeneDL
uses as a reference to pull out the right reads from IMGT/GENE-DB.
- If for some reason multiple data directories for a given species are required to be maintained simultaneously, you can navigate to the data directory and rename the species folder (e.g. you could have ‘HUMAN’ and ‘HUMAN2’), which could then be referred to specifically using the