autoDCR
A modified version of Decombinator for flexible TCR annotation
Contents:
autoDCR
(short for automatic Decombinator) is a python script to perform T cell receptor (TCR) gene annotation of DNA sequencing reads. This is inspired by and in part built upon the core functionality of Decombinator, the TCR analysis software developed by the Chain lab at UCL (in part by me). It uses a similar conceptual framework of using fast Aho-Corasick tries to search for the presence of ‘tag’ sequences in DNA reads, and use these to identify V and J TCR genes.
The first big difference between the scripts is what tags are looked for. Standard Decombinator
(V1 to V4) aims to find single CDR3-proximal tags (i.e. in the 3’ of V genes or 5’ of J genes) which identify individual TCR genes - ideally covering all of the alleles of that gene - producing an extremely concise trie which can very rapidly search large datasets. Instead, autoDCR
generates overlapping tags across the entirety of all alleles for all genes for both TRA and TRB, producing a much larger trie. Tags will then potentially cover multiple genes, but genes are called based on the intersection between genes/alleles covered in all tags which they matched.
This sacrifices the speed of Decombinator
, but allows for greater resolution of TCR genes and alleles, while using and retaining sequence information from a greater portion of the input TCR read. This tag approach also makes the automation of tag generation much simpler, making it much easier to either update autoDCR
when new TCR alleles are discovered, or even to apply it to situations like non-natural repertoires, or for novel species for which TCR information has recently become available.
Note that autoDCR
is in active development and should be considered highly experimental, and does not offer as full a suite of functions as does Decombinator
(nor seek to). Most especially it currently lacks the UMI based error-correction functionality of Decombinator
. Instead it is intended to supply a flexible framework of adaptable TCR annotation-related functions that can be tweaked for dataset-specific applications, including a few which I don’t feel are necessarily well catered to with existing tools.