`stitchr` input data

Species covered

stitchr can stitch any TCR loci for which it has the necessary raw data, appropriately formatted in the Data directory (which can be located by running stitchr -dd or stitchr --data_dir). Currently stitchr relies on IMGT/GENE-DB for obtaining germline reference data, which contains enough TCR data for stitching (i.e. at least one leader, V gene, J gene, and constant region for a given loci) for the following species/loci:

stitchr species/loci table
Common Name	Genus species	TRA	TRB	TRG	TRD	Kazusa species ID
CAT	Felis catus	✔	✔	✔	✔	9685
COW	Bos taurus	✔	✔	✔	✔	9913
CYNOMOLGUS_MONKEY	Macaca fascicularis		✔			9541
DOG	Canis lupus familiaris	✔	✔	✔	✔	9615
DOLPHIN	Tursiops truncatus	✔		✔	✔	9739
DROMEDARY	Camelus dromedarius		✔	✔		9838
FERRET	Mustela putorius furo		✔			9669
HUMAN	Homo sapiens	✔	✔	✔	✔	9606
MOUSE	Mus musculus	✔	✔	✔	✔	10090
NAKED_MOLE-RAT	Heterocephalus glaber	✔	✔	✔	✔
PIG	Sus scrofa		✔			9823
RABBIT	Oryctolagus cuniculus		✔	✔	✔	9986
RHESUS_MONKEY	Macaca mulatta	✔	✔	✔	✔	9544
SHEEP	Ovis aries	✔	✔		✔	9940

(Note that D gene information is not required, as all D gene residues are by definition included entirely within the CDR3 junction sequences.)

The species can be specified using the -s / --species command line flag when running your stitchr command (default = ‘human’). E.g., here’s an example using everyone’s favourite mouse TCR, OT-I (sequences inferred from this plasmid on AddGene:

stitchr -s mouse -v TRBV12-1 -j TRBJ2-7 -cdr3 CASSRANYEQYF
stitchr -s mouse -v TRAV14D-1 -j TRAJ33 -cdr3 CAASDNYQLIW

It must be noted that many of the less well studied species have poorer gene annotations and germline variation covered by IMGT, so TCRs produced using these datasets should be treated with more caution than say for humans. E.g. different mouse strains will have different alleles (and different numbers of gene family members), so accuracy of stitched TCRs will depend both on the quality of both germline gene information and TCR clonotyping.

Generating new reference input files

You may wish to update your raw TCR data files after an IMGT update, as sequences can be added (or even changed), or as more species become available. Assuming IMGT maintains its current naming conventions and webhosting scheme, these tasks can be undertaken automatically using the stitchrdl command, which is a wrapper for the IMGTgeneDL script (details in its own repo).

Users should note that when this is run, it creates a data-production-date.tsv file in the directory of that species, which contains the IMGT release number and date of downloaded sequences, which should be included in any published reporting of the TCR sequences used. We also recommend that you update the germline TCR data for stitchr at the same time you update the database used in whatever TCR gene annotation software you use, to ensure that there’s no discrepancy in allele nomenclature between the two.

Stitchr data formatting

Each species you wish to stitch TCRs for must have its own folder in the installed Data/ directory, named after whatever flag you wish you use when giving stitchr information through the -s / --species flag. Note that stitchr will assume every folder in Data/ that isn’t named ‘kazusa’ or ‘GUI-Examples’ is a potential TCR germline folder, so it’s advised to not put any other folders there. It is recommended to use stitchrdl to generate and update these folders where possible.

Inside that folder there should be various files:

data-production-date.tsv
- Contains information about the germline reference and script versions used to generate this data
- Technically not used by stitchr as it runs, but contains important information for recording or relaying the products of stitchr
imgt-data.fasta
- Contains all of the FASTA reads that were successfully downloaded for this species
- Again, this file isn’t used during stitching but it’s a useful reference to have
J-region-motifs.tsv
- Contains automatically inferred CDR3 junction ending motifs and residues (using the process established in the autoDCR TCR assignation tool), for use in finding the ends of junctions in stitchr
C-region-motifs.tsv
- Contains automatically inferred in-frame constant region peptide sequences, for use in finding the correct frame of stitched sequences
TR[A/B/G/D].fasta
- FASTA files of the individual loci’s genes
- FASTA headers require a flag specifying what type of gene they are at the very end, after a tilde (~) character, being one of LEADER/VARIABLE/JOINING/CONSTANT

Note that the method used to automatically download TCR data (IMGTgeneDL) seems to struggle for constant regions in certain species, when trying to download the fully spliced sequences. As such, the script will instead download each of the individual exons and splice them together.

This is particularly important for users who wish to stitch gamma chain TCRs with constant regions which may have multiple possible exon configurations. Any constant region which uses non-conserved or duplicated exons has an additional suffix to the allele field, in which the non-standard exon labels are appended after an underscore. E.g. the human TRGC2*05 gene has the arrangement EX1+EX2T+EX2R+EX2+EX3 (instead of the usual EX1+EX2+EX3), so it is labelled TRGC2*05_TR. This allows users to have multiple isoforms of the same allele with different sequences.

Codon usage files

Non-templated based are assigned by taking the most common nucleotide triplet for a given amino acid, in a provided codon usage file.

Codon usage files are provided for all species for which data is available on the the Kazusa website, and can be found in the Data/kazusa/ directory. Alternative files can be provided, but must be in the same format (e.g. those provided by HIVE), and named according to the common species name used for the rest of the data and placed in that directory if not specified using the -cu / --codon_usage flag. U/T can be used interchangeably, as all U bases will be replaced with T anyway.

If no species-specific codon usage file is found the script will default to using the human file.

Preferred allele files

stitchr requires an exact allele to know which sequence to pull out of the database. By default, it always prioritises using the exact allele specified, but if the user just gives a gene identifier without an allele specified (e.g. TRAV1-1) then stitchr will use the prototypical ‘01’ allele (e.g. TRAV1-1*01), as this is the only allele which every IMGT-provided gene should theoretically have. It will similarly default to *01 if explicitly given an allele which it can’t find in the input data.

There are occasions when this is not the biologically appropriate allele to choose (and even some examples where a gene lacks an *01 allele). While users can of course specify the allele explicitly when providing the gene, they may alternatively wish to make use of the -p / --preferred_alleles_path command line option, which allows them to point to a tab-delimited file detailing specific alleles which should be used. Here users can specify four fields:

Gene: the IMGT-format gene name
Allele: the preferred allele (i.e. the text after the ‘*’ in a complete name)
Region: one of LEADER/VARIABLE/JOINING/CONSTANT, which tells stitchr explicitly what kind of sequence it is
Loci: specify which locus or loci this preferred allele covers using three digit codes, comma-delimiting if >1 (e.g. “TRB” or “TRA,TRD”)
Source: this field is not used by stitchr, but can be useful for keeping track of the origin of/reason for including each preferred allele

This feature is particularly of use when generating large numbers of stitched sequences from a particular individual or strain where non-prototypical alleles are known. Note that if you are specifying adifferent allele for a variable gene that has variants in its leader sequence as well, make sure you add entries for both VARIABLE and LEADER alleles.

A template and two example common mouse strain files are included in the templates/preferred-alleles/ directory. These examples are for the common mouse strains C56/Bl6 and Balb/c, and were produced by using the subspecies/strain field of the IMGT headers. Note that even then users should take care, as some genes have multiple alleles associated with them, despite being from inbred mice – e.g. TRAV9D-2 has two alleles associated with it for Balb/c (01 and 03). I’ve tried to pick the ones that are more likely to be functional (F > ORF > P, e.g. choosing TRBV24*03 over 02 for Balb/c, or TRAV9D-4*04 over 02 for C57/BL6), or are from better inferred data (e.g. taking one with functionality ‘F’ over ‘(F)’).

Providing additional gene sequences

Sometimes you may wish to generate TCRs using additional gene sequences which won’t be provided by IMGT (at least in the context of a given species). This can be used to introduce sequences from other loci/species, and modified or otherwise non-naturally occurring gene combinations.

Genes to be included can be added to the Data/additional-genes.fasta file, and then when stitchr or thimble is run these sequences will be read in by use of the -xg flag. As constant region gene switching is a common modification used in TCR expression and engineering studies, human alpha/beta/gamma/delta and mouse alpha/beta constant regions have been preloaded into this file. Genes added to this fasta must have a FASTA header in the format:

>[reference]|[gene]*[allele]|[species]|[functionality]||||||||||||~[region]

Only the second and last fields are important for these additional genes, and all other fields can be left empty.
The second field contains gene name and allele information: the gene name can be any alphanumeric string (that doesn’t contain an asterisk), while the allele will usually be a zero-padded two (or more) digit integer (e.g. ‘01’), but can be anything. Any case can be used in gene names, but bear in mind all will be made upper case when running.
The last field is prefixed by a tilde, and describes what kind of gene region this is (LEADER/VARIABLE/JOINING/CONSTANT). This can usually be informed by the default fifth field of a standard IMGT header (V-REGION, J-REGION, EX1+EX2+EX3+EX4 (constant region), or L-PART1+L-PART2 (leader sequence)), but in practice that field alone wasn’t sufficient for stitchr to always assign the genes correctly, hence the need for this additional field.
Note that the fourth field describing functionality (F, P, or ORF) can be left blank, with empty fields being presumed functional. This feature does not impact stitched results; it only determines whether or not certain warning flags are raised for a given rearrangement.

The following can be used as template FASTA read headers:

>[reference]|TR[ABGD]V[#]*YY|Homo sapiens|F||||||||||||~LEADER

>[reference]|TR[ABGD]V[#]*YY|Homo sapiens|F||||||||||||~VARIABLE

>[reference]|TR[ABGD]J[#]*YY|Homo sapiens|F||||||||||||~JOINING

>[reference]|TR[ABGD]C[#]*YY|Homo sapiens|F||||||||||||~CONSTANT

While FASTA entries can be manually added to the additional-genes.fasta file (located via the stitchr -dd command), it is recommended to instead use stitchrdl which has the capacity to automatically read sequences into this file via it’s -fa / --fasta flag:

stitchrdl -fa sequences_to_add.fasta

Not only will this likely be more convenient for the user, modifying the additional-genes file this way will automatically make a backup of the file first (located in the data directory, prefixed with ‘archived-YYYY-MM-DD-‘), permitting recreation of stitchr runs using historic additional sequences. When providing sequences in this manner, it is recommended to use the full templates above, however users may optionally instead provide only a ‘gene*allele’ identifier, which the script will use to try to infer gene region (with leaders being distinguished from variable regions based on length, being presumed to be < or >= 100 nt respectively).

Some things to remember when using custom sequences:

Functional leader sequences usually have lengths that are multiples of 3. They don’t need to be, but if they’re not the V gene will need to account for it to maintain the reading frame.
The 3’ nucleotide of the J gene is the first nucleotide of the first codon of the constant region.
Constant regions in default settings are trimmed by the script to run up to the codon just before the first stop codon (as occur in EX4UTR exons of TRAC and TRDC). This is not required, and stop codons can be left in if desired, but care must be taken if the intention is to use thimble or gui-stitchr with these genes to make bicistronic expression constructs. It’s recommended to leave stop codons off any constant regions added to additional-genes.fasta, and then provide them in thimble instead as needed.
Most of the gene sequence and format checks cannot be applied, so extra care must be taken to ensure input genes are valid. For instance, using the -xg flag automatically sets the -sc flag, which skips the usual constant region frame check (as stitchr doesn’t know what frame is intended, see below).
Extra genes added via the additional-genes.fasta file are supplemented to the working dictionaries in stitcher after the reference gene sequences are read in; any extra genes with the same gene name/allele combination as one already in the dataset will overwrite the default sequence. If you wish to use both in the same rearrangement or thimble run, use novel naming in the input FASTA file - e.g. the example constant regions added have ‘m’ and ‘h’ prefixes, denoting their human or mouse origin, but any chance to ensure unique names will work.

Using novel/inferred human TCR alleles

Note: this section details an experimental feature of stitchr, and is only recommended for advanced users.

I maintain a repo of putative novel human TCR alleles identified in the literature, mostly inferred from repertoire data but some from long-read genomic sequencing, most of which are not currently available in IMGT. If users wish to include these alleles in their stitchr references, they can be automatically read into the additional-genes.fasta file using the -na / --novel_alleles flag of stitchrdl, for use with the -xg / --extra_genes flag of stitchr or thimble.

Note that these alleles are provided as-is, with no additional or standardised QC applied beyond what is present in the source publications. It is possible to gain confidence in these alleles through use of the -ns / --novel_studies_threshold and -nd / --novel_donors_threshold flags to provide values of the number of studies and donors each allele must have been observed in to be included in this scrape.

An example of this process may therefore look like the following, in which novel alleles found in two or more studies in 5 or more cumulate donors would be added to the extra genes file:

stitchrdl -na -ns 2 -nd 5

Skipping constant region checks

For the default loci covered, stitchr has a constant region frame-checking function that uses known correctly-translated sequences to infer the right frame (and where appropriate, placement of endogenous stop codons). If you wish to override these checks for some reason (most likely if you’re manually creating your own non-standard or engineered constant region sequences) then you can get the -sc / --skip_c_checks flag in the command line. Under these circumstances, stitchr will instead determine the correct frame of the C terminal domain by finding the one with the longest stretch of amino acids before hitting a stop codon. Note that this is less reliable and slower than using the pre-computed motif files. This feature will also only activate if the gene name of the relevant constant region is not found in the C-region-motifs.tsv file for that species.

If for some reason users which to skip the C region checks (using the automatically inferred translation frame) for a gene that already is covered in the pre-generated motifs file, they should add a renamed variant of that gene to the additional-genes.fasta file, and use the -xg extra genes flag. Note that using the -xg flag will automatically set the -sc / --skip_c_checks on.

stitchr input data