Skip to content

Latest commit

 

History

History
481 lines (358 loc) · 19.4 KB

README.md

File metadata and controls

481 lines (358 loc) · 19.4 KB

gliph

  • GROUPING OF LYMPHOCYTE INTERACTIONS BY PARATOPE HOTSPOTS - GLIPH

Questions? Comments? Contact [email protected] / [email protected] [email protected]

Name

gliph - Grouping of Lymphocyte Interactions by Paratope Hotspots

Synopsis

gliph-group-discovery.pl [options ] --tcr TCR_TABLE

gliph-group-scoring.pl --convergence_file TCR_TABLE-convergence-groups.txt
--clone_annotations TCR_TABLE
--hla_file HLA_TABLE
--motif_pval_file TCR_TABLE.minp.ove10.txt

Description

GLIPH clusters TCRs that are predicted to bind the same MHC-restricted peptide antigen. When multiple donors have contributed to the clusters, and HLA genotypes for those donors are available, GLIPH additionally can provide predictions of which HLA-allele is presenting the antigen.

Typically the user will pass in a sequence set of hundreds to thousands of TCR sequences. This dataset will be analyzed for very similar TCRs, or TCRs that share CDR3 motifs that appear enriched in this set relative to their expected frequencies in an unselected naive reference TCR set.

GLIPH returns significant motif lists, significant TCR convergence groups, and for each group, a collection of scores for that group indicating enrichment for motif, V-gene, CDR3 length, shared HLA among contributors, and proliferation count. When HLA data is available, it predicts the likely HLA that the set of TCRs recognizes.

Installation

First, unpack gliph:

tar -xzvf gliph-1.0.tgz

You are done. The gliph commands are found in

gliph/bin/gliph-group-discovery.pl gliph/bin/gliph-group-scoring.pl

Optionally, for added convenience, you could add the path to your gliph/bin directory to your system $PATH.

Options

Required Data Inputs


The user must provide a table of TCR sequences. 

  --tcr TCR_TABLE

The format of the table is tab delimited, expecting the following columns in this 
order. Only TCRb is required for the primary component of the algorithm to function, 
but patient identity is required for HLA prediction. 

Example:

CDR3b		TRBV	TRBJ	CDR3a		TRAV		TRAJ	PatientCounts
CAADTSSGANVLTF	TRBV30	TRBJ2-6	CALSDEDTGRRALTF	TRAV19		TRAJ5	09/02171
CAATGGDRAYEQYF	TRBV2	TRBJ2-7	CAASSGANSKLTF	TRAV13-1	TRAJ56	03/04922
CAATQQGETQYF	TRBV2	TRBJ2-5	CAASYGGSARQLTF	TRAV13-1	TRAJ22	02/02591
CACVSNTEAFF	TRBV28	TRBJ1-1	CAGDLNGAGSYQLTF	TRAV25		TRAJ28	PBMC8631
CAGGKGNSPLHF	TRBV2	TRBJ1-6	CVVLRGGSQGNLIF	TRAV12-1	TRAJ42	02/02071
CAGQILAGSDTQYF	TRBV6-4	TRBJ2-3	CATASGNTPLVF	TRAV17		TRAJ29	09/00181
CAGRTGVSTDTQYF	TRBV5-1	TRBJ2-3	CAVTPGGGADGLTF	TRAV41		TRAJ45	02/02591
CAGYTGRANYGYTF	TRBV2	TRBJ1-2	CVVNGGFGNVLHC	TRAV12-1	TRAJ35	01/08733


Optional Data Inputs

The user may additional supply a table of HLA genotyping for each subject.

--hla HLA_TABLE

The format of the table is tab delimited, with each row beginning with the identity of a subject, and then two or more following column providing HLA identification. The number of total columns (HLA defined genotypes) is flexible.

Example:

09/0217 DPA101:03 DPA102:02 DPB104:01 DPB114:01 DQA101:02 09/0125 DPA102:02 DPA102:02 DPB105:01 DPB105:01 DQA106:01 03/0345 DPA102:01 DPA102:01 DPB117:01 DPB101:01 DQA101:03 03/0492 DPA101:03 DPA102:01 DPB103:01 DPB111:01 DQA101:02 02/0259 DPA101:03 DPA101:03 DPB1104:01 DPB102:01 DQA1*02:01

Optional Arguments


  --refdb DB             optional alternative reference database

  --gccutoff=1           global covergence distance cutoff. This is the maximum
			 CDR3 Hamming mutation distance between two clones sharing
			 the same V, same J, and same CDR3 length in order for them
		         to be considered to be likely binding the same antigen. 
			
                         This number will change depending on sample depth, as
			 with more reads, the odds of finding a similar sequence 
			 increases even in a naive repertoire. 

			 This number will also change depending on the species
			 evaluated and even the choice of reference database (memory
			 TCRs will be more likely to have similar TCRs than naive 
			 TCR repertoires). Thus, by default this is calculated at 
			 runtime if not specified.

  --simdepth=1000        simulated resampling depth for non-parametric convergence 
			 significance tests. This defines the number of random
			 repeat samplings into the reference distribution that
			 GLIPH performs when analyzing 
				1) global similarity cutoff
				2) local similarity motif enrichment
				3) V-gene enrichment
				4) CDR3 length enrichment
				5) clonal proliferation enrichment
			 A higher number will take longer to run but will produce 
			 more reproducible results. 

  --lcminp=0.01          local convergence minimum probability score cutoff. 
			 The score reports the probability that a random sample of the
			 same size as the sample set would but into the reference set
			 (i.e. naive repertoire) would generate an enrichment of the
			 given motif at least as high as has been observed in the 
			 sample set. It is set to 0.01 by default.

  --lcminove=10          local convergence minimum observed vs expected fold change.
			 This is a cutoff for the minimum fold enrichment over a
			 reference distribution that a given motif should have in 
			 the sample set in order to be considered for further evaluation.
			 It is set to 10 by default.

  --kmer_mindepth=3      minimum observations of kmer for it to be evaluated. This is
			 the minimum number of times a kmer should be observed in 
			 the sample set in order for it to be considered for further
			 evaluation. The number can be set higher to provide less
			 motif-based clusters with higher confidence. This could be
			 recommended if the sample set is greater than 5000 reads. Lowering
			 the value to 2 will identify more groups but likely at a cost
			 of an increase False Discovery Rate. 

  --global=1             Search for global TCR similarity (Default 1)

  --local=1              Search for local TCR similarity (Default 1)

  --make_depth_fig=0     Perform repeat random samplings at the sample set depth in 
			 order to visualize convergence

  --discontinuous=0      Allow discontinuous motifs (Default 0)

  --positional_motifs=0  Restrict motif clustering to a shared position that is fixed
			 from the N-terminal end of CDR3

  --cdr3len_stratify=0   Stratify by shared cdr3length distribution (Default 0)

  --vgene_stratify=0     Stratify by shared V-gene frequency distribution (Default 0)
 
  --public_tcrs=0        Reward motifs in public TCRs (Default 0)

Controlling Output
~~~~~~~~~~~~~~~~~~

GLIPH produces multiple output files. 

  --output FILE          Place command output into the named file.

  --verbose              Have *tmo* print messages describing
                         

Definitions

Convergence group - a set of multiple TCRs from one or more individuals that bind the same antigen in a similar manner through similar TCR contacts. GLIPH predicts convergence groups by asking "what is the probability that this cluster of similar TCRs could have appeared without the selection of common antigen?"

Global convergence - A pair of TCRs that share the same length CDR3 and differ by less than a certain number of amino acids in those CDR3s. Example: in a set of 300 random TCRs, finding two TCRs that only differ by one amino acid in their CDR3 would be highly unlikely.

Local convergence - A pair of TCRs that share in their CDR3 regions an amino acid motif that appears enriched in their sample set. Optionally this common motif could be positionally constrained. Example: in the malaria set, an enriched QRW motif was found in 23 unique TCRs from 12 individuals in a conserved position in their CDR3.

Reference set - A large database of TCR sequences that are not expected to be enriched for the specificities found in the sample set. Example: by default, GLIPH uses as a reference database over 200,000 nonredundant naive CD4 and CD8 TCRb sequences from 12 healthy controls.

Sample set - The input collection of TCRs under evaluation that are potentially enriched for a specificity not present in the reference set.

Function of Algorithm


GLIPH first makes the input sample set non-redundant at the CDR3 amino acid level.
The identity of the V-genes, the J-genes, the clonal frequencies, and the associated
HLA information for each TCR is hidden during the CDR3 analysis in order to act
as independent confirmation variables of the resulting clusters. 

GLIPH next counts the number of unique CDR3 sequences in the sample dataset. 

GLIPH next calculates the minimum hamming distance from each CDR3 in the sample 
dataset to the next closest same-length sequence in the sample set. This result is 
the minimum distance distribution.
 
GLIPH next loads the reference dataset. 

GLIPH next repeatedly resamples from the reference database at the nonredundant
sample set depth, calculating the minimum hamming distance distribution in each case.
This is repeated simdepth times (1000 by default). Probability of encountering
next closest members at distant 0, 1, 2... is tabulated across all random samples.  

GLIPH next compares the sample set minimum hamming distance distribution to that of
the reference set, to assign a probability to each hamming distance that sequences
of that distance are likely to appear randomly. 

Example: after analysis, it is found that out of 1000 simulations

   hamming distance in-test-set  in-ref-set    probability
   1                 14          0             0.001
   2                 37          8+/-3         0.02
   3                 194         147+/-35      0.24

GLIPH then selects a maximum hamming distance cutoff for grouping similar TCRs. 

NOTE: For the standard reference set this analysis has been precomputed for a range
of sampling depths, enabling a lookup table rather than simulation as a runtime
accelerator. 
 
GLIPH then analyzes all possible 3mer, 4mer, 5mer and discontinuous 3mer and 4mer
motifs for their frequency in the sample set. This is done excluding the first three
and last three residues in the CDR, as they are not observed to be in contact with
antigen in known crystal structures. This is done for every non-redundant CDR3 in
the reference set. 

Example: for a scan of the following CDR3, the following 3mers would be collected 

  CAS SFGSGGHYE TYF
  xxx|SFGSGGHYE|xxx
      SFG
       FGS
        GSG
         SGG
          GGH
           GHY
            HYE
      SFGS
       FGSG
        GSGG
         SGGH
          GGHY
           GHYE
      S.GS
       F.SG
      (etc)

The frequences off all motifs across the sample set are calculated. 

In order to establish whether these motifs are just naturally abundant or specifically
enriched by antigen, these frequencies are compared to a repeat random sampling of
the reference set at the same depth as the non-redundant database. The frequency
of every motif is collected with each random sampling in order to create a distribution.

If each motif could only be observed in a given sequence once, then the distribution of sampling
motif frequency means u become normally distributed and this result is equivalent
to calculating the frequencies of all motifs in the reference database, and then calculating
one-sided confidence intervals for expected frequencies of any given motif in the reference
database at any given sampling depth:
   
               _
CI(99.9% os) = y + t        (s / sqrt(n))
                 -  0.0005
                                                        _
where n is the sample set non-redundant CDR3 sample size, y is the motif mean frequency, and
s is the SD estimate at that sampling depth for the motif, as
       ________________
      |  _ _       _ 2
s =   |  \   (y1 - y)
      |  /__ 
      | --------------
     \|   (n - 1)

This approximation provides a runtime acceleration.

GLIPH then compares the probability that each motif in the sample set would have appeared
at the observed frequency by random change (calculated as the number of times the motif
was found at at least that depth in a random sampling, divided by the total number of
random samplings, or alternatively though the CI). 

GLIPH next generates a network of all CDR3s as nodes, and all edges as either global or
local interactions. The clusters can be optionally filtered with the following arguements

  --positional_motifs=0  Restrict motif edges to CDR3s where the motif is in a shared 
			 position that is fixed from the N-terminal end of CDR3

  --public_tcrs=0        Allow CDR3s from multiple subjects to increase the size of
			 their clusters. Rewards public TCRs but risks contamination
			 clouding results. 

  --global_vgene=0       Restricts global relationships to TCRs of common V-gene
			 Recommended but invalidates V-gene probability score

GLIPH then scores each individual cluster (candidate convergence group) by evaluating a
set of features that are independet of the CDR3 observations, assigning a probability 
p to each feature, and then combining those probabilities into a single score by conflation. 

For tests i through N, testing Pi(X=C) probability that cluster X is convergent, the combined
conflation score is given as

                      __N
P(X=C) =              ||  P (X=C)
                        i  i
           ----------------------------
            __N             __N
            ||  P (X=C)  +  ||  P (X!=C)
              i  i            i  i

The individual Pi(X=C) tests include 

1) global similarity probability
2) local motif probability
3) network size
4) enrichment of V-gene within cluster
5) enrichment of CDR3 length (spectratype) within cluster
6) enrichment of clonal expansion within cluster
7) enrichment of common HLA among donor TCR contributors in cluster 

Individual score components are calculated as follows:

== 3) Calculating network size p ==

For each discrete cluster, the probability p of a given cluster topology can be 
obtained from the number of members of the cluster by comparison to a lookup table 
calculated from repeat random sampling and GLIPH clustering of naive TCR sequences 
at a range of different sampling depths n from 25 to 5000, each performed 1000 times 
each. 

Example: at sampling depth n=500, clusters of size 5 have a probabilty p=0.002 of occuring
in naive TCR sample sets.

== 4) enrichment of V-gene within cluster ==

GLIPH hides V-gene usage for all clones prior to CDR3 analysis, and does not explore the
V-gene template endcoded amino acids in the local (motif) search. 

To evaluate whether there is an enrichment of common V-segment within the cluster over
the degree that would be expected from an unbiased sampling of TCRs of cluster size n
from the total dataset, we perform repeat random sampling at sample size n from the
total dataset, each time obtaining the V-genes for each clone and calculating a 
 V-gene Simpson's Diversity Index D for each sample, where D is interpreted as the probability
that any two members within the cluster would share a V-gene and is calculated as

      __V
     \   v(v-1)
     /__1
D = -----------
      __V
     \   n(n-1)
     /__1

where V is the number of all V-genes, v is the total counts of a given v-gene, and n 
is the sampling size (cluster size). 
 
The probability of the observed D for candidate convergence group is obtained as the 
one-tailed probability of observing a score at least that high in the D score distribution
from random sampled clusters of same size n.

== 4) enrichment of CDR3 within cluster ==

For local (motif) scoring, no stratification based on CDR3 length is performed. 

To evaluate whether there is an enrichment of common CDR3 length within the cluster over
the degree that would be expected from an unbiased sampling of TCRs of cluster size n
from the total dataset, we perform repeat random sampling at sample size n from the
total dataset, each time obtaining the CDR3 lengths for each clone and calculating a
 V-gene Simpson's Diversity Index D for each sample, where D is interpreted as the probability
that any two members within the cluster would share a CDR3length and is calculated as

      __C
     \   c(c-1)
     /__1
D = -----------
      __C
     \   n(n-1)
     /__1

where C is the number of all CDR3 lengths, c is the total counts of a given CDR3 length, 
and n is the sampling size (cluster size).

The probability of the observed D for candidate convergence group is obtained as the
one-tailed probability of observing a score at least that high in the D score distribution
from random sampled clusters of same size n.


== 5) enrichment of clonal expansion within cluster ==

The first step in GLIPH is to make all CDR3s non-redundant, hiding their clone frequences.
To evaluate whether there is an enrichment of expanded clones within a given convergence
group, the number of counts for each clone are recovered, and a convergence group expansion
coefficient e is calculated as

e = total clones / total unique clones

To evaluate whether there is an enrichment of expanded clones within the cluster over
the degree that would be expected from an unbiased sampling of TCRs of cluster size n
from the total dataset, we perform repeat random sampling at sample size n from the
total dataset, each time obtaining the clone counts for each clone and calculating e
for each random sample to establish a distribution. 

The probability of the observed e for candidate convergence group is obtained as the
one-tailed probability of observing a score at least that high in the e score distribution
from random sampled clusters of same size n.

== 6) enrichment of common HLA among donor TCR contributors in cluster ==

TCRs that recognize a common antigen should be constrained by the same HLA. Thus, their
contributing donors should contain that HLA allele in their genotype. 

When HLA genotype information is available, GLIPH uses combinatorial resampling without
replacement to estimate the probability that the collection of TCRs in the convergent
group recognizes any given HLA. For each HLA allele that appears at least twice in the
candidate convergence group, GLIPH obtains

a the number of subjects in the convergence group that harbor that allele
A the number of subjects in the study that harbor that allele 
n the number of subjects in the convergence group
N the number of subjects in the study

GLIPH uses this information to calculate the probability that a given HLA allele is
present by chance.
 
Examples
--------

To run GLIPH on a TCR_TABLE file mytcrtable.txt, run:
gliph-group-discovery.pl --tcr mytcrtable.txt

Tun run GLIPH on a list of CDR3s mycdr3list.txt, run:
gliph-group-discovery.pl --tcr mycdr3list.txt

To run GLIPH with an alternative mouse reference DB mouseDB.fa, run:
gliph-group-discovery.pl --tcr mytcrtable --refdb=mouseDB.fa
                                       
To run GLIPH slower with a more thorough simdepth, run
gliph-group-discovery.pl --tcr mytcrtable.txt --simdepth=10000

To run GLIPH slower with a more thorough simdepth and an altered lcminp, run
gliph-group-discovery.pl --tcr mytcrtable.txt --simdepth=10000 --lcminp=0.001

To score GLIPH clusters, run 
  gliph-group-scoring.pl --convergence_file TCR_TABLE-convergence-groups.txt \
                         --clone_annotations TCR_TABLE \
                         --hla_file HLA_TABLE \
                         --motif_pval_file TCR_TABLE.minp.ove10.txt