WES-analysis

Project showing the basic steps for Whole Exome Sequence analysis to identify and discuss a single rare Non-synonymous variant

(Please cite this work if you take help from it)

INTRODUCTION

The objective of this project was to analyze the Whole Exome Sequence(WES) of an Indian Telugu female living in the UK having South Asian ancestry, with an aim to identify and study a rare non-synonymous variant. This was achieved by aligning the exome sequence to the human reference genome(GRCh38) obtained from the 1000 Genomes Project.

Individual's sequence ID and information

a) Sequence ID: HG03973

b) Biosample ID: SAME1839728

c) Sample Run ID: ERR250841

d) Population (ethnicity and geographical location): Indian Telugu in the UK, South Asian Ancestry

e) Gender: Female

f) Cell line source: HG03973 at Coriell

WORKFLOW

Figure 1- Analytical pipeline for WES data analysis

METHODOLOGY

- Firstly, the GRCh38 no alt analysis set reference genome sequence file was downloaded from the following website:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.26_GRCh38/GRCh38_major_release_seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

- The bwa indexed files were downloaded from the same website:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.26_GRCh38/GRCh38_major_release_seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bwa_index.tar.gz

- The Whole Exome Sequence of the individual with sequence id HG03973 was downloaded from the 1000Genomes website:

https://www.internationalgenome.org/data-portal/sample

- The reads were mapped using the bwa mem tool:

bwa mem -M -R '@RG\tID:flowcell\tSM: HG03973' GCA_000001405.15_GRCh38_no_alt_analysis_set.fna ERR250841_1.fastq ERR250841_2.fastq > HG03973bwamem.sam

- Samtools was used to clean up the generated sam file:

samtools fixmate -O bam HG03973bwamem.sam HG03973bwamemfixmate.bam

- Samtools was then used to sort the bam file:

samtools sort -O bam -o HG03973sorted.bam -T /tmp/HG03973temp HG03973bwamemfixmate.bam

- In the improvement step, Samtools was used to index sorted bam file:

samtools index HG03973sorted.bam

- Variant calling was done using the bcftools to generate the vcf file:

bcftools mpileup -Ou -f GCA_000001405.15_GRCh38_no_alt_analysis_set.fna HG03973sorted.bam | bcftools call -vmO z -o HG03973.vcf.gz

- Finally, the vcf file was indexed using tabix:

tabix -p vcf HG03973.vcf.gz

- The generated unindexed gunzipped vcf file was uploaded in the wANNOVAR[5] online tool for annotation. The following parameter settings were used:

> Reference genome: hg38

> Gene Definition: RefSeq Gene

> Analysis: Individual

- The list of annotated variants obtained through the wANNOVAR tool was first filtered to select variants having exonic function: “nonsynonymous SNV” which left 11,672 variants out of the total 24,608 variants. These leftout variants were filtered for the variants having ClinVar Significance: “Pathogenic/Likely Pathogenic” which left 21 variants. These 21 variants were filtered for the variants having 1000G_ALL and 1000G_SAS frequency < 0.01, which left only 2 variants and those 2 variants had 1000Genomes other subpopulation frequencies, ExAC frequency, and ExAC all subpopulation frequencies < 0.01. Among these two, the final variant was selected based on the CADD_phred score > 30, polyphen-2 score near to 1 and SIFT score < 0.05.

RESULTS AND DISCUSSION

Summary table of variants:

Total number of variants: 24,608

Table 1- Table showing the top 25 variants from all the 24,608 variants ordered with chromosome number

Total number of synonymous variants: 12,050

Table 2- Table showing the top 25 variants from all the 12,050 synonymous variants

Total number of non-synonymous variants: 11,667

Table 3- Table showing the top 25 variants from all the 11,667 nonsynonymous variants

Number of protein-truncating variants: 269

Table 4- Table showing the top 25 variants from all the 269 protein-truncating variants

Table of potentially damaging or pathogenic nonsynonymous variants:

Table 5- List of potentially damaging/pathogenic nonsynonymous variants having ClinVar Significance Pathogenic/Likely Pathogenic along with their SIFT, Polyphen2 HDIV, Polyphen2 HVAR, CADD_raw, CADD_phred scores. Data obtained through the wANNOVAR annotation tool

Table 6- Remaining columns from Table 5. List of potentially damaging/pathogenic nonsynonymous variants having ClinVar Significance Pathogenic/Likely Pathogenic along with their allele frequency in the population of the individual(1000G_SAS) and popmax allele frequency in gnomAD (shown in Bold). Data obtained through the wANNOVAR annotation tool

Table 7- List of potentially damaging/pathogenic nonsynonymous variants having ClinVar Significance Pathogenic/Likely Pathogenic along with their allele frequency in 1000G_ALL and 1000G different sub-populations

Damaging variant information:

Selected single nonsynonymous variant that is predicted to be pathogenic, with low allele frequency (allele frequency less than 0.1%):

Gene: NUP93
Protein Name: Nuclear pore complex protein Nup93
Variant ID (rs number): rs145146218
ClinVar ID: RCV000210641.1
Exonic function: nonsynonymous SNV
- Variant genotype (DNA change and amino acid change):
- DNA change: C to T (start and stop position 56831918)
Amino acid change: R(Arginine) to W(Tryptophan) at position 388
Variant frequency in the overall human population and in different ethnic/regional populations –

According to 1000Genomes:

1000G_All (variant frequency in overall population): 0.001
1000G_SAS (variant frequency in South Asian population): 0.003
1000G_EUR (variant frequency in European population): 0.002
Other populations (1000G_AFRICAN, AMERICAN, EAST ASIAN): 0.0
The highest frequency is in the South Asian population (T = 0.003)

Table 8– Variant frequency in overall and different human sub-populations according to 1000Genomes[1]

SIFT score: 0.001
Polyphen2_HDIV_score: 0.999
Polyphen2_HVAR_score: 0.94
CADD_raw: 7.534
CADD_phred: 34

Damaging variant analysis:

The nuclear pore complex is a massive structure that extends across the nuclear envelope, forming a gateway that regulates the flow of macromolecules between the nucleus and the cytoplasm. Nucleoporins are the main components of the nuclear pore complex in eukaryotic cells. The gene NUP93 encodes a nucleoporin protein that localizes both to the basket of the pore and to the nuclear entry of the central gated channel of the pore. The encoded protein is a target of caspase cysteine proteases that play a central role in programmed cell death by apoptosis. Alternative splicing results in multiple transcript variants encoding different isoforms[2].

The likely effect of the variant is that the mutant protein is unable to constitute nuclear pore complex which was reported through a study on Xenopus Laevis egg extract by Braun et al. (2016)[3] and is responsible for the Steroid-resistant nephrotic syndrome (SRNS) disease of the renal glomerular filter in Homo sapiens (second most frequent cause of end-stage kidney disease (ESKD) in the first 3 decades of life)[3]. This variant gene has been reported in a Serbian girl with nephrotic syndrome type 12 disease who had compound heterozygous mutations in NUP93 gene and each unaffected parent was heterozygous for 1 of the mutations. Different variants of the NUP93 gene have been studied in vitro in the human podocytes by Braun et al. (2016) and some of the mutant proteins were not able to localize properly to the nuclear envelope. R388W mutant protein localized properly to the nuclear envelope in human podocyte cells but was unable to restore nuclear envelope and nuclear pore complex assembly in NUP93-depleted Xenopus egg extracts. The mutation also impaired BMP7 /SMAD4 -dependent gene transcription [3], [4].

Previous citations on the R388W variant and any other variant on the same gene:

This variant(R388W) has been cited in one paper (PMID: 26878725) which summarizes its predicted role in the Steroid-resistant nephrotic syndrome through the interference with the BMP7 dependent SMAD signaling and the likely inability of Nucleoporin to form nuclear pore complex around the nucleus which is responsible for regulating the flow of macromolecules between the nucleus and the cytoplasm [3].

The same paper (PMID: 26878725) by Braun et al. (2016) reported four more damaging variants (GLY591VAL, TYR629CYS, 1-BP DEL 1326G, IVS13DS G-A +1) on the same gene.

The mutations GLY591VAL (reported in 2 siblings, born of consanguineous Turkish parents, with nephrotic syndrome type 12) and TYR629CYS (reported in 2 unrelated boys, each born of consanguineous Turkish parents, with nephrotic syndrome type 12) were shown, through in vitro functional studies, to have a role in abrogation of the normal interaction of NUP93 with the phosphorylated, activated forms of SMAD1 and SMAD5 and with the nuclear import receptor IPO7, and impair BMP7 /SMAD4-dependent gene transcription. [3]

The mutation 1-BP DEL, 1326G (reported in a German girl with nephrotic syndrome type 12), leads to a 1-bp deletion (c.1326delG, NM_014669.4) in exon 12, resulting in a frameshift and premature termination (Lys442AsnfsTer14) forming a truncated protein and the mutation IVS13DS, G-A, +1 (reported in a German girl with nephrotic syndrome type 12) lead to a G-to-A transition (c.1537+1G-A, NM_014669.4), resulting in a splice site mutation and the in-frame skipping of exon 13. Through in vitro studies, Braun et al. (2016) identified that in both the cases the mutant protein failed to properly localize to the nuclear envelope in human podocytes and was unable to restore nuclear envelope and nuclear pore-complex assembly in NUP93 depleted Xenopus egg extracts. It also abrogated and impaired the normal interaction of NUP93 with the phosphorylated, activated forms of SMAD1 and SMAD5 and with the nuclear import receptor IPO7, and impaired BMP7 /SMAD4-dependent gene transcription. [3]

Structural model of the protein:

Figure 2- Nuclear pore complex protein Nup93. The colors indicate the Model confidence (Navy Blue-Very high (pLDDT* > 90), Sky Blue-Confident (90 > pLDDT > 70), Yellow-Low (70 > pLDDT > 50), Orange-Very low (pLDDT < 50)). *pLDDT- per-residue confidence score

Visualization of altered amino acid on the structural model:

Figure 3- Nup93 protein highlighting (red circle) the variant amino acid substitution position

The structures are taken from the AlphaFold Protein Structure Database[6]. AlphaFold produces a per-residue confidence score (pLDDT) between 0 and 100[6]. This position (388) has a score of 89.92 which according to AlphaFold lies in the category of “Confident” score and is considered a good score. The regions surrounding the site of the variant amino acid (position 388) position have a pLDDT of around 90 which is considered a very high confidence score according to the AlphaFold Protein Structure Database[6]. Therefore, this model can overall be considered reliable.

The altered amino acid position is not exactly on the surface but slightly inwards. The variant protein has a single base substitution in the codon at position 388, which is not leading to a protein truncating amino acid, the structure of the protein is less likely to be altered but the substitution of the positively charged amino acid Arginine with the neutral amino acid Tryptophan (in the variant protein) is likely to affect amino acid chain interactions and compromise or alter the secondary and tertiary structures of the protein, which may lead to its abnormal interactions and eventually diseases. It has already been reported[4] that this changed amino acid does change the functioning of the Nup93 protein and is likely involved in the abnormal interaction of NUP93 with the phosphorylated, activated forms of SMAD1 and SMAD5 and with the nuclear import receptor IPO7 in the Steroid-resistant nephrotic syndrome, and impairs BMP7 /SMAD4-dependent gene transcription[3] .

Figure 4- Slightly rotated view of the protein structure in figure2, for better visualization of the variant amino acid position

REFERENCES

[1] “rs145146218 RefSNP Report - dbSNP - NCBI.” https://www.ncbi.nlm.nih.gov/snp/rs145146218/#frequency_tab (accessed Nov. 29, 2021).

[2] “NUP93 nucleoporin 93 [Homo sapiens (human)] - Gene - NCBI.” https://www.ncbi.nlm.nih.gov/gene/9688 (accessed Nov. 23, 2021).

[3] D. A. Braun et al., “Mutations in nuclear pore genes NUP93, NUP205 and XPO5 cause steroid-resistant nephrotic syndrome,” Nature Genetics, vol. 48, no. 4, pp. 457–465, Mar. 2016, doi: 10.1038/NG.3512.

[4] “OMIM Entry - * 614351 - NUCLEOPORIN, 93-KD; NUP93.” https://www.omim.org/entry/614351?search=nup93&highlight=nup93 (accessed Nov. 23, 2021).

[5] “wANNOVAR.” https://wannovar.wglab.org/ (accessed Nov. 29, 2021).

[6] “AlphaFold Protein Structure Database.” https://alphafold.ebi.ac.uk/entry/Q8N1F7 (accessed Nov. 27, 2021).

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WES-analysis

INTRODUCTION

Individual's sequence ID and information

WORKFLOW

METHODOLOGY

RESULTS AND DISCUSSION

REFERENCES

About

Releases

Packages

mariamnawaz1/WES-analysis

Folders and files

Latest commit

History

Repository files navigation

WES-analysis

INTRODUCTION

Individual's sequence ID and information

WORKFLOW

METHODOLOGY

RESULTS AND DISCUSSION

REFERENCES

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages