Skip to content

AllanSCosta/semantic-protein-folding

Repository files navigation

Semantic Protein Folding

A pipeline for Protein Language Modeling + Protein Folding experimentation

Based on Distillation of MSA Embeddings to Protein Folded Structures (biorxiv preprint and full latest text).

This repository stands on shoulders of giant work by the scientific community:

For experimental and prototypical access to internal code, these repos are collected under building_blocks (except sidechainnet). As development progresses they will be incorporated as original imports.

Experimentation Pipeline

General

  • debug: whether to use wandb logging
  • submit: submit training to a SLURM scheduler
  • name: name of experiment
  • note: experimental note
  • report_frequency: how often to log metrics in log file

Data

  • dataset_source: path to sidechainnet-formatted dataset
  • downsample: whether to uniformly at random downsample dataset
  • seq_clamp: clamp data at the sequence level, size of clamp
  • max_seq_len: throw out data with sequences larger than max_seq_len
  • num_workers: number of CPU workers for data fetching and loading
  • batch_size: batch size

Architecture

  • wipe_edge_information: drops out all hij

  • topography_giveaway: instead of providing language-model-based hij, produces hij based on ground truth distance and orientation

  • giveaway_distance_resolution: number of bins of relative distance information to input

  • giveaway_angle_resolution: number of bins of relative orientation information to input

  • wiring_checkpoint: checkpoint model-inbetween Dense networks

  • use_msa: use ESM-MSA-1 embeddings

  • use_seq: use ESM-1b embeddings

  • use_at: process hij with an Axial Transformer after distillation

  • use_gt: project 3D coordinates with Graph Transformer after distillation

  • use_en: refine with E(n)-Transformer given coords

ESM-MSA-1 Distillation

  • node_msa_distill_layers: hidden layer enumeration of Dense for msa node information extraction [768, 256, 256, 128]
  • edge_msa_distill_layers: hidden layer enumeration of Dense for msa edge information extraction [96, 64, 64]

ESM-1B Distillation

  • node_seq_distill_layers: hidden layer enumeration of Dense for msa node information extraction [1280, 256, 128]
  • edge_seq_distill_layers: hidden layer enumeration of Dense for msa edge information extraction [160, 64, 64]

Seq + MSA ensemble

  • node_ens_distill_layers: hidden layer enumeration of Dense for msa node information extraction [128, 128, 128]
  • edge_ens_distill_layers: hidden layer enumeration of Dense for msa edge information extraction [64, 64]

AXIAL TRANSFORMER

  • at_checkpoint: if the axial transformer should be checkpointed
  • at_dim: axial transformer dim
  • at_depth: axial transformer depth
  • at_heads: axial transformer number of attention heads
  • at_dim_head: axial transformer dim head
  • at_window_size: axial transformer window size (for internal Long-Short optimization)

GRAPH TRANSFORMER

  • gt_checkpoint: graph transformer checkpoint
  • gt_dim: graph transformer dim
  • gt_edim: graph transformer edge dim
  • gt_depth: graph transformer depth
  • gt_heads: graph transformer number of heads
  • gt_dim_head: graph trnasformer dim head

EN TRANSFORMER

  • gaussian_noise: if graph transformer is not used, which gaussian noise to be added to backbone as starting point
  • et_checkpoint: checkpoint en transformer
  • et_dim: dim of en transformer
  • et_edim: en transformer edge dim
  • et_depth: en transformer depth
  • et_heads: en transformer num heads
  • et_dim_head: en transformer dim head
  • et_coors_hidden_dim: hidden dim of internal coordinate-head mixer
  • en_num_neighbors: num neighbors to consider in 3d space
  • en_num_seq_neighbors: num neighbors to consider in sequence space

FOLDING STEPS

  • unroll_steps - during training, applies en transformer without gradients up to N, where N ~ U(0, unroll_steps) and each batch gets a different sample
  • train_fold_steps - during training, how many en transformer iterations to perform with gradients
  • eval_fold_steps - during testing, how many en trasnformer iterations to perform

PREDICTIONS

  • angle_number_of_bins - number of bins to use for predicting relative orientations
  • distance_number_of_bins - number of bins to use for predicting relative distances
  • distance_max_radius - maximum radius for predicitng relative distances

OPTIM

  • lr: learning rate

  • at_loss_coeff: axial transformer loss coefficient

  • gt_loss_coeff: graph transformer loss coefficient

  • et_loss_coeff: en transformer loss coefficient

  • et_drmsd: use drmsd for en transformer

  • max_epochs: number of epochs

  • validation_check_rate: how often to perform validation checks

  • validation_start: when to start validating

STOCHASTICITY

  • coordinate_reset_prob: legacy, will be removed
  • msa_wipe_out_prob: probability of selecting MSA embeddings
  • msa_wipe_out_dropout: dropout of edge and node information for selected MSA embeddings

Test, Retrain

  • test_model: path to model weights for testing
  • retrain_model: path to model weights for retraining

About

Experiments in protein folding through language modeling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages