-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protein function prediction with GO #36
Comments
Protein Preprocessing StatisticsThese are the statistics for the proteins that were ignored during preprocessing due to either non-valid amino acids or sequence lengths greater than 1002, as per the guidelines outlined in the paper:
The number of ignored proteins is very insignificant in size compared to the whole dataset. I have attached the CSV file which lists the IDs of the ignored proteins for reference. |
Shorten input sequence lengths:
|
Note: The issue will be implemented in 3 PRs: |
From #39 (comment):
Excluding long sequences instead of truncating them seems to be the better option. We should implement that |
I performed a quick analysis by truncating all the protein sequences to different maximum lengths and then checked for duplicates based on the truncated sequences. The results show that as we reduce the maximum sequence length, the number of duplicates increases significantly. Here are the results for different truncation lengths: import pandas as pd
df = pd.DataFrame(pd.read_pickle("data/GO_UniProt/GO250_BP/processed/data.pkl"))
df['first_1002'] = df['sequence'].str[:1002]
df['first_700'] = df['sequence'].str[:700]
df['first_500'] = df['sequence'].str[:500]
df['first_200'] = df['sequence'].str[:200]
# Checking for duplicates in the original sequences
df.groupby('sequence').filter(lambda x: len(x) > 1).shape
# (476, 1053)
# Checking for duplicates with truncated sequences
df.groupby('first_1002').filter(lambda x: len(x) > 1).shape
# (480, 1053)
df.groupby('first_700').filter(lambda x: len(x) > 1).shape
# (503, 1053)
df.groupby('first_500').filter(lambda x: len(x) > 1).shape
# (545, 1053)
df.groupby('first_200').filter(lambda x: len(x) > 1).shape
# (1011, 1053) |
I still get models that don't train at all, even if reducing the input data length. Two things we should do that (possibly) help with the GO-task:
|
I finally found the problem: The labels were incomplete. The dataset only included the direct labels (as assigned by SwissProt), but ignored the transitive labels (all the superclasses of the direct labels). However, when deciding whether to include a class (based on having at least N samples), the transitive labels were used. This resulted in
That explains why the model didn't learn anything. I fixed this in 6511086 and now it is learning better. (Also, starting with an easier task helps - the |
Updated Protein Preprocessing StatisticsThese are the updated statistics for the proteins that were ignored during preprocessing due to either non-valid amino acids, sequence lengths greater than 1002, or no valid associated GO IDs: Note: A valid GO ID is one that has one of the following evidence codes, as per the paper: "EXP", "IDA", "IPI", "IMP", "IGI", "IEP", "TAS", "IC". Ignored Proteins Stats:
Percentages (with respect to number of proteins associated with valid GO IDs):
Invalid/No Association
|
As per the statistics, we have 20,737 proteins that are not associated or annotated with any GO labels. These proteins can be used for pretraining in our model pipeline. List of proteins ids (swiss_ids) that are not annotated with any GO label: no_go_id_proteins.csv |
That is good news. I would also use the proteins with non-valid experimental codes as well. Since we are not using them in finetuning, we might as well use them for pretraining. That would give us ~500,000 proteins, more than enough for pretraining. |
Pretraining Protein Dataset Statistics
Current Filtering Approach for PretrainingCurrently, the pretraining setup filters proteins based on:
This includes 493,688 proteins. Question for Additional FilteringShould we add an additional filter to include only proteins with a sequence length of ≤ 1002 (length being a hyperparameter with default 1002)? Similar to the approach we use in training. Also, Please review the code for Pretraining, |
The next steps here are:
|
While working on this, I observed some key differences between the original DeepGO paper by and the latest DeepGO-SE paper:
Given these changes, should we update the experimental codes and amino acid definitions in our implementation to align with the latest paper? |
Also, While reviewing the latest (DeepGO-SE) paper, I noticed the authors have implemented a specific approach to dataset splitting. Here’s the relevant excerpt from the paper:
|
For the experimental codes and valid amino acids I would say yes, we should update that. I don't have the expert knowledge to assess if the H- ("high throughput") experimental codes are less or more reliable thant the others, but if that is what DeepGO-SE uses, it should be fine. And for the splits: The method sounds interesting and might influence results (testing on random data might give better results than testing on data that is "different" to the training data). Therefore, we should use the same splits they used when training with their data. In the best case, the splits are included in the dataset they provide. |
Here is the link to access data from all relevant DeepGO papers: |
There’s a significant implementation difference between the two models regarding how they incorporate Protein-Protein Interaction (PPI) networks:
Quote from paper:
Quote from paper:
https://github.com/bio-ontology-research-group/deepgo2/blob/main/train_gat.py#L119 Question: |
DeepGO2 Invalid Amino acid handling using "X" amino acid notation: #64 (comment) |
Until now, we have only used our framework for ChEBI, but in principle, it should also be applicable to other data sets and prediction tasks. One such task is the prediction of protein functions as specified by the Gene Ontology in combination with protein data from UniProtKB. As an orientation, we can use the DeepGO paper which proposes a solution for this exact task. The goal is to apply our model to the GO / UniProtKB datasets and compare the results to those of DeepGO.
Tasks
The text was updated successfully, but these errors were encountered: