Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract edge weights from graph file #945

Open
gauravdiwan89 opened this issue Nov 20, 2024 · 5 comments
Open

Extract edge weights from graph file #945

gauravdiwan89 opened this issue Nov 20, 2024 · 5 comments

Comments

@gauravdiwan89
Copy link

Hello,

Is there a way to extract the MCL graph edge weights between pairs of proteins that have been declared as orthologues?

As far as I can tell, the OrthoFinder_graph.txt file should have these. However is there a way to connect what is in this file with the pair of proteins whose connection has the corresponding weights?

Or maybe I'm misunderstanding something completely.

Any help would be appreciated.

@Jonathan-Holmes-Bioinformatics

Hi gauravdiwan89,

The format of the OrthoFinder_graph.txt should be:
geneID gene1ID:weight gene2ID:weight

As far as I am aware the gene IDs used in OrthoFinder_graph.txt text file references the row in the SequenceIDs.txt which contains the gene ID and protein name pairs. So first identify the rows in this file of the proteins you want to look at (starting from 0) you can then parse the row of the OrthoFinder_graph.txt corresponding to each of these pairs and pull the weight from there.

If you need further help implementing this I would be happy to provide a python function which does this.

@gauravdiwan89
Copy link
Author

gauravdiwan89 commented Nov 21, 2024

Hi Jonathan,

Thanks a lot for your comment. However its still not very clear to me how the file is structured.

Here are a few example lines from my file

(mclmatrix
begin

0    1709:0.113 2318:0.400 5011:2.376 5429:0.354 $
1    3420:1.101 4901:0.267 6616:0.378 6812:0.255 $
6    1584:1.910 2057:0.483 2964:0.350 5182:2.333 5785:1.074 7575:2.446 844769:1.923 2258948:0.927 2259171:0.872 2259359:0.822 2259498:0.969 2259625:1.942 2259680:0.751 2259867:1.940 2259918:0.967 2623723:0.146 2772717:0.224 3913911:0.138 3914167:0.278 $
7    80414:0.263 154874:0.270 246086:0.314 712422:0.399 735593:0.341 774493:0.278 972974:0.284 1073849:0.321 1406402:0.257 1593557:0.270 1593843:0.284 1691143:0.261 1901301:0.283 1939580:0.290 3773828:0.298 3775276:0.286 3777215:0.262 4045807:0.256 4128041:0.293 4135792:0.288 $
8    1150:0.739 1310:0.206 3215:0.205 4017:0.244 4524:1.111 5619:0.781 6319:0.872 6405:0.739 7527:0.779 980322:0.270 $
9    3026:1.966 3105:1.531 $
......

There are even number of entries on each line, so its clear there are multiple pairs. But when you say that the format is
geneID gene1ID:weight gene2ID:weight, I'm not clear as to what the first number on each line is? Did you actually mean OrthogroupID or SpeciesID?

If you can share a python function, that would be super! Thank you so much!

Best regards,
Gaurav

@Jonathan-Holmes-Bioinformatics

Hi Gaurav,

The first number in each row is a protein ID, with each further enter being a Protein which OrthoFinder has linked via BLAST score.

In your example the first number 0 refers to a protein which is the first gene (row) in the SequenceIDs.txt file. Each entry following from this is an protein and its edge weight in the format geneID:weight and ends with the '$' sign.

0 1709:0.113 2318:0.400 5011:2.376 5429:c $

We can extract 4 gene pairs and weights compared to geneID 0.
gene1 gene2 weight
0 1709 0.113
0 2318 0.400
0 5011 2.376
0 5429 2.376

Through this you can extract the sparse matrix of all gene pair connections. However each gene is coded numerically.

The data in SequenceIDs.txt allows you to reference the id numbers. Such that you can get the all gene pairs relative to gene 0.

@gauravdiwan89
Copy link
Author

Oh I see! So they are the indices of the protein in the SequenceID file. That helps me a lot. I will try and script this for my set of proteins and will get back to you if this doesn't work somehow.

Thanks!

@Jonathan-Holmes-Bioinformatics

I hope that helped. Sorry if it was unclear the Protein IDs undergo several changes during the OrthoFinder run.

If have quickly coded something that will check and and return the weights for a given pair of proteins if you need something to start with.

Input:

1) Path to graph file
2) Path to SequenceID.txt
3) Gene 1 (either speciesID_GeneID)
4) Gene 2

Output:

[gene1 v gene2, gene2 v gene1]

Function

def get_gene_pair(graph_file,sequence_file,gene1,gene2):

    with open(sequence_file) as sfile:
        for pos, line in enumerate(sfile):
            if line.startswith(gene1 + ":")  or ": " + gene1 in line:
                gene1_index = pos
            if line.startswith(gene2 + ":") or ": " + gene2 in line:
                gene2_index = pos
               
    result = []            
    with open(graph_file) as gfile:
        for line in gfile:
            if line.startswith(str(gene1_index) + " ") or line.startswith(str(gene2_index) + " "):
                if str(gene1_index) in line and str(gene2_index) in line:
                    l = line.split(" ")
                    for i in l[1:]:
                        if str(gene1_index) in i or str(gene2_index) in i:
                            result.append([l[0]] + i.split(":"))
    return result       
                    
            
#Example        
            
pair_test = get_gene_pair("WorkingDirectory/OrthoFinder_graph.txt","WorkingDirectory/SequenceIDs.txt","0_60","0_19026")
print(pair_test)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants