Extract edge weights from graph file #945

gauravdiwan89 · 2024-11-20T15:53:46Z

Hello,

Is there a way to extract the MCL graph edge weights between pairs of proteins that have been declared as orthologues?

As far as I can tell, the OrthoFinder_graph.txt file should have these. However is there a way to connect what is in this file with the pair of proteins whose connection has the corresponding weights?

Or maybe I'm misunderstanding something completely.

Any help would be appreciated.

Jonathan-Holmes-Bioinformatics · 2024-11-21T13:36:30Z

Hi gauravdiwan89,

The format of the OrthoFinder_graph.txt should be:
geneID gene1ID:weight gene2ID:weight

As far as I am aware the gene IDs used in OrthoFinder_graph.txt text file references the row in the SequenceIDs.txt which contains the gene ID and protein name pairs. So first identify the rows in this file of the proteins you want to look at (starting from 0) you can then parse the row of the OrthoFinder_graph.txt corresponding to each of these pairs and pull the weight from there.

If you need further help implementing this I would be happy to provide a python function which does this.

gauravdiwan89 · 2024-11-21T14:59:13Z

Hi Jonathan,

Thanks a lot for your comment. However its still not very clear to me how the file is structured.

Here are a few example lines from my file

(mclmatrix
begin

0    1709:0.113 2318:0.400 5011:2.376 5429:0.354 $
1    3420:1.101 4901:0.267 6616:0.378 6812:0.255 $
6    1584:1.910 2057:0.483 2964:0.350 5182:2.333 5785:1.074 7575:2.446 844769:1.923 2258948:0.927 2259171:0.872 2259359:0.822 2259498:0.969 2259625:1.942 2259680:0.751 2259867:1.940 2259918:0.967 2623723:0.146 2772717:0.224 3913911:0.138 3914167:0.278 $
7    80414:0.263 154874:0.270 246086:0.314 712422:0.399 735593:0.341 774493:0.278 972974:0.284 1073849:0.321 1406402:0.257 1593557:0.270 1593843:0.284 1691143:0.261 1901301:0.283 1939580:0.290 3773828:0.298 3775276:0.286 3777215:0.262 4045807:0.256 4128041:0.293 4135792:0.288 $
8    1150:0.739 1310:0.206 3215:0.205 4017:0.244 4524:1.111 5619:0.781 6319:0.872 6405:0.739 7527:0.779 980322:0.270 $
9    3026:1.966 3105:1.531 $
......

There are even number of entries on each line, so its clear there are multiple pairs. But when you say that the format is
geneID gene1ID:weight gene2ID:weight, I'm not clear as to what the first number on each line is? Did you actually mean OrthogroupID or SpeciesID?

If you can share a python function, that would be super! Thank you so much!

Best regards,
Gaurav

Jonathan-Holmes-Bioinformatics · 2024-11-21T15:14:51Z

Hi Gaurav,

The first number in each row is a protein ID, with each further enter being a Protein which OrthoFinder has linked via BLAST score.

In your example the first number 0 refers to a protein which is the first gene (row) in the SequenceIDs.txt file. Each entry following from this is an protein and its edge weight in the format geneID:weight and ends with the '$' sign.

0 1709:0.113 2318:0.400 5011:2.376 5429:c $

We can extract 4 gene pairs and weights compared to geneID 0.
gene1 gene2 weight
0 1709 0.113
0 2318 0.400
0 5011 2.376
0 5429 2.376

Through this you can extract the sparse matrix of all gene pair connections. However each gene is coded numerically.

The data in SequenceIDs.txt allows you to reference the id numbers. Such that you can get the all gene pairs relative to gene 0.

gauravdiwan89 · 2024-11-21T15:30:23Z

Oh I see! So they are the indices of the protein in the SequenceID file. That helps me a lot. I will try and script this for my set of proteins and will get back to you if this doesn't work somehow.

Thanks!

Jonathan-Holmes-Bioinformatics · 2024-11-21T15:45:04Z

I hope that helped. Sorry if it was unclear the Protein IDs undergo several changes during the OrthoFinder run.

If have quickly coded something that will check and and return the weights for a given pair of proteins if you need something to start with.

Input:

1) Path to graph file
2) Path to SequenceID.txt
3) Gene 1 (either speciesID_GeneID)
4) Gene 2

Output:

[gene1 v gene2, gene2 v gene1]

Function

def get_gene_pair(graph_file,sequence_file,gene1,gene2):

    with open(sequence_file) as sfile:
        for pos, line in enumerate(sfile):
            if line.startswith(gene1 + ":")  or ": " + gene1 in line:
                gene1_index = pos
            if line.startswith(gene2 + ":") or ": " + gene2 in line:
                gene2_index = pos
               
    result = []            
    with open(graph_file) as gfile:
        for line in gfile:
            if line.startswith(str(gene1_index) + " ") or line.startswith(str(gene2_index) + " "):
                if str(gene1_index) in line and str(gene2_index) in line:
                    l = line.split(" ")
                    for i in l[1:]:
                        if str(gene1_index) in i or str(gene2_index) in i:
                            result.append([l[0]] + i.split(":"))
    return result       
                    
            
#Example        
            
pair_test = get_gene_pair("WorkingDirectory/OrthoFinder_graph.txt","WorkingDirectory/SequenceIDs.txt","0_60","0_19026")
print(pair_test)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract edge weights from graph file #945

Extract edge weights from graph file #945

gauravdiwan89 commented Nov 20, 2024

Jonathan-Holmes-Bioinformatics commented Nov 21, 2024

gauravdiwan89 commented Nov 21, 2024 •

edited

Loading

Jonathan-Holmes-Bioinformatics commented Nov 21, 2024

gauravdiwan89 commented Nov 21, 2024

Jonathan-Holmes-Bioinformatics commented Nov 21, 2024

Extract edge weights from graph file #945

Extract edge weights from graph file #945

Comments

gauravdiwan89 commented Nov 20, 2024

Jonathan-Holmes-Bioinformatics commented Nov 21, 2024

gauravdiwan89 commented Nov 21, 2024 • edited Loading

Jonathan-Holmes-Bioinformatics commented Nov 21, 2024

gauravdiwan89 commented Nov 21, 2024

Jonathan-Holmes-Bioinformatics commented Nov 21, 2024

gauravdiwan89 commented Nov 21, 2024 •

edited

Loading