-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More species with lower quality genomes finishes faster #942
Comments
Species51 just finished - and maybe it looks to be higher quality - I think ? - it's tricky for me to think about. From the OrthoFinder logs: Species51 Species67 Specifically, I calculate this to be: Percentage of orthogroups with all species present: Species51 = 710/36662 = 1.9% In this sense - if I am thinking about things correctly - Species51 has orthogroups that are likely to be 6x greater in species representation per orthogroup compared to Species67. I’m less sure what to make of the Species51 G50 being in half as many orthogroups as the Species67 - and at a much lower percentage of all clusters (Species51 = 2962/36662=0.0808 = 8% vs Species67 = 6546/49297=0.133 = 13%). This might be a good thing, in that I think it means, the Species51 orthogroups are greater in sequence diversity and so again likely to be greater in species diversity compared to Species67? My sense - though I don't know if it is correct - in working with OrthoFinder and other clustering tools the last few years is that inclusion of lower quality genomes may 1) cryptically inflate the number of sequences of poor quality species a given orthogroup in some cases (when multiple partial gene models cryptically present in the assembly of what is a single gene in reality make it into the same orthogroup) - or more likely, 2) will splinter orthogroups when one or more partial gene models fail to make it into the orthogroup the gene belongs to because they are unable to meet thresholds for inclusion due to being partial sequences. They may then become singletons or form small spurious orthogroups of low species diversity. Critically, these singletons or spurious orthogroups will look like innovations/gains/novelty in sequence diversity in a comparative analysis of species and orthogroups - making them dangerous artifacts to be avoided when possible. So better not to include low-quality genomes given all this, unless there is no alternative. Along these lines, the fact that Species67 had 25% more species but finished basically 25% faster than Species51 suggests it has many smaller orthogroups that are likely of lower quality than Species51 - as I'm guessing OrthoFinder is able to work through many smaller orthogroups more rapidly than fewer larger orthogroups, for whatever reason / unknown to me Or maybe I am going down the wrong path in thinking / interpreting Species51 vs Species67? Any guidance on assessing / selecting between the two and/or on scaling up to Species230 and Species919 from here - would be great! Thank you very much :) Eric |
I used the new core-assign feature of OrthoFinder3 to expand Species51 to include the additional lower quality species of Species67. Below are the stats from all three runs - again, wondering how to assess quality across now all three - and select one to work with. Species51 Number of species 51 Species67 Number of species 67 Species51_assign67 Number of species 67 Of note - more genes are in orthogroups using core-assign expansion in Species51_assign67 vs the standard core production of Species67 - and the G50 sequences are contained in fewer orthogroups for Species51_assign67 - both suggesting Species51_assign67 might be considered the better of the two Species67s - or maybe it suggests over clustering? Thank you! Eric |
Hi!
I'm running the new OrthoFinder3 on two species sets - Species51 and Species67. Species in both represent animal phyla or their major unicellular outgroups. Phylogenetically, it is structured as one or two species per major group in Species51 and one or several species per major group in Species67. All species in Species51 are in Species67 - the additional species in Species67 fall within the major groups in Species51 but are lower or much lower quality genomes (additional Species67 genomes have n50s below 1 Mb with a few in the low 1000s vs Species51 genomes are chromosome-scale or better and have n50s above 1 Mb and typically 10-100s of Mb).
I started identical jobs of OrthoFinder3 on Species51 and Species67 at the same time - Species67 completed in 4-5 days - while Species51 is still running at what will be a week later today.
Given similar but higher quality input of Species51, I was thinking it will have higher quality clustering and OrthoGroups in the end - and I assumed given 25% fewer species, it would finish faster. Now that Species67 has completed at least days before Species51, I am confused.
My guess is that Species51 has orthogroups that are much larger than Species67 and it is taking longer due to their processing. Does this seem likely?
Once Species51 completes, I am wondering what you would recommend for assessing / selecting between the two?
The bigger picture is that I will take either Species51 or else Species67 to then use the new expansion feature of OrthoFinder3 to go to Species230 and then Species919, which are the end targets.
Any guidance or suggestions on how best to proceed once I have Species51 - or things I can post here that would be helpful to you to evaluate - would be great.
Thank you very much! Eric
The text was updated successfully, but these errors were encountered: