Skip to content

Commit

Permalink
docs: Add tab for known ARETE issues (#181)
Browse files Browse the repository at this point in the history
Signed-off-by: jvfe <[email protected]>
  • Loading branch information
jvfe authored Jan 8, 2024
1 parent 0d9f702 commit 90cda86
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 0 deletions.
27 changes: 27 additions & 0 deletions docs/issues.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Known issues in ARETE

## PopPUNK

We have experienced issues with PopPUNK in ARETE runs, primarily related to the distances and clusters generated and how these affect both the **subsampling** and **recombination** subworkflows.

Sometimes the distances generated are too small or the number of clusters changes between executions of the same dataset.
While we can't solve the latter since it is a result of how PopPUNK itself is defined, the former can be mitigated by adjusting the subsampling thresholds with `--core_similarity` (99.9 by default) and `--accessory_similarity` (99 by default), or disabling subsampling altogether (`--enable_subsetting false`).

A useful course of action is to first run your dataset through the [PopPUNK entry](https://beiko-lab.github.io/arete/usage/#poppunk-entry), and then choose the appropriate parameters for your final pipeline run.
The PopPUNK entry takes at most 2 hours to run, even on [large datasets](https://beiko-lab.github.io/arete/resource_profiles/).
Disabling PopPUNK in your execution is also simple to do with `--skip_poppunk`.

## rSPR

rSPR, which you can enable with `--run_rspr` or with the [rSPR entry](https://beiko-lab.github.io/arete/usage/#rspr-entry), is known to be **very slow**, especially with larger datasets.

The default of 3 days for rSPR runtimes should be enough for some runs, but for most larger datasets it won't be sufficient. In this case, if you do want to run rSPR, we suggest two possible routes:

- Increasing the default time allocation for the `RSPR_EXACT` processes. [Check out how](https://beiko-lab.github.io/arete/usage/#custom-resource-requests).
- Ignoring timeout errors altogether and finish the pipeline execution with whatever finished running in these 3 days. **This is the default course of action for ARETE**.

By choosing the second course of action, we ignore timeout errors generated with `RSPR_EXACT` and finish the execution of downstream processes, i.e. `RSPR_HEATMAP`, with whatever results we already have.
This process generates a heatmap of Tree size and Exact rSPR distance.

While `RSPR_HEATMAP` should execute with the results that were generated up to the timeout, we have heard from users that this process can still not run, even when results from `RSPR_EXACT` were generated.
This is unfortunate but it shouldn't be a big problem: The only output given by `RSPR_HEATMAP` is the aforementioned heatmap, which can also be generated externally by using our [`rspr_heatmap.py`](https://github.com/beiko-lab/arete/blob/master/bin/rspr_heatmap.py) script or your own downstream analysis.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ nav:
- Usage: usage.md
- FAQ: faq.md
- Output: output.md
- Known issues: issues.md
- Citations: CITATIONS.md
- Roadmap: ROADMAP.md
- Reference:
Expand Down

0 comments on commit 90cda86

Please sign in to comment.