Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Workflow Run RO-crate format #39

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

famosab
Copy link

@famosab famosab commented Dec 18, 2024

We worked on a first version of the plugin which is able to render valid RO-crates for any workflow run.

Happy to receive feedback to get this finished up :)

Continues #19 and #33.

famosab and others added 16 commits November 18, 2024 15:45
add encodingFormat for nextflow.config
feat: add wrroc to valid formats
* fx: make getIntermediateOutputFiles work again

* Fix bugs

fixes #16
fixes #17

---------

Co-authored-by: fbartusch <[email protected]>
* feat: add README to create

* feat: ignore vscode

* fix: make getIntermediateOutputFiles work again (#18) (#19)

* fx: make getIntermediateOutputFiles work again

* Fix bugs

fixes #16
fixes #17

---------

Co-authored-by: fbartusch <[email protected]>

* feat: add README to json

* feat: check first if readme exists

* Add readme to hasPart

Signed-off-by: fbartusch <[email protected]>

---------

Signed-off-by: fbartusch <[email protected]>
Co-authored-by: fbartusch <[email protected]>
* Add getEncodingFormat function that return the encoding format for a file
* handle YAML files manually

Signed-off-by: fbartusch <[email protected]>
* main workflow complies (more or less) with ComputationalWorkflow profile version 1.0
  (if set in manifest add license, url, version, description, ...)
* Correct value vor ActionStatus

Signed-off-by: fbartusch <[email protected]>
* start with metaYaml imports

* merge dev-wrroc into metaYaml (#23)

* add encodingFormat for nextflow.config

* add encodingFormat for main.nf

* feat: add wrroc to valid formats

* fix: make getIntermediateOutputFiles work again (#18)

* fx: make getIntermediateOutputFiles work again

* Fix bugs

fixes #16
fixes #17

---------

Co-authored-by: fbartusch <[email protected]>

* feat: add README to crate (#14)

* feat: add README to create

* feat: ignore vscode

* fix: make getIntermediateOutputFiles work again (#18) (#19)

* fx: make getIntermediateOutputFiles work again

* Fix bugs

fixes #16
fixes #17

---------

Co-authored-by: fbartusch <[email protected]>

* feat: add README to json

* feat: check first if readme exists

* Add readme to hasPart

Signed-off-by: fbartusch <[email protected]>

---------

Signed-off-by: fbartusch <[email protected]>
Co-authored-by: fbartusch <[email protected]>

---------

Signed-off-by: fbartusch <[email protected]>
Co-authored-by: fbartusch <[email protected]>

* WIP

* only add from meta if meta exists

* remove usage from ext args

* add module name to id

---------

Signed-off-by: fbartusch <[email protected]>
Co-authored-by: fbartusch <[email protected]>
@famosab
Copy link
Author

famosab commented Dec 18, 2024

@bentsherman maybe you can have a look here :)

@simleo
Copy link

simleo commented Dec 18, 2024

Have you got an example RO-Crate generated with this version of the plugin?

@famosab
Copy link
Author

famosab commented Dec 18, 2024

ro-crate-metadata.json
This was created using the plugin and this pipeline: https://github.com/famosab/wrrocmetatest

@bentsherman
Copy link
Member

Is this superseding #33 now?

@bentsherman bentsherman changed the base branch from master to workflow-run-crate December 18, 2024 15:46
@bentsherman bentsherman changed the base branch from workflow-run-crate to master December 18, 2024 15:47
@simleo
Copy link

simleo commented Dec 19, 2024

ro-crate-metadata.json

I ran runcrate report (https://github.com/ResearchObject/runcrate) on a directory containing that file and this is what I got:

action: #7d8bdcb2-6ea3-4132-b134-61b40ef98b8d
  instrument: main.nf (['File', 'SoftwareSourceCode', 'ComputationalWorkflow', 'HowTo'])
  started: 2024-12-18T11:33:47.967336+01:00
  ended: 2024-12-18T11:33:50.017405+01:00
  inputs:
    Users/famke/04_other/BioHackathon24/testdata/read2.fq.gz
    Users/famke/04_other/BioHackathon24/testdata/read1.fq.gz
    work/b2/0766652cd477129c14781bb6d8b148/test_1.fastp.fastq.gz
    work/b2/0766652cd477129c14781bb6d8b148/test_2.fastp.fastq.gz
    testsheet.csv <- #input
    None <- #genome
    s3://ngi-igenomes/igenomes/ <- #igenomes_base
    True <- #igenomes_ignore
    results <- #outdir
    copy <- #publish_dir_mode
    None <- #email
    None <- #email_on_fail
    False <- #plaintext_email
    False <- #monochrome_logs
    False <- #help
    False <- #help_full
    False <- #show_hidden
    None <- #version
    https://raw.githubusercontent.com/nf-core/test-datasets/ <- #pipelines_testdata_base_path
    None <- #config_profile_name
    None <- #config_profile_description
    master <- #custom_config_version
    https://raw.githubusercontent.com/nf-core/configs/master <- #custom_config_base
    None <- #config_profile_contact
    None <- #config_profile_url
    True <- #validate_params
    None <- #genomes
  outputs:
    fastp/test_2.fastp.fastq.gz
    fastp/test.fastp.html
    fastp/test.fastp.log
    fastp/test.fastp.json
    fastp/test_1.fastp.fastq.gz
    megahit/test.contigs.fa.gz
    megahit/intermediate_contigs/k51.contigs.fa.gz
    megahit/intermediate_contigs/k51.final.contigs.fa.gz
    megahit/intermediate_contigs/k71.contigs.fa.gz
    megahit/intermediate_contigs/k51.addi.fa.gz
    megahit/intermediate_contigs/k71.final.contigs.fa.gz
    megahit/intermediate_contigs/k71.addi.fa.gz
    megahit/test.log
    megahit/intermediate_contigs/k51.local.fa.gz
    work/b2/0766652cd477129c14781bb6d8b148/versions.yml
    work/34/f6dbb451e056d59cd528618daaa6cc/versions.yml

action: #b20766652cd477129c14781bb6d8b148
  step: main.nf#main/FAMOSAB_WRROCMETATEST:WRROCMETATEST:FASTP
  instrument: #Script_fd7c4fa8fd93ef0e@7337bd2e (SoftwareApplication)
  inputs:
    Users/famke/04_other/BioHackathon24/testdata/read1.fq.gz
    Users/famke/04_other/BioHackathon24/testdata/read2.fq.gz
  outputs:
    fastp/test_1.fastp.fastq.gz
    fastp/test_2.fastp.fastq.gz
    fastp/test.fastp.json
    fastp/test.fastp.html
    fastp/test.fastp.log
    work/b2/0766652cd477129c14781bb6d8b148/versions.yml

action: #34f6dbb451e056d59cd528618daaa6cc
  step: main.nf#main/FAMOSAB_WRROCMETATEST:WRROCMETATEST:MEGAHIT
  instrument: #Script_55473827b5576695@174cb0d8 (SoftwareApplication)
  inputs:
    work/b2/0766652cd477129c14781bb6d8b148/test_1.fastp.fastq.gz
    work/b2/0766652cd477129c14781bb6d8b148/test_2.fastp.fastq.gz
  outputs:
    megahit/test.contigs.fa.gz
    megahit/intermediate_contigs/k51.contigs.fa.gz
    megahit/intermediate_contigs/k51.final.contigs.fa.gz
    megahit/intermediate_contigs/k71.contigs.fa.gz
    megahit/intermediate_contigs/k71.final.contigs.fa.gz
    megahit/intermediate_contigs/k51.addi.fa.gz
    megahit/intermediate_contigs/k71.addi.fa.gz
    megahit/intermediate_contigs/k51.local.fa.gz
    megahit/intermediate_contigs/k51.final.contigs.fa.gz
    megahit/intermediate_contigs/k71.final.contigs.fa.gz
    megahit/test.log
    work/34/f6dbb451e056d59cd528618daaa6cc/versions.yml

Since I don't have the original crate I'm wondering, do all relative paths correspond to existing files in the crate, e.g., is there a Users/famke/04_other/BioHackathon24/testdata/read2.fq.gz in the crate?

This was created using the plugin and this pipeline: https://github.com/famosab/wrrocmetatest

I tried to run the pipeline following the instructions at the repo above. I had to manually change the plugin's version (to 1.1.0-DEV) after installing it, otherwise Nextflow downloads the [email protected] and replaces the installed one. I got this error (?) message:

Unknown parentDir: /home/simleo/repos/wrrocmetatest/read1.fq.gz
Unknown parentDir: /home/simleo/repos/wrrocmetatest/read2.fq.gz
Unexpected input file: /home/simleo/repos/wrrocmetatest/read1.fq.gz
Unexpected input file: /home/simleo/repos/wrrocmetatest/read2.fq.gz

Despite that, the run went on and I got an RO-Crate, but without the read{1,2}.fq.gz input files. I have Nextflow 23.10.0.

@famosab
Copy link
Author

famosab commented Dec 19, 2024

@simleo I will try and answer to the issues you had :)

Since I don't have the original crate I'm wondering, do all relative paths correspond to existing files in the crate, e.g., is there a Users/famke/04_other/BioHackathon24/testdata/read2.fq.gz in the crate?

Yes this path is copied like this to the folder in which the json can be found. But that might not be the most elegant solution as the input files could also be put in a folder called input or something. What do you think?

image

I tried to run the pipeline following the instructions at the repo above. I had to manually change the plugin's version (to 1.1.0-DEV) after installing it, otherwise Nextflow downloads the [email protected] and replaces the installed one. I got this error (?) message:

Yes the installation of the plugin needs to be fixed. I always run make install from the folder where I worked on the plugin and then use the version that is appropriate - I think it is 1.3.0 in our case.

Seems like something is off with your inputs. Did you download the files as described in the README of the pipeline?

@simleo
Copy link

simleo commented Dec 19, 2024

Yes this path is copied like this to the folder in which the json can be found. But that might not be the most elegant solution as the input files could also be put in a folder called input or something. What do you think?

It's not that important, what matters is that there are no clashes between files due to their names.

Seems like something is off with your inputs. Did you download the files as described in the README of the pipeline?

Yes, and I put them in the root directory of the wrrocmetatest repo as cloned on my machine. My testsheet.tsv is:

sample,fastq_1,fastq_2
test,/home/simleo/repos/wrrocmetatest/read1.fq.gz,/home/simleo/repos/wrrocmetatest/read2.fq.gz

The command I ran is:

nextflow run main.nf -profile docker --input testsheet.csv --outdir results -c testdata.config

@famosab
Copy link
Author

famosab commented Dec 20, 2024

I have Nextflow 23.10.0.

Then I think you might need to update to a more recent Nextflow version as I only tested this for versions above 24.04.

But you should be able to test the plugin with any pipeline - maybe running an nf-core pipeline that is better maintained is more reliable!

@famosab
Copy link
Author

famosab commented Jan 7, 2025

@bentsherman I would appreciate any feedback on this so we can get this finished up this month? As soon as people start using the plugin I guess more requests for changes / improvements / updates will come in.
@simleo Are there things that are missing from your point of view?

@simleo
Copy link

simleo commented Jan 7, 2025

ro-crate-metadata.json

The metadata file looks OK. One thing I suggest changing is PropertyValue instances corresponding to parameters that haven't been specified: better not add them at all rather than add them with a "value": null.

Regarding the crate structure, it looks fine from your screenshot, though some files added to the crate are not listed in the metadata (e.g. README.txt): this is not an error, but listing them in the metadata with a name and/or description (or avoiding adding them to the crate if they are not essential) would improve the crate (this could be left to a future release).

However, I still cannot reproduce your result after upgrading Nextflow to 24.10.3. The output directory is missing several files and directories including ro-crate-metadata.json. The command line output shows this:

Unknown parentDir: /home/simleo/repos/wrrocmetatest/read2.fq.gz
Unknown parentDir: /home/simleo/repos/wrrocmetatest/read1.fq.gz

(with respect to Nextflow 23.10.0 the Unexpected input file messages have disappeared, but not the above ones) and there's a java.lang.IllegalArgumentException: 'other' is different type of Path in the .nextflow.log.

@famosab
Copy link
Author

famosab commented Jan 8, 2025

Can you try changing the config file to something like: (mainly updating the version of the plugin)

plugins {
	id '[email protected]'
}

prov {
	enabled = true
	formats {
    	wrroc {
        	file = "${params.outdir}/ro-crate-metadata.json"
        	overwrite = true
        	agent {
            	name = "John Doe"
            	orcid = "https://orcid.org/0000-0000-0000-0000"
        	}
			license = "https://spdx.org/licenses/MIT"
            profile = "provenance_run_crate"
    	}
	}
}

@simleo
Copy link

simleo commented Jan 9, 2025

Thanks @famosab, I finally managed to install the right version of the plugin. The run finished with no errors and no warnings, and I got a valid RO-Crate.

It's looking pretty good already, but some things need to be fixed. I think the most important one is test_{1,2}.fastp.fastq.gz being listed among the workflow inputs: these are intermediate files, produced by FASTP and consumed by MEGAHIT, so they should not be listed among workflow inputs. Similarly, intermediate files should not be listed among workflow outputs (in fact it's not clear to me which files are the actual workflow outputs: all of MEGAHIT's outputs?). Other remarks:

  • It would be nice to have the testsheet.csv file in the crate. Now only its name is listed, represented as the value of a PropertyValue: I guess this is due to the fact that the workflow somehow sees it as a string rather than a file.

  • The dependencies of main.nf, direct and indirect (e.g. workflows/wrrocmetatest.nf, subworkflows/nf-core/utils_nfcore_pipeline/main.nf, modules/nf-core/fastp/main.nf, etc.) should be included in the crate as files.

@bentsherman bentsherman changed the base branch from master to workflow-run-crate January 10, 2025 20:35
@bentsherman bentsherman changed the base branch from workflow-run-crate to master January 10, 2025 20:35
@bentsherman bentsherman changed the base branch from master to workflow-run-crate January 10, 2025 20:40
@bentsherman bentsherman changed the base branch from workflow-run-crate to master January 10, 2025 20:43
This was referenced Jan 10, 2025
@bentsherman bentsherman changed the title Finalize addition of Workflow Run RO-crate format Add Workflow Run RO-crate format Jan 10, 2025
@bentsherman
Copy link
Member

Taking a look this afternoon. Expect some minor edits soon

Signed-off-by: Ben Sherman <[email protected]>
Copy link
Member

@bentsherman bentsherman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some minor cleanup so far. The render() function is pretty long, so I'm going to see if I can move some code into helper functions to make it easier to read at a high-level. Please refrain from making edits for now as I work through the code. I left some comments/questions in the meantime.

final configMap = session.config

// Set RO-Crate Root and workdir
this.crateRootDir = Path.of(params['outdir'].toString()).toAbsolutePath()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be the parent directory of the ro-crate-metadata.json instead, because params.outdir is not universal to all pipelines

Comment on lines +94 to +99
// Add intermediate input files (produced by workflow tasks and consumed by other tasks)
workflowInputs.addAll(getIntermediateInputFiles(tasks, workflowInputs))
final workflowInputMapping = getWorkflowInputMapping(workflowInputs)

// Add intermediate output files (produced by workflow tasks and consumed by other tasks)
workflowOutputs.putAll(getIntermediateOutputFiles(tasks, workflowOutputs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workflowInputs and workflowOutputs are meant to contain only the pipeline inputs/outputs, not the intermediate outputs. I would keep them separate.

Comment on lines +101 to +102
// Copy workflow input files into RO-Crate
workflowInputMapping.each { source, dest ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you are copying the intermediate files into the RO-crate? I don't think this is feasible in general. I think it would be better to save only a record of the task inputs/outputs with a checksum. That is what the BCO does at least.

Comment on lines +132 to +133
// Copy workflow output files into RO-Crate
workflowOutputs.each { source, dest ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting intermediate files aside, the workflow outputs should already be present in the output directory, so there should be no need to copy them. The most I would do is verify that each workflow output actually resides in the output directory and warn the user if it doesn't.

Comment on lines +203 to +204
// Copy workflow into crate directory
Files.copy(scriptFile, crateRootDir.resolve(scriptFile.getFileName()), StandardCopyOption.REPLACE_EXISTING)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simleo you mentioned copying all of the pipeline code into the crate, but is this really necessary? If the crate specifies a git repository, revision, and main script path, the user can reproduce the pipeline code at any time.

Comment on lines +223 to +227
// license information
final license = [
"@id" : manifest.license,
"@type": "CreativeWork"
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nextflow 24.10 added the manifest.license config option. Since the latest version of nf-prov requires 24.10 , I went ahead and refactored this bit to use this option instead of a custom option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants