Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality Checking Pipeline #23

Draft
wants to merge 69 commits into
base: main
Choose a base branch
from
Draft

Quality Checking Pipeline #23

wants to merge 69 commits into from

Conversation

claymcleod
Copy link
Member

@claymcleod claymcleod commented Jun 28, 2023

Continuation of #3 and #13.

Rendered

claymcleod and others added 30 commits February 18, 2020 19:37
a-frantz and others added 20 commits November 21, 2020 15:09
@claymcleod claymcleod force-pushed the rfcs/qc-workflow branch 3 times, most recently from 5205bc4 to 7850c1c Compare June 28, 2023 15:25
time computing the information themselves while also informing appropriate
use of the data.

You can find the relevant discussion on the [associated pull request](https://github.com/stjudecloud/rfcs/pull/3).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be updated to PR #23 or do we want to link to all of the QC PRs? That would be the only way to have the complete discussion available.

Comment on lines +17 to +20
2. Publish a collection of metrics that end-users of St. Jude Cloud can leverage
to assess the quality of the data available. This context should save users
time computing the information themselves while also informing appropriate
use of the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we accomplish that? Most of our discussion to this point has been about what metrics we want to use to determine a "quality" dataset. Are there metrics that an end user would want to evaluate the applicability of a dataset to a problem that differ?

Comment on lines +26 to +27
St. Jude Cloud is a large repository of omics data available for request from
the academic and non-profit community. As such, the project processes thousands
Copy link
Member

@adthrasher adthrasher Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it appropriate to constrain our description of St. Jude Cloud this way? What about visualizations? What about imaging data? There are likely to be other examples in the future.

Suggested change
St. Jude Cloud is a large repository of omics data available for request from
the academic and non-profit community. As such, the project processes thousands
St. Jude Cloud provides a large repository of omics data available for request from
the academic and non-profit community. As such, the project processes thousands

Thus, the scope of this RFC, and the QC of samples on the project in general, is
limited to the _computational_ QC of the files produced for publication in St.
Jude Cloud. While we do produce results that define _experimental_ results (such
as `fastqc`), these are rarely used to decide which files pass or fail our
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to note here whether we would exclude data or simply mark the QC failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is true. In my discussions with @dfinkels about his QC process, I've never seen him make this distinction between computational and experimental QC. I think it's a valid distinction to make, but the claim "these (experimental QC metrics) are rarely used to decide which files pass or fail" seems erroneous to me.

And agree with @adthrasher here, we should be explicit about what "QC fail" means. Is it an exclusion from upload or a note in the metadata? We've historically done either, depending on severity of the problems detected.

David, can you comment on this paragraph please?

| Percentage of Reads Aligned ([link](#percentage-of-reads-aligned)) | [picard] | Number of mapped reads divided by the total number of reads as a percentage. | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) |
| Median Insert Size ([link](#median-insert-size)) | [picard] | Median size of the fragment that is inserted between the sequencing adapters (estimated in silico). | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![No](https://img.shields.io/badge/no-red) |
| Percentage Duplication ([link](#percentage-duplication)) | [picard] | Percentage of the reads that are marked as PCR or optical duplicates. | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![No](https://img.shields.io/badge/no-red) |
| ≥ 30X Coverage ([link](#-30x-coverage)) | [mosdepth] | The percentage of locations that are covered by at least 30 reads for the whole genome, the exonic regions, and the coding sequence regions specifically. | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![No](https://img.shields.io/badge/no-red) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure a simple No for RNA-Seq is appropriate here. I think we need to qualify that we run everything except the WG coverage for RNA-Seq.

quality scores across the sample (peaks for lower scores in the orange or red
areas of the graph are cause for concern).
- The **per base sequence content**, which shows is there are any biases in what
nucleotides are showing up at particular locations in the reads. For whole
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nucleotides are showing up at particular locations in the reads. For whole
nucleotides are called at particular locations in the reads. For whole

nucleotides are showing up at particular locations in the reads. For whole
genome and whole exome data, there should be relatively few biases. However,
for RNA-Seq, you will see a bias at the 5' region of the reads.
- The **per sequence GC content** chart, which is highly informative of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the distributions are rarely uniform and smooth, we should add some discussion here of what to actually expect and why.

For example:
Screenshot 2023-06-29 at 11 57 39 AM

a graph that we find, often, even good data fails FastQC's stringent test).
- The **overrepresented sequences** chart, which gives an indication if you're
sequencing the same sequence over and over.
- The **adapter content** chart, which we use to ensure there are minimal to no
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good example of required background knowledge. By contrast, above you explain what C and G stand for and describe the nucleotide bonding. Here we assume that the reader is familiar with sequencing adapters.

We presume Anaconda is available and installed. If not, please follow the link to [Anaconda](https://www.anaconda.com/) first.

```bash
conda create --name bio-qc \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing kraken2 and mosdepth. It also includes fastq-screen which we no longer run.

conda activate bio-qc
```

For linting created fastqs, `fqlib` must be installed. See installation instructions [here](https://github.com/stjude/fqlib).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs some additional text (or maybe less). I would either say "fqlib must be installed. See installation instructions here". Or provide a description of linting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, isn't fqlib in conda now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants