-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quality Checking Pipeline #23
base: main
Are you sure you want to change the base?
Conversation
Co-Authored-By: Clay McLeod <[email protected]>
Co-authored-by: Andrew Thrasher <[email protected]>
Co-authored-by: Michael Macias <[email protected]>
Co-authored-by: Michael Macias <[email protected]>
Co-authored-by: Michael Macias <[email protected]>
Co-authored-by: Michael Macias <[email protected]>
Co-authored-by: Michael Macias <[email protected]>
Co-authored-by: Michael Macias <[email protected]>
Co-authored-by: Michael Macias <[email protected]>
Co-authored-by: Michael Macias <[email protected]>
5205bc4
to
7850c1c
Compare
time computing the information themselves while also informing appropriate | ||
use of the data. | ||
|
||
You can find the relevant discussion on the [associated pull request](https://github.com/stjudecloud/rfcs/pull/3). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be updated to PR #23 or do we want to link to all of the QC PRs? That would be the only way to have the complete discussion available.
2. Publish a collection of metrics that end-users of St. Jude Cloud can leverage | ||
to assess the quality of the data available. This context should save users | ||
time computing the information themselves while also informing appropriate | ||
use of the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we accomplish that? Most of our discussion to this point has been about what metrics we want to use to determine a "quality" dataset. Are there metrics that an end user would want to evaluate the applicability of a dataset to a problem that differ?
St. Jude Cloud is a large repository of omics data available for request from | ||
the academic and non-profit community. As such, the project processes thousands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it appropriate to constrain our description of St. Jude Cloud this way? What about visualizations? What about imaging data? There are likely to be other examples in the future.
St. Jude Cloud is a large repository of omics data available for request from | |
the academic and non-profit community. As such, the project processes thousands | |
St. Jude Cloud provides a large repository of omics data available for request from | |
the academic and non-profit community. As such, the project processes thousands |
Thus, the scope of this RFC, and the QC of samples on the project in general, is | ||
limited to the _computational_ QC of the files produced for publication in St. | ||
Jude Cloud. While we do produce results that define _experimental_ results (such | ||
as `fastqc`), these are rarely used to decide which files pass or fail our |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to note here whether we would exclude data or simply mark the QC failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is true. In my discussions with @dfinkels about his QC process, I've never seen him make this distinction between computational and experimental QC. I think it's a valid distinction to make, but the claim "these (experimental QC metrics) are rarely used to decide which files pass or fail" seems erroneous to me.
And agree with @adthrasher here, we should be explicit about what "QC fail" means. Is it an exclusion from upload or a note in the metadata? We've historically done either, depending on severity of the problems detected.
David, can you comment on this paragraph please?
| Percentage of Reads Aligned ([link](#percentage-of-reads-aligned)) | [picard] | Number of mapped reads divided by the total number of reads as a percentage. | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | | ||
| Median Insert Size ([link](#median-insert-size)) | [picard] | Median size of the fragment that is inserted between the sequencing adapters (estimated in silico). | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![No](https://img.shields.io/badge/no-red) | | ||
| Percentage Duplication ([link](#percentage-duplication)) | [picard] | Percentage of the reads that are marked as PCR or optical duplicates. | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![No](https://img.shields.io/badge/no-red) | | ||
| ≥ 30X Coverage ([link](#-30x-coverage)) | [mosdepth] | The percentage of locations that are covered by at least 30 reads for the whole genome, the exonic regions, and the coding sequence regions specifically. | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![No](https://img.shields.io/badge/no-red) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure a simple No
for RNA-Seq is appropriate here. I think we need to qualify that we run everything except the WG coverage for RNA-Seq.
quality scores across the sample (peaks for lower scores in the orange or red | ||
areas of the graph are cause for concern). | ||
- The **per base sequence content**, which shows is there are any biases in what | ||
nucleotides are showing up at particular locations in the reads. For whole |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nucleotides are showing up at particular locations in the reads. For whole | |
nucleotides are called at particular locations in the reads. For whole |
nucleotides are showing up at particular locations in the reads. For whole | ||
genome and whole exome data, there should be relatively few biases. However, | ||
for RNA-Seq, you will see a bias at the 5' region of the reads. | ||
- The **per sequence GC content** chart, which is highly informative of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a graph that we find, often, even good data fails FastQC's stringent test). | ||
- The **overrepresented sequences** chart, which gives an indication if you're | ||
sequencing the same sequence over and over. | ||
- The **adapter content** chart, which we use to ensure there are minimal to no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good example of required background knowledge. By contrast, above you explain what C
and G
stand for and describe the nucleotide bonding. Here we assume that the reader is familiar with sequencing adapters.
We presume Anaconda is available and installed. If not, please follow the link to [Anaconda](https://www.anaconda.com/) first. | ||
|
||
```bash | ||
conda create --name bio-qc \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing kraken2
and mosdepth
. It also includes fastq-screen
which we no longer run.
conda activate bio-qc | ||
``` | ||
|
||
For linting created fastqs, `fqlib` must be installed. See installation instructions [here](https://github.com/stjude/fqlib). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs some additional text (or maybe less). I would either say "fqlib must be installed. See installation instructions here". Or provide a description of linting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, isn't fqlib
in conda now?
Continuation of #3 and #13.
Rendered