Quality Checking Pipeline #23

claymcleod · 2023-06-28T15:06:40Z

Continuation of #3 and #13.

Rendered

Co-Authored-By: Clay McLeod <[email protected]>

Co-authored-by: Andrew Thrasher <[email protected]>

Co-authored-by: Michael Macias <[email protected]>

adthrasher · 2023-06-29T13:23:36Z

text/XXXX-quality-check-workflow.md

+   time computing the information themselves while also informing appropriate
+   use of the data.
+
+You can find the relevant discussion on the [associated pull request](https://github.com/stjudecloud/rfcs/pull/3).


Should this be updated to PR #23 or do we want to link to all of the QC PRs? That would be the only way to have the complete discussion available.

adthrasher · 2023-06-29T13:27:00Z

text/XXXX-quality-check-workflow.md

+2. Publish a collection of metrics that end-users of St. Jude Cloud can leverage
+   to assess the quality of the data available. This context should save users
+   time computing the information themselves while also informing appropriate
+   use of the data.


Do we accomplish that? Most of our discussion to this point has been about what metrics we want to use to determine a "quality" dataset. Are there metrics that an end user would want to evaluate the applicability of a dataset to a problem that differ?

adthrasher · 2023-06-29T13:33:08Z

text/XXXX-quality-check-workflow.md

+St. Jude Cloud is a large repository of omics data available for request from
+the academic and non-profit community. As such, the project processes thousands


Is it appropriate to constrain our description of St. Jude Cloud this way? What about visualizations? What about imaging data? There are likely to be other examples in the future.

Suggested change

St. Jude Cloud is a large repository of omics data available for request from

the academic and non-profit community. As such, the project processes thousands

St. Jude Cloud provides a large repository of omics data available for request from

the academic and non-profit community. As such, the project processes thousands

adthrasher · 2023-06-29T13:49:26Z

text/XXXX-quality-check-workflow.md

+Thus, the scope of this RFC, and the QC of samples on the project in general, is
+limited to the _computational_ QC of the files produced for publication in St.
+Jude Cloud. While we do produce results that define _experimental_ results (such
+as `fastqc`), these are rarely used to decide which files pass or fail our


We may want to note here whether we would exclude data or simply mark the QC failure.

I'm not sure this is true. In my discussions with @dfinkels about his QC process, I've never seen him make this distinction between computational and experimental QC. I think it's a valid distinction to make, but the claim "these (experimental QC metrics) are rarely used to decide which files pass or fail" seems erroneous to me.

And agree with @adthrasher here, we should be explicit about what "QC fail" means. Is it an exclusion from upload or a note in the metadata? We've historically done either, depending on severity of the problems detected.

David, can you comment on this paragraph please?

adthrasher · 2023-06-29T13:51:10Z

text/XXXX-quality-check-workflow.md

+| Percentage of Reads Aligned ([link](#percentage-of-reads-aligned))                                         | [picard]    | Number of mapped reads divided by the total number of reads as a percentage.                                                                                                                             | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) |
+| Median Insert Size ([link](#median-insert-size))                                                           | [picard]    | Median size of the fragment that is inserted between the sequencing adapters (estimated in silico).                                                                                                      | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![No](https://img.shields.io/badge/no-red)           |
+| Percentage Duplication ([link](#percentage-duplication))                                                   | [picard]    | Percentage of the reads that are marked as PCR or optical duplicates.                                                                                                                                    | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![No](https://img.shields.io/badge/no-red)           |
+| ≥ 30X Coverage ([link](#-30x-coverage))                                                                    | [mosdepth]  | The percentage of locations that are covered by at least 30 reads for the whole genome, the exonic regions, and the coding sequence regions specifically.                                                | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![Yes](https://img.shields.io/badge/yes-brightgreen) | ![No](https://img.shields.io/badge/no-red)           |


I'm not sure a simple No for RNA-Seq is appropriate here. I think we need to qualify that we run everything except the WG coverage for RNA-Seq.

adthrasher · 2023-06-29T15:56:05Z

text/XXXX-quality-check-workflow.md

+  quality scores across the sample (peaks for lower scores in the orange or red
+  areas of the graph are cause for concern).
+- The **per base sequence content**, which shows is there are any biases in what
+  nucleotides are showing up at particular locations in the reads. For whole


Suggested change

nucleotides are showing up at particular locations in the reads. For whole

nucleotides are called at particular locations in the reads. For whole

adthrasher · 2023-06-29T15:57:52Z

text/XXXX-quality-check-workflow.md

+  nucleotides are showing up at particular locations in the reads. For whole
+  genome and whole exome data, there should be relatively few biases. However,
+  for RNA-Seq, you will see a bias at the 5' region of the reads.
+- The **per sequence GC content** chart, which is highly informative of


Given that the distributions are rarely uniform and smooth, we should add some discussion here of what to actually expect and why.

For example:

adthrasher · 2023-06-29T16:00:30Z

text/XXXX-quality-check-workflow.md

+  a graph that we find, often, even good data fails FastQC's stringent test).
+- The **overrepresented sequences** chart, which gives an indication if you're
+  sequencing the same sequence over and over.
+- The **adapter content** chart, which we use to ensure there are minimal to no


This is a good example of required background knowledge. By contrast, above you explain what C and G stand for and describe the nucleotide bonding. Here we assume that the reader is familiar with sequencing adapters.

adthrasher · 2023-06-29T16:01:29Z

text/XXXX-quality-check-workflow.md

+We presume Anaconda is available and installed. If not, please follow the link to [Anaconda](https://www.anaconda.com/) first.
+
+```bash
+conda create --name bio-qc \


This is missing kraken2 and mosdepth. It also includes fastq-screen which we no longer run.

adthrasher · 2023-06-29T16:02:43Z

text/XXXX-quality-check-workflow.md

+conda activate bio-qc
+```
+
+For linting created fastqs, `fqlib` must be installed. See installation instructions [here](https://github.com/stjude/fqlib).


This needs some additional text (or maybe less). I would either say "fqlib must be installed. See installation instructions here". Or provide a description of linting.

Also, isn't fqlib in conda now?

claymcleod and others added 30 commits February 18, 2020 19:37

ci: replace Travis with Github Actions

632890e

Add rudimentary draft for quality assurance pipeline RFC

2a78e0b

Remove tool options from workflow table

91bd217

Add table row for infer_experiment

45432e7

Add table rows for junction_annotation and junction_saturation

f62098f

Remove reference to rseqc junction_annotation

fcd0c37

Add in-progress section and update goals

d152b1e

Add important metrics section with some brief justifications

adf6cce

Update text/0000-quality-assurance-pipeline.md

ab952c5

Co-Authored-By: Clay McLeod <[email protected]>

Remove junction_annotation

68db496

Update 0000-quality-assurance-pipeline.md

0495c99

Updated language for clarity

1d4507a

Further clarification /language edits

cc3db2b

Adjust outline to look like rnseq workflow

6201ab2

Spelling Corrections outstanding questions

e9337d9

Added draft Metrics for WGS,WES,RNAseq

2c98ad2

Edited language in "Metrics for" sections

7b02f76

Correct spelling errors "Outstanding Questions"

fed96b7

Mapping Percentage to % Aligned

00baeec

Expanded introduction

f69d008

Edits Intro, Motivation and Discussion

aa8e019

Two spaces after period

7f4e92d

Two spaces after periods full doc

6ff0cb5

Added Thresholds section

cfc7d84

Added commands for QC steps

9c45425

Changed bash code syntax

c69f752

Edited text under St Jude Genomics QC Process

7859a58

Various updates to the QC RFC

f1091d0

Change all 'RNA-Seq' to 'RNA-seq'

4c0a2d8

improve: updates to the QC RFC

75ec04c

a-frantz and others added 20 commits November 21, 2020 15:09

Apply suggestions from code review

ad88c89

Co-authored-by: Andrew Thrasher <[email protected]>

Normalize RNA-Seq, whole genome, and whole exome references

885d204

added typo-ci ignore file

9611f4c

add to typo-ci ignore

fdf5be9

Update text/0002-quality-check-workflow.md

e5f1bb7

Co-authored-by: Michael Macias <[email protected]>

Update text/0002-quality-check-workflow.md

087d17d

Co-authored-by: Michael Macias <[email protected]>

Update text/0002-quality-check-workflow.md

79a043d

Co-authored-by: Michael Macias <[email protected]>

Update text/0002-quality-check-workflow.md

4b20e2e

Co-authored-by: Michael Macias <[email protected]>

Update text/0002-quality-check-workflow.md

6365acd

Co-authored-by: Michael Macias <[email protected]>

Update text/0002-quality-check-workflow.md

bcce81c

Co-authored-by: Michael Macias <[email protected]>

Update text/0002-quality-check-workflow.md

efe0f7f

Co-authored-by: Michael Macias <[email protected]>

revert: change 'RNA-seq' back to 'RNA-Seq'

c0c3723

revert: fix from last commit

25db2d1

Apply suggestions from code review

2387f8c

Co-authored-by: Michael Macias <[email protected]>

docs: update to latest iteration of QC workflow

041c203

fix: typo and CI exclusion

6b59d09

revise: revises content for the QC RFC

21f1594

Merge branch 'master' into rfcs/qc-workflow

5906600

revise: further revisions to the QC RFC

3f49f1c

revise: updates to the QC RFC

850ecf6

claymcleod requested a review from adthrasher June 28, 2023 15:06

claymcleod assigned claymcleod and a-frantz Jun 28, 2023

revise: updates QC RFC to not have an assigned number

7850c1c

claymcleod force-pushed the rfcs/qc-workflow branch 3 times, most recently from 5205bc4 to 7850c1c Compare June 28, 2023 15:25

claymcleod added 2 commits June 28, 2023 10:25

revise: adds title to the RFC

45115c9

revise: adds TOC back into the RFC

3a7f5b0

adthrasher reviewed Jun 29, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quality Checking Pipeline #23

Quality Checking Pipeline #23

claymcleod commented Jun 28, 2023 •

edited

Loading

adthrasher Jun 29, 2023

adthrasher Jun 29, 2023

adthrasher Jun 29, 2023 •

edited

Loading

adthrasher Jun 29, 2023

a-frantz Aug 3, 2023

adthrasher Jun 29, 2023

adthrasher Jun 29, 2023

adthrasher Jun 29, 2023

adthrasher Jun 29, 2023

adthrasher Jun 29, 2023

adthrasher Jun 29, 2023

adthrasher Jun 29, 2023

		St. Jude Cloud is a large repository of omics data available for request from
		the academic and non-profit community. As such, the project processes thousands

	nucleotides are showing up at particular locations in the reads. For whole
	nucleotides are called at particular locations in the reads. For whole

Quality Checking Pipeline #23

Are you sure you want to change the base?

Quality Checking Pipeline #23

Conversation

claymcleod commented Jun 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adthrasher Jun 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

claymcleod commented Jun 28, 2023 •

edited

Loading

adthrasher Jun 29, 2023 •

edited

Loading