copied over contents of Intro to R workshop

hbctraining · Jun 10, 2019 · 502bbd8 · 502bbd8
1 parent b9063cb
commit 502bbd8
Show file tree

Hide file tree

Showing 111 changed files with 67,249 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -1,2 +1,43 @@
-# EpiR
-6 hour introduction to R for Epi summer program
+## Introduction to R
+
+| Audience | Computational skills required | Duration |
+:----------|:-------------|:----------|
+| Biologists | None | 1-day workshop (~ 5.5 hours of trainer-led time)|
+
+### Description
+This repository has teaching materials for a hands-on **Introduction to R** workshop. The workshop will introduce participants to the basics of R and RStudio. R is a simple programming environment that enables the effective handling of data, while providing excellent graphical support. RStudio is a tool that provides a user-friendly environment for working with R. 
+
+These materials are intended to provide both basic R programming knowledge and its application for increasing efficiency for data analysis. 
+
+> These materials are developed for a trainer-led workshop, but also amenable to self-guided learning.
+
+### Learning Objectives
+
+1. **R syntax**: Understand the different 'parts of speech'.
+2. **Data types structures in R**: Describe the various data types and data structures.
+3. **Data inspection and wrangling**: Demonstrate the utilization of functions and indices to inspect and subset data from various data structures.
+4. **Visualizing data**: Demonstrate the use of the ggplot2 package to create plots for easy data visualization.
+
+### Lessons
+
+Below are links to the lessons and suggested schedules:
+
+* [1-day schedule](https://hbctraining.github.io/Intro-to-R/schedules/1.5-day.html)
+
+### Installation Requirements
+
+Download the most recent versions of R and RStudio for the appropriate OS using the links below:
+
+ - [R](https://cran.r-project.org/) 
+ - [RStudio](https://www.rstudio.com/products/rstudio/download/#download)
+
+### Dataset
+
+All the files used for the above lessons are linked within, but can also be [accessed here](https://github.com/hbctraining/Intro-to-R-with-DGE/tree/master/data).
+
+---
+*These materials have been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*
+
+* *Some materials used in these lessons were derived from work that is Copyright © Data Carpentry (http://datacarpentry.org/). 
+All Data Carpentry instructional material is made available under the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0).*
+
diff --git a/_config.yml b/_config.yml
@@ -0,0 +1,2 @@
+theme: jekyll-theme-cayman
+title: Introduction to R
diff --git a/assets/css/style.scss b/assets/css/style.scss
@@ -0,0 +1,8 @@
+---
+---
+
+@import "{{ site.theme }}";
+
+.page-header { color: #fff; text-align: center; background-image: url("../images/dna-sequence-1600x800.jpg"); }
+
+.main-content h1, .main-content h2, .main-content h3, .main-content h4, .main-content h5, .main-content h6 { margin-top: 2rem; margin-bottom: 1rem; font-weight: normal; color: #000000; }
diff --git a/assets/images/dna-sequence-1600x800.jpg b/assets/images/dna-sequence-1600x800.jpg
diff --git a/data/Mov10_full_meta.txt b/data/Mov10_full_meta.txt
@@ -0,0 +1,9 @@
+	sampletype	MOVexpr
+Mov10_kd_2	MOV10_knockdown	low
+Mov10_kd_3	MOV10_knockdown	low
+Mov10_oe_1	MOV10_overexpression	high
+Mov10_oe_2	MOV10_overexpression	high
+Mov10_oe_3	MOV10_overexpression	high
+Irrel_kd_1	control	normal
+Irrel_kd_2	control	normal
+Irrel_kd_3	control	normal
diff --git a/data/animals.csv b/data/animals.csv
@@ -0,0 +1,7 @@
+speed,color
+Elephant,40,Gray
+Cheetah,120,Tan
+Tortoise,0.1,Green
+Hare,48,Grey
+Lion,80,Tan
+PolarBear,30,White
diff --git a/data/counts.rpkm b/data/counts.rpkm
diff --git a/data/mouse_exp_design.csv b/data/mouse_exp_design.csv
@@ -0,0 +1,13 @@
+genotype,celltype,replicate
+sample1,Wt,typeA,1
+sample2,Wt,typeA,2
+sample3,Wt,typeA,3
+sample4,KO,typeA,1
+sample5,KO,typeA,2
+sample6,KO,typeA,3
+sample7,Wt,typeB,1
+sample8,Wt,typeB,2
+sample9,Wt,typeB,3
+sample10,KO,typeB,1
+sample11,KO,typeB,2
+sample12,KO,typeB,3
diff --git a/data/normalized_counts.txt b/data/normalized_counts.txt
diff --git a/dataset.zip b/dataset.zip
diff --git a/homework/Intro_to_R_hw.md b/homework/Intro_to_R_hw.md
@@ -0,0 +1,165 @@
+# Introduction to R practice
+
+## Creating vectors/factors and dataframes
+
+1. We are performing RNA-Seq on cancer samples being treated with three different types of treatment (A, B, and P). You have 12 samples total, with 4 replicates per treatment. Write the R code you would use to construct your metadata table as described below.  
+     - Create the vectors/factors for each column (Hint: you can type out each vector/factor, or if you want the process go faster try exploring the `rep()` function).
+     - Put them together into a dataframe called `meta`.
+     - Use the `rownames()` function to assign row names to the dataframe (Hint: you can type out the row names as a vector, or if you want the process go faster try exploring the `paste()` function).
+
+     Your finished metadata table should have information for the variables `sex`, `stage`, `treatment`, and `myc` levels: 
+
+     | |sex	| stage	| treatment	| myc |
+     |:--:|:--: | :--:	| :------:	| :--: |
+     |sample1|	M	|I	|A	|2343|
+     |sample2|	F	|II	|A	|457|
+     |sample3	|M	|II	|A	|4593|
+     |sample4	|F	|I	|A	|9035|
+     |sample5|	M	|II	|B	|3450|
+     |sample6|	F|	II|	B|	3524|
+     |sample7|	M|	I|	B|	958|
+     |sample8|	F|	II|	B|	1053|
+     |sample9|	M|	II|	P|	8674|
+     |sample10	|F|	I	|P	|3424|
+     |sample11|	M	|II	|P	|463|
+     |sample12|	F|	II|	P|	5105|
+
+
+## Subsetting vectors/factors and dataframes
+
+2. Using the `meta` data frame from question #1, write out the R code you would use to perform the following operations (questions **DO NOT** build upon each other):
+
+     - return only the `treatment` and `sex` columns using `[]`:
+     - return the `treatment` values for samples 5, 7, 9, and 10 using `[]`:
+     - use `filter()` to return all data for those samples receiving treatment `P`:
+     - use `filter()`/`select()`to return only the `stage` and `treatment` columns for those samples with `myc` > 5000:
+     - remove the `treatment` column from the dataset using `[]`:
+     - remove samples 7, 8 and 9 from the dataset using `[]`:
+     - keep only samples 1-6 using `[]`:
+     - add a column called `pre_treatment` to the beginning of the dataframe with the values T, F, F, F, T, T, F, T, F, F, T, T (Hint: use `cbind()`): 
+     - change the names of the columns to: "A", "B", "C", "D":
+
+## Extracting components from lists
+3. Create a new list, `list_hw` with three components, the `glengths` vector, the dataframe `df`, and `number` value. Use this list to answer the questions below . `list_hw` has the following structure (NOTE: the components of this list are not currently named):
+
+          [[1]]
+          [1]   4.6  3000.0 50000.0 
+
+          [[2]]
+                 species  glengths 
+            1    ecoli    4.6
+            2    human    3000.0
+            3    corn     50000.0
+
+          [[3]]
+          [1] 8
+
+Write out the R code you would use to perform the following operations (questions **DO NOT** build upon each other):
+ - return the second component of the list:
+ - return `50000.0` from the first component of the list:
+ - return the value `human` from the second component: 
+ - give the components of the list the following names: "genome_lengths", "genomes", "record":
+
+## Creating figures with ggplot2
+
+![plot_image](plotcounts.png)
+
+4. Create the same plot as above using ggplot2 using the provided metadata and counts datasets. The [metadata table](https://github.com/hbc/Intro-to-R-2-day/raw/master/data/Mov10_full_meta.txt) describes an experiment that you have setup for RNA-seq analysis, while the [associated count matrix](https://github.com/hbc/Intro-to-R-2-day/raw/master/data/normalized_counts.txt) gives the normalized counts for each sample for every gene. Download the count matrix and metadata using the links provided.
+
+     Follow the instructions below to build your plot. Write the code you used and provide the final image.
+
+     - Read in the metadata file using: `meta <- read.delim("Mov10_full_meta.txt", sep="\t", row.names=1)`
+
+     - Read in the count matrix file using: `data <- read.delim("normalized_counts.txt", sep="\t", row.names=1)`
+
+     - Create a vector called `expression` that contains the normalized count values from the row in normalized_counts that corresponds to the MOV10 gene.  
+
+     - Check the class of this expression vector. Then, convert it to a numeric vector using `as.numeric(expression)`
+
+     - Bind that vector to your metadata data frame (`meta`) and call the new data frame `df`. 
+
+     - Create a ggplot by constructing the plot line by line:
+
+          - Initialize a  ggplot with your `df` as input.
+
+          - Add the `geom_jitter()` geometric object with the required aesthetics which are x and y.
+
+          - Color the points based on `sampletype`
+
+          - Add the `theme_bw()` layer 
+
+          - Add the title "Expression of MOV10" to the plot
+
+          - Change the x-axis label to be blank
+
+          - Change the y-axis label to "Normalized counts"
+
+          - Using `theme()` change the following properties of the plot:
+
+               - Remove the legend (Hint: use ?theme help and scroll down to legend.position)
+
+               - Change the plot title size to 1.5x the default and center align
+
+               - Change the axis title to 1.5x the default size
+
+               - Change the size of the axis text only on the y-axis to 1.25x the default size
+               
+               - Rotate the x-axis text to 45 degrees using `axis.text.x=element_text(angle=45, hjust=1)`
+
+## Practice with nested functions (optional)
+
+Let's derive some nested functions similar to those we will use in our RNA-Seq analysis. The following dataframes, `value_table` and `meta`, should be used to address the questions below (you do not actually need to create these dataframes):
+
+**value_table**
+
+| |MX1|	MX2|	MX3|
+|:--: |:--:|	:--:|	:--:|
+|KD.2	|-222517.197	|-21756.82	|-16036.035|
+|KD.3	|17453.907	|-30058.14	|-25837.482|
+|OE.1	|-31247.923|	73061.38	|7019.940|
+|OE.2	|-4184.355	|61994.47	|1777.858|
+|OE.3|	147391.709	|11970.45	|-18663.686|
+|IR.1|	-32247.617	|-27896.01	|29383.153|
+|IR.2	|25456.820|	-30714.29	|19148.752|
+|IR.3	|99894.656|	-36601.04|	3207.501|
+
+**meta**
+
+| |sampletype|	MOVexpr|
+|:--: |:--:|	:--:|
+|KD.2|	MOV10_knockdown	|low|
+|KD.3	|MOV10_knockdown|	low|
+|OE.1	|MOV10_overexpression	|high|
+|OE.2|	MOV10_overexpression|	high|
+|OE.3	|MOV10_overexpression	|high|
+|IR.1	|siRNA|	normal|
+|IR.2	|siRNA|	normal|
+|IR.3|	siRNA	|normal|
+
+
+
+5. We would like to count the number of samples which have normal Mov10 expression (`MOVexpr`) in the `meta` dataset. Let's do this in steps:
+
+   - Write the R code you would run to return the row numbers of the samples with `MOVexpr` equal to "normal": 
+
+   - Write the R code you would run to determine the number of elements in the `MOVexpr` column: 
+
+   - Now, try to combine your first two actions into a single line of code using nested functions to determine the number of elements in the MOVexpr column with expression levels of MOV10 being normal: 
+
+6. Now we would like to add the `MX1` and `MX3` columns to the `meta` data frame. Let's do this in steps:
+
+   - Write the R code you would run to extract columns `MX1` and `MX3` from the `value_table` and to save it to a variable `mx` (hint: you will need to use the `c()` function to specify the columns you want to extract): 
+
+   - Using the `cbind()` function, write the R code you would use to add the columns in your `mx` variable to the end of your `meta` dataset : 
+
+   - Now, try to combine your first two actions into a single line of code using nested functions (hint: you do not need to generate the `mx` variable) to add the `MX1` and `MX3` columns to the `meta` file: 
+
+7. Finally, we would like to extract only those rows from the `meta` dataset for replicate 2 from all conditions (KD.2, OE.2, IR.2). Let's do this in steps:
+
+   - Write the function you would use to determine the row names of the `meta` dataset: 
+
+   - Using the `which()` function, write the R code you would run to determine the location of the row name `KD.2` in the `meta` dataset: 
+
+   - Using the `which()` function, write the R code you would use to determine the location of row names `KD.2`, `OE.2`, and `IR.2` in the `meta` dataset (use the OR operator ("\|") to return multiple locations):
+
+   - Now, extract the rows from the `meta` dataset with row names `KD.2`, `OE.2`, and `IR.2` using a single line of code using nested functions:
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		theme: jekyll-theme-cayman
		title: Introduction to R