Skip to content

Commit

Permalink
copied over contents of Intro to R workshop
Browse files Browse the repository at this point in the history
  • Loading branch information
marypiper committed Jun 10, 2019
1 parent b9063cb commit 502bbd8
Show file tree
Hide file tree
Showing 111 changed files with 67,249 additions and 2 deletions.
45 changes: 43 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,43 @@
# EpiR
6 hour introduction to R for Epi summer program
## Introduction to R

| Audience | Computational skills required | Duration |
:----------|:-------------|:----------|
| Biologists | None | 1-day workshop (~ 5.5 hours of trainer-led time)|

### Description
This repository has teaching materials for a hands-on **Introduction to R** workshop. The workshop will introduce participants to the basics of R and RStudio. R is a simple programming environment that enables the effective handling of data, while providing excellent graphical support. RStudio is a tool that provides a user-friendly environment for working with R.

These materials are intended to provide both basic R programming knowledge and its application for increasing efficiency for data analysis.

> These materials are developed for a trainer-led workshop, but also amenable to self-guided learning.
### Learning Objectives

1. **R syntax**: Understand the different 'parts of speech'.
2. **Data types structures in R**: Describe the various data types and data structures.
3. **Data inspection and wrangling**: Demonstrate the utilization of functions and indices to inspect and subset data from various data structures.
4. **Visualizing data**: Demonstrate the use of the ggplot2 package to create plots for easy data visualization.

### Lessons

Below are links to the lessons and suggested schedules:

* [1-day schedule](https://hbctraining.github.io/Intro-to-R/schedules/1.5-day.html)

### Installation Requirements

Download the most recent versions of R and RStudio for the appropriate OS using the links below:

- [R](https://cran.r-project.org/)
- [RStudio](https://www.rstudio.com/products/rstudio/download/#download)

### Dataset

All the files used for the above lessons are linked within, but can also be [accessed here](https://github.com/hbctraining/Intro-to-R-with-DGE/tree/master/data).

---
*These materials have been developed by members of the teaching team at the [Harvard Chan Bioinformatics Core (HBC)](http://bioinformatics.sph.harvard.edu/). These are open access materials distributed under the terms of the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.*

* *Some materials used in these lessons were derived from work that is Copyright © Data Carpentry (http://datacarpentry.org/).
All Data Carpentry instructional material is made available under the [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/) (CC BY 4.0).*

2 changes: 2 additions & 0 deletions _config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
theme: jekyll-theme-cayman
title: Introduction to R
8 changes: 8 additions & 0 deletions assets/css/style.scss
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
---

@import "{{ site.theme }}";

.page-header { color: #fff; text-align: center; background-image: url("../images/dna-sequence-1600x800.jpg"); }

.main-content h1, .main-content h2, .main-content h3, .main-content h4, .main-content h5, .main-content h6 { margin-top: 2rem; margin-bottom: 1rem; font-weight: normal; color: #000000; }
Binary file added assets/images/dna-sequence-1600x800.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions data/Mov10_full_meta.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
sampletype MOVexpr
Mov10_kd_2 MOV10_knockdown low
Mov10_kd_3 MOV10_knockdown low
Mov10_oe_1 MOV10_overexpression high
Mov10_oe_2 MOV10_overexpression high
Mov10_oe_3 MOV10_overexpression high
Irrel_kd_1 control normal
Irrel_kd_2 control normal
Irrel_kd_3 control normal
7 changes: 7 additions & 0 deletions data/animals.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
speed,color
Elephant,40,Gray
Cheetah,120,Tan
Tortoise,0.1,Green
Hare,48,Grey
Lion,80,Tan
PolarBear,30,White
38,829 changes: 38,829 additions & 0 deletions data/counts.rpkm

Large diffs are not rendered by default.

13 changes: 13 additions & 0 deletions data/mouse_exp_design.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
genotype,celltype,replicate
sample1,Wt,typeA,1
sample2,Wt,typeA,2
sample3,Wt,typeA,3
sample4,KO,typeA,1
sample5,KO,typeA,2
sample6,KO,typeA,3
sample7,Wt,typeB,1
sample8,Wt,typeB,2
sample9,Wt,typeB,3
sample10,KO,typeB,1
sample11,KO,typeB,2
sample12,KO,typeB,3
23,369 changes: 23,369 additions & 0 deletions data/normalized_counts.txt

Large diffs are not rendered by default.

Binary file added dataset.zip
Binary file not shown.
165 changes: 165 additions & 0 deletions homework/Intro_to_R_hw.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# Introduction to R practice

## Creating vectors/factors and dataframes

1. We are performing RNA-Seq on cancer samples being treated with three different types of treatment (A, B, and P). You have 12 samples total, with 4 replicates per treatment. Write the R code you would use to construct your metadata table as described below.
- Create the vectors/factors for each column (Hint: you can type out each vector/factor, or if you want the process go faster try exploring the `rep()` function).
- Put them together into a dataframe called `meta`.
- Use the `rownames()` function to assign row names to the dataframe (Hint: you can type out the row names as a vector, or if you want the process go faster try exploring the `paste()` function).

Your finished metadata table should have information for the variables `sex`, `stage`, `treatment`, and `myc` levels:

| |sex | stage | treatment | myc |
|:--:|:--: | :--: | :------: | :--: |
|sample1| M |I |A |2343|
|sample2| F |II |A |457|
|sample3 |M |II |A |4593|
|sample4 |F |I |A |9035|
|sample5| M |II |B |3450|
|sample6| F| II| B| 3524|
|sample7| M| I| B| 958|
|sample8| F| II| B| 1053|
|sample9| M| II| P| 8674|
|sample10 |F| I |P |3424|
|sample11| M |II |P |463|
|sample12| F| II| P| 5105|


## Subsetting vectors/factors and dataframes

2. Using the `meta` data frame from question #1, write out the R code you would use to perform the following operations (questions **DO NOT** build upon each other):

- return only the `treatment` and `sex` columns using `[]`:
- return the `treatment` values for samples 5, 7, 9, and 10 using `[]`:
- use `filter()` to return all data for those samples receiving treatment `P`:
- use `filter()`/`select()`to return only the `stage` and `treatment` columns for those samples with `myc` > 5000:
- remove the `treatment` column from the dataset using `[]`:
- remove samples 7, 8 and 9 from the dataset using `[]`:
- keep only samples 1-6 using `[]`:
- add a column called `pre_treatment` to the beginning of the dataframe with the values T, F, F, F, T, T, F, T, F, F, T, T (Hint: use `cbind()`):
- change the names of the columns to: "A", "B", "C", "D":

## Extracting components from lists
3. Create a new list, `list_hw` with three components, the `glengths` vector, the dataframe `df`, and `number` value. Use this list to answer the questions below . `list_hw` has the following structure (NOTE: the components of this list are not currently named):

[[1]]
[1] 4.6 3000.0 50000.0

[[2]]
species glengths
1 ecoli 4.6
2 human 3000.0
3 corn 50000.0

[[3]]
[1] 8

Write out the R code you would use to perform the following operations (questions **DO NOT** build upon each other):
- return the second component of the list:
- return `50000.0` from the first component of the list:
- return the value `human` from the second component:
- give the components of the list the following names: "genome_lengths", "genomes", "record":

## Creating figures with ggplot2

![plot_image](plotcounts.png)

4. Create the same plot as above using ggplot2 using the provided metadata and counts datasets. The [metadata table](https://github.com/hbc/Intro-to-R-2-day/raw/master/data/Mov10_full_meta.txt) describes an experiment that you have setup for RNA-seq analysis, while the [associated count matrix](https://github.com/hbc/Intro-to-R-2-day/raw/master/data/normalized_counts.txt) gives the normalized counts for each sample for every gene. Download the count matrix and metadata using the links provided.

Follow the instructions below to build your plot. Write the code you used and provide the final image.

- Read in the metadata file using: `meta <- read.delim("Mov10_full_meta.txt", sep="\t", row.names=1)`

- Read in the count matrix file using: `data <- read.delim("normalized_counts.txt", sep="\t", row.names=1)`

- Create a vector called `expression` that contains the normalized count values from the row in normalized_counts that corresponds to the MOV10 gene.

- Check the class of this expression vector. Then, convert it to a numeric vector using `as.numeric(expression)`

- Bind that vector to your metadata data frame (`meta`) and call the new data frame `df`.

- Create a ggplot by constructing the plot line by line:

- Initialize a ggplot with your `df` as input.

- Add the `geom_jitter()` geometric object with the required aesthetics which are x and y.

- Color the points based on `sampletype`

- Add the `theme_bw()` layer

- Add the title "Expression of MOV10" to the plot

- Change the x-axis label to be blank

- Change the y-axis label to "Normalized counts"

- Using `theme()` change the following properties of the plot:

- Remove the legend (Hint: use ?theme help and scroll down to legend.position)

- Change the plot title size to 1.5x the default and center align

- Change the axis title to 1.5x the default size

- Change the size of the axis text only on the y-axis to 1.25x the default size
- Rotate the x-axis text to 45 degrees using `axis.text.x=element_text(angle=45, hjust=1)`

## Practice with nested functions (optional)

Let's derive some nested functions similar to those we will use in our RNA-Seq analysis. The following dataframes, `value_table` and `meta`, should be used to address the questions below (you do not actually need to create these dataframes):

**value_table**

| |MX1| MX2| MX3|
|:--: |:--:| :--:| :--:|
|KD.2 |-222517.197 |-21756.82 |-16036.035|
|KD.3 |17453.907 |-30058.14 |-25837.482|
|OE.1 |-31247.923| 73061.38 |7019.940|
|OE.2 |-4184.355 |61994.47 |1777.858|
|OE.3| 147391.709 |11970.45 |-18663.686|
|IR.1| -32247.617 |-27896.01 |29383.153|
|IR.2 |25456.820| -30714.29 |19148.752|
|IR.3 |99894.656| -36601.04| 3207.501|

**meta**

| |sampletype| MOVexpr|
|:--: |:--:| :--:|
|KD.2| MOV10_knockdown |low|
|KD.3 |MOV10_knockdown| low|
|OE.1 |MOV10_overexpression |high|
|OE.2| MOV10_overexpression| high|
|OE.3 |MOV10_overexpression |high|
|IR.1 |siRNA| normal|
|IR.2 |siRNA| normal|
|IR.3| siRNA |normal|



5. We would like to count the number of samples which have normal Mov10 expression (`MOVexpr`) in the `meta` dataset. Let's do this in steps:

- Write the R code you would run to return the row numbers of the samples with `MOVexpr` equal to "normal":

- Write the R code you would run to determine the number of elements in the `MOVexpr` column:

- Now, try to combine your first two actions into a single line of code using nested functions to determine the number of elements in the MOVexpr column with expression levels of MOV10 being normal:

6. Now we would like to add the `MX1` and `MX3` columns to the `meta` data frame. Let's do this in steps:

- Write the R code you would run to extract columns `MX1` and `MX3` from the `value_table` and to save it to a variable `mx` (hint: you will need to use the `c()` function to specify the columns you want to extract):

- Using the `cbind()` function, write the R code you would use to add the columns in your `mx` variable to the end of your `meta` dataset :

- Now, try to combine your first two actions into a single line of code using nested functions (hint: you do not need to generate the `mx` variable) to add the `MX1` and `MX3` columns to the `meta` file:

7. Finally, we would like to extract only those rows from the `meta` dataset for replicate 2 from all conditions (KD.2, OE.2, IR.2). Let's do this in steps:

- Write the function you would use to determine the row names of the `meta` dataset:

- Using the `which()` function, write the R code you would run to determine the location of the row name `KD.2` in the `meta` dataset:

- Using the `which()` function, write the R code you would use to determine the location of row names `KD.2`, `OE.2`, and `IR.2` in the `meta` dataset (use the OR operator ("\|") to return multiple locations):

- Now, extract the rows from the `meta` dataset with row names `KD.2`, `OE.2`, and `IR.2` using a single line of code using nested functions:
Loading

0 comments on commit 502bbd8

Please sign in to comment.