1 Overview of RNA-Seq analysis workflow
2 Experiment design
3 Preliminary analysis
4 Statistics behind differential expression analysis
5 Differential expression analysis
6 Pathway analysis
7 More on RANSeq analysis and differential expression analysis

1 Overview of RNA-Seq analysis workflow

2 Experiment design

3 Preliminary analysis

3.1 Experimental design

targets = read.delim('http://bioinf.wehi.edu.au/edgeR/F1000Research2016/targets.txt', 
                     stringsAsFactors=FALSE)
targets

##                GEO        SRA CellType    Status
## MCL1.DG GSM1480297 SRR1552450        B    virgin
## MCL1.DH GSM1480298 SRR1552451        B    virgin
## MCL1.DI GSM1480299 SRR1552452        B  pregnant
## MCL1.DJ GSM1480300 SRR1552453        B  pregnant
## MCL1.DK GSM1480301 SRR1552454        B lactating
## MCL1.DL GSM1480302 SRR1552455        B lactating
## MCL1.LA GSM1480291 SRR1552444        L    virgin
## MCL1.LB GSM1480292 SRR1552445        L    virgin
## MCL1.LC GSM1480293 SRR1552446        L  pregnant
## MCL1.LD GSM1480294 SRR1552447        L  pregnant
## MCL1.LE GSM1480295 SRR1552448        L lactating
## MCL1.LF GSM1480296 SRR1552449        L lactating

To study the luminal cells in the mouse mammary gland.

The first two columns are GEO and SRA identifiers
CellType: basal stem-cell enriched cells (B) and com- mitted luminal cells (L)
Status: mammary glands of virgin, pregnant and lactating mice

group <- paste(targets$CellType, targets$Status, sep=".")
group <- factor(group)
table(group)

## group
## B.lactating  B.pregnant    B.virgin L.lactating  L.pregnant    L.virgin 
##           2           2           2           2           2           2

Why do we combine the Cell Type and Status to create group?

3.2 Downloading the read counts

FileURL = paste(
  "http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE60450",
  "format=file",
  "file=GSE60450_Lactation-GenewiseCounts.txt.gz",
  sep="&"
)
download.file(FileURL, "GSE60450_Lactation-GenewiseCounts.txt.gz")

3.3 Importing data

GenewiseCounts <- read.delim("GSE60450_Lactation-GenewiseCounts.txt.gz",
                             row.names="EntrezGeneID")
colnames(GenewiseCounts) <- substring(colnames(GenewiseCounts),1,7)
dim(GenewiseCounts)

## [1] 27179    13

head(GenewiseCounts)

##           Length MCL1.DG MCL1.DH MCL1.DI MCL1.DJ MCL1.DK MCL1.DL MCL1.LA
## 497097      3634     438     300      65     237     354     287       0
## 100503874   3259       1       0       1       1       0       4       0
## 100038431   1634       0       0       0       0       0       0       0
## 19888       9747       1       1       0       0       0       0      10
## 20671       3130     106     182      82     105      43      82      16
## 27395       4203     309     234     337     300     290     270     560
##           MCL1.LB MCL1.LC MCL1.LD MCL1.LE MCL1.LF
## 497097          0       0       0       0       0
## 100503874       0       0       0       0       0
## 100038431       0       0       0       0       0
## 19888           3      10       2       0       0
## 20671          25      18       8       3      10
## 27395         464     489     328     307     342

3.3.1 Create `DGElist` data object

The edgeR package stores data in a simple list-based data object called DGElist, main components include:

a matrix of read counts
sample information in the data.frame format
optional gene annotation

y <- DGEList(genes=GenewiseCounts[,1,drop=FALSE],
             GenewiseCounts[,-1], 
             group=group)
options(digits=3)
y$samples

##               group lib.size norm.factors
## MCL1.DG    B.virgin 23227641            1
## MCL1.DH    B.virgin 21777891            1
## MCL1.DI  B.pregnant 24100765            1
## MCL1.DJ  B.pregnant 22665371            1
## MCL1.DK B.lactating 21529331            1
## MCL1.DL B.lactating 20015386            1
## MCL1.LA    L.virgin 20392113            1
## MCL1.LB    L.virgin 21708152            1
## MCL1.LC  L.pregnant 22241607            1
## MCL1.LD  L.pregnant 21988240            1
## MCL1.LE L.lactating 24723827            1
## MCL1.LF L.lactating 24657293            1

3.3.2 Adding gene annotation

The row names of the DGElist object y are ENTREZIDs. We can map gene annotations to these IDs.

# source("https://bioconductor.org/biocLite.R")
# biocLite("org.Mm.eg.db")
# library(org.Mm.eg.db)
y$genes$Symbol <- mapIds(org.Mm.eg.db, 
                         rownames(y),
                         keytype="ENTREZID", 
                         column="SYMBOL")

## 'select()' returned 1:1 mapping between keys and columns

head(y$genes)

##           Length  Symbol
## 497097      3634    Xkr4
## 100503874   3259 Gm19938
## 100038431   1634 Gm10568
## 19888       9747     Rp1
## 20671       3130   Sox17
## 27395       4203  Mrpl15

Remove genes that don’t have official gene symbols.

y <- y[!is.na(y$genes$Symbol), ]
dim(y)

## [1] 26327    12

3.3.3 Filtering to remove low counts

From biological point of view, a gene must be expressed at some mini- mal level before it is likely to be translated into a protein or to be considered biologically important. From a statistical point of view, genes with consistently low counts are very unlikely be assessed as significantly DE because low counts do not provide enough statistical evidence for a reliable judgement to be made.

Filter genes based on count-per-million(CPM) values instead of simply on raw counts. The CPM takes into account the library size.
What problem could have if the filtering is based on raw counts?

keep <- rowSums(cpm(y) > 0.5) >= 2
table(keep)

## keep
## FALSE  TRUE 
## 10686 15641

y <- y[keep, , keep.lib.sizes=FALSE]

3.4 Exploring differences between libraries

Use Multi-Dimensional Scaling (MDS) plots to visualize the difference of within- and betwee-groups.

y <- calcNormFactors(y)
pch <- c(0,1,2,15,16,17)
colors <- rep(c("darkgreen", "red", "blue"), 2)
plotMDS(y, col=colors[group], pch=pch[group], xlim=c(-4, 4), ylim=c(-5,5))
legend("topleft", legend=levels(group), pch=pch, col=colors, ncol=2)

Use Mean-Difference (MD) plots to visualize log-fold change between two libraries against the average log-expression.

what are those diagnoal lines in the MD plots?

par(mfrow=c(1,2))
plotMD(y, column=1, ylim=c(-10, 10))
plotMD(y, column=11, ylim=c(-10, 10))

plotMD(y[y$counts[,1] < 3, ], column = 1, ylab="read counts < 3")
plotMD(y[y$counts[,1] == 300, ], column = 1, ylab = "read counts = 300")

par(mfrow=c(1,1))

4 Statistics behind differential expression analysis

5 Differential expression analysis

5.1 Design matrix

design <- model.matrix(~0+group)
colnames(design) <- levels(group)
design

##    B.lactating B.pregnant B.virgin L.lactating L.pregnant L.virgin
## 1            0          0        1           0          0        0
## 2            0          0        1           0          0        0
## 3            0          1        0           0          0        0
## 4            0          1        0           0          0        0
## 5            1          0        0           0          0        0
## 6            1          0        0           0          0        0
## 7            0          0        0           0          0        1
## 8            0          0        0           0          0        1
## 9            0          0        0           0          1        0
## 10           0          0        0           0          1        0
## 11           0          0        0           1          0        0
## 12           0          0        0           1          0        0
## attr(,"assign")
## [1] 1 1 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

Usually, the predictors in the linear model in differential expression analysis are categorical variables (e.g., the group variable). Have you encountered situations in your research that the predictors are continous variables?

5.2 Normalization (estimate size factors)

See the change in norm.factors column before and after normalization.

y <- calcNormFactors(y)
y$samples

##               group lib.size norm.factors
## MCL1.DG    B.virgin 23137472        1.235
## MCL1.DH    B.virgin 21687755        1.213
## MCL1.DI  B.pregnant 23974787        1.125
## MCL1.DJ  B.pregnant 22545375        1.069
## MCL1.DK B.lactating 21420532        1.036
## MCL1.DL B.lactating 19916685        1.087
## MCL1.LA    L.virgin 20273585        1.370
## MCL1.LB    L.virgin 21568458        1.368
## MCL1.LC  L.pregnant 22117517        1.006
## MCL1.LD  L.pregnant 21877287        0.924
## MCL1.LE L.lactating 24657903        0.529
## MCL1.LF L.lactating 24600304        0.535

Effective library size = library size X normalizatin factor.

The normalization factors of all the libraries multiply to unity. A normalization factor below one indicates that a small number of high count genes are monopolizing the sequencing, causing the counts for other genes to be lower than would be usual given the library size. As a result, the effective library size will be scaled down for that sample. Here we see that the luminal-lactating samples have low normalization factors. This is a sign that these samples contain a number of very highly upregulated genes.

5.3 Dispersion estimation

Dispersion = Coefficient of Variation = Stadard Deviation divided by Mean Biological Cofficient of Variation (BCV) = sqaure root Dispersion = square root Coefficient of Variation

The dispersion trend tends to decrease smoothly with abundance and to asymptotic to a constant value for genes with larger counts

y <- estimateDisp(y, design, robust=TRUE)
par(mfrow=c(1,2))
plotBCV(y, ylim=c(0,2.5))

fit <- glmQLFit(y, design, robust=TRUE)
plotQLDisp(fit, ylim=c(0,2.5))

par(mfrow=c(1,1))

5.4 Testing for differential expression

design

##    B.lactating B.pregnant B.virgin L.lactating L.pregnant L.virgin
## 1            0          0        1           0          0        0
## 2            0          0        1           0          0        0
## 3            0          1        0           0          0        0
## 4            0          1        0           0          0        0
## 5            1          0        0           0          0        0
## 6            1          0        0           0          0        0
## 7            0          0        0           0          0        1
## 8            0          0        0           0          0        1
## 9            0          0        0           0          1        0
## 10           0          0        0           0          1        0
## 11           0          0        0           1          0        0
## 12           0          0        0           1          0        0
## attr(,"assign")
## [1] 1 1 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

B.LvsP <- makeContrasts(B.lactating-B.pregnant, levels=design)
res <- glmQLFTest(fit, contrast=B.LvsP)

topTags(res)

## Coefficient:  1*B.lactating -1*B.pregnant 
##        Length  Symbol logFC logCPM   F   PValue      FDR
## 12992     765 Csn1s2b  6.08  10.19 421 4.77e-11 7.47e-07
## 211577   2006  Mrgprf  5.15   2.75 343 1.32e-10 7.94e-07
## 226101   7094    Myof  2.32   6.45 322 1.96e-10 7.94e-07
## 381290   8292  Atp2b4  2.14   6.15 320 2.03e-10 7.94e-07
## 140474  11281    Muc4 -7.17   6.06 308 2.61e-10 8.16e-07
## 231830   3346 Micall2 -2.25   5.19 282 4.43e-10 1.15e-06
## 24117    2242    Wif1 -1.82   6.77 261 7.10e-10 1.59e-06
## 12740    1812   Cldn4 -5.32   9.87 298 9.07e-10 1.73e-06
## 21953     667   Tnni2  5.75   3.86 313 9.93e-10 1.73e-06
## 231991   2873   Creb5  2.57   4.87 241 1.17e-09 1.83e-06

Determine which genes are DE genes based on the glmQLFTest result.

is.de <- decideTestsDGE(res)
summary(is.de)

##    1*B.lactating -1*B.pregnant
## -1                        2758
## 0                        10396
## 1                         2487

Visualize log-fold change against average abundance.

plotMD(res, status=is.de, values=c(1,-1), col=c("red","blue"),
       legend="topright")

Are the DE genes identified based on P values or the FDR values in the res? If we use P values to filter DE genes, which DE genes are more likely to be false positives?

par(mfrow=c(1,2))
# based on P value
plotMD(res[res$table$PValue < 0.05, ], values=c(1,-1),
       col=c("red","blue"),legend="topright")

# Based on FDR
plotMD(res[is.de != 0, ], values=c(1,-1),
       col=c("red","blue"),legend="topright")

par(mfrow=c(1,1))

Are all statistically significant DE genes biologically meaningful to us?

Test whether the log-fold changes are significantly greater than a number.

tr <- glmTreat(fit, contrast=B.LvsP, lfc=log2(1.5))

par(mfrow=c(1,2))
# original
plotMD(res[is.de != 0, ], values=c(1,-1),
       col=c("red","blue"),legend="topright")

# log-fold change greater than 1.5
is.de.1point5 = decideTestsDGE(tr)
plotMD(res[is.de.1point5 != 0, ], values=c(1,-1),
       col=c("red","blue"),legend="topright")

par(mfrow=c(1,1))

5.5 Heat map clustering

logCPM <- cpm(y, prior.count=2, log=TRUE)
rownames(logCPM) <- y$genes$Symbol
colnames(logCPM) <- paste(y$samples$group, 1:2, sep="-")
o <- order(tr$table$PValue)
logCPM <- logCPM[o[1:30],]
logCPM <- t(scale(t(logCPM)))
col.pan <- colorpanel(100, "blue", "white", "red")
heatmap.2(logCPM, col=col.pan, Rowv=TRUE, scale="none",
          trace="none", dendrogram="both", cexRow=1, cexCol=1.4, 
          density.info="none",
          margin=c(7,8), lhei=c(2,6), lwid=c(2,5))

5.6 ANalysis Of DEViance (ANODEV)

The differential expression analysis comparing two groups can be easily extended to comparisons between three or more groups. This is done by creating a matrix of independent contrasts. In this manner, users can perform a one-way analysis of deviance (ANODEV) for each gene.

The QL F-test identifies genes that are DE between the three groups. This combines the three pairwise comparisons into a single F-statistic and p-value.

con <- makeContrasts(
     L.PvsL = L.pregnant - L.lactating,
     L.VvsL = L.virgin - L.lactating,
     L.VvsP = L.virgin - L.pregnant, levels=design)
res <- glmQLFTest(fit, contrast=con)
topTags(res)

## Coefficient:  LR test on 2 degrees of freedom 
##       Length  Symbol logFC.L.PvsL logFC.L.VvsL logFC.L.VvsP logCPM    F
## 19242   2021     Ptn        -1.54         7.26        8.802   7.97 2389
## 13645   4757     Egf        -5.36        -7.22       -1.863   3.67 1123
## 52150   4089   Kcnk6        -2.42        -7.00       -4.578   5.91 1020
## 15439   1345      Hp         1.08         5.42        4.337   4.93  991
## 12992    765 Csn1s2b        -8.55       -11.36       -2.810  10.19 1050
## 14183   5346   Fgfr2        -1.15         3.95        5.097   7.38  952
## 20856   1793    Stc2        -1.81         3.20        5.006   6.11  920
## 11941   7050  Atp2b2        -7.37       -10.56       -3.189   6.60 1134
## 13358   1678 Slc25a1        -4.13        -4.91       -0.783   7.50  887
## 17068    691    Ly6d         3.42         9.24        5.820   4.69  887
##         PValue      FDR
## 19242 3.70e-17 5.79e-13
## 13645 4.49e-15 3.14e-11
## 52150 8.28e-15 3.14e-11
## 15439 9.92e-15 3.14e-11
## 12992 1.00e-14 3.14e-11
## 14183 1.28e-14 3.15e-11
## 20856 1.59e-14 3.15e-11
## 11941 1.78e-14 3.15e-11
## 13358 2.01e-14 3.15e-11
## 17068 2.01e-14 3.15e-11

What does a significant PValue or FDR value in the table above mean? Does it mean all three groups are significantly different from each other?.

See the MD plot below, why are there no “down-regulated” genes?

is.de = decideTestsDGE(res)
plotMD(res, status=is.de, values=c(1,-1), col=c("red","blue"),
       legend="topright")

The decideTestsDGE() function: > If object contains F-tests or LRTs for multiple contrasts, then the genes are simply classified as significant (1) or not significant. In this case, the log2-fold-change theshold lfc has to be achieved by at least one of the contrastsf or a gene to be significant. >

6 Pathway analysis

6.1 Gene ontology analysis

To count the number of DE genes that are annotated with each possible GO term. GO terms that occur frequently in the list of DE genes are said to be over-represented or enriched.

fit <- glmQLFit(y, design, robust=TRUE)
tr <- glmTreat(fit, contrast=B.LvsP, lfc=log2(1.5))
go <- goana(tr, species="Mm")
topGO(go, n=15)

##                                            Term Ont    N Up Down  P.Up
## GO:0007059               chromosome segregation  BP  261  1   56 0.998
## GO:0000280                     nuclear division  BP  507 10   79 0.792
## GO:1903047           mitotic cell cycle process  BP  665  8   92 0.993
## GO:0007067             mitotic nuclear division  BP  407  5   68 0.972
## GO:0022402                   cell cycle process  BP  949 15  111 0.975
## GO:0000278                   mitotic cell cycle  BP  747  8   96 0.998
## GO:0048285                    organelle fission  BP  553 12   79 0.697
## GO:0051301                        cell division  BP  496  3   72 1.000
## GO:0000070 mitotic sister chromatid segregation  BP  125  0   35 1.000
## GO:0098813       nuclear chromosome segregation  BP  203  1   44 0.993
## GO:0000776                          kinetochore  CC  123  1   34 0.952
## GO:0007049                           cell cycle  BP 1317 17  129 0.999
## GO:0000819         sister chromatid segregation  BP  148  0   37 1.000
## GO:0000775       chromosome, centromeric region  CC  176  1   40 0.987
## GO:0098687                   chromosomal region  CC  299  5   49 0.856
##              P.Down
## GO:0007059 3.22e-23
## GO:0000280 7.62e-23
## GO:1903047 9.70e-23
## GO:0007067 1.83e-21
## GO:0022402 2.30e-21
## GO:0000278 2.31e-21
## GO:0048285 1.93e-20
## GO:0051301 4.41e-19
## GO:0000070 7.25e-19
## GO:0098813 1.20e-18
## GO:0000776 3.60e-18
## GO:0007049 4.23e-18
## GO:0000819 4.66e-18
## GO:0000775 7.83e-18
## GO:0098687 2.16e-15

The P.Up and P.Down columns contain the p-values for over-representation of the GO term in the up- and down-regulated genes, respectively. Note that the p-values are not adjusted for multiple testing—we would usually ignore GO terms with p-values greater than about 10^-5 (why such a small p value?).

6.2 KEGG pathway analysis

keg <- kegga(tr, species="Mm")
topKEGG(keg, n=15, truncate=34)

##                                          Pathway   N Up Down     P.Up
## path:mmu03008 Ribosome biogenesis in eukaryot...  76  0   19 1.00e+00
## path:mmu05150    Staphylococcus aureus infection  30  0   10 1.00e+00
## path:mmu04110                         Cell cycle 120  0   20 1.00e+00
## path:mmu00100               Steroid biosynthesis  18  5    0 4.55e-05
## path:mmu00240              Pyrimidine metabolism  93  0   15 1.00e+00
## path:mmu04970                 Salivary secretion  63  8    7 1.01e-04
## path:mmu00900    Terpenoid backbone biosynthesis  21  5    0 1.02e-04
## path:mmu04972               Pancreatic secretion  66  8    4 1.42e-04
## path:mmu04610 Complement and coagulation casc...  47  2    9 3.05e-01
## path:mmu00230                  Purine metabolism 151  2   17 8.78e-01
## path:mmu04510                     Focal adhesion 187 12    8 1.43e-03
## path:mmu04261 Adrenergic signaling in cardiom... 120  9    5 1.98e-03
## path:mmu05144                            Malaria  38  2    7 2.26e-01
## path:mmu04640         Hematopoietic cell lineage  60  2    9 4.17e-01
## path:mmu04911                  Insulin secretion  60  6    5 2.72e-03
##                 P.Down
## path:mmu03008 2.96e-09
## path:mmu05150 1.02e-06
## path:mmu04110 1.51e-06
## path:mmu00100 1.00e+00
## path:mmu00240 4.75e-05
## path:mmu04970 3.65e-02
## path:mmu00900 1.00e+00
## path:mmu04972 4.21e-01
## path:mmu04610 4.35e-04
## path:mmu00230 1.30e-03
## path:mmu04510 7.26e-01
## path:mmu04261 7.24e-01
## path:mmu05144 2.39e-03
## path:mmu04640 2.67e-03
## path:mmu04911 1.79e-01

Possible species values include “Hs” (human), “Mm” (mouse), “Rn” (rat), “Dm” (fly) or “Pt” (chimpanzee), but other values are possible if the corresponding organism package is available.

6.3 FRY gene set tests

Aims to test for particular gene expression signatures or particular pathways

6.3.1 Extract Entrez Gene Identifies

cyt.go <- c("GO:0032465", "GO:0000281", "GO:0000920")
Rkeys(org.Mm.egGO2ALLEGS) <- cyt.go
cyt.go.genes <- as.list(org.Mm.egGO2ALLEGS)
cyt.go.genes

## $`GO:0032465`
##      ISO      ISO      IMP      ISO      ISO      ISO      ISO      ISO 
##  "11848"  "12145"  "12190"  "12211"  "12531"  "12704"  "12795"  "13489" 
##      ISO      ISO      ISO      ISO      ISO      ISO      ISO      ISO 
##  "13490"  "13605"  "14539"  "16001"  "16569"  "18226"  "18754"  "20609" 
##      IBA      IBA      ISO      IBA      ISO      IMP      ISO      ISO 
##  "20871"  "20877"  "20877"  "20878"  "23834"  "23988"  "23988"  "26370" 
##      ISO      ISO      ISO      IMP      ISO      ISO      ISO      ISO 
##  "26554"  "26934"  "50850"  "52679"  "54673"  "56194"  "56208"  "57028" 
##      ISO      ISO      ISO      ISO      ISO      ISO      ISO      ISS 
##  "66371"  "66700"  "67903"  "68112"  "71819"  "72008"  "73139"  "73139" 
##      ISO      ISO      ISO      ISO      IDA      ISO      ISO      ISO 
##  "74393"  "75305"  "75669"  "78610"  "83560"  "83770" "101565" "102774" 
##      IMP      ISO      ISO      IBA      ISO      ISO      ISO      ISO 
## "108961" "109333" "116733" "211660" "211660" "216805" "216963" "225326" 
##      ISO      ISO      IMP      ISO      ISO      ISO      ISO 
## "227937" "239336" "240641" "240641" "242785" "381293" "387349" 
## 
## $`GO:0000281`
##      ISO      IMP      ISO      IGI      ISO      ISO      IMP      ISO 
##  "11735"  "11789"  "11789"  "11840"  "11848"  "12615"  "12631"  "12704" 
##      IMP      ISO      IGI      ISO      ISO      ISO      ISO      ISO 
##  "13163"  "16571"  "16765"  "16765"  "18226"  "18817"  "19348"  "20658" 
##      ISO      ISO      ISO      IEA      ISO      ISO      ISO      ISO 
##  "20742"  "26934"  "50850"  "57784"  "66515"  "66616"  "68743"  "69028" 
##      ISO      ISO      IBA      ISO      ISO      ISO      IMP      ISO 
##  "70527"  "71819"  "74107"  "74107"  "75305"  "75608"  "77579"  "77579" 
##      ISO      ISO      ISO      ISA      ISO      ISO      ISO      IMP 
##  "80986"  "84092" "102774" "108907" "170625" "216049" "216846" "235072" 
##      ISO      IMP 
## "235406" "240641" 
## 
## $`GO:0000920`
##      ISO      ISO      ISO      ISO      ISO      ISO      ISO      ISO 
##  "18571"  "19348"  "20479"  "66371"  "66700"  "67064"  "68942"  "68953" 
##      ISO      IBA      ISO      ISO      ISO      ISO      ISO      ISO 
##  "69028"  "74107"  "74107"  "75608"  "76959" "105513" "116733" "208092" 
##      ISO 
## "234852"

6.3.2 FRY test

# note that the row names of GenewiseCounts are the Entrez Gene Identifiers
B.VvsL <- makeContrasts(B.virgin-B.lactating, levels=design)
fry(y, index=cyt.go.genes, design=design, contrast=B.VvsL)

##            NGenes Direction  PValue     FDR PValue.Mixed FDR.Mixed
## GO:0032465     49        Up 0.00055 0.00165     3.82e-06  7.03e-06
## GO:0000920     16      Down 0.00186 0.00207     4.69e-06  7.03e-06
## GO:0000281     37        Up 0.00207 0.00207     1.20e-05  1.20e-05

6.3.3 Visualize FRY test

Using the code from the article generates wrong barcode plot. Need to switch the “Up” and “Down” labels.

res <- glmQLFTest(fit, contrast=B.VvsL)
index1 <- rownames(fit) %in% cyt.go.genes[[1]]
index2 = rownames(fit) %in% cyt.go.genes[[2]]
barcodeplot(res$table$logFC, index=index1, index2 = index2, labels=c("Down (B.lactating)", "Up (B.virgin)"),
            main=paste0(cyt.go[1], '(Up), ' , cyt.go[2], '(Down)'))

Genes associated with GO term GO:0032465 tend to be up-regulated in the basal cells of virgin mice compared to lactating mice, and genes associated with GO term GO:0000920 tend to be down-regulated confirming the result of the fry test above.

6.4 Camera gene set enrichment analysis

load(url("http://bioinf.wehi.edu.au/software/MSigDB/mouse_c2_v5p1.rdata"))
idx <- ids2indices(Mm.c2,id=rownames(y))
BvsL.v <- makeContrasts(B.virgin - L.virgin, levels=design)
cam <- camera(y, idx, design, contrast=BvsL.v, inter.gene.cor=0.01)
options(digits=2)
head(cam,14)

##                                                     NGenes Direction
## LIM_MAMMARY_STEM_CELL_UP                               782        Up
## LIM_MAMMARY_LUMINAL_MATURE_DN                          169        Up
## SASAI_RESISTANCE_TO_NEOPLASTIC_TRANSFROMATION           80        Up
## LIM_MAMMARY_STEM_CELL_DN                               664      Down
## FARMER_BREAST_CANCER_CLUSTER_4                          74        Up
## NABA_BASEMENT_MEMBRANES                                 52        Up
## HAEGERSTRAND_RESPONSE_TO_IMATINIB                       35        Up
## ROZANOV_MMP14_TARGETS_SUBSET                            83        Up
## REACTOME_COLLAGEN_FORMATION                             70        Up
## REACTOME_NCAM1_INTERACTIONS                             71        Up
## OXFORD_RALB_TARGETS_UP                                  27        Up
## ANASTASSIOU_CANCER_MESENCHYMAL_TRANSITION_SIGNATURE    148        Up
## PAPASPYRIDONOS_UNSTABLE_ATEROSCLEROTIC_PLAQUE_DN        68        Up
## LIM_MAMMARY_LUMINAL_PROGENITOR_UP                       92      Down
##                                                      PValue     FDR
## LIM_MAMMARY_STEM_CELL_UP                            2.0e-43 9.5e-40
## LIM_MAMMARY_LUMINAL_MATURE_DN                       3.8e-25 9.0e-22
## SASAI_RESISTANCE_TO_NEOPLASTIC_TRANSFROMATION       4.1e-20 6.5e-17
## LIM_MAMMARY_STEM_CELL_DN                            2.3e-19 2.7e-16
## FARMER_BREAST_CANCER_CLUSTER_4                      2.9e-19 2.8e-16
## NABA_BASEMENT_MEMBRANES                             1.6e-17 1.2e-14
## HAEGERSTRAND_RESPONSE_TO_IMATINIB                   4.0e-17 2.7e-14
## ROZANOV_MMP14_TARGETS_SUBSET                        5.2e-16 3.1e-13
## REACTOME_COLLAGEN_FORMATION                         1.3e-15 6.8e-13
## REACTOME_NCAM1_INTERACTIONS                         2.2e-15 1.0e-12
## OXFORD_RALB_TARGETS_UP                              4.6e-15 2.0e-12
## ANASTASSIOU_CANCER_MESENCHYMAL_TRANSITION_SIGNATURE 7.1e-15 2.8e-12
## PAPASPYRIDONOS_UNSTABLE_ATEROSCLEROTIC_PLAQUE_DN    1.3e-14 4.9e-12
## LIM_MAMMARY_LUMINAL_PROGENITOR_UP                   2.0e-14 6.9e-12

res <- glmQLFTest(fit, contrast=BvsL.v)
barcodeplot(res$table$logFC,
            index=idx[["LIM_MAMMARY_STEM_CELL_UP"]],
            index2=idx[["LIM_MAMMARY_STEM_CELL_DN"]],
            labels=c("Down (B.lactating)", "Up (B.virgin)"),
            main="LIM_MAMMARY_STEM_CELL",alpha=1)

Can you think of other use cases of FRY or Camera analysis?

What other downstream analysis can you do after you get a list of DE genes?

What RNA-Seq analysis pipeline do you use for your RNA-Seq data? Do you prefer running the whole workflow in the R environment? Why or why not?

7 More on RANSeq analysis and differential expression analysis

Other materials that I found useful to understand RNASeq and differential expression analysis: http://www.pathwaycommons.org/guide/primers/functional_analysis/rna_sequencing_analysis/

EPP531-1122

Ming Chen

11/19/2017

1 Overview of RNA-Seq analysis workflow

2 Experiment design

3 Preliminary analysis

3.1 Experimental design

3.2 Downloading the read counts

3.3 Importing data

3.3.1 Create `DGElist` data object

3.3.2 Adding gene annotation

3.3.3 Filtering to remove low counts

3.4 Exploring differences between libraries

4 Statistics behind differential expression analysis

5 Differential expression analysis

5.1 Design matrix

5.2 Normalization (estimate size factors)

5.3 Dispersion estimation

5.4 Testing for differential expression

5.5 Heat map clustering

5.6 ANalysis Of DEViance (ANODEV)

6 Pathway analysis

6.1 Gene ontology analysis

6.2 KEGG pathway analysis

6.3 FRY gene set tests

6.3.1 Extract Entrez Gene Identifies

6.3.2 FRY test

6.3.3 Visualize FRY test

6.4 Camera gene set enrichment analysis

7 More on RANSeq analysis and differential expression analysis

EPP531-1122

Ming Chen

11/19/2017

1 Overview of RNA-Seq analysis workflow

2 Experiment design

3 Preliminary analysis

3.1 Experimental design

3.2 Downloading the read counts

3.3 Importing data

3.3.1 Create DGElist data object

3.3.2 Adding gene annotation

3.3.3 Filtering to remove low counts

3.4 Exploring differences between libraries

4 Statistics behind differential expression analysis

5 Differential expression analysis

5.1 Design matrix

5.2 Normalization (estimate size factors)

5.3 Dispersion estimation

5.4 Testing for differential expression

5.5 Heat map clustering

5.6 ANalysis Of DEViance (ANODEV)

6 Pathway analysis

6.1 Gene ontology analysis

6.2 KEGG pathway analysis

6.3 FRY gene set tests

6.3.1 Extract Entrez Gene Identifies

6.3.2 FRY test

6.3.3 Visualize FRY test

6.4 Camera gene set enrichment analysis

7 More on RANSeq analysis and differential expression analysis

3.3.1 Create `DGElist` data object