WGBSFlow/Methylator pipeline version XXX
Developed and maintained at the BiBs platform (Epigenetics and Cell Fate Center - UMR7216) by Olivier Kirsh, Elouan Bethuel and Magali Hennion.
Please send feedback to: bibs@parisepigenetics.com.
Don’t forget to read and sign our charter.
Remember to cite pipeline authors in your publication.
Methylator runs R scripts and uses MethylKit package to handle DNA methylation data.
Conda within a Singularity/Apptainer image handles R version and package dependencies for improved reproducibility. Refer to the ‘R session information’ section for version details.
# load packages
library(EnrichedHeatmap)
library(GenomicRanges)
library(IRanges)
library(mixtools)
library(data.table)
library(tzdb)
library(readr)
library(genomation)
library(methylKit)
library(dplyr)
library(tidyr)
library(magrittr)
library(ggplot2)
library(xlsx)
library(yaml)
library(factoextra)
library(FactoMineR)
library(stringr)
library(annotatr)
library(GenomicFeatures)
Read analysis parameters from ‘config_wgbs.yaml’ or ‘config_nanopore.yaml’ and extract sample names and groups from the corresponding metadata file.
# load the config file
yaml.file <- yaml.load_file('config_ongoing_run.yaml')
# extract the information from the yaml file
DATATYPE <- yaml.file$DATATYPE
MINCOV <- yaml.file$MINCOV
DESTRAND <- yaml.file$DESTRAND
TS <- yaml.file$TILESIZE
RTS <- yaml.file$STEPSIZE
LEVEL <- yaml.file$LEVEL
COV.PERC <- yaml.file$COV.PERC
UNITE <- yaml.file$UNITE
CUSTOM_ANNOT <- yaml.file$CUSTOM_ANNOT
NB_CPG_TILES <- yaml.file$NB_CPG_TILES
ORG <- yaml.file$ASSEMBLY
merge_annot <- yaml.file$MERGE_WITH_BASICS_ANNOT
meta <- read.table(yaml.file$METAFILE, sep='\t', header=T)
# colors
full.color = "#de0000"
high.color = "#e17e65"
mid.color = "#c6c6c6"
low.color = "#7c6fac"
unmeth.color = "#0f208f"
hyper.color = "#de0000"
hypo.color = "#0f208f"
not.color = "#c6c6c6"
titre <- paste0("DNA methylation exploratory analysis on ", LEVEL, " for ", mark)
knitr::kable(meta, "pipe")
| sample | group |
|---|---|
| SRR9016926_sub500000_chr19 | WT |
| SRR9016927_sub500000_chr19 | 1KO |
| SRR9016928_sub500000_chr19 | DKO |
| SRR11806587_sub500000_chr19 | WT |
| SRR11806588_sub500000_chr19 | 1KO |
| SRR11806589_sub500000_chr19 | DKO |
Methylator takes primary DNA methylation data and generates one RData environment with one MethylRaw object per sample.
The MethylRaw object stores the following for each CpG:
- Genomic coordinates (Chrom, Start, End, Strand)
- Coverage (number of reads)
- Number of methylated and unmethylated cytosines (for computing the methylation status)
It contains the total number of CpG, sample_id, genome assembly and methylation context (CpG).
# load Rdata
# myobj
vect_myobj <- c()
treatment <- as.integer(factor(meta$group, ordered = TRUE))
recup_myobj <- function(RDATA){
load(RDATA)
myobj
}
vect_myobj <- lapply(liste_Rdata_path, recup_myobj)
myobj <- methylRawList(vect_myobj, treatment=c(treatment))
head(myobj, n = 20)
## [[1]]
## methylRaw object with 68609 rows
## --------------
## chr start end strand coverage numCs numTs
## 1 chr19 3050120 3050120 - 1 1 0
## 2 chr19 3051793 3051793 - 1 1 0
## 3 chr19 3051860 3051860 + 1 0 1
## 4 chr19 3051902 3051902 + 1 1 0
## 5 chr19 3051907 3051907 + 1 1 0
## 6 chr19 3051940 3051940 + 1 0 1
## --------------
## sample.id: WT_1
## assembly: mm39
## context: CpG
## resolution: base
##
##
## [[2]]
## methylRaw object with 77141 rows
## --------------
## chr start end strand coverage numCs numTs
## 1 chr19 3051793 3051793 - 1 0 1
## 2 chr19 3051820 3051820 - 1 0 1
## 3 chr19 3051822 3051822 - 1 0 1
## 4 chr19 3051861 3051861 - 1 1 0
## 5 chr19 3051903 3051903 - 2 0 2
## 6 chr19 3051908 3051908 - 2 0 2
## --------------
## sample.id: 1KO_1
## assembly: mm39
## context: CpG
## resolution: base
##
##
## [[3]]
## methylRaw object with 83572 rows
## --------------
## chr start end strand coverage numCs numTs
## 1 chr19 3050119 3050119 + 1 0 1
## 2 chr19 3050120 3050120 - 1 0 1
## 3 chr19 3050317 3050317 - 1 0 1
## 4 chr19 3050813 3050813 - 1 0 1
## 5 chr19 3050818 3050818 - 1 0 1
## 6 chr19 3050955 3050955 + 1 1 0
## --------------
## sample.id: DKO_1
## assembly: mm39
## context: CpG
## resolution: base
##
##
## [[4]]
## methylRaw object with 76702 rows
## --------------
## chr start end strand coverage numCs numTs
## 1 chr19 3051819 3051819 + 1 0 1
## 2 chr19 3051821 3051821 + 1 1 0
## 3 chr19 3051860 3051860 + 1 0 1
## 4 chr19 3051903 3051903 - 2 0 2
## 5 chr19 3051908 3051908 - 1 0 1
## 6 chr19 3051914 3051914 - 1 0 1
## --------------
## sample.id: WT_2
## assembly: mm39
## context: CpG
## resolution: base
##
##
## [[5]]
## methylRaw object with 81732 rows
## --------------
## chr start end strand coverage numCs numTs
## 1 chr19 3050603 3050603 + 1 0 1
## 2 chr19 3051012 3051012 + 1 0 1
## 3 chr19 3051067 3051067 + 1 0 1
## 4 chr19 3051154 3051154 + 1 0 1
## 5 chr19 3051573 3051573 + 1 0 1
## 6 chr19 3051820 3051820 - 1 0 1
## --------------
## sample.id: 1KO_2
## assembly: mm39
## context: CpG
## resolution: base
##
##
## [[6]]
## methylRaw object with 67543 rows
## --------------
## chr start end strand coverage numCs numTs
## 1 chr19 3051154 3051154 + 2 0 2
## 2 chr19 3051895 3051895 - 1 0 1
## 3 chr19 3051903 3051903 - 2 0 2
## 4 chr19 3051908 3051908 - 2 1 1
## 5 chr19 3051914 3051914 - 2 1 1
## 6 chr19 3052077 3052077 + 1 1 0
## --------------
## sample.id: DKO_2
## assembly: mm39
## context: CpG
## resolution: base
rm(vect_myobj)
rm(treatment)
Analysis on CpG or Tiles - LEVEL : CpG
Merge reads from both strands? DESTRAND : TRUE
Minimum coverage - MINCOV : 5
Discard the bases that have more than COV.PERCth percentile of coverage - COV.PERC : 99.9
Type of unification (all or at least one per group) - UNITE : all
DNA methylation percentage is computed with the formula = 100 * C /C + T.
The ENCODE data standards recommend a 30X coverage for accurate DNA methylation inference and statistical power in differential analysis.
With this analysis, we will explore how the coverage depth threshold penalizes coverage breadth.
Methylator first computes the theoretical number of CpGs or Tiles present in your reference genome (FASTA file).
fasta <- Biostrings::readDNAStringSet(genome_path)
dnf <- Biostrings::dinucleotideFrequency(fasta,simplify.as="collapsed")
nb_all_cpg <- dnf["CG"]*2 # multiply by 2 because we work with the CGs on both strands
options("scipen"=100, "digits"=4)
rm(fasta)
Theoretical number of CpG for wgbs : 110386
### eval=(LEVEL=="Tiles") -> this cell is not run if CpG is choosen
fasta <-Biostrings::readDNAStringSet(genome_path)
nb_all_nucleo <- Biostrings::vcountPattern("A", fasta,max.mismatch=0, min.mismatch=0) +
Biostrings::vcountPattern("T", fasta,max.mismatch=0, min.mismatch=0) +
Biostrings::vcountPattern("G", fasta,max.mismatch=0, min.mismatch=0) +
Biostrings::vcountPattern("C", fasta,max.mismatch=0, min.mismatch=0)
nb_all_nucleo <- sum(as.vector(nb_all_nucleo))
nb_all_tiles <- nb_all_nucleo / (TS * RTS)
options("scipen"=100, "digits"=4)
rm(fasta)
Tiles are not used in this analysis.
Methylator explores how the minimum coverage filter from 1 to 20 reduces the total number of analyzed CpGs.
Use these figures to determine the best threshold and detect samples with inadequate global coverage compared to others.
The MINCOV value can be modified in the ‘config_wgbs.yaml’ or ‘config_nanopore.yaml’ files.
## Filtering minimum COVERAGE THRESHOLD (1 to 30) exploration for each sample
cov <- c(1:15,20,25,30) # filterByCoverage(myobj,lo.count= value), keep cov in lower case.
CpG_mapped <- NULL # Number of mapped CpG
sample_id <- NULL # Sample ID
group <- NULL
for(i in 1:length(myobj)){
print(myobj[[i]]@sample.id)
CpG_sample <- NULL
for(j in cov){
fobj <- filterByCoverage(myobj[[i]],
lo.count=j, # variable from 1 to 20
lo.perc=NULL,
hi.count=NULL,
hi.perc=COV.PERC)
CpG_sample <- c(CpG_sample, dim(fobj)[1]) # nb of mapped CpG remain after filtering by coverage filter j
sample_id <- c(sample_id, fobj@sample.id)
group <- c(group, strsplit(fobj@sample.id, "_")[[1]][1])
}
print(CpG_sample)
CpG_mapped <- c(CpG_mapped, CpG_sample)
}
## [1] "WT_1"
## [1] 68466 63367 56345 47498 37429 27585 19115 12456 7626 4344 2368 1179
## [13] 554 170 0 0 0 0
## [1] "1KO_1"
## [1] 76998 70983 62269 50845 38271 26836 17289 10343 5767 2977 1404 560
## [13] 190 0 0 0 0 0
## [1] "DKO_1"
## [1] 83411 73949 61169 46390 32249 20584 12171 6758 3396 1573 666 191
## [13] 0 0 0 0 0 0
## [1] "WT_2"
## [1] 76624 72514 66421 58006 47842 37361 27538 19073 12482 7812 4698 2728
## [13] 1490 787 382 0 0 0
## [1] "1KO_2"
## [1] 81613 77244 70551 61164 49836 38082 27505 18683 11924 7295 4261 2339
## [13] 1247 639 301 0 0 0
## [1] "DKO_2"
## [1] 67468 64388 59766 53001 44619 35498 26579 18843 12502 8002 4826 2772
## [13] 1548 819 418 0 0 0
rm(fobj)
As a total number of CpG per sample
# Draw Barplot (ggplot2)
################################################################################
if(DATATYPE == "WGBS"){
dfct <- data.frame(Coverage_Filter = as.factor(cov),
CpG_number = CpG_mapped,
CpG_percentage = CpG_mapped/nb_all_cpg,
Sample_ID = sample_id,
group = group )
bpctt <- ggplot(dfct, aes(x = Coverage_Filter, y = CpG_number, fill = Sample_ID)) +
geom_bar(stat="identity", position=position_dodge(), color="black") +
ggtitle("Min Cov Value : Number of retained CpG (by sample)") +
xlab("Minimun number of reads") + ylab("Number of CpG") +
coord_cartesian(ylim = c(0, nb_all_cpg*1.1 )) +
geom_hline(aes(yintercept = nb_all_cpg, linetype="Number of Theoretical CpG"), colour = "red", size = 1.2) +
scale_linetype_manual(name ="", values = 'dotted') +
theme_classic(base_size = 14) +
theme(plot.title = element_text(hjust = 0.5))
bpctt
}else{
dfct <- data.frame(Coverage_Filter = as.factor(cov),
CpG_number = CpG_mapped,
Sample_ID = sample_id,
group = group )
bpctt <- ggplot(dfct, aes(x = Coverage_Filter, y = CpG_number, fill = Sample_ID)) +
geom_bar(stat="identity", position=position_dodge(), color="black") +
ggtitle("Min Cov Value : Number of retained CpG (by sample)") +
xlab("Minimun number of reads") + ylab("Number of CpG") +
coord_cartesian(ylim = c(0, nb_all_cpg*1.1 )) +
scale_linetype_manual(name ="", values = 'dotted') +
theme_classic(base_size = 14) +
theme(plot.title = element_text(hjust = 0.5))
bpctt
}
As a proportion of each sample in your study
if(DATATYPE == "WGBS"){
bpcttg <- ggplot(dfct, aes(x = Coverage_Filter, y = CpG_percentage, fill = Sample_ID)) +
geom_bar(stat="identity", position="fill", color="black") +
ggtitle("Min Cov Value : % of retained CpG (by sample)") +
xlab("Minimun number of reads") + ylab("Sample proportion") +
ylim(0,1) +
theme_classic(base_size = 14) +
theme(plot.title = element_text(hjust = 0.5))
bpcttg
bppc <- ggplot(dfct, aes(x = Coverage_Filter, y = CpG_percentage, fill = Sample_ID)) +
geom_bar(stat="identity", position=position_dodge(), color="black") +
ggtitle("Min Cov Value : Percentage of retained CpG (by sample)") +
xlab("Minimun number of reads") + ylab("% of CpG") +
coord_cartesian(ylim = c(0, 1)) +
scale_linetype_manual(name ="", values = 'dotted') +
theme_classic(base_size = 14) +
theme(plot.title = element_text(hjust = 0.5))
bppc
}else{
bpcttg <- ggplot(dfct, aes(x = Coverage_Filter, y = CpG_number, fill = Sample_ID)) +
geom_bar(stat="identity", position="fill", color="black") +
ggtitle("Min Cov Value : % of retained CpG (by sample)") +
xlab("Minimun number of reads") + ylab("Sample proportion") +
theme_classic(base_size = 14) +
theme(plot.title = element_text(hjust = 0.5))
bpcttg
}
As a total number of CpG per sample
# bpcttc <- ggplot(dfct, aes(x = Coverage_Filter, y = CpG_number, fill = group)) +
# geom_bar(stat="identity", position=position_dodge(), color="black") +
# ggtitle("Min Cov Value : Number of retained CpG (by condition)") +
# xlab("Minimun number of reads") + ylab("Number of CpG") +
# coord_cartesian(ylim = c(0, nb_all_cpg )) +
# geom_hline(aes(yintercept = nb_all_cpg, linetype="Number of Theoretical CpG"), colour = "red", size = 1.2) +
# scale_linetype_manual(name ="", values = 'dotted') +
# theme_classic(base_size = 14) +
# theme(plot.title = element_text(hjust = 0.5))
# bpcttc
As a proportion of each sample in your study
# bpcttgc <- ggplot(dfct, aes(x = Coverage_Filter, y = CpG_percentage, fill = group)) +
# geom_bar(stat="identity", position="fill", color="black") +
# ggtitle("Proportion of retained CpG (by condition)") +
# xlab("Minimun number of reads") + ylab("Sample proportion") +
# ylim(0,1) +
# theme_classic(base_size = 14) +
# theme(plot.title = element_text(hjust = 0.5))
#
# bpcttgc
Methylator applies the ‘MINCOV’ filter of 5 on each sample, removing CpGs below this minimum coverage threshold.
It also discards the sites that have more than 99.9th percentile of coverage. Those values are editable in the configuration file.
## [[1]]
## methylRaw object with 37429 rows
## --------------
## chr start end strand coverage numCs numTs
## 14 chr19 3065878 3065878 - 6 0 6
## 34 chr19 3078956 3078956 - 7 5 2
## 35 chr19 3079413 3079413 + 7 6 1
## 36 chr19 3079414 3079414 - 6 5 1
## 40 chr19 3079526 3079526 - 8 7 1
## 41 chr19 3079664 3079664 - 6 3 3
## --------------
## sample.id: WT_1
## assembly: mm39
## context: CpG
## resolution: base
##
##
## [[2]]
## methylRaw object with 38271 rows
## --------------
## chr start end strand coverage numCs numTs
## 38 chr19 3078799 3078799 + 5 3 2
## 44 chr19 3078955 3078955 + 6 1 5
## 53 chr19 3079664 3079664 - 5 0 5
## 65 chr19 3080405 3080405 + 5 0 5
## 67 chr19 3080609 3080609 + 11 1 10
## 71 chr19 3081126 3081126 + 7 1 6
## --------------
## sample.id: 1KO_1
## assembly: mm39
## context: CpG
## resolution: base
##
##
## [[3]]
## methylRaw object with 32249 rows
## --------------
## chr start end strand coverage numCs numTs
## 40 chr19 3078603 3078603 + 9 1 8
## 41 chr19 3078604 3078604 - 5 1 4
## 42 chr19 3078649 3078649 + 7 1 6
## 47 chr19 3078832 3078832 - 6 0 6
## 50 chr19 3078956 3078956 - 7 2 5
## 53 chr19 3079445 3079445 - 6 2 4
## --------------
## sample.id: DKO_1
## assembly: mm39
## context: CpG
## resolution: base
##
##
## [[4]]
## methylRaw object with 47842 rows
## --------------
## chr start end strand coverage numCs numTs
## 39 chr19 3078603 3078603 + 7 2 5
## 40 chr19 3078604 3078604 - 5 5 0
## 41 chr19 3078649 3078649 + 5 3 2
## 42 chr19 3078650 3078650 - 9 9 0
## 50 chr19 3078956 3078956 - 6 5 1
## 51 chr19 3079413 3079413 + 7 6 1
## --------------
## sample.id: WT_2
## assembly: mm39
## context: CpG
## resolution: base
##
##
## [[5]]
## methylRaw object with 49836 rows
## --------------
## chr start end strand coverage numCs numTs
## 45 chr19 3076963 3076963 + 5 0 5
## 56 chr19 3078832 3078832 - 5 2 3
## 59 chr19 3078955 3078955 + 6 1 5
## 70 chr19 3079896 3079896 - 5 0 5
## 72 chr19 3079903 3079903 - 5 1 4
## 73 chr19 3080073 3080073 - 7 1 6
## --------------
## sample.id: 1KO_2
## assembly: mm39
## context: CpG
## resolution: base
##
##
## [[6]]
## methylRaw object with 44619 rows
## --------------
## chr start end strand coverage numCs numTs
## 44 chr19 3078832 3078832 - 5 0 5
## 46 chr19 3078875 3078875 - 5 0 5
## 47 chr19 3078955 3078955 + 5 0 5
## 66 chr19 3080168 3080168 + 8 3 5
## 67 chr19 3080169 3080169 - 5 0 5
## 68 chr19 3080405 3080405 + 6 1 5
## --------------
## sample.id: DKO_2
## assembly: mm39
## context: CpG
## resolution: base
Calculate the number of CpG on raw and filtered data, determining the percentage of lost sites after filtering.
nb_row <- c()
nb_filtered <- c()
sample_id <- c()
group <- c()
for(i in 1:length(dm)){
print(dm[[i]]@sample.id)
sample_id <- c(sample_id, dm[[i]]@sample.id)
group <- c(group, strsplit(dm[[i]]@sample.id, "_")[[1]][1] )
print(paste0("Number of ", LEVEL, " on raw data: ", dm[[i]] %>% dim %>% .[1]))
nb_row <- c(nb_row, dm[[i]] %>% dim %>% .[1])
print(paste0("Number of ", LEVEL, " on filtered data: ", f.dm[[i]] %>% dim %>% .[1]))
nb_filtered <- c(nb_filtered, f.dm[[i]] %>% dim %>% .[1])
print(paste0("Percentage of lost rows after filtering dm: ",
signif(100*(dm[[i]] %>% dim %>% .[1] - f.dm[[i]] %>% dim %>% .[1] ) /
dm[[i]] %>% dim %>% .[1], digits = 4), " %" ))
cat("\n")
}
## [1] "WT_1"
## [1] "Number of CpG on raw data: 68609"
## [1] "Number of CpG on filtered data: 37429"
## [1] "Percentage of lost rows after filtering dm: 45.45 %"
##
## [1] "1KO_1"
## [1] "Number of CpG on raw data: 77141"
## [1] "Number of CpG on filtered data: 38271"
## [1] "Percentage of lost rows after filtering dm: 50.39 %"
##
## [1] "DKO_1"
## [1] "Number of CpG on raw data: 83572"
## [1] "Number of CpG on filtered data: 32249"
## [1] "Percentage of lost rows after filtering dm: 61.41 %"
##
## [1] "WT_2"
## [1] "Number of CpG on raw data: 76702"
## [1] "Number of CpG on filtered data: 47842"
## [1] "Percentage of lost rows after filtering dm: 37.63 %"
##
## [1] "1KO_2"
## [1] "Number of CpG on raw data: 81732"
## [1] "Number of CpG on filtered data: 49836"
## [1] "Percentage of lost rows after filtering dm: 39.03 %"
##
## [1] "DKO_2"
## [1] "Number of CpG on raw data: 67543"
## [1] "Number of CpG on filtered data: 44619"
## [1] "Percentage of lost rows after filtering dm: 33.94 %"
df_stat <- data.frame(ID = sample_id, nb_row = nb_row, nb_filtered = nb_filtered, group=group)
if(LEVEL == "Tiles"){
max.value <- nb_all_tiles
}else{
max.value <- nb_all_cpg
}
nb_row_sci <- NULL
nb_filtered_sci <- NULL
for(i in nb_row){nb_row_sci <- c(nb_row_sci, (formatC(i, format='e', digits=2)))}
for(i in nb_filtered){nb_filtered_sci <- c(nb_filtered_sci, (formatC(i, format='e', digits=2)))}
bf <- ggplot(df_stat, aes(x=as.factor(ID), y=nb_row, fill=group)) +
geom_bar(stat="identity", color="black") +
geom_text(aes(label=nb_row_sci), vjust=1.6, color="white", size= 3)+
ggtitle(paste0("Number of ", LEVEL, " in raw data")) +
xlab("Samples") + ylab(paste0("Number of ", LEVEL)) +
ylim(c(0, max.value*1.1)) +
geom_hline(aes(yintercept = max.value, linetype=paste0("Theoretical number of ",LEVEL)), colour = "red", size = 1.2) +
scale_linetype_manual(name ="", values = 'dotted') +
theme_bw(base_size = 14)
af <- ggplot(df_stat, aes(x=as.factor(ID), y=nb_filtered, fill=group)) +
geom_bar(stat="identity", color="black") +
geom_text(aes(label=nb_filtered_sci), vjust=1.6, color="white", size= 3)+
ggtitle(paste0("Number of ", LEVEL, " in filtered data")) +
xlab("Samples") + ylab(paste0("Number of ", LEVEL)) +
ylim(c(0, max.value*1.1)) +
geom_hline(aes(yintercept = max.value, linetype=paste0("Theoretical number of ",LEVEL)), colour = "red", size = 1.2) +
scale_linetype_manual(name ="", values = 'dotted') +
theme_bw(base_size = 14)
par(mfrow = c(1,2))
bf
af
rm(nb_row, nb_row_sci)
rm(nb_filtered, nb_filtered_sci)
rm(sample_id)
rm(group)
For a differential analysis, each CpG or tile must be present in each sample/group. Two unifications are available :
- “all”, where only the CpGs covered in all samples are retained (stringent but recommended),
- “at least one per group”, where a CpG is retained if covered at leat once per group (generates NA, not recommended).
Filtered and united (“all”) data (with all samples)
## chr start end strand coverage1 numCs1 numTs1 coverage2 numCs2 numTs2
## 1 chr19 3078955 3078955 + 7 5 2 6 1 5
## 2 chr19 3080405 3080405 + 9 4 5 5 0 5
## 3 chr19 3080609 3080609 + 13 8 5 11 1 10
## 4 chr19 3083013 3083013 + 10 9 1 5 1 4
## 5 chr19 3083062 3083062 + 7 3 4 10 2 8
## 6 chr19 3089018 3089018 + 8 6 2 13 2 11
## 7 chr19 3128966 3128966 + 12 9 3 21 3 18
## 8 chr19 3130088 3130088 + 5 5 0 6 1 5
## 9 chr19 3130109 3130109 + 5 4 1 9 3 6
## 10 chr19 3130130 3130130 + 6 6 0 5 1 4
## 11 chr19 3130134 3130134 + 12 12 0 5 1 4
## 12 chr19 3130191 3130191 + 12 10 2 7 2 5
## 13 chr19 3130195 3130195 + 12 12 0 14 1 13
## 14 chr19 3130209 3130209 + 12 10 2 12 0 12
## 15 chr19 3130335 3130335 + 14 13 1 8 2 6
## 16 chr19 3130430 3130430 + 15 14 1 5 0 5
## 17 chr19 3130791 3130791 + 9 1 8 7 1 6
## 18 chr19 3131084 3131084 + 9 6 3 10 3 7
## 19 chr19 3140156 3140156 + 6 3 3 12 1 11
## 20 chr19 3140191 3140191 + 15 10 5 11 1 10
## coverage3 numCs3 numTs3 coverage4 numCs4 numTs4 coverage5 numCs5 numTs5
## 1 7 2 5 6 5 1 6 1 5
## 2 5 1 4 6 5 1 6 0 6
## 3 8 0 8 6 5 1 21 1 20
## 4 6 2 4 5 3 2 13 3 10
## 5 8 2 6 5 3 2 12 0 12
## 6 11 1 10 8 3 5 10 0 10
## 7 10 3 7 17 10 7 14 2 12
## 8 6 2 4 15 13 2 23 3 20
## 9 12 3 9 5 1 4 16 2 14
## 10 12 2 10 11 5 6 28 6 22
## 11 13 3 10 5 3 2 28 9 19
## 12 12 3 9 13 10 3 18 2 16
## 13 12 0 12 13 8 5 16 2 14
## 14 5 2 3 9 7 2 15 0 15
## 15 7 2 5 9 8 1 6 1 5
## 16 14 5 9 13 12 1 9 1 8
## 17 15 0 15 12 2 10 12 1 11
## 18 8 3 5 8 6 2 11 3 8
## 19 18 2 16 10 4 6 31 3 28
## 20 10 2 8 28 15 13 20 5 15
## coverage6 numCs6 numTs6
## 1 5 0 5
## 2 16 3 13
## 3 13 4 9
## 4 8 1 7
## 5 6 2 4
## 6 8 3 5
## 7 16 3 13
## 8 16 5 11
## 9 5 0 5
## 10 6 2 4
## 11 7 3 4
## 12 15 8 7
## 13 13 5 8
## 14 16 6 10
## 15 12 6 6
## 16 5 0 5
## 17 13 1 12
## 18 5 1 4
## 19 14 3 11
## 20 22 6 16
Filtered and united (“at least one per group”) data (with all samples)
## chr start end strand coverage1 numCs1 numTs1 coverage2 numCs2 numTs2
## 1 chr19 3078955 3078955 + 7 5 2 6 1 5
## 2 chr19 3079663 3079663 + 6 3 3 5 0 5
## 3 chr19 3080072 3080072 + 6 6 0 NA NA NA
## 4 chr19 3080168 3080168 + 8 4 4 NA NA NA
## 5 chr19 3080405 3080405 + 9 4 5 5 0 5
## 6 chr19 3080609 3080609 + 13 8 5 11 1 10
## 7 chr19 3081126 3081126 + 8 6 2 7 1 6
## 8 chr19 3081305 3081305 + NA NA NA 5 0 5
## 9 chr19 3082071 3082071 + 10 6 4 NA NA NA
## 10 chr19 3082173 3082173 + 16 14 2 NA NA NA
## 11 chr19 3082452 3082452 + NA NA NA NA NA NA
## 12 chr19 3082673 3082673 + NA NA NA 5 1 4
## 13 chr19 3082683 3082683 + 5 4 1 5 0 5
## 14 chr19 3083004 3083004 + 5 5 0 NA NA NA
## 15 chr19 3083013 3083013 + 10 9 1 5 1 4
## 16 chr19 3083062 3083062 + 7 3 4 10 2 8
## 17 chr19 3083094 3083094 + 5 2 3 9 0 9
## 18 chr19 3083140 3083140 + 6 3 3 NA NA NA
## 19 chr19 3083855 3083855 + NA NA NA NA NA NA
## 20 chr19 3084251 3084251 + 5 5 0 5 2 3
## coverage3 numCs3 numTs3 coverage4 numCs4 numTs4 coverage5 numCs5 numTs5
## 1 7 2 5 6 5 1 6 1 5
## 2 11 1 10 11 10 1 NA NA NA
## 3 5 2 3 5 2 3 7 1 6
## 4 5 0 5 6 2 4 9 1 8
## 5 5 1 4 6 5 1 6 0 6
## 6 8 0 8 6 5 1 21 1 20
## 7 NA NA NA NA NA NA 7 3 4
## 8 8 1 7 5 4 1 NA NA NA
## 9 8 1 7 12 7 5 10 0 10
## 10 8 1 7 13 5 8 13 0 13
## 11 10 2 8 5 4 1 5 0 5
## 12 7 0 7 9 7 2 5 1 4
## 13 5 0 5 10 8 2 NA NA NA
## 14 8 1 7 5 5 0 19 8 11
## 15 6 2 4 5 3 2 13 3 10
## 16 8 2 6 5 3 2 12 0 12
## 17 6 1 5 NA NA NA 9 1 8
## 18 NA NA NA 11 9 2 8 0 8
## 19 6 2 4 9 8 1 6 1 5
## 20 NA NA NA 5 5 0 5 2 3
## coverage6 numCs6 numTs6
## 1 5 0 5
## 2 NA NA NA
## 3 NA NA NA
## 4 13 3 10
## 5 16 3 13
## 6 13 4 9
## 7 6 3 3
## 8 NA NA NA
## 9 8 0 8
## 10 6 0 6
## 11 NA NA NA
## 12 NA NA NA
## 13 NA NA NA
## 14 7 1 6
## 15 8 1 7
## 16 6 2 4
## 17 8 4 4
## 18 9 2 7
## 19 5 5 0
## 20 5 3 2
Compute the minimum of replicat by condition/group
replicat_group <- data.frame(table(meta$group))
min_nb_replicat <- min(replicat_group$Freq)
min.per.group.max <- min_nb_replicat-1
Le nombre minimal de réplicat par group est : 2, vous pouvez donc unifier avec un min.per.group maximal de : 1
min.per.group <- yaml.file$MIN_PER_GROUP
if(min.per.group > min.per.group.max){
print("Error, vous avez choisit une unification non compatible avec vos nombres de réplicats par group,
par defaut l'unification se fera avec le paramètre all ")
min.per.group <- "all"
}
## [1] "Error, vous avez choisit une unification non compatible avec vos nombres de réplicats par group, \n par defaut l'unification se fera avec le paramètre all "
list_comparisons <- yaml.file$COMPARISON
for(comparison in list_comparisons){
control.cond <- comparison[1]
treat.cond <- comparison[2]
nam_fu <- paste0("fu.meth_", control.cond, "_", treat.cond)
nam_fu1 <- paste0("fu1.meth_", control.cond, "_", treat.cond)
PREP_RDATA_CONDI <- paste0(bigoutput_path, "PrepC_", control.cond, "_", treat.cond, "_", mark, ".RData")
load(PREP_RDATA_CONDI)
assign(nam_fu, fu.meth_condi)
assign(nam_fu1, fu1.meth_condi)
rm(myobj_condi)
rm(dm_condi)
rm(f.dm_condi)
rm(fu.meth_condi)
rm(fu1.meth_condi)
}
df_uni <- NULL
for(comparison in list_comparisons){
control.cond <- comparison[1]
treat.cond <- comparison[2]
nam_fu <- paste0("fu.meth_", control.cond, "_", treat.cond)
nam_fu1 <- paste0("fu1.meth_", control.cond, "_", treat.cond)
fu.meth_condi <- get(nam_fu)
fu1.meth_condi <- get(nam_fu1)
if(is.null(df_uni)){
df_uni <- data.frame(fu.meth_condi %>% dim %>% .[1], fu1.meth_condi %>% dim %>% .[1], paste0(control.cond, "_", treat.cond))
colnames(df_uni) <- c("all", "at1", "comparisons")
}else{
nb_row <- data.frame(fu.meth_condi %>% dim %>% .[1], fu1.meth_condi %>% dim %>% .[1], paste0(control.cond, "_", treat.cond))
colnames(nb_row) <- c("all", "at1", "comparisons")
df_uni <- rbind(df_uni, nb_row)
}
}
df_long <- tidyr::pivot_longer(df_uni, 1:2, names_to = "unification", values_to = "nb")
df_long$max.value <- rep(max.value, (length(list_comparisons)*2))
df_long$perct <- round(((df_long$nb / df_long$max.value)*100),2)
unf <- ggplot(df_long, aes(x= as.factor(comparisons), y = nb, fill= unification)) +
geom_bar(stat="identity", position=position_dodge(), color="black") +
geom_text(aes( x = as.factor(comparisons), y = nb, label=paste0(perct, "%"), fill = unification),
position = position_dodge(width = 1), vjust = +3, color="white", size = 4) +
ggtitle(paste0("Retained ", LEVEL, " in filtered and unified data, for each comparison")) +
xlab("Comparisons") + ylab(paste0("Nb of ", LEVEL, " retained ")) +
theme_classic(base_size = 14) +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 45, vjust = 0.6, hjust=0.5))
unf
On raw, filtered and united data
if(UNITE == "all" & notempty.fu){
for(i in seq(from=5, to=length(fu.meth), by=3)){
df_fu.meth = na.omit(getData(fu.meth))
print("methylation statistics per base (unified on all samples)")
print("summary")
print(summary(df_fu.meth[,i+1]/df_fu.meth[,i]))
print("percentiles")
print(quantile(df_fu.meth[,i+1]/df_fu.meth[,i], na.rm = TRUE))
rm(df_fu.meth)
}
} else if (UNITE == "one" & notempty.fu1) {
for(i in seq(from=5, to=length(fu1.meth), by=3)){
df_fu1.meth = na.omit(getData(fu1.meth))
print("methylation statistics per base (unified on at least one sample)")
print("summary")
print(summary(df_fu1.meth[,i+1]/df_fu1.meth[,i]))
print("percentiles")
print(quantile(df_fu1.meth[,i+1]/df_fu1.meth[,i], na.rm = TRUE))
cat("\n")
rm(df_fu1.meth)
}
}
## [1] "methylation statistics per base (unified on all samples)"
## [1] "summary"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.750 0.913 0.781 1.000 1.000
## [1] "percentiles"
## 0% 25% 50% 75% 100%
## 0.000 0.750 0.913 1.000 1.000
## [1] "methylation statistics per base (unified on all samples)"
## [1] "summary"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.154 0.186 0.300 1.000
## [1] "percentiles"
## 0% 25% 50% 75% 100%
## 0.0000 0.0000 0.1538 0.3000 1.0000
## [1] "methylation statistics per base (unified on all samples)"
## [1] "summary"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.104 0.167 1.000
## [1] "percentiles"
## 0% 25% 50% 75% 100%
## 0.0000 0.0000 0.0000 0.1667 1.0000
## [1] "methylation statistics per base (unified on all samples)"
## [1] "summary"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.714 0.900 0.771 1.000 1.000
## [1] "percentiles"
## 0% 25% 50% 75% 100%
## 0.0000 0.7143 0.9000 1.0000 1.0000
## [1] "methylation statistics per base (unified on all samples)"
## [1] "summary"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.167 0.207 0.333 1.000
## [1] "percentiles"
## 0% 25% 50% 75% 100%
## 0.0000 0.0000 0.1667 0.3333 1.0000
## [1] "methylation statistics per base (unified on all samples)"
## [1] "summary"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0556 0.1407 0.2222 1.0000
## [1] "percentiles"
## 0% 25% 50% 75% 100%
## 0.00000 0.00000 0.05556 0.22222 1.00000
On raw, filtered and united data
for(i in 1:length(f.dm)){
print(paste0("sample ID : ",f.dm[[i]]@sample.id))
print("dm, raw data")
getCoverageStats(dm[[i]],
plot=FALSE,both.strands=FALSE)
print("f.dm, filtered data")
getCoverageStats(f.dm[[i]],
plot=FALSE,both.strands=FALSE)
}
## [1] "sample ID : WT_1"
## [1] "dm, raw data"
## read coverage statistics per base
## summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 5.00 5.11 7.00 32.00
## percentiles:
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 99% 99.5%
## 1 2 3 3 4 5 6 6 7 9 10 13 13
## 99.9% 100%
## 15 32
##
## [1] "f.dm, filtered data"
## read coverage statistics per base
## summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 5.00 7.00 7.01 8.00 14.00
## percentiles:
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 99% 99.5%
## 5 5 5 6 6 7 7 8 9 10 11 13 13
## 99.9% 100%
## 14 14
##
## [1] "sample ID : 1KO_1"
## [1] "dm, raw data"
## read coverage statistics per base
## summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 4.00 4.76 6.00 54.00
## percentiles:
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 99% 99.5%
## 1 2 3 3 4 4 5 6 7 8 9 11 12
## 99.9% 100%
## 14 54
##
## [1] "f.dm, filtered data"
## read coverage statistics per base
## summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 5.00 6.00 6.71 8.00 13.00
## percentiles:
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 99% 99.5%
## 5 5 5 6 6 6 7 7 8 9 10 12 12
## 99.9% 100%
## 13 13
##
## [1] "sample ID : DKO_1"
## [1] "dm, raw data"
## read coverage statistics per base
## summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 4.00 4.13 5.00 48.00
## percentiles:
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 99% 99.5%
## 1 1 2 3 3 4 4 5 6 7 8 10 11
## 99.9% 100%
## 13 48
##
## [1] "f.dm, filtered data"
## read coverage statistics per base
## summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 5.00 6.00 6.41 7.00 12.00
## percentiles:
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 99% 99.5%
## 5 5 5 5 6 6 6 7 8 9 9 11 12
## 99.9% 100%
## 12 12
##
## [1] "sample ID : WT_2"
## [1] "dm, raw data"
## read coverage statistics per base
## summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 4.0 5.0 5.7 7.0 87.0
## percentiles:
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 99% 99.5%
## 1 2 3 4 5 5 6 7 8 10 11 14 15
## 99.9% 100%
## 18 87
##
## [1] "f.dm, filtered data"
## read coverage statistics per base
## summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 6.00 7.00 7.39 9.00 17.00
## percentiles:
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 99% 99.5%
## 5 5 5 6 6 7 7 8 9 10 12 14 15
## 99.9% 100%
## 17 17
##
## [1] "sample ID : 1KO_2"
## [1] "dm, raw data"
## read coverage statistics per base
## summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 5.00 5.57 7.00 81.00
## percentiles:
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 99% 99.5%
## 1 2 3 4 5 5 6 7 8 9 11 13 15
## 99.9% 100%
## 17 81
##
## [1] "f.dm, filtered data"
## read coverage statistics per base
## summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 6.00 7.00 7.25 8.00 16.00
## percentiles:
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 99% 99.5%
## 5 5 5 6 6 7 7 8 9 10 11 14 15
## 99.9% 100%
## 16 16
##
## [1] "sample ID : DKO_2"
## [1] "dm, raw data"
## read coverage statistics per base
## summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.00 6.00 5.96 8.00 91.00
## percentiles:
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 99% 99.5%
## 1 2 3 4 5 6 6 7 8 10 11 14 15
## 99.9% 100%
## 18 91
##
## [1] "f.dm, filtered data"
## read coverage statistics per base
## summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 6.00 7.00 7.51 9.00 17.00
## percentiles:
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 99% 99.5%
## 5 5 5 6 6 7 8 8 9 11 12 14 15
## 99.9% 100%
## 17 17
if(UNITE == "all" & notempty.fu){
for(i in seq(from=5, to=length(fu.meth), by=3)){
df_fu.meth = getData(fu.meth)
print("read coverage statistics per base (unified on all samples)")
print("summary")
print(summary(df_fu.meth[,i]))
print("percentiles")
print(quantile(df_fu.meth[,i]))
cat("\n")
rm(df_fu.meth)
}
} else if (UNITE == "one" & notempty.fu1) {
for(i in seq(from=5, to=length(fu1.meth), by=3)){
df_fu1.meth = na.omit(getData(fu1.meth))
cat("\n")
print("read coverage statistics per base (unified on at least one sample)")
print("summary")
print(summary(df_fu1.meth[,i]))
print("percentiles")
print(quantile(df_fu1.meth[,i]))
rm(df_fu1.meth)
}
}
## [1] "read coverage statistics per base (unified on all samples)"
## [1] "summary"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5 7 11 11 14 27
## [1] "percentiles"
## 0% 25% 50% 75% 100%
## 5 7 11 14 27
##
## [1] "read coverage statistics per base (unified on all samples)"
## [1] "summary"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 6.00 9.00 9.88 13.00 24.00
## [1] "percentiles"
## 0% 25% 50% 75% 100%
## 5 6 9 13 24
##
## [1] "read coverage statistics per base (unified on all samples)"
## [1] "summary"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 6.00 7.00 8.68 11.00 23.00
## [1] "percentiles"
## 0% 25% 50% 75% 100%
## 5 6 7 11 23
##
## [1] "read coverage statistics per base (unified on all samples)"
## [1] "summary"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.0 7.0 12.0 11.8 15.0 30.0
## [1] "percentiles"
## 0% 25% 50% 75% 100%
## 5 7 12 15 30
##
## [1] "read coverage statistics per base (unified on all samples)"
## [1] "summary"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.0 7.0 11.0 11.3 15.0 31.0
## [1] "percentiles"
## 0% 25% 50% 75% 100%
## 5 7 11 15 31
##
## [1] "read coverage statistics per base (unified on all samples)"
## [1] "summary"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.0 8.0 13.0 12.7 16.0 31.0
## [1] "percentiles"
## 0% 25% 50% 75% 100%
## 5 8 13 16 31
#### with ggplot
#### recode getMethylationStats and getCoverageStats with ggplot
dfpc <- NULL
group <- c()
print("Global methylation (raw data)")
## [1] "Global methylation (raw data)"
means <- NULL
for(i in 1:length(dm)){
group_sample <- strsplit(dm[[i]]@sample.id, "_")[[1]][1]
# for each condition convert a dm to a dataframe, add 2 columns (percentage methyl and sample id),
# save this dataframe as df and
dm[[i]] %>%
getData() %>% # native function of methylkit package, return a data.frame object
mutate(., Meth_perc = 100 * (numCs/coverage)) %>% # add a column for methylation percentage
mutate(., ID = dm[[i]]@sample.id) %>%
mutate(., group = group_sample) -> df # add a column for the sample id
dfpc <- rbind(dfpc, df) # bind rows , build a dataframe with long format (ajoute les dataframes les uns à las suite des autres pour céer un dataframe au format long )
print(paste0(dm[[i]]@sample.id, ": ", signif(sum(df$Meth_perc)/dim(df)[1], digits = 4), " %"))
means <- rbind(means, c(dm[[i]]@sample.id, signif(sum(df$Meth_perc)/dim(df)[1], digits = 4), group_sample))
rm(df) # delete the dataframe associate to the condition
}
## [1] "WT_1: 68.94 %"
## [1] "1KO_1: 17.08 %"
## [1] "DKO_1: 9.227 %"
## [1] "WT_2: 67.48 %"
## [1] "1KO_2: 18.59 %"
## [1] "DKO_2: 12.86 %"
means <- data.frame(means)
colnames(means) <- c("ID", "Meth_perc", "group")
means$Meth_perc <- as.numeric(means$Meth_perc)
bgm <- ggplot(means, aes(x=ID, y= Meth_perc, fill = group)) +
geom_bar(stat = "identity") + ggtitle(paste0("Average Methylation on raw data")) +
labs(x = "Samples", y = "Average Methylation in %") +
ylim(0,100) +
theme_bw(base_size = 14) +
geom_text(data = means, aes(label = Meth_perc, y = Meth_perc - 2), size = 5, angle = 90) +
theme( legend.title = element_text(),
axis.text.x = element_text(angle = 45, vjust = 0.6, hjust=0.5),
axis.title.x=element_text(),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
bgm
#Plot of CpG Methylation on raw data (before filtering coverage and unite)
p <- ggplot(dfpc, aes(x=Meth_perc, fill=group, color=group)) +
geom_histogram(position="identity",
alpha=0.6,
breaks = seq(0, 100, 5)) +
ggtitle(paste0("% of ", LEVEL, " Methylation on raw data")) +
facet_grid(. ~ ID ) +
labs(x = "% of Methylation", y = "Counts") +
theme_bw(base_size = 14) +
theme(
legend.title = element_text(),
axis.title.x = element_text(),
axis.text.x = element_text(angle=45, hjust=1),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
strip.text.x = element_text(size = 12, color = "black", face = "bold.italic"))
p
rm(dm)
vp <- ggplot(dfpc, aes(x=ID, y=Meth_perc, fill = ID)) +
geom_violin(trim=FALSE) +
geom_boxplot(width=0.05) +
labs( title = paste0("Violin plot of % ", LEVEL, " methylation on raw data") ,
x = "conditions", y = "Methylation %" ) +
theme_bw(base_size = 14) +
theme(axis.title.x=element_blank(),
axis.text.x = element_text(angle = 45),
plot.title = element_text(hjust = 0.5),
legend.position = "none")
vp
#Plot of CpG or Tiles Coverage on raw data (before filtering coverage and unite)
c <- ggplot(dfpc, aes(x=log2(coverage), fill=group, color=group)) +
geom_histogram(position="identity", alpha=0.4) +
ggtitle(paste0(LEVEL, " Coverage on raw data")) +
facet_grid(. ~ ID ) +
labs(x = paste0("log2(reads) per ", LEVEL) , y = "Counts") +
xlim(1,10) +
theme_bw(base_size = 14) +
theme(
legend.title = element_text(),
axis.title.x=element_text(),
axis.text.x = element_text(angle = 55, hjust = 1),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
strip.text.x = element_text(size = 12, color = "black", face = "bold.italic"))
c
rm(dfpc)
#### with ggplot
#### recode getMethylationStats and getCoverageStats with ggplot for filtered data
### Same ideas that for raw data but with f.dm instead of dm
f.dfpc <- NULL
group <- c()
print("Global methylation (filtered & united data)")
## [1] "Global methylation (filtered & united data)"
for(i in 1:length(f.dm)){
group <- c(group, rep(strsplit(f.dm[[i]]@sample.id, "_")[[1]][1] , dim(f.dm[[i]])[1] ))
f.dm[[i]] %>%
getData() %>%
mutate(., Meth_perc = 100 * (numCs/coverage)) %>%
mutate(., ID = f.dm[[i]]@sample.id) -> f.df
#sprintf("%s : %.2f %%", f.dm[[i]]@sample.id, sum(f.df$numCs)*100/sum(f.df$coverage)) # global methylation
print(paste0(f.dm[[i]]@sample.id, ": ", signif(sum(f.df$numCs)*100/sum(f.df$coverage), digits = 4), " %"))
f.dfpc <- rbind(f.dfpc, f.df)
rm(f.df)
}
## [1] "WT_1: 75.38 %"
## [1] "1KO_1: 18.17 %"
## [1] "DKO_1: 10.13 %"
## [1] "WT_2: 66.99 %"
## [1] "1KO_2: 19.03 %"
## [1] "DKO_2: 13.11 %"
f.dfpc <- cbind(f.dfpc, group)
fp <- ggplot(f.dfpc, aes(x=Meth_perc, fill=group, color= group)) +
geom_histogram(position="identity",
alpha=0.6,
breaks = seq(0, 100, 5)) +
ggtitle(paste0("% of ", LEVEL, " Methylation on filtered data")) +
facet_grid(. ~ ID) +
labs(x = "% of Methylation", y = "Counts") +
theme_bw(base_size = 14) +
theme(
legend.title = element_text(),
axis.title.x=element_text(),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
strip.text.x = element_text(size = 12, color = "black", face = "bold.italic"))
fp
if(UNITE == "all" & notempty.fu){
##################################################################
#### recode getMethylationStats and getCoverageStats with ggplot for unified (All) data
df_fu.meth <- getData(fu.meth)
u.dfpc <- NULL
ID <- NULL
group <- NULL
for(i in 1:length(fu.meth@sample.ids)){
ID <- c(ID, rep(fu.meth@sample.ids[i], length(df_fu.meth[,1])))
group <- c(group, rep(strsplit(fu.meth@sample.ids[i], "_")[[1]][1], length(df_fu.meth[,1])))
}
for(i in seq(from=5, to=length(fu.meth), by=3)){
chr <- df_fu.meth[,1]
start <- df_fu.meth[,2]
end <- df_fu.meth[,3]
strand <- df_fu.meth[,4]
coverage <- df_fu.meth[,i]
numCs <- df_fu.meth[,i+1]
numTs <- df_fu.meth[,i+2]
Meth_perc <- (df_fu.meth[,i+1]/df_fu.meth[,i])*100
df <- data.frame(chr, start, end , strand, coverage, numCs, numTs, Meth_perc)
u.dfpc <- rbind(u.dfpc, df)
}
u.dfpc <- cbind(u.dfpc, ID)
u.dfpc <- cbind(u.dfpc, group)
ufp <- ggplot(u.dfpc, aes(x=Meth_perc, fill=group, color= group)) +
geom_histogram(position="identity",
alpha=0.6,
breaks = seq(0, 100, 5)) +
ggtitle(paste0("% of " ,LEVEL, " Methylation on filtered and united (all) data")) +
facet_grid(. ~ ID) +
labs(x = "% of Methylation", y = "Counts") +
theme_bw(base_size = 14) +
theme(
legend.title = element_text(),
axis.title.x=element_text(),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
strip.text.x = element_text(size = 12, color = "black", face = "bold.italic"))
ufp
} else if (UNITE == "one" & notempty.fu1) {
##################################################################
#### recode getMethylationStats and getCoverageStats with ggplot for unified (at least one per group) data
df_fu1.meth <- getData(fu1.meth)
df_fu1.meth[is.na(df_fu1.meth)] <- 0
u1.dfpc <- NULL
ID <- NULL
group <- NULL
for(i in 1:length(fu1.meth@sample.ids)){
ID <- c(ID, rep(fu1.meth@sample.ids[i], length(df_fu1.meth[,1])))
group <- c(group, rep(strsplit(fu1.meth@sample.ids[i], "_")[[1]][1], length(df_fu1.meth[,1])))
}
for(i in seq(from=5, to=length(fu1.meth), by=3)){
chr <- df_fu1.meth[,1]
start <- df_fu1.meth[,2]
end <- df_fu1.meth[,3]
strand <- df_fu1.meth[,4]
coverage <- df_fu1.meth[,i]
numCs <- df_fu1.meth[,i+1]
numTs <- df_fu1.meth[,i+2]
Meth_perc <- (df_fu1.meth[,i+1]/df_fu1.meth[,i])*100
df1 <- data.frame(chr, start, end , strand, coverage, numCs, numTs, Meth_perc)
u1.dfpc <- rbind(u1.dfpc, df1)
}
u1.dfpc <- cbind(u1.dfpc, ID)
u1.dfpc <- cbind(u1.dfpc, group)
u1.dfpc <- na.omit(u1.dfpc)
u1fp <- ggplot(u1.dfpc, aes(x=Meth_perc, fill=group, color= group)) +
geom_histogram(position="identity",
alpha=0.6,
breaks = seq(0, 100, 5)) +
ggtitle(paste0("% of " ,LEVEL, " Methylation on filtered and united (at least one per group) data")) +
facet_grid(. ~ ID) +
labs(x = "% of Methylation", y = "Counts") +
theme_bw(base_size = 14) +
theme(
legend.title = element_text(),
axis.title.x=element_text(),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
strip.text.x = element_text(size = 12, color = "black", face = "bold.italic"))
u1fp
}
vfp <- ggplot(f.dfpc, aes(x=ID, y=Meth_perc, fill = ID)) +
geom_violin(trim=FALSE) +
geom_boxplot(width=0.05) +
labs( title = paste0("% of ", LEVEL, " methylation on filtered data") ,
x = "conditions", y = "Methylation %" ) +
theme_bw(base_size = 14) +
theme(axis.title.x=element_blank(),
plot.title = element_text(hjust = 0.5),
legend.position = "none")
vfp
if(UNITE == "all" & notempty.fu){
uvfp <- ggplot(u.dfpc, aes(x=ID, y=Meth_perc, fill = ID)) +
geom_violin(trim=FALSE) +
geom_boxplot(width=0.05) +
labs( title = paste0( "% of ", LEVEL, " methylation on filtered and united (all) data") ,
x = "conditions", y = "Methylation %" ) +
theme_bw(base_size = 14) +
theme(axis.title.x=element_blank(),
plot.title = element_text(hjust = 0.5),
legend.position = "none")
uvfp
}else if (UNITE == "one" & notempty.fu1){
u1vfp <- ggplot(u1.dfpc, aes(x=ID, y=Meth_perc, fill = ID)) +
geom_violin(trim=FALSE) +
geom_boxplot(width=0.05) +
labs( title = paste0("% of ", LEVEL , " methylation on filtered and united (at least one per group) data") ,
x = "conditions", y = "Methylation %" ) +
theme_bw(base_size = 14) +
theme(axis.title.x=element_blank(),
plot.title = element_text(hjust = 0.5),
legend.position = "none")
u1vfp
}
fc <- ggplot(f.dfpc, aes(x=log2(coverage), fill=group, color=group)) +
geom_histogram(position="identity", alpha=0.4) +
ggtitle(paste0(LEVEL, " Coverage on filtered data")) +
facet_grid(. ~ ID) +
labs(x = paste0("log2(reads) per ", LEVEL) , y = "Counts") +
theme_bw(base_size = 14) +
theme(
legend.title = element_text(),
axis.title.x=element_text(),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
strip.text.x = element_text(size = 12, color = "black", face = "bold.italic"))
fc
if(UNITE == "all" & notempty.fu){
ufc <- ggplot(u.dfpc, aes(x=log2(coverage), fill=group, color=group)) +
geom_histogram(position="identity", alpha=0.4) +
ggtitle(paste0(LEVEL, " Coverage on filtered and united (all) data")) +
facet_grid(. ~ ID) +
labs(x = paste0("log2(reads) per ", LEVEL) , y = "Counts") +
theme_bw(base_size = 14) +
theme(
legend.title = element_text(),
axis.title.x=element_text(),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
strip.text.x = element_text(size = 12, color = "black", face = "bold.italic"))
ufc
}else if (UNITE == "one" & notempty.fu1){
u1fc <- ggplot(u1.dfpc, aes(x=log2(coverage), fill=group, color=group)) +
geom_histogram(position="identity", alpha=0.4) +
ggtitle(paste0(LEVEL, " Coverage on filtered and united (at least one per group) data")) +
facet_grid(. ~ ID) +
labs(x = "Log2 of read coverage per base", y = "Counts") +
theme_bw(base_size = 14) +
theme(
legend.title = element_text(),
axis.title.x=element_text(),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
strip.text.x = element_text(size = 12, color = "black", face = "bold.italic"))
u1fc
}
rm(dfpc)
if(UNITE == "all" & notempty.fu){
met_cov <- ggplot(u.dfpc, aes(x= Meth_perc, y=log2(coverage))) +
geom_point() +
labs(title = "DNA Methylation percentage by coverage") +
xlab("Methylation percentage") +
ylab("Coverage (Log2)") +
theme_bw(base_size = 14)
print(met_cov)
}else if (UNITE == "one" & notempty.fu1){
met_cov <- ggplot(u1.dfpc, aes(x= Meth_perc, y=log2(coverage))) +
geom_point() +
labs(title = "DNA Methylation percentage by coverage") +
xlab("Methylation percentage") +
ylab("Coverage (Log2)") +
theme_bw(base_size = 14)
print(met_cov)
}
Dendrogram is performed on united data with “euclidean” distance and “ward.D” clustering methods
## Clustering
### useless on merged samples
if(dim(meta)[1] > 2){
if(UNITE == "all" & notempty.fu){
mat <- getData(fu.meth)
mat <- mat[ rowSums(is.na(mat))==0, ]
meth.mat <- mat[, fu.meth@numCs.index]/(mat[,fu.meth@numCs.index] + mat[,fu.meth@numTs.index] )
names(meth.mat)=fu.meth@sample.ids
d <- dist(scale(t(meth.mat)), method = "euclidean")
hc <- hclust(d, method = "ward.D")
plot(hc, main = "Cluster Dendrogram (Unite all)")
} else if (UNITE == "one" & notempty.fu1) {
mat1 <- getData(fu1.meth)
mat1 <- mat1[ rowSums(is.na(mat1))==0, ]
meth.mat1 <- mat1[, fu1.meth@numCs.index]/(mat1[,fu1.meth@numCs.index] + mat1[,fu1.meth@numTs.index] )
names(meth.mat1)=fu1.meth@sample.ids
d1 <- dist(scale(t(meth.mat1)), method = "euclidean")
hc1 <- hclust(d1, method = "ward.D")
plot(hc1, main = "Cluster Dendrogram (at least one per group) ")
}
}
Each sample is colored by condition
By defaut the PCA is perform on data united (unite all)
but you can change for a unification at least one per group
if(dim(meta)[1] > 2){
if(UNITE == "all" & notempty.fu){
res.pca <- PCA(t(meth.mat), ncp = 5, graph = FALSE, scale.unit = TRUE)
screeplot <- fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50), main= "Scree plot PCA unite all ")
pca <- fviz_pca_ind(res.pca, title = "PCA on filtered and united data (unite all)", col.ind = meta$group)
print(pca)
print(screeplot)
rm(mat)
rm(meth.mat)
}else if (UNITE == "one" & notempty.fu1){
res.pca1 <- PCA(t(meth.mat1), ncp = 5, graph = FALSE, scale.unit = TRUE)
screeplot <- fviz_eig(res.pca1, addlabels = TRUE, ylim = c(0, 50), main= "Scree plot PCA (at least one per group) ")
pca <- fviz_pca_ind(res.pca1, title = "PCA on filtered and united data (at least one per group)", col.ind = meta$group, repel = TRUE)
print(pca)
print(screeplot)
rm(mat1)
rm(meth.mat1)
}
}
Perform a correlation plot (methylKit package) on data on filtered and unified CpG
## Correlations
if(UNITE == "all" & notempty.fu){
getCorrelation(fu.meth,plot=TRUE)
} else if (UNITE == "one" & notempty.fu1) {
getCorrelation(fu1.meth,plot=TRUE)
}
## WT_1 1KO_1 DKO_1 WT_2 1KO_2 DKO_2
## WT_1 1.0000 0.4309 0.2523 0.8991 0.4118 0.3126
## 1KO_1 0.4309 1.0000 0.2357 0.4380 0.4642 0.2624
## DKO_1 0.2523 0.2357 1.0000 0.2597 0.2632 0.5544
## WT_2 0.8991 0.4380 0.2597 1.0000 0.4244 0.3259
## 1KO_2 0.4118 0.4642 0.2632 0.4244 1.0000 0.2753
## DKO_2 0.3126 0.2624 0.5544 0.3259 0.2753 1.0000
Methylator performs a DNA methylation segmentation analysis by samples (global) and at genomic features on filtered and unified CpG
The methylation status is categorised into five levels:
- Full : 100% methylated
- High : between 75 and 100% excluded
- Mid : between 25 et 75% exluded
- Low : between 0 excluded and 25% excluded
- UnMeth : 100% unmethylated
Five gene features are represented
- Genes
- Intergenic
- Introns
- Exons
- Promoters
In this annotations you can see three types of regions around CGI :
- CpG Islands
- CpG Shores
- CpG Shelves
Images source : annotatr vignette
For each sample, Methylator gives the number of CpG with a given methylation status
load(annotations_path)
if(UNITE == "all" & notempty.fu){
df_fu <- getData(fu.meth)
df_final <- data.frame()
sample_cpt <- 1
for (i in seq(5, length(fu.meth), by=3)) {
df_f.meth_annot <- NULL
df <- cbind(df_fu[, 0:4], df_fu[, i:(i+2)])
meth_perct <- df[,6] / df[,5]
meth_status <- ifelse(meth_perct == 0 , "UnMeth", ifelse(meth_perct == 1, "Full", ifelse(meth_perct < 1/4, "Low", ifelse(meth_perct > 3/4, "High", "Mid"))))
sample <- rep(fu.meth@sample.ids[sample_cpt], dim(fu.meth)[1])
df <- cbind(df, meth_status, sample)
df_f.meth_annot <- rbind(df_f.meth_annot, df)
colnames(df_f.meth_annot) <- c("chr" ,"start" ,"end" , "strand" , "coverage" , "numCs" , "numTs" , "meth_status" , "sample")
if(sample_cpt == 1){
df_final <- df_f.meth_annot
}
else{
df_final <- rbind(df_final, df_f.meth_annot)
}
sample_cpt <- sample_cpt + 1
}
bps <- ggplot(df_final, aes(x= sample, fill = factor(meth_status, levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
geom_bar() +
scale_fill_manual("Methylation status ",
values = c( 'Full' = full.color,
'High' = high.color,
'Mid' = mid.color,
'Low' = low.color,
'UnMeth' = unmeth.color)) +
theme_bw(base_size = 14) +
labs(title = paste0('Methylation status by sample (absolut counts) '),
x = "Samples", y = paste0("Number of ", LEVEL)) +
theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_text(size = 12 ),
axis.title.y = element_text(size = 12 ))
print(bps)
} else if (UNITE == "one" & notempty.fu1) {
df_fu1 <- getData(fu1.meth)
df_final <- data.frame()
sample_cpt <- 1
for (i in seq(5, length(fu1.meth), by=3)) {
df_f.meth_annot <- NULL
df <- cbind(df_fu1[, 0:4], df_fu1[, i:(i+2)])
df <- na.omit(df)
meth_perct <- df[,6] / df[,5]
meth_status <- ifelse(meth_perct == 0 , "UnMeth", ifelse(meth_perct == 1, "Full", ifelse(meth_perct < 1/4, "Low", ifelse(meth_perct > 3/4, "High", "Mid"))))
sample <- rep(fu1.meth@sample.ids[sample_cpt], dim(df)[1])
df <- cbind(df, meth_status, sample)
df_f.meth_annot <- rbind(df_f.meth_annot, df)
colnames(df_f.meth_annot) <- c("chr" ,"start" ,"end" , "strand" , "coverage" , "numCs" , "numTs" , "meth_status" , "sample")
if(sample_cpt == 1){
df_final <- df_f.meth_annot
}
else{
df_final <- rbind(df_final, df_f.meth_annot)
}
sample_cpt <- sample_cpt + 1
}
bps <- ggplot(df_final, aes(x= sample, fill = factor(meth_status, levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
geom_bar(position = "fill") +
scale_fill_manual("Methylation status ",
values = c( 'Full' = full.color,
'High' = high.color,
'Mid' = mid.color,
'Low' = low.color,
'UnMeth' = unmeth.color)) +
theme_bw(base_size = 14) +
labs(title = paste0('Methylation status by Sample '),
x = "Samples", y = paste0("Number of ", LEVEL)) +
theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_text(size = 12 ),
axis.title.y = element_text(size = 12 ))
print(bps)
}
rm(df_fu)
rm(df)
rm(meth_perct)
rm(meth_status)
rm(df_final)
Now for each sample, methylator gives the number (1st plot) and the proportion(2nd plot) of CpG with a given methylation status
according to basic genomic annotations (Genes & CGI)
df__f.meth_annot_glb <- NULL
# Plot with fu.meth and fu1.meth!!!
if(UNITE == "all" & notempty.fu){
df_fu <- getData(fu.meth)
y <- 1
for (i in seq(5, length(fu.meth), by=3)) {
df_f.meth_annot <- NULL
df <- cbind(df_fu[, 0:4], df_fu[, i:(i+2)])
meth_perct <- df[,6] / df[,5]
meth_status <- ifelse(meth_perct == 0 , "UnMeth", ifelse(meth_perct == 1, "Full", ifelse(meth_perct < 1/4, "Low", ifelse(meth_perct > 3/4, "High", "Mid"))))
df <- cbind(df, meth_status)
df_f.meth_annot <- rbind(df_f.meth_annot, df)
dfpc_regions <- makeGRangesFromDataFrame(df_f.meth_annot ,
keep.extra.columns=TRUE,
seqinfo= annotations@seqinfo,
ignore.strand=F)
##########################################################################################
dfpc_annotated <- annotate_regions(regions = dfpc_regions,
annotations = annotations,
ignore.strand = T,
quiet = FALSE)
dfpc_annotated <- dfpc_annotated %>% data.frame
dfpc_annotated$annot.type <- factor(dfpc_annotated$annot.type , levels = IGV)
##########################################################################################
bpmaa <- ggplot(dfpc_annotated, aes(x= annot.type, fill = factor(meth_status, levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
geom_bar() +
scale_fill_manual("Methylation status ",
values = c( 'Full' = full.color,
'High' = high.color,
'Mid' = mid.color,
'Low' = low.color,
'UnMeth' = unmeth.color)) +
theme_bw(base_size = 14) +
labs(title = paste0('Methylation status by annotation (absolut counts) '),
subtitle = paste0(f.dm[[y]]@sample.id),
x = "Annotations", y = paste0("Number of ", LEVEL)) +
theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_text(size = 12 ),
axis.title.y = element_text(size = 12 ))
print(bpmaa)
bpmare <- ggplot(dfpc_annotated, aes(x= annot.type, fill = factor(meth_status, levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
geom_bar(position = "fill") +
scale_fill_manual("Methylation status ",
values = c( 'Full' = full.color,
'High' = high.color,
'Mid' = mid.color,
'Low' = low.color,
'UnMeth' = unmeth.color)) +
theme_bw(base_size = 14) +
labs(title = paste0('Methylation status by annotation (relative counts) '),
subtitle = paste0(f.dm[[y]]@sample.id),
x = "Annotations", y = paste0("Number of ", LEVEL)) +
theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_text(size = 12 ),
axis.title.y = element_text(size = 12 ))
print(bpmare)
y <- y+1
}
} else if (UNITE == "one" & notempty.fu1) {
df_fu1 <- getData(fu1.meth)
y <- 1
for (i in seq(5, length(fu1.meth), by=3)) {
df_f.meth_annot <- NULL
df <- cbind(df_fu1[, 0:4], df_fu1[, i:(i+2)])
df <- na.omit(df)
meth_perct <- df[,6] / df[,5]
meth_status <- ifelse(meth_perct == 0 , "UnMeth", ifelse(meth_perct == 1, "Full", ifelse(meth_perct < 1/4, "Low", ifelse(meth_perct > 3/4, "High", "Mid"))))
df <- cbind(df, meth_status)
df_f.meth_annot <- rbind(df_f.meth_annot, df)
dfpc_regions <- makeGRangesFromDataFrame(df_f.meth_annot ,
keep.extra.columns=TRUE,
seqinfo= annotations@seqinfo,
ignore.strand=F)
dfpc_annotated <- annotate_regions(regions = dfpc_regions,
annotations = annotations,
ignore.strand = T,
quiet = FALSE)
dfpc_annotated <- dfpc_annotated %>% data.frame
dfpc_annotated$annot.type <- factor(dfpc_annotated$annot.type , levels = IGV)
bpmaa <- ggplot(dfpc_annotated, aes(x= annot.type, fill = factor(meth_status, levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
geom_bar() +
scale_fill_manual("Methylation status ",
values = c( 'Full' = full.color,
'High' = high.color,
'Mid' = mid.color,
'Low' = low.color,
'UnMeth' = unmeth.color)) +
theme_bw(base_size = 14) +
labs(title = paste0('Methylation status by annotation (absolut counts) '),
subtitle = paste0(f.dm[[y]]@sample.id),
x = "Annotations", y = paste0("Number of ", LEVEL)) +
theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_text(size = 12 ),
axis.title.y = element_text(size = 12 ))
print(bpmaa)
bpmare <- ggplot(dfpc_annotated, aes(x= annot.type, fill = factor(meth_status, levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
geom_bar(position = "fill") +
scale_fill_manual("Methylation status ",
values = c( 'Full' = full.color,
'High' = high.color,
'Mid' = mid.color,
'Low' = low.color,
'UnMeth' = unmeth.color)) +
theme_bw(base_size = 14) +
labs(title = paste0('Methylation status by annotation (relative counts)'),
subtitle = paste0(f.dm[[y]]@sample.id),
x = "Annotations", y = paste0("Number of ", LEVEL)) +
theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_text(size = 12 ),
axis.title.y = element_text(size = 12 ))
print(bpmare)
y <- y+1
}
}
## Annotating...
## Annotating...
## Annotating...
## Annotating...
## Annotating...
## Annotating...
Now for each sample, methylator gives the number (1st plot) and the proportion(2nd plot) of CpG with a given methylation status
according custom genomic features & tracks
if((CUSTOM_ANNOT == TRUE) && (merge_annot == FALSE)){
library(GenomicRanges)
df_custom_tracks <- data.frame(list_custom_tracks[[1]])
for(i in 2:length(list_custom_tracks)){
df <- data.frame(list_custom_tracks[[i]])
df_custom_tracks <- rbind(df_custom_tracks, df)
}
GR_custom_tracks <- makeGRangesFromDataFrame(df_custom_tracks, keep.extra.columns = TRUE)
if(UNITE == "all" & notempty.fu){
df_fu <- getData(fu.meth)
j <- 1
for (i in seq(5, length(fu.meth), by=3)) {
for(y in 1:length(unique(meta_annot$group))){
df_f.meth_annot <- NULL
df <- cbind(df_fu[, 0:4], df_fu[, i:(i+2)])
meth_perct <- df[,6] / df[,5]
meth_status <- ifelse(meth_perct == 0 , "UnMeth", ifelse(meth_perct == 1, "Full", ifelse(meth_perct < 1/4, "Low", ifelse(meth_perct > 3/4, "High", "Mid"))))
df <- cbind(df, meth_status)
df_f.meth_annot <- rbind(df_f.meth_annot, df)
dfpc_regions <- makeGRangesFromDataFrame(df_f.meth_annot ,
keep.extra.columns=TRUE,
#seqinfo= annotations@seqinfo,
ignore.strand=F)
dfpc_annotated <- annotate_regions(regions = dfpc_regions,
annotations = GR_custom_tracks[GR_custom_tracks$group == unique(meta_annot$group)[y] ],
ignore.strand = TRUE,
quiet = TRUE)
dfpc_annotated <- dfpc_annotated %>% data.frame
dfpc_annotated$annot.type <- factor(dfpc_annotated$annot.type)
print("-------------------------------------------------------")
print(meta_annot$group[y])
print("-------------------------------------------------------")
bpmaa <- ggplot(dfpc_annotated, aes(x= annot.type, fill = factor(meth_status, levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
geom_bar() +
scale_fill_manual("Methylation status ",
values = c( 'Full' = full.color,
'High' = high.color,
'Mid' = mid.color,
'Low' = low.color,
'UnMeth' = unmeth.color)) +
theme_bw(base_size = 14) +
labs(title = paste0('Methylation status by annotation (absolut counts) '),
subtitle = paste0(f.dm[[j]]@sample.id),
x = "Annotations", y = paste0("Number of ", LEVEL)) +
theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_text(size = 12 ),
axis.title.y = element_text(size = 12 ))
print(bpmaa)
bpmare <- ggplot(dfpc_annotated, aes(x= annot.type, fill = factor(meth_status, levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
geom_bar(position = "fill") +
scale_fill_manual("Methylation status ",
values = c( 'Full' = full.color,
'High' = high.color,
'Mid' = mid.color,
'Low' = low.color,
'UnMeth' = unmeth.color)) +
theme_bw(base_size = 14) +
labs(title = paste0('Methylation status by annotation (relative counts) '),
subtitle = paste0(f.dm[[j]]@sample.id),
x = "Annotations", y = paste0("Number of ", LEVEL, " ")) +
theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_text(size = 12 ),
axis.title.y = element_text(size = 12 ))
print(bpmare)
}
j <- j+1
}
} else if (UNITE == "one" & notempty.fu1) {
df_fu1 <- getData(fu1.meth)
y <- 1
for (i in seq(5, length(fu1.meth), by=3)) {
df_f.meth_annot <- NULL
df <- cbind(df_fu1[, 0:4], df_fu1[, i:(i+2)])
df <- na.omit(df)
meth_perct <- df[,6] / df[,5]
meth_status <- ifelse(meth_perct == 0 , "UnMeth", ifelse(meth_perct == 1, "Full", ifelse(meth_perct < 1/4, "Low", ifelse(meth_perct > 3/4, "High", "Mid"))))
df <- cbind(df, meth_status)
df_f.meth_annot <- rbind(df_f.meth_annot, df)
dfpc_regions <- makeGRangesFromDataFrame(df_f.meth_annot ,
keep.extra.columns=TRUE,
seqinfo= annotations@seqinfo,
ignore.strand=F)
dfpc_annotated <- annotate_regions(regions = dfpc_regions,
annotations = annotations,
ignore.strand = T,
quiet = FALSE)
dfpc_annotated <- dfpc_annotated %>% data.frame
dfpc_annotated$annot.type <- factor(dfpc_annotated$annot.type , levels = IGV)
bpmaa <- ggplot(dfpc_annotated, aes(x= annot.type, fill = factor(meth_status, levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
geom_bar() +
scale_fill_manual("Methylation status ",
values = c( 'Full' = full.color,
'High' = high.color,
'Mid' = mid.color,
'Low' = low.color,
'UnMeth' = unmeth.color)) +
theme_bw(base_size = 14) +
labs(title = paste0('Methylation status by annotation (absolut counts) '),
subtitle = paste0(f.dm[[y]]@sample.id),
x = "Annotations", y = paste0("Number of ", LEVEL)) +
theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_text(size = 12 ),
axis.title.y = element_text(size = 12 ))
print(bpmaa)
bpmare <- ggplot(dfpc_annotated, aes(x= annot.type, fill = factor(meth_status, levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
geom_bar(position = "fill") +
scale_fill_manual("Methylation status ",
values = c( 'Full' = full.color,
'High' = high.color,
'Mid' = mid.color,
'Low' = low.color,
'UnMeth' = unmeth.color)) +
theme_bw(base_size = 14) +
labs(title = paste0('Methylation status by annotation (relative counts) '),
subtitle = paste0(f.dm[[y]]@sample.id),
x = "Annotations", y = paste0("Number of ", LEVEL, " ")) +
theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_text(size = 12 ),
axis.title.y = element_text(size = 12 ))
print(bpmare)
y <- y+1
}
}
}
#rm(fu.meth)
rm(fu1.meth)
rm(dfpc_annotated)
rm(dfpc_regions)
# list_sample_gr <- NULL
# list_sample_id <- NULL
#
# for(i in 1:length(f.dm)){
# sample_gr <- data.frame(f.dm[[i]])
# sample_gr <- makeGRangesFromDataFrame(sample_gr , keep.extra.columns = TRUE)
# meth_perct <- (sample_gr$numCs/ (sample_gr$numCs + sample_gr$numTs ))*100
# meth_perct[is.na(meth_perct)] <- 0
# sample_gr$meth_perct <- meth_perct
# sample_gr@strand@values <- rep("*", length(sample_gr@strand@values))
#
# list_sample_gr <- c(list_sample_gr, sample_gr)
# list_sample_id <- c(list_sample_id, f.dm[[i]]@sample.id)
# rm(sample_gr)
# rm(meth_perct)
# }
#
# names(list_sample_gr) <- list_sample_id
#
# genes_gr <- annotations[annotations$type == paste0(ORG,"_genes")]
# tss = IRanges::promoters(genes_gr, upstream = 0, downstream = 1)
# i <- 1
# list_heatmap <- NULL
#
# for(group in names(list_sample_gr)){
#
# name <- paste0("enrichedHeatmap_", group, sep = "")
# meth_sample <- data.frame(list_sample_gr[i])
# colnames(meth_sample) <- c("seqnames", "start", "end", "width", "strand", "coverage", "numCs", "numTs", "meth_perct")
# meth_sample <- makeGRangesFromDataFrame(meth_sample, keep.extra.columns = TRUE)
# mat = normalizeToMatrix(meth_sample, tss, value_column = "meth_perct", mean_mode = "absolute", extend = 5000, w = 50, background = NA, smooth = TRUE)
# enrich_heatmap <- EnrichedHeatmap(mat, name = paste0("methylation", group) , column_title = group, top_annotation = HeatmapAnnotation(
# enriched = anno_enriched(
# ylim = c(0, 100),
# axis_param = list(
# side = "left",
# facing = "inside"
# ))))
# assign(name, enrich_heatmap)
# list_heatmap <- list_heatmap + get(name)
# i <- i + 1
# }
#
#
# draw(list_heatmap, show_heatmap_legend = FALSE, row_title = "Enriched Heatmap, Methylation near TSS")
# df_gr_cust_all <- data.frame(list_custom_tracks[[1]])
#
# for(i in 2:length(list_custom_tracks)){
#
# df_gr_cust <- data.frame(list_custom_tracks[[i]])
# df_gr_cust_all <- rbind(df_gr_cust_all, df_gr_cust)
# }
#
# gr_cust_all <- makeGRangesFromDataFrame(df_gr_cust_all, keep.extra.columns = TRUE)
#
# for(i in unique(meta_annot$group)){
#
# custom <- gr_cust_all[gr_cust_all$group == i]
# mean_width <- round(mean(custom@ranges@width), digits = 0)
# custom <- resize(custom, width = mean_width , fix = "center")
# custom_name <- custom$group[1]
#
# sml <- ScoreMatrixList(targets=list_sample_gr, windows=custom, bin.num=50, strand.aware=TRUE, is.noCovNA = TRUE, weight.col = "numCs")
# sml.sub =intersectScoreMatrixList(sml, reorder = FALSE)
# multiHeatMatrix(sml.sub, xlab= custom_name , matrix.main=names(list_sample_gr))
# plotMeta(sml.sub )
# }
#
# rm(gr_cust_all)
# rm(df_gr_cust_all)
# rm(custom, custom_name)
# rm(sml, sml.sub)
Saves the whole R session of the script (variables, objects …)
at the end of its execution. The RData file produced can then be opened in a local environment (if you have enough space)
to make modifications. For example to retouch figures.
rm(annotations)
# save the R Session in RData folder
save.image(file = rdata_out)
# R Session
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 11 (bullseye)
##
## Matrix products: default
## BLAS/LAPACK: /opt/conda/envs/wgbsflow/lib/libopenblasp-r0.3.21.so
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8
## [5] LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C.UTF-8
## [9] LC_ADDRESS=C.UTF-8 LC_TELEPHONE=C.UTF-8
## [11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C.UTF-8
##
## attached base packages:
## [1] stats4 grid stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] GenomicFeatures_1.46.1 AnnotationDbi_1.56.2 Biobase_2.54.0
## [4] annotatr_1.20.0 stringr_1.5.0 FactoMineR_2.6
## [7] factoextra_1.0.7 yaml_2.3.5 xlsx_0.6.5
## [10] ggplot2_3.3.6 magrittr_2.0.3 tidyr_1.3.0
## [13] dplyr_1.0.10 methylKit_1.20.0 genomation_1.26.0
## [16] readr_2.1.2 tzdb_0.3.0 data.table_1.14.2
## [19] mixtools_1.2.0 EnrichedHeatmap_1.24.0 GenomicRanges_1.46.1
## [22] GenomeInfoDb_1.30.1 IRanges_2.28.0 S4Vectors_0.32.4
## [25] BiocGenerics_0.40.0 ComplexHeatmap_2.10.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.3 R.utils_2.12.2
## [3] tidyselect_1.2.0 RSQLite_2.2.20
## [5] htmlwidgets_1.6.1 BiocParallel_1.28.3
## [7] munsell_0.5.0 codetools_0.2-19
## [9] DT_0.27 withr_2.5.0
## [11] colorspace_2.1-0 filelock_1.0.2
## [13] highr_0.10 knitr_1.42
## [15] leaps_3.1 ggsignif_0.6.4
## [17] fastseg_1.40.0 rJava_1.0-6
## [19] labeling_0.4.2 MatrixGenerics_1.6.0
## [21] emmeans_1.8.4-1 bbmle_1.0.25
## [23] GenomeInfoDbData_1.2.7 farver_2.1.1
## [25] seqPattern_1.26.0 bit64_4.0.5
## [27] coda_0.19-4 vctrs_0.5.2
## [29] generics_0.1.3 xfun_0.37
## [31] BiocFileCache_2.2.0 regioneR_1.26.0
## [33] R6_2.5.1 doParallel_1.0.17
## [35] clue_0.3-64 locfit_1.5-9.7
## [37] bitops_1.0-7 cachem_1.0.6
## [39] DelayedArray_0.20.0 assertthat_0.2.1
## [41] promises_1.2.0.1 BiocIO_1.4.0
## [43] scales_1.2.1 gtable_0.3.1
## [45] multcompView_0.1-8 rlang_1.0.6
## [47] scatterplot3d_0.3-42 GlobalOptions_0.1.2
## [49] splines_4.1.3 rstatix_0.7.2
## [51] rtracklayer_1.54.0 impute_1.68.0
## [53] broom_1.0.3 abind_1.4-5
## [55] BiocManager_1.30.19 reshape2_1.4.4
## [57] backports_1.4.1 httpuv_1.6.8
## [59] qvalue_2.26.0 tools_4.1.3
## [61] gridBase_0.4-7 ellipsis_0.3.2
## [63] jquerylib_0.1.4 RColorBrewer_1.1-3
## [65] Rcpp_1.0.10 plyr_1.8.8
## [67] progress_1.2.2 zlibbioc_1.40.0
## [69] purrr_1.0.1 RCurl_1.98-1.10
## [71] prettyunits_1.1.1 ggpubr_0.5.0
## [73] GetoptLong_1.0.5 SummarizedExperiment_1.24.0
## [75] ggrepel_0.9.3 cluster_2.1.4
## [77] circlize_0.4.15 mvtnorm_1.1-3
## [79] matrixStats_0.63.0 hms_1.1.2
## [81] xlsxjars_0.6.1 mime_0.12
## [83] evaluate_0.20 xtable_1.8-4
## [85] XML_3.99-0.11 emdbook_1.3.12
## [87] mclust_6.0.0 shape_1.4.6
## [89] compiler_4.1.3 biomaRt_2.50.0
## [91] bdsmatrix_1.3-6 tibble_3.1.8
## [93] KernSmooth_2.23-20 crayon_1.5.2
## [95] R.oo_1.25.0 htmltools_0.5.4
## [97] segmented_1.6-2 later_1.3.0
## [99] DBI_1.1.3 dbplyr_2.3.0
## [101] MASS_7.3-58.2 rappdirs_0.3.3
## [103] car_3.1-1 Matrix_1.5-3
## [105] cli_3.6.0 R.methodsS3_1.8.2
## [107] parallel_4.1.3 pkgconfig_2.0.3
## [109] flashClust_1.01-2 GenomicAlignments_1.30.0
## [111] numDeriv_2016.8-1.1 xml2_1.3.3
## [113] foreach_1.5.2 bslib_0.4.2
## [115] XVector_0.34.0 estimability_1.4.1
## [117] digest_0.6.31 Biostrings_2.62.0
## [119] rmarkdown_2.20 restfulr_0.0.15
## [121] curl_4.3.3 kernlab_0.9-32
## [123] shiny_1.7.4 Rsamtools_2.10.0
## [125] gtools_3.9.4 rjson_0.2.21
## [127] lifecycle_1.0.3 nlme_3.1-162
## [129] jsonlite_1.8.4 carData_3.0-5
## [131] limma_3.50.3 BSgenome_1.62.0
## [133] fansi_1.0.4 pillar_1.8.1
## [135] lattice_0.20-45 KEGGREST_1.34.0
## [137] fastmap_1.1.0 httr_1.4.4
## [139] plotrix_3.8-2 survival_3.5-0
## [141] interactiveDisplayBase_1.32.0 glue_1.6.2
## [143] png_0.1-8 iterators_1.0.14
## [145] BiocVersion_3.14.0 bit_4.0.5
## [147] stringi_1.7.12 sass_0.4.5
## [149] blob_1.2.3 AnnotationHub_3.2.0
## [151] memoise_2.0.1