WGBSFlow/Methylator pipeline version XXX
Developed and maintained at the BiBs platform (Epigenetics and Cell Fate Center - UMR7216) by Olivier Kirsh, Elouan Bethuel and Magali Hennion.

Please send feedback to: bibs@parisepigenetics.com.
Don’t forget to read and sign our charter.

Remember to cite pipeline authors in your publication.

1 R packages requirements

Methylator runs R scripts and uses MethylKit package to handle DNA methylation data.
Conda within a Singularity/Apptainer image handles R version and package dependencies for improved reproducibility. Refer to the ‘R session information’ section for version details.

# load packages
library(EnrichedHeatmap)
library(GenomicRanges)
library(IRanges)
library(mixtools)
library(data.table)
library(tzdb)
library(readr)
library(genomation)
library(methylKit)
library(dplyr)
library(tidyr)
library(magrittr)
library(ggplot2)
library(xlsx)
library(yaml)
library(factoextra)
library(FactoMineR)
library(stringr)
library(annotatr)
library(GenomicFeatures)

2 Configuration and settings

Read analysis parameters from ‘config_wgbs.yaml’ or ‘config_nanopore.yaml’ and extract sample names and groups from the corresponding metadata file.

# load the config file
yaml.file <- yaml.load_file('config_ongoing_run.yaml')

# extract the information from the yaml file
DATATYPE <- yaml.file$DATATYPE
MINCOV <- yaml.file$MINCOV  
DESTRAND <- yaml.file$DESTRAND  
TS <- yaml.file$TILESIZE  
RTS <- yaml.file$STEPSIZE  
LEVEL <- yaml.file$LEVEL  
COV.PERC <- yaml.file$COV.PERC  
UNITE <- yaml.file$UNITE  
CUSTOM_ANNOT <- yaml.file$CUSTOM_ANNOT
NB_CPG_TILES <- yaml.file$NB_CPG_TILES
ORG <- yaml.file$ASSEMBLY
merge_annot <- yaml.file$MERGE_WITH_BASICS_ANNOT

meta <- read.table(yaml.file$METAFILE, sep='\t', header=T)

# colors
full.color = "#de0000"
high.color = "#e17e65"
mid.color  = "#c6c6c6"
low.color = "#7c6fac"
unmeth.color = "#0f208f"
hyper.color = "#de0000"
hypo.color = "#0f208f"
not.color = "#c6c6c6"

titre <- paste0("DNA methylation exploratory analysis on ", LEVEL, " for ", mark) 
knitr::kable(meta, "pipe")

sample	group
SRR9016926_sub500000_chr19	WT
SRR9016927_sub500000_chr19	1KO
SRR9016928_sub500000_chr19	DKO
SRR11806587_sub500000_chr19	WT
SRR11806588_sub500000_chr19	1KO
SRR11806589_sub500000_chr19	DKO

3 Load RData

Methylator takes primary DNA methylation data and generates one RData environment with one MethylRaw object per sample.
The MethylRaw object stores the following for each CpG:
- Genomic coordinates (Chrom, Start, End, Strand)
- Coverage (number of reads)
- Number of methylated and unmethylated cytosines (for computing the methylation status)

It contains the total number of CpG, sample_id, genome assembly and methylation context (CpG).

# load Rdata 
# myobj

vect_myobj <- c()
treatment <- as.integer(factor(meta$group, ordered = TRUE))

recup_myobj <- function(RDATA){
  load(RDATA)
  myobj
}

vect_myobj <- lapply(liste_Rdata_path, recup_myobj)
myobj <- methylRawList(vect_myobj, treatment=c(treatment))
head(myobj, n = 20)

## [[1]]
## methylRaw object with 68609 rows
## --------------
##     chr   start     end strand coverage numCs numTs
## 1 chr19 3050120 3050120      -        1     1     0
## 2 chr19 3051793 3051793      -        1     1     0
## 3 chr19 3051860 3051860      +        1     0     1
## 4 chr19 3051902 3051902      +        1     1     0
## 5 chr19 3051907 3051907      +        1     1     0
## 6 chr19 3051940 3051940      +        1     0     1
## --------------
## sample.id: WT_1 
## assembly: mm39 
## context: CpG 
## resolution: base 
## 
## 
## [[2]]
## methylRaw object with 77141 rows
## --------------
##     chr   start     end strand coverage numCs numTs
## 1 chr19 3051793 3051793      -        1     0     1
## 2 chr19 3051820 3051820      -        1     0     1
## 3 chr19 3051822 3051822      -        1     0     1
## 4 chr19 3051861 3051861      -        1     1     0
## 5 chr19 3051903 3051903      -        2     0     2
## 6 chr19 3051908 3051908      -        2     0     2
## --------------
## sample.id: 1KO_1 
## assembly: mm39 
## context: CpG 
## resolution: base 
## 
## 
## [[3]]
## methylRaw object with 83572 rows
## --------------
##     chr   start     end strand coverage numCs numTs
## 1 chr19 3050119 3050119      +        1     0     1
## 2 chr19 3050120 3050120      -        1     0     1
## 3 chr19 3050317 3050317      -        1     0     1
## 4 chr19 3050813 3050813      -        1     0     1
## 5 chr19 3050818 3050818      -        1     0     1
## 6 chr19 3050955 3050955      +        1     1     0
## --------------
## sample.id: DKO_1 
## assembly: mm39 
## context: CpG 
## resolution: base 
## 
## 
## [[4]]
## methylRaw object with 76702 rows
## --------------
##     chr   start     end strand coverage numCs numTs
## 1 chr19 3051819 3051819      +        1     0     1
## 2 chr19 3051821 3051821      +        1     1     0
## 3 chr19 3051860 3051860      +        1     0     1
## 4 chr19 3051903 3051903      -        2     0     2
## 5 chr19 3051908 3051908      -        1     0     1
## 6 chr19 3051914 3051914      -        1     0     1
## --------------
## sample.id: WT_2 
## assembly: mm39 
## context: CpG 
## resolution: base 
## 
## 
## [[5]]
## methylRaw object with 81732 rows
## --------------
##     chr   start     end strand coverage numCs numTs
## 1 chr19 3050603 3050603      +        1     0     1
## 2 chr19 3051012 3051012      +        1     0     1
## 3 chr19 3051067 3051067      +        1     0     1
## 4 chr19 3051154 3051154      +        1     0     1
## 5 chr19 3051573 3051573      +        1     0     1
## 6 chr19 3051820 3051820      -        1     0     1
## --------------
## sample.id: 1KO_2 
## assembly: mm39 
## context: CpG 
## resolution: base 
## 
## 
## [[6]]
## methylRaw object with 67543 rows
## --------------
##     chr   start     end strand coverage numCs numTs
## 1 chr19 3051154 3051154      +        2     0     2
## 2 chr19 3051895 3051895      -        1     0     1
## 3 chr19 3051903 3051903      -        2     0     2
## 4 chr19 3051908 3051908      -        2     1     1
## 5 chr19 3051914 3051914      -        2     1     1
## 6 chr19 3052077 3052077      +        1     1     0
## --------------
## sample.id: DKO_2 
## assembly: mm39 
## context: CpG 
## resolution: base

rm(vect_myobj)
rm(treatment)

4 Configuration file parameters

Analysis on CpG or Tiles - LEVEL : CpG
Merge reads from both strands? DESTRAND : TRUE
Minimum coverage - MINCOV : 5
Discard the bases that have more than COV.PERCth percentile of coverage - COV.PERC : 99.9
Type of unification (all or at least one per group) - UNITE : all

5 Sequencing depth of coverage

DNA methylation percentage is computed with the formula = 100 * C /C + T.

The ENCODE data standards recommend a 30X coverage for accurate DNA methylation inference and statistical power in differential analysis.

With this analysis, we will explore how the coverage depth threshold penalizes coverage breadth.
Methylator first computes the theoretical number of CpGs or Tiles present in your reference genome (FASTA file).

5.1 Theoretical number of CpG

fasta <- Biostrings::readDNAStringSet(genome_path)
dnf <- Biostrings::dinucleotideFrequency(fasta,simplify.as="collapsed")
nb_all_cpg <- dnf["CG"]*2   # multiply by 2 because we work with the CGs on both strands

options("scipen"=100, "digits"=4)

rm(fasta)

Theoretical number of CpG for wgbs : 110386

5.2 Theoretical number of Tiles

### eval=(LEVEL=="Tiles") -> this cell is not run if CpG is choosen
fasta <-Biostrings::readDNAStringSet(genome_path)
nb_all_nucleo <- Biostrings::vcountPattern("A", fasta,max.mismatch=0, min.mismatch=0) + 
                 Biostrings::vcountPattern("T", fasta,max.mismatch=0, min.mismatch=0) + 
                 Biostrings::vcountPattern("G", fasta,max.mismatch=0, min.mismatch=0) + 
                 Biostrings::vcountPattern("C", fasta,max.mismatch=0, min.mismatch=0)

nb_all_nucleo <- sum(as.vector(nb_all_nucleo))

nb_all_tiles <- nb_all_nucleo / (TS * RTS)
options("scipen"=100, "digits"=4)

rm(fasta)

Tiles are not used in this analysis.

5.3 Number and percentage of retained CpGs as a function of coverage filter

Methylator explores how the minimum coverage filter from 1 to 20 reduces the total number of analyzed CpGs.
Use these figures to determine the best threshold and detect samples with inadequate global coverage compared to others.
The MINCOV value can be modified in the ‘config_wgbs.yaml’ or ‘config_nanopore.yaml’ files.

## Filtering minimum COVERAGE THRESHOLD (1 to 30) exploration for each sample

cov <- c(1:15,20,25,30)       # filterByCoverage(myobj,lo.count= value), keep cov in lower case.
CpG_mapped <- NULL       # Number of mapped CpG
sample_id <- NULL          # Sample ID
group <- NULL 

for(i in 1:length(myobj)){
  print(myobj[[i]]@sample.id)
  CpG_sample <- NULL
  for(j in cov){
    fobj <- filterByCoverage(myobj[[i]],
            lo.count=j, # variable from 1 to 20
            lo.perc=NULL,
            hi.count=NULL,
            hi.perc=COV.PERC)
    CpG_sample <- c(CpG_sample, dim(fobj)[1]) # nb of mapped CpG remain after filtering by coverage filter j
    sample_id <- c(sample_id, fobj@sample.id)
    group <- c(group, strsplit(fobj@sample.id, "_")[[1]][1])
  }
  print(CpG_sample)
  CpG_mapped <- c(CpG_mapped, CpG_sample)
}

## [1] "WT_1"
##  [1] 68466 63367 56345 47498 37429 27585 19115 12456  7626  4344  2368  1179
## [13]   554   170     0     0     0     0
## [1] "1KO_1"
##  [1] 76998 70983 62269 50845 38271 26836 17289 10343  5767  2977  1404   560
## [13]   190     0     0     0     0     0
## [1] "DKO_1"
##  [1] 83411 73949 61169 46390 32249 20584 12171  6758  3396  1573   666   191
## [13]     0     0     0     0     0     0
## [1] "WT_2"
##  [1] 76624 72514 66421 58006 47842 37361 27538 19073 12482  7812  4698  2728
## [13]  1490   787   382     0     0     0
## [1] "1KO_2"
##  [1] 81613 77244 70551 61164 49836 38082 27505 18683 11924  7295  4261  2339
## [13]  1247   639   301     0     0     0
## [1] "DKO_2"
##  [1] 67468 64388 59766 53001 44619 35498 26579 18843 12502  8002  4826  2772
## [13]  1548   819   418     0     0     0

rm(fobj)

5.3.1 By samples

As a total number of CpG per sample

# Draw Barplot (ggplot2)
################################################################################

if(DATATYPE == "WGBS"){
  
  dfct <- data.frame(Coverage_Filter = as.factor(cov),
                    CpG_number = CpG_mapped,
                    CpG_percentage = CpG_mapped/nb_all_cpg, 
                    Sample_ID = sample_id, 
                    group = group )
  
  bpctt <- ggplot(dfct, aes(x = Coverage_Filter, y = CpG_number, fill = Sample_ID)) +
    geom_bar(stat="identity", position=position_dodge(), color="black") +
    ggtitle("Min Cov Value : Number of retained CpG (by sample)") +
    xlab("Minimun number of reads") + ylab("Number of CpG") +
    coord_cartesian(ylim = c(0, nb_all_cpg*1.1 )) +
    geom_hline(aes(yintercept = nb_all_cpg, linetype="Number of Theoretical CpG"), colour = "red", size = 1.2) + 
    scale_linetype_manual(name ="", values = 'dotted') +
    theme_classic(base_size = 14) +
    theme(plot.title = element_text(hjust = 0.5))
  bpctt


}else{
  
  dfct <- data.frame(Coverage_Filter = as.factor(cov),
                     CpG_number = CpG_mapped,
                     Sample_ID = sample_id, 
                     group = group )
  
  bpctt <- ggplot(dfct, aes(x = Coverage_Filter, y = CpG_number, fill = Sample_ID)) +
  geom_bar(stat="identity", position=position_dodge(), color="black") +
  ggtitle("Min Cov Value : Number of retained CpG (by sample)") +
  xlab("Minimun number of reads") + ylab("Number of CpG") +
  coord_cartesian(ylim = c(0, nb_all_cpg*1.1 )) +
  scale_linetype_manual(name ="", values = 'dotted') +
  theme_classic(base_size = 14) +
  theme(plot.title = element_text(hjust = 0.5))
bpctt

}

As a proportion of each sample in your study

if(DATATYPE == "WGBS"){
  
  bpcttg <- ggplot(dfct, aes(x = Coverage_Filter, y = CpG_percentage, fill = Sample_ID)) +
    geom_bar(stat="identity", position="fill", color="black") +
    ggtitle("Min Cov Value : % of retained CpG (by sample)") +
    xlab("Minimun number of reads") + ylab("Sample proportion") +
    ylim(0,1) + 
    theme_classic(base_size = 14) +
    theme(plot.title = element_text(hjust = 0.5))
  bpcttg
  
  
  bppc <- ggplot(dfct, aes(x = Coverage_Filter, y = CpG_percentage, fill = Sample_ID)) +
    geom_bar(stat="identity", position=position_dodge(), color="black") +
    ggtitle("Min Cov Value : Percentage of retained CpG (by sample)") +
    xlab("Minimun number of reads") + ylab("% of CpG") +
    coord_cartesian(ylim = c(0, 1)) +
    scale_linetype_manual(name ="", values = 'dotted') +
    theme_classic(base_size = 14) +
    theme(plot.title = element_text(hjust = 0.5))
  bppc

}else{

  bpcttg <- ggplot(dfct, aes(x = Coverage_Filter, y = CpG_number, fill = Sample_ID)) +
    geom_bar(stat="identity", position="fill", color="black") +
    ggtitle("Min Cov Value : % of retained CpG (by sample)") +
    xlab("Minimun number of reads") + ylab("Sample proportion") +
    theme_classic(base_size = 14) +
    theme(plot.title = element_text(hjust = 0.5))
  bpcttg

}

5.3.2 By groups

As a total number of CpG per sample

# bpcttc <- ggplot(dfct, aes(x = Coverage_Filter, y = CpG_number, fill = group)) +
#   geom_bar(stat="identity", position=position_dodge(), color="black") +
#   ggtitle("Min Cov Value : Number of retained CpG (by condition)") +
#   xlab("Minimun number of reads") + ylab("Number of CpG") +
#   coord_cartesian(ylim = c(0, nb_all_cpg )) +
#   geom_hline(aes(yintercept = nb_all_cpg, linetype="Number of Theoretical CpG"), colour = "red", size = 1.2) + 
#   scale_linetype_manual(name ="", values = 'dotted') + 
#   theme_classic(base_size = 14) +
#   theme(plot.title = element_text(hjust = 0.5))
# bpcttc

As a proportion of each sample in your study

# bpcttgc <- ggplot(dfct, aes(x = Coverage_Filter, y = CpG_percentage, fill = group)) +
#   geom_bar(stat="identity", position="fill", color="black") +
#   ggtitle("Proportion of retained CpG (by condition)") +
#   xlab("Minimun number of reads") + ylab("Sample proportion") +
#   ylim(0,1) + 
#   theme_classic(base_size = 14) +
#   theme(plot.title = element_text(hjust = 0.5))
#   
# bpcttgc

6 DNA methylation exploration on CpG

6.1 Filtering and unification

6.1.1 Filtering by coverage

Methylator applies the ‘MINCOV’ filter of 5 on each sample, removing CpGs below this minimum coverage threshold.
It also discards the sites that have more than 99.9th percentile of coverage. Those values are editable in the configuration file.

## [[1]]
## methylRaw object with 37429 rows
## --------------
##      chr   start     end strand coverage numCs numTs
## 14 chr19 3065878 3065878      -        6     0     6
## 34 chr19 3078956 3078956      -        7     5     2
## 35 chr19 3079413 3079413      +        7     6     1
## 36 chr19 3079414 3079414      -        6     5     1
## 40 chr19 3079526 3079526      -        8     7     1
## 41 chr19 3079664 3079664      -        6     3     3
## --------------
## sample.id: WT_1 
## assembly: mm39 
## context: CpG 
## resolution: base 
## 
## 
## [[2]]
## methylRaw object with 38271 rows
## --------------
##      chr   start     end strand coverage numCs numTs
## 38 chr19 3078799 3078799      +        5     3     2
## 44 chr19 3078955 3078955      +        6     1     5
## 53 chr19 3079664 3079664      -        5     0     5
## 65 chr19 3080405 3080405      +        5     0     5
## 67 chr19 3080609 3080609      +       11     1    10
## 71 chr19 3081126 3081126      +        7     1     6
## --------------
## sample.id: 1KO_1 
## assembly: mm39 
## context: CpG 
## resolution: base 
## 
## 
## [[3]]
## methylRaw object with 32249 rows
## --------------
##      chr   start     end strand coverage numCs numTs
## 40 chr19 3078603 3078603      +        9     1     8
## 41 chr19 3078604 3078604      -        5     1     4
## 42 chr19 3078649 3078649      +        7     1     6
## 47 chr19 3078832 3078832      -        6     0     6
## 50 chr19 3078956 3078956      -        7     2     5
## 53 chr19 3079445 3079445      -        6     2     4
## --------------
## sample.id: DKO_1 
## assembly: mm39 
## context: CpG 
## resolution: base 
## 
## 
## [[4]]
## methylRaw object with 47842 rows
## --------------
##      chr   start     end strand coverage numCs numTs
## 39 chr19 3078603 3078603      +        7     2     5
## 40 chr19 3078604 3078604      -        5     5     0
## 41 chr19 3078649 3078649      +        5     3     2
## 42 chr19 3078650 3078650      -        9     9     0
## 50 chr19 3078956 3078956      -        6     5     1
## 51 chr19 3079413 3079413      +        7     6     1
## --------------
## sample.id: WT_2 
## assembly: mm39 
## context: CpG 
## resolution: base 
## 
## 
## [[5]]
## methylRaw object with 49836 rows
## --------------
##      chr   start     end strand coverage numCs numTs
## 45 chr19 3076963 3076963      +        5     0     5
## 56 chr19 3078832 3078832      -        5     2     3
## 59 chr19 3078955 3078955      +        6     1     5
## 70 chr19 3079896 3079896      -        5     0     5
## 72 chr19 3079903 3079903      -        5     1     4
## 73 chr19 3080073 3080073      -        7     1     6
## --------------
## sample.id: 1KO_2 
## assembly: mm39 
## context: CpG 
## resolution: base 
## 
## 
## [[6]]
## methylRaw object with 44619 rows
## --------------
##      chr   start     end strand coverage numCs numTs
## 44 chr19 3078832 3078832      -        5     0     5
## 46 chr19 3078875 3078875      -        5     0     5
## 47 chr19 3078955 3078955      +        5     0     5
## 66 chr19 3080168 3080168      +        8     3     5
## 67 chr19 3080169 3080169      -        5     0     5
## 68 chr19 3080405 3080405      +        6     1     5
## --------------
## sample.id: DKO_2 
## assembly: mm39 
## context: CpG 
## resolution: base

7 Some statistics on raw and filtered data

7.1 Filtering statistics

Calculate the number of CpG on raw and filtered data, determining the percentage of lost sites after filtering.

nb_row <- c()
nb_filtered <- c()
sample_id <- c()
group <- c()

for(i in 1:length(dm)){
  
  print(dm[[i]]@sample.id)
  sample_id <- c(sample_id, dm[[i]]@sample.id)    
  group <- c(group, strsplit(dm[[i]]@sample.id, "_")[[1]][1] )
  print(paste0("Number of ", LEVEL, " on raw data: ", dm[[i]] %>% dim %>% .[1]))

  nb_row <- c(nb_row,  dm[[i]] %>% dim %>% .[1])
  print(paste0("Number of ", LEVEL, " on filtered data: ", f.dm[[i]] %>% dim  %>% .[1]))
  nb_filtered <- c(nb_filtered,  f.dm[[i]] %>% dim %>% .[1])
  print(paste0("Percentage of lost rows after filtering dm: ", 
          signif(100*(dm[[i]] %>% dim %>% .[1] - f.dm[[i]] %>% dim  %>% .[1] ) /
          dm[[i]] %>% dim %>% .[1], digits = 4), " %" ))
  cat("\n")
}

## [1] "WT_1"
## [1] "Number of CpG on raw data: 68609"
## [1] "Number of CpG on filtered data: 37429"
## [1] "Percentage of lost rows after filtering dm: 45.45 %"
## 
## [1] "1KO_1"
## [1] "Number of CpG on raw data: 77141"
## [1] "Number of CpG on filtered data: 38271"
## [1] "Percentage of lost rows after filtering dm: 50.39 %"
## 
## [1] "DKO_1"
## [1] "Number of CpG on raw data: 83572"
## [1] "Number of CpG on filtered data: 32249"
## [1] "Percentage of lost rows after filtering dm: 61.41 %"
## 
## [1] "WT_2"
## [1] "Number of CpG on raw data: 76702"
## [1] "Number of CpG on filtered data: 47842"
## [1] "Percentage of lost rows after filtering dm: 37.63 %"
## 
## [1] "1KO_2"
## [1] "Number of CpG on raw data: 81732"
## [1] "Number of CpG on filtered data: 49836"
## [1] "Percentage of lost rows after filtering dm: 39.03 %"
## 
## [1] "DKO_2"
## [1] "Number of CpG on raw data: 67543"
## [1] "Number of CpG on filtered data: 44619"
## [1] "Percentage of lost rows after filtering dm: 33.94 %"

df_stat <- data.frame(ID = sample_id, nb_row = nb_row, nb_filtered = nb_filtered, group=group)

if(LEVEL == "Tiles"){
max.value <- nb_all_tiles
}else{
max.value <- nb_all_cpg
}


nb_row_sci <- NULL
nb_filtered_sci <- NULL
for(i in nb_row){nb_row_sci <- c(nb_row_sci, (formatC(i,  format='e', digits=2)))}
for(i in nb_filtered){nb_filtered_sci <- c(nb_filtered_sci, (formatC(i,  format='e', digits=2)))}


bf <- ggplot(df_stat, aes(x=as.factor(ID), y=nb_row, fill=group)) +
  geom_bar(stat="identity", color="black") +
  geom_text(aes(label=nb_row_sci), vjust=1.6, color="white", size= 3)+
  ggtitle(paste0("Number of ", LEVEL, " in raw data")) +
  xlab("Samples") + ylab(paste0("Number of ", LEVEL)) + 
  ylim(c(0, max.value*1.1)) + 
  geom_hline(aes(yintercept = max.value, linetype=paste0("Theoretical number of ",LEVEL)), colour = "red", size = 1.2) + 
  scale_linetype_manual(name ="", values = 'dotted') + 
  theme_bw(base_size = 14)
    

af <- ggplot(df_stat, aes(x=as.factor(ID), y=nb_filtered, fill=group)) +
  geom_bar(stat="identity", color="black") +
  geom_text(aes(label=nb_filtered_sci), vjust=1.6, color="white", size= 3)+
  ggtitle(paste0("Number of ", LEVEL, " in filtered data")) +
  xlab("Samples") + ylab(paste0("Number of ", LEVEL)) + 
  ylim(c(0, max.value*1.1)) + 
  geom_hline(aes(yintercept = max.value, linetype=paste0("Theoretical number of ",LEVEL)), colour = "red", size = 1.2) + 
  scale_linetype_manual(name ="", values = 'dotted') + 
  theme_bw(base_size = 14)

par(mfrow = c(1,2))
bf

af

rm(nb_row, nb_row_sci)
rm(nb_filtered, nb_filtered_sci)
rm(sample_id)
rm(group)

7.1.1 Sample unification

For a differential analysis, each CpG or tile must be present in each sample/group. Two unifications are available :
- “all”, where only the CpGs covered in all samples are retained (stringent but recommended),
- “at least one per group”, where a CpG is retained if covered at leat once per group (generates NA, not recommended).

Filtered and united (“all”) data (with all samples)

##      chr   start     end strand coverage1 numCs1 numTs1 coverage2 numCs2 numTs2
## 1  chr19 3078955 3078955      +         7      5      2         6      1      5
## 2  chr19 3080405 3080405      +         9      4      5         5      0      5
## 3  chr19 3080609 3080609      +        13      8      5        11      1     10
## 4  chr19 3083013 3083013      +        10      9      1         5      1      4
## 5  chr19 3083062 3083062      +         7      3      4        10      2      8
## 6  chr19 3089018 3089018      +         8      6      2        13      2     11
## 7  chr19 3128966 3128966      +        12      9      3        21      3     18
## 8  chr19 3130088 3130088      +         5      5      0         6      1      5
## 9  chr19 3130109 3130109      +         5      4      1         9      3      6
## 10 chr19 3130130 3130130      +         6      6      0         5      1      4
## 11 chr19 3130134 3130134      +        12     12      0         5      1      4
## 12 chr19 3130191 3130191      +        12     10      2         7      2      5
## 13 chr19 3130195 3130195      +        12     12      0        14      1     13
## 14 chr19 3130209 3130209      +        12     10      2        12      0     12
## 15 chr19 3130335 3130335      +        14     13      1         8      2      6
## 16 chr19 3130430 3130430      +        15     14      1         5      0      5
## 17 chr19 3130791 3130791      +         9      1      8         7      1      6
## 18 chr19 3131084 3131084      +         9      6      3        10      3      7
## 19 chr19 3140156 3140156      +         6      3      3        12      1     11
## 20 chr19 3140191 3140191      +        15     10      5        11      1     10
##    coverage3 numCs3 numTs3 coverage4 numCs4 numTs4 coverage5 numCs5 numTs5
## 1          7      2      5         6      5      1         6      1      5
## 2          5      1      4         6      5      1         6      0      6
## 3          8      0      8         6      5      1        21      1     20
## 4          6      2      4         5      3      2        13      3     10
## 5          8      2      6         5      3      2        12      0     12
## 6         11      1     10         8      3      5        10      0     10
## 7         10      3      7        17     10      7        14      2     12
## 8          6      2      4        15     13      2        23      3     20
## 9         12      3      9         5      1      4        16      2     14
## 10        12      2     10        11      5      6        28      6     22
## 11        13      3     10         5      3      2        28      9     19
## 12        12      3      9        13     10      3        18      2     16
## 13        12      0     12        13      8      5        16      2     14
## 14         5      2      3         9      7      2        15      0     15
## 15         7      2      5         9      8      1         6      1      5
## 16        14      5      9        13     12      1         9      1      8
## 17        15      0     15        12      2     10        12      1     11
## 18         8      3      5         8      6      2        11      3      8
## 19        18      2     16        10      4      6        31      3     28
## 20        10      2      8        28     15     13        20      5     15
##    coverage6 numCs6 numTs6
## 1          5      0      5
## 2         16      3     13
## 3         13      4      9
## 4          8      1      7
## 5          6      2      4
## 6          8      3      5
## 7         16      3     13
## 8         16      5     11
## 9          5      0      5
## 10         6      2      4
## 11         7      3      4
## 12        15      8      7
## 13        13      5      8
## 14        16      6     10
## 15        12      6      6
## 16         5      0      5
## 17        13      1     12
## 18         5      1      4
## 19        14      3     11
## 20        22      6     16

Filtered and united (“at least one per group”) data (with all samples)

##      chr   start     end strand coverage1 numCs1 numTs1 coverage2 numCs2 numTs2
## 1  chr19 3078955 3078955      +         7      5      2         6      1      5
## 2  chr19 3079663 3079663      +         6      3      3         5      0      5
## 3  chr19 3080072 3080072      +         6      6      0        NA     NA     NA
## 4  chr19 3080168 3080168      +         8      4      4        NA     NA     NA
## 5  chr19 3080405 3080405      +         9      4      5         5      0      5
## 6  chr19 3080609 3080609      +        13      8      5        11      1     10
## 7  chr19 3081126 3081126      +         8      6      2         7      1      6
## 8  chr19 3081305 3081305      +        NA     NA     NA         5      0      5
## 9  chr19 3082071 3082071      +        10      6      4        NA     NA     NA
## 10 chr19 3082173 3082173      +        16     14      2        NA     NA     NA
## 11 chr19 3082452 3082452      +        NA     NA     NA        NA     NA     NA
## 12 chr19 3082673 3082673      +        NA     NA     NA         5      1      4
## 13 chr19 3082683 3082683      +         5      4      1         5      0      5
## 14 chr19 3083004 3083004      +         5      5      0        NA     NA     NA
## 15 chr19 3083013 3083013      +        10      9      1         5      1      4
## 16 chr19 3083062 3083062      +         7      3      4        10      2      8
## 17 chr19 3083094 3083094      +         5      2      3         9      0      9
## 18 chr19 3083140 3083140      +         6      3      3        NA     NA     NA
## 19 chr19 3083855 3083855      +        NA     NA     NA        NA     NA     NA
## 20 chr19 3084251 3084251      +         5      5      0         5      2      3
##    coverage3 numCs3 numTs3 coverage4 numCs4 numTs4 coverage5 numCs5 numTs5
## 1          7      2      5         6      5      1         6      1      5
## 2         11      1     10        11     10      1        NA     NA     NA
## 3          5      2      3         5      2      3         7      1      6
## 4          5      0      5         6      2      4         9      1      8
## 5          5      1      4         6      5      1         6      0      6
## 6          8      0      8         6      5      1        21      1     20
## 7         NA     NA     NA        NA     NA     NA         7      3      4
## 8          8      1      7         5      4      1        NA     NA     NA
## 9          8      1      7        12      7      5        10      0     10
## 10         8      1      7        13      5      8        13      0     13
## 11        10      2      8         5      4      1         5      0      5
## 12         7      0      7         9      7      2         5      1      4
## 13         5      0      5        10      8      2        NA     NA     NA
## 14         8      1      7         5      5      0        19      8     11
## 15         6      2      4         5      3      2        13      3     10
## 16         8      2      6         5      3      2        12      0     12
## 17         6      1      5        NA     NA     NA         9      1      8
## 18        NA     NA     NA        11      9      2         8      0      8
## 19         6      2      4         9      8      1         6      1      5
## 20        NA     NA     NA         5      5      0         5      2      3
##    coverage6 numCs6 numTs6
## 1          5      0      5
## 2         NA     NA     NA
## 3         NA     NA     NA
## 4         13      3     10
## 5         16      3     13
## 6         13      4      9
## 7          6      3      3
## 8         NA     NA     NA
## 9          8      0      8
## 10         6      0      6
## 11        NA     NA     NA
## 12        NA     NA     NA
## 13        NA     NA     NA
## 14         7      1      6
## 15         8      1      7
## 16         6      2      4
## 17         8      4      4
## 18         9      2      7
## 19         5      5      0
## 20         5      3      2

8 Uniting statistics (for each condition)

Compute the minimum of replicat by condition/group

replicat_group <- data.frame(table(meta$group))
min_nb_replicat <- min(replicat_group$Freq)
min.per.group.max <- min_nb_replicat-1

Le nombre minimal de réplicat par group est : 2, vous pouvez donc unifier avec un min.per.group maximal de : 1

min.per.group <- yaml.file$MIN_PER_GROUP

if(min.per.group > min.per.group.max){
  print("Error, vous avez choisit une unification non compatible avec vos nombres de réplicats par group, 
         par defaut l'unification se fera avec le paramètre all ")
  min.per.group <- "all"
}

## [1] "Error, vous avez choisit une unification non compatible avec vos nombres de réplicats par group, \n         par defaut l'unification se fera avec le paramètre all "

list_comparisons <- yaml.file$COMPARISON

for(comparison in list_comparisons){
  
  control.cond <- comparison[1]
  treat.cond <- comparison[2]
  nam_fu <- paste0("fu.meth_", control.cond, "_", treat.cond)
  nam_fu1 <- paste0("fu1.meth_", control.cond, "_", treat.cond)
  
  PREP_RDATA_CONDI <- paste0(bigoutput_path, "PrepC_", control.cond, "_", treat.cond, "_", mark, ".RData")
  load(PREP_RDATA_CONDI)
  
  assign(nam_fu, fu.meth_condi)
  assign(nam_fu1, fu1.meth_condi)
  rm(myobj_condi)
  rm(dm_condi)
  rm(f.dm_condi)
  rm(fu.meth_condi)
  rm(fu1.meth_condi)
}

df_uni <- NULL

for(comparison in list_comparisons){

  control.cond <- comparison[1]
  treat.cond <- comparison[2]
  nam_fu <- paste0("fu.meth_", control.cond, "_", treat.cond)
  nam_fu1 <- paste0("fu1.meth_", control.cond, "_", treat.cond)

  fu.meth_condi <- get(nam_fu)
  fu1.meth_condi <- get(nam_fu1)

  if(is.null(df_uni)){
    df_uni <- data.frame(fu.meth_condi %>% dim %>% .[1], fu1.meth_condi %>% dim %>% .[1],  paste0(control.cond, "_", treat.cond))
    colnames(df_uni) <- c("all", "at1", "comparisons")
  }else{
    nb_row <- data.frame(fu.meth_condi %>% dim %>% .[1], fu1.meth_condi %>% dim %>% .[1],  paste0(control.cond, "_", treat.cond))
    colnames(nb_row) <- c("all", "at1", "comparisons")
    df_uni <- rbind(df_uni, nb_row)
  }
}

df_long <- tidyr::pivot_longer(df_uni, 1:2,  names_to = "unification", values_to = "nb")
df_long$max.value <- rep(max.value, (length(list_comparisons)*2))
df_long$perct <- round(((df_long$nb / df_long$max.value)*100),2)

unf <- ggplot(df_long, aes(x= as.factor(comparisons), y = nb, fill= unification)) +
  geom_bar(stat="identity", position=position_dodge(), color="black") +
  geom_text(aes( x = as.factor(comparisons), y = nb, label=paste0(perct, "%"), fill = unification), 
            position = position_dodge(width = 1), vjust =  +3, color="white", size = 4) + 
  ggtitle(paste0("Retained ", LEVEL, " in filtered and unified data, for each comparison")) +
  xlab("Comparisons") + ylab(paste0("Nb of ", LEVEL, " retained ")) +
  theme_classic(base_size = 14) +
  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 45, vjust = 0.6, hjust=0.5))

unf

8.1 Global methylation statistics

On raw, filtered and united data

if(UNITE == "all" & notempty.fu){

  for(i in seq(from=5, to=length(fu.meth), by=3)){
    df_fu.meth = na.omit(getData(fu.meth))
    print("methylation statistics per base (unified on all samples)")
    print("summary")
    print(summary(df_fu.meth[,i+1]/df_fu.meth[,i]))
    print("percentiles")
    print(quantile(df_fu.meth[,i+1]/df_fu.meth[,i], na.rm = TRUE))
    rm(df_fu.meth)
  }

} else if (UNITE == "one" & notempty.fu1) {

  for(i in seq(from=5, to=length(fu1.meth), by=3)){
    df_fu1.meth = na.omit(getData(fu1.meth))
    print("methylation statistics per base (unified on at least one sample)")
    print("summary")
    print(summary(df_fu1.meth[,i+1]/df_fu1.meth[,i]))
    print("percentiles")
    print(quantile(df_fu1.meth[,i+1]/df_fu1.meth[,i], na.rm = TRUE))
    cat("\n")
    rm(df_fu1.meth)
  }
}

## [1] "methylation statistics per base (unified on all samples)"
## [1] "summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.750   0.913   0.781   1.000   1.000 
## [1] "percentiles"
##    0%   25%   50%   75%  100% 
## 0.000 0.750 0.913 1.000 1.000 
## [1] "methylation statistics per base (unified on all samples)"
## [1] "summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.154   0.186   0.300   1.000 
## [1] "percentiles"
##     0%    25%    50%    75%   100% 
## 0.0000 0.0000 0.1538 0.3000 1.0000 
## [1] "methylation statistics per base (unified on all samples)"
## [1] "summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.104   0.167   1.000 
## [1] "percentiles"
##     0%    25%    50%    75%   100% 
## 0.0000 0.0000 0.0000 0.1667 1.0000 
## [1] "methylation statistics per base (unified on all samples)"
## [1] "summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.714   0.900   0.771   1.000   1.000 
## [1] "percentiles"
##     0%    25%    50%    75%   100% 
## 0.0000 0.7143 0.9000 1.0000 1.0000 
## [1] "methylation statistics per base (unified on all samples)"
## [1] "summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.167   0.207   0.333   1.000 
## [1] "percentiles"
##     0%    25%    50%    75%   100% 
## 0.0000 0.0000 0.1667 0.3333 1.0000 
## [1] "methylation statistics per base (unified on all samples)"
## [1] "summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0556  0.1407  0.2222  1.0000 
## [1] "percentiles"
##      0%     25%     50%     75%    100% 
## 0.00000 0.00000 0.05556 0.22222 1.00000

8.2 Coverage Statistics

On raw, filtered and united data

for(i in 1:length(f.dm)){
  print(paste0("sample ID : ",f.dm[[i]]@sample.id))
  print("dm, raw data")
  getCoverageStats(dm[[i]],
                   plot=FALSE,both.strands=FALSE)
  print("f.dm, filtered data")
  getCoverageStats(f.dm[[i]],
                   plot=FALSE,both.strands=FALSE)
}

## [1] "sample ID : WT_1"
## [1] "dm, raw data"
## read coverage statistics per base
## summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    5.00    5.11    7.00   32.00 
## percentiles:
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%   99% 99.5% 
##     1     2     3     3     4     5     6     6     7     9    10    13    13 
## 99.9%  100% 
##    15    32 
## 
## [1] "f.dm, filtered data"
## read coverage statistics per base
## summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    5.00    7.00    7.01    8.00   14.00 
## percentiles:
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%   99% 99.5% 
##     5     5     5     6     6     7     7     8     9    10    11    13    13 
## 99.9%  100% 
##    14    14 
## 
## [1] "sample ID : 1KO_1"
## [1] "dm, raw data"
## read coverage statistics per base
## summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    4.00    4.76    6.00   54.00 
## percentiles:
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%   99% 99.5% 
##     1     2     3     3     4     4     5     6     7     8     9    11    12 
## 99.9%  100% 
##    14    54 
## 
## [1] "f.dm, filtered data"
## read coverage statistics per base
## summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    5.00    6.00    6.71    8.00   13.00 
## percentiles:
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%   99% 99.5% 
##     5     5     5     6     6     6     7     7     8     9    10    12    12 
## 99.9%  100% 
##    13    13 
## 
## [1] "sample ID : DKO_1"
## [1] "dm, raw data"
## read coverage statistics per base
## summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    4.00    4.13    5.00   48.00 
## percentiles:
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%   99% 99.5% 
##     1     1     2     3     3     4     4     5     6     7     8    10    11 
## 99.9%  100% 
##    13    48 
## 
## [1] "f.dm, filtered data"
## read coverage statistics per base
## summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    5.00    6.00    6.41    7.00   12.00 
## percentiles:
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%   99% 99.5% 
##     5     5     5     5     6     6     6     7     8     9     9    11    12 
## 99.9%  100% 
##    12    12 
## 
## [1] "sample ID : WT_2"
## [1] "dm, raw data"
## read coverage statistics per base
## summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     4.0     5.0     5.7     7.0    87.0 
## percentiles:
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%   99% 99.5% 
##     1     2     3     4     5     5     6     7     8    10    11    14    15 
## 99.9%  100% 
##    18    87 
## 
## [1] "f.dm, filtered data"
## read coverage statistics per base
## summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    6.00    7.00    7.39    9.00   17.00 
## percentiles:
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%   99% 99.5% 
##     5     5     5     6     6     7     7     8     9    10    12    14    15 
## 99.9%  100% 
##    17    17 
## 
## [1] "sample ID : 1KO_2"
## [1] "dm, raw data"
## read coverage statistics per base
## summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    5.00    5.57    7.00   81.00 
## percentiles:
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%   99% 99.5% 
##     1     2     3     4     5     5     6     7     8     9    11    13    15 
## 99.9%  100% 
##    17    81 
## 
## [1] "f.dm, filtered data"
## read coverage statistics per base
## summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    6.00    7.00    7.25    8.00   16.00 
## percentiles:
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%   99% 99.5% 
##     5     5     5     6     6     7     7     8     9    10    11    14    15 
## 99.9%  100% 
##    16    16 
## 
## [1] "sample ID : DKO_2"
## [1] "dm, raw data"
## read coverage statistics per base
## summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00    6.00    5.96    8.00   91.00 
## percentiles:
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%   99% 99.5% 
##     1     2     3     4     5     6     6     7     8    10    11    14    15 
## 99.9%  100% 
##    18    91 
## 
## [1] "f.dm, filtered data"
## read coverage statistics per base
## summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    6.00    7.00    7.51    9.00   17.00 
## percentiles:
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%   99% 99.5% 
##     5     5     5     6     6     7     8     8     9    11    12    14    15 
## 99.9%  100% 
##    17    17

if(UNITE == "all" & notempty.fu){

  for(i in seq(from=5, to=length(fu.meth), by=3)){
    df_fu.meth = getData(fu.meth)
    print("read coverage statistics per base (unified on all samples)")
    print("summary")
    print(summary(df_fu.meth[,i]))
    print("percentiles")
    print(quantile(df_fu.meth[,i]))
    cat("\n")
    rm(df_fu.meth)
  }

} else if (UNITE == "one" & notempty.fu1) {

  for(i in seq(from=5, to=length(fu1.meth), by=3)){
    
    df_fu1.meth = na.omit(getData(fu1.meth))
    cat("\n")
    print("read coverage statistics per base (unified on at least one sample)")
    print("summary")
    print(summary(df_fu1.meth[,i]))
    print("percentiles")
    print(quantile(df_fu1.meth[,i]))
    rm(df_fu1.meth)
  }
}

## [1] "read coverage statistics per base (unified on all samples)"
## [1] "summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       5       7      11      11      14      27 
## [1] "percentiles"
##   0%  25%  50%  75% 100% 
##    5    7   11   14   27 
## 
## [1] "read coverage statistics per base (unified on all samples)"
## [1] "summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    6.00    9.00    9.88   13.00   24.00 
## [1] "percentiles"
##   0%  25%  50%  75% 100% 
##    5    6    9   13   24 
## 
## [1] "read coverage statistics per base (unified on all samples)"
## [1] "summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    6.00    7.00    8.68   11.00   23.00 
## [1] "percentiles"
##   0%  25%  50%  75% 100% 
##    5    6    7   11   23 
## 
## [1] "read coverage statistics per base (unified on all samples)"
## [1] "summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     5.0     7.0    12.0    11.8    15.0    30.0 
## [1] "percentiles"
##   0%  25%  50%  75% 100% 
##    5    7   12   15   30 
## 
## [1] "read coverage statistics per base (unified on all samples)"
## [1] "summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     5.0     7.0    11.0    11.3    15.0    31.0 
## [1] "percentiles"
##   0%  25%  50%  75% 100% 
##    5    7   11   15   31 
## 
## [1] "read coverage statistics per base (unified on all samples)"
## [1] "summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     5.0     8.0    13.0    12.7    16.0    31.0 
## [1] "percentiles"
##   0%  25%  50%  75% 100% 
##    5    8   13   16   31

9 DNA methylation % and coverage plots on raw data

the percentage of DNA methylation on raw data (histogram and violin plot)
the number of read coverage per base on raw data

#### with ggplot
#### recode getMethylationStats and getCoverageStats with ggplot

dfpc <- NULL
group <- c()
print("Global methylation (raw data)")

## [1] "Global methylation (raw data)"

means <- NULL 

for(i in 1:length(dm)){
  group_sample <- strsplit(dm[[i]]@sample.id, "_")[[1]][1]

  # for each condition convert a dm to a dataframe,  add 2 columns (percentage methyl and sample id),
  # save this dataframe as df and
  dm[[i]] %>%
    getData() %>%  # native function of methylkit package, return a data.frame object
    mutate(., Meth_perc = 100 * (numCs/coverage))  %>%  # add a column for methylation percentage
    mutate(., ID = dm[[i]]@sample.id) %>%
    mutate(., group = group_sample) -> df # add a column for the sample id
  dfpc <- rbind(dfpc, df) # bind rows , build a dataframe with long format  (ajoute les dataframes les uns à las suite des autres pour céer un dataframe au format long )
  print(paste0(dm[[i]]@sample.id, ": ", signif(sum(df$Meth_perc)/dim(df)[1], digits = 4), " %"))
  means <- rbind(means, c(dm[[i]]@sample.id, signif(sum(df$Meth_perc)/dim(df)[1], digits = 4), group_sample))
  rm(df) # delete the dataframe associate to the condition
}

## [1] "WT_1: 68.94 %"
## [1] "1KO_1: 17.08 %"
## [1] "DKO_1: 9.227 %"
## [1] "WT_2: 67.48 %"
## [1] "1KO_2: 18.59 %"
## [1] "DKO_2: 12.86 %"

means <- data.frame(means)
colnames(means) <- c("ID", "Meth_perc", "group")
means$Meth_perc <- as.numeric(means$Meth_perc)

bgm <- ggplot(means, aes(x=ID, y= Meth_perc, fill = group)) + 
       geom_bar(stat = "identity")  + ggtitle(paste0("Average Methylation on raw data")) +
       labs(x = "Samples",  y = "Average Methylation in %") +
       ylim(0,100) + 
       theme_bw(base_size = 14) +  
       geom_text(data = means, aes(label = Meth_perc, y = Meth_perc - 2), size = 5, angle = 90) + 
       theme( legend.title = element_text(),
              axis.text.x = element_text(angle = 45, vjust = 0.6, hjust=0.5),
              axis.title.x=element_text(),
              plot.title = element_text(hjust = 0.5),
              plot.subtitle = element_text(hjust = 0.5))
bgm

#Plot of CpG Methylation on raw data (before filtering coverage and unite)
p <- ggplot(dfpc, aes(x=Meth_perc, fill=group, color=group)) +
  geom_histogram(position="identity",
                 alpha=0.6,
                 breaks = seq(0, 100, 5)) +
  ggtitle(paste0("% of ", LEVEL, " Methylation on raw data")) +
  facet_grid(. ~ ID ) + 
  labs(x = "% of Methylation",  y = "Counts") +
  theme_bw(base_size = 14) +  
  theme(
    legend.title = element_text(),
    axis.title.x = element_text(),
    axis.text.x = element_text(angle=45, hjust=1),
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    strip.text.x = element_text(size = 12, color = "black", face = "bold.italic"))

p
rm(dm)

vp <- ggplot(dfpc, aes(x=ID, y=Meth_perc, fill = ID)) +  
  geom_violin(trim=FALSE) +  
  geom_boxplot(width=0.05) + 
  labs( title = paste0("Violin plot of % ", LEVEL, " methylation on raw data") ,
        x = "conditions", y = "Methylation %" )  +
  theme_bw(base_size = 14) +               
  theme(axis.title.x=element_blank(),
        axis.text.x = element_text(angle = 45),
        plot.title = element_text(hjust = 0.5), 
        legend.position = "none")
vp

#Plot of CpG or Tiles Coverage on raw data (before filtering coverage and unite)
c <- ggplot(dfpc, aes(x=log2(coverage), fill=group, color=group)) +
  geom_histogram(position="identity", alpha=0.4) +
  ggtitle(paste0(LEVEL, " Coverage on raw data")) +
  facet_grid(. ~ ID ) + 
  labs(x = paste0("log2(reads) per ", LEVEL) ,  y = "Counts") +
  xlim(1,10) + 
  theme_bw(base_size = 14) +  
  theme(
    legend.title = element_text(),
    axis.title.x=element_text(),
    axis.text.x = element_text(angle = 55, hjust = 1),
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    strip.text.x = element_text(size = 12, color = "black", face = "bold.italic"))

c
rm(dfpc)

10 DNA Methylation % and coverage plots on filtered and united data

the percentage of CpG methylation on filtered and united data (histogram and violin plot)
the number of read coverage per base on filtered and united data

####  with ggplot
#### recode getMethylationStats and getCoverageStats with ggplot for filtered data
### Same ideas that for raw data but with f.dm instead of dm

f.dfpc <- NULL
group <- c()
print("Global methylation (filtered & united data)")

## [1] "Global methylation (filtered & united data)"

for(i in 1:length(f.dm)){
  group <- c(group, rep(strsplit(f.dm[[i]]@sample.id, "_")[[1]][1] , dim(f.dm[[i]])[1] ))
  
  f.dm[[i]] %>%
    getData() %>%
    mutate(., Meth_perc = 100 * (numCs/coverage))  %>%
    mutate(., ID = f.dm[[i]]@sample.id) -> f.df
  #sprintf("%s : %.2f %%", f.dm[[i]]@sample.id, sum(f.df$numCs)*100/sum(f.df$coverage))  # global methylation 
  print(paste0(f.dm[[i]]@sample.id, ": ",  signif(sum(f.df$numCs)*100/sum(f.df$coverage), digits = 4), " %"))
  f.dfpc <- rbind(f.dfpc, f.df)
  rm(f.df)
  
}

## [1] "WT_1: 75.38 %"
## [1] "1KO_1: 18.17 %"
## [1] "DKO_1: 10.13 %"
## [1] "WT_2: 66.99 %"
## [1] "1KO_2: 19.03 %"
## [1] "DKO_2: 13.11 %"

f.dfpc <- cbind(f.dfpc, group)


fp <- ggplot(f.dfpc, aes(x=Meth_perc, fill=group, color= group)) +
  geom_histogram(position="identity",
                 alpha=0.6,
                 breaks = seq(0, 100, 5)) +
  ggtitle(paste0("% of ", LEVEL, " Methylation on filtered data")) +
  facet_grid(. ~ ID) + 
  labs(x = "% of Methylation",  y = "Counts") +
  theme_bw(base_size = 14) +  
  theme(
    legend.title = element_text(),
    axis.title.x=element_text(),
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    strip.text.x = element_text(size = 12, color = "black", face = "bold.italic"))

fp


if(UNITE == "all" & notempty.fu){
  
  ##################################################################
  #### recode getMethylationStats and getCoverageStats with ggplot for unified (All) data
  
  df_fu.meth <- getData(fu.meth)
  u.dfpc <- NULL
  ID <- NULL 
  group <- NULL
  
  for(i in 1:length(fu.meth@sample.ids)){
    ID <- c(ID, rep(fu.meth@sample.ids[i], length(df_fu.meth[,1])))
    group <- c(group, rep(strsplit(fu.meth@sample.ids[i], "_")[[1]][1], length(df_fu.meth[,1])))
  }
  
  for(i in seq(from=5, to=length(fu.meth), by=3)){
    chr <- df_fu.meth[,1]
    start <- df_fu.meth[,2]
    end <- df_fu.meth[,3]
    strand <- df_fu.meth[,4]
    coverage <- df_fu.meth[,i]
    numCs <- df_fu.meth[,i+1]
    numTs <- df_fu.meth[,i+2]
    Meth_perc <- (df_fu.meth[,i+1]/df_fu.meth[,i])*100 
    df <- data.frame(chr, start, end , strand, coverage, numCs, numTs, Meth_perc)
    u.dfpc <- rbind(u.dfpc, df)
  }
  
  u.dfpc <- cbind(u.dfpc, ID)
  u.dfpc <- cbind(u.dfpc, group)
  
  
  ufp <- ggplot(u.dfpc, aes(x=Meth_perc, fill=group, color= group)) +
    geom_histogram(position="identity",
                   alpha=0.6,
                   breaks = seq(0, 100, 5)) +
    ggtitle(paste0("% of " ,LEVEL, " Methylation on filtered and united (all) data")) +
    facet_grid(. ~ ID) + 
    labs(x = "% of Methylation",  y = "Counts") +
    theme_bw(base_size = 14) +  
    theme(
      legend.title = element_text(),
      axis.title.x=element_text(),
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5),
      strip.text.x = element_text(size = 12, color = "black", face = "bold.italic"))
  
  ufp
  
} else if (UNITE == "one" & notempty.fu1) {

  ##################################################################
  #### recode getMethylationStats and getCoverageStats with ggplot for unified (at least one per group) data
  
  df_fu1.meth <- getData(fu1.meth)
  df_fu1.meth[is.na(df_fu1.meth)] <- 0
  u1.dfpc <- NULL
  ID <- NULL 
  group <- NULL
  
  for(i in 1:length(fu1.meth@sample.ids)){
    ID <- c(ID, rep(fu1.meth@sample.ids[i], length(df_fu1.meth[,1])))
    group <- c(group, rep(strsplit(fu1.meth@sample.ids[i], "_")[[1]][1], length(df_fu1.meth[,1])))
  }
  
  for(i in seq(from=5, to=length(fu1.meth), by=3)){
    chr <- df_fu1.meth[,1]
    start <- df_fu1.meth[,2]
    end <- df_fu1.meth[,3]
    strand <- df_fu1.meth[,4]
    coverage <- df_fu1.meth[,i]
    numCs <- df_fu1.meth[,i+1]
    numTs <- df_fu1.meth[,i+2]
    Meth_perc <- (df_fu1.meth[,i+1]/df_fu1.meth[,i])*100 
    df1 <- data.frame(chr, start, end , strand, coverage, numCs, numTs, Meth_perc)
    u1.dfpc <- rbind(u1.dfpc, df1)
  }
  
  u1.dfpc <- cbind(u1.dfpc, ID)
  u1.dfpc <- cbind(u1.dfpc, group)
  
  u1.dfpc <- na.omit(u1.dfpc)
  
  
  u1fp <- ggplot(u1.dfpc, aes(x=Meth_perc, fill=group, color= group)) +
    geom_histogram(position="identity",
                   alpha=0.6,
                   breaks = seq(0, 100, 5)) +
    ggtitle(paste0("% of  " ,LEVEL, " Methylation on filtered and united (at least one per group) data")) +
    facet_grid(. ~ ID) + 
    labs(x = "% of Methylation",  y = "Counts") +
    theme_bw(base_size = 14) +  
    theme(
      legend.title = element_text(),
      axis.title.x=element_text(),
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5),
      strip.text.x = element_text(size = 12, color = "black", face = "bold.italic"))
  
  u1fp
}

vfp <- ggplot(f.dfpc, aes(x=ID, y=Meth_perc, fill = ID)) +  
  geom_violin(trim=FALSE) +  
  geom_boxplot(width=0.05) + 
  labs( title = paste0("% of ", LEVEL, " methylation on filtered data") ,
        x = "conditions", y = "Methylation %" )  +
  theme_bw(base_size = 14) +               
  theme(axis.title.x=element_blank(),
        plot.title = element_text(hjust = 0.5), 
        legend.position = "none")
vfp


if(UNITE == "all" & notempty.fu){

  uvfp <- ggplot(u.dfpc, aes(x=ID, y=Meth_perc, fill = ID)) +  
    geom_violin(trim=FALSE) +  
    geom_boxplot(width=0.05) + 
    labs( title = paste0( "% of ", LEVEL, " methylation on filtered and united (all) data") ,
          x = "conditions", y = "Methylation %" )  +
    theme_bw(base_size = 14) +               
    theme(axis.title.x=element_blank(),
          plot.title = element_text(hjust = 0.5), 
          legend.position = "none")
  uvfp

}else if (UNITE == "one" & notempty.fu1){

  u1vfp <- ggplot(u1.dfpc, aes(x=ID, y=Meth_perc, fill = ID)) +  
    geom_violin(trim=FALSE) +  
    geom_boxplot(width=0.05) + 
    labs( title = paste0("% of ", LEVEL , " methylation on filtered and united (at least one per group) data") ,
          x = "conditions", y = "Methylation %" )  +
    theme_bw(base_size = 14) +               
    theme(axis.title.x=element_blank(),
          plot.title = element_text(hjust = 0.5), 
          legend.position = "none")
  u1vfp
}

fc <- ggplot(f.dfpc, aes(x=log2(coverage), fill=group, color=group)) +
  geom_histogram(position="identity", alpha=0.4) +
  ggtitle(paste0(LEVEL, " Coverage on filtered data")) +
  facet_grid(. ~ ID) + 
  labs(x = paste0("log2(reads) per ", LEVEL) ,  y = "Counts") +
  theme_bw(base_size = 14) +  
  theme(
    legend.title = element_text(),
    axis.title.x=element_text(),
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    strip.text.x = element_text(size = 12, color = "black", face = "bold.italic"))

fc


if(UNITE == "all" & notempty.fu){

  ufc <- ggplot(u.dfpc, aes(x=log2(coverage), fill=group, color=group)) +
    geom_histogram(position="identity", alpha=0.4) +
    ggtitle(paste0(LEVEL, " Coverage on filtered and united (all) data")) +
    facet_grid(. ~ ID) + 
    labs(x = paste0("log2(reads) per ", LEVEL) , y = "Counts") +
    theme_bw(base_size = 14) +  
    theme(
      legend.title = element_text(),
      axis.title.x=element_text(),
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5),
      strip.text.x = element_text(size = 12, color = "black", face = "bold.italic"))
  
  ufc

}else if (UNITE == "one" & notempty.fu1){

  u1fc <- ggplot(u1.dfpc, aes(x=log2(coverage), fill=group, color=group)) +
    geom_histogram(position="identity", alpha=0.4) +
    ggtitle(paste0(LEVEL, " Coverage on filtered and united (at least one per group) data")) +
    facet_grid(. ~ ID) + 
    labs(x = "Log2 of read coverage per base",  y = "Counts") +
    theme_bw(base_size = 14) +  
    theme(
      legend.title = element_text(),
      axis.title.x=element_text(),
      plot.title = element_text(hjust = 0.5),
      plot.subtitle = element_text(hjust = 0.5),
      strip.text.x = element_text(size = 12, color = "black", face = "bold.italic"))
  
  u1fc
}


rm(dfpc)

11 DNA Methylation percentage by coverage

if(UNITE == "all" & notempty.fu){
  
  met_cov <- ggplot(u.dfpc, aes(x= Meth_perc,  y=log2(coverage))) + 
             geom_point() + 
             labs(title = "DNA Methylation percentage by coverage") +
             xlab("Methylation percentage") +
             ylab("Coverage (Log2)") +
             theme_bw(base_size = 14)
  print(met_cov)
  
}else if (UNITE == "one" & notempty.fu1){
  
    met_cov <- ggplot(u1.dfpc, aes(x= Meth_perc,  y=log2(coverage))) + 
               geom_point() + 
               labs(title = "DNA Methylation percentage by coverage") +  
               xlab("Methylation percentage") + 
               ylab("Coverage (Log2)") +
               theme_bw(base_size = 14)
    print(met_cov)
}

12 Clustering

Dendrogram is performed on united data with “euclidean” distance and “ward.D” clustering methods

## Clustering
### useless on merged samples

if(dim(meta)[1] > 2){

  if(UNITE == "all" & notempty.fu){
    
    mat <- getData(fu.meth)
    mat <- mat[ rowSums(is.na(mat))==0, ] 
    meth.mat <- mat[, fu.meth@numCs.index]/(mat[,fu.meth@numCs.index] + mat[,fu.meth@numTs.index] )          
    names(meth.mat)=fu.meth@sample.ids
    
    d <- dist(scale(t(meth.mat)), method = "euclidean")
    hc <- hclust(d, method = "ward.D")
    plot(hc, main = "Cluster Dendrogram (Unite all)")

  
  } else if (UNITE == "one" & notempty.fu1) {
  
    mat1 <- getData(fu1.meth)
    mat1 <- mat1[ rowSums(is.na(mat1))==0, ]
    meth.mat1 <- mat1[, fu1.meth@numCs.index]/(mat1[,fu1.meth@numCs.index] + mat1[,fu1.meth@numTs.index] )
    names(meth.mat1)=fu1.meth@sample.ids
    
    d1 <- dist(scale(t(meth.mat1)), method = "euclidean")
    hc1 <- hclust(d1, method = "ward.D")
    plot(hc1, main = "Cluster Dendrogram (at least one per group) ")
    
  }
}

13 Principal component analysis (PCA)

Each sample is colored by condition
By defaut the PCA is perform on data united (unite all)
but you can change for a unification at least one per group

if(dim(meta)[1] > 2){
  
  if(UNITE == "all" & notempty.fu){
  
    res.pca <- PCA(t(meth.mat), ncp = 5, graph = FALSE, scale.unit = TRUE)
    screeplot <- fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50), main= "Scree plot PCA unite all  ")
    pca <- fviz_pca_ind(res.pca,  title = "PCA on filtered and united data (unite all)", col.ind = meta$group)
  
    print(pca)
    print(screeplot)
    rm(mat)
    rm(meth.mat)
  
  }else if (UNITE == "one" & notempty.fu1){
  
    res.pca1 <- PCA(t(meth.mat1), ncp = 5, graph = FALSE, scale.unit = TRUE)
    screeplot <- fviz_eig(res.pca1, addlabels = TRUE, ylim = c(0, 50), main= "Scree plot PCA (at least one per group)  ")
    pca <- fviz_pca_ind(res.pca1,  title = "PCA on filtered and united data (at least one per group)", col.ind = meta$group, repel = TRUE)
  
    print(pca)
    print(screeplot)
    rm(mat1)
    rm(meth.mat1)
  }
}

14 Correlations

Perform a correlation plot (methylKit package) on data on filtered and unified CpG

## Correlations
if(UNITE == "all" & notempty.fu){
  getCorrelation(fu.meth,plot=TRUE)
} else if (UNITE == "one" & notempty.fu1) {
  getCorrelation(fu1.meth,plot=TRUE)
}

##         WT_1  1KO_1  DKO_1   WT_2  1KO_2  DKO_2
## WT_1  1.0000 0.4309 0.2523 0.8991 0.4118 0.3126
## 1KO_1 0.4309 1.0000 0.2357 0.4380 0.4642 0.2624
## DKO_1 0.2523 0.2357 1.0000 0.2597 0.2632 0.5544
## WT_2  0.8991 0.4380 0.2597 1.0000 0.4244 0.3259
## 1KO_2 0.4118 0.4642 0.2632 0.4244 1.0000 0.2753
## DKO_2 0.3126 0.2624 0.5544 0.3259 0.2753 1.0000

15 DNA methylation segmentation and annotation

Methylator performs a DNA methylation segmentation analysis by samples (global) and at genomic features on filtered and unified CpG

15.1 Segmentation

The methylation status is categorised into five levels:
- Full : 100% methylated
- High : between 75 and 100% excluded
- Mid : between 25 et 75% exluded
- Low : between 0 excluded and 25% excluded
- UnMeth : 100% unmethylated

15.2 Genomic features

15.2.1 What is a gene?

Five gene features are represented
- Genes
- Intergenic
- Introns
- Exons
- Promoters

15.2.2 CpG Islands, Shores and Shelves

In this annotations you can see three types of regions around CGI :
- CpG Islands
- CpG Shores
- CpG Shelves

Schematic gene annotations

Images source : annotatr vignette

15.3 Global Methylation status by sample

For each sample, Methylator gives the number of CpG with a given methylation status

load(annotations_path)



if(UNITE == "all" & notempty.fu){

  df_fu <- getData(fu.meth)
  df_final <- data.frame()
  sample_cpt <- 1 
  
  for (i in seq(5, length(fu.meth), by=3)) {

    df_f.meth_annot <- NULL
    df <- cbind(df_fu[, 0:4], df_fu[, i:(i+2)])
    meth_perct <- df[,6] / df[,5]
  
    meth_status <- ifelse(meth_perct == 0  , "UnMeth", ifelse(meth_perct ==  1, "Full", ifelse(meth_perct < 1/4, "Low", ifelse(meth_perct > 3/4, "High", "Mid"))))
    
    sample <- rep(fu.meth@sample.ids[sample_cpt], dim(fu.meth)[1])
    df <- cbind(df, meth_status, sample)
    df_f.meth_annot <- rbind(df_f.meth_annot, df)
    colnames(df_f.meth_annot) <- c("chr" ,"start" ,"end" , "strand" , "coverage" , "numCs" , "numTs" , "meth_status" , "sample")
    
    if(sample_cpt ==  1){
      df_final <- df_f.meth_annot
    }
    else{
      df_final <- rbind(df_final, df_f.meth_annot)
    }
    sample_cpt <- sample_cpt + 1 
  }
  
  
  bps <-  ggplot(df_final, aes(x= sample, fill = factor(meth_status,  levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
              geom_bar() + 
              scale_fill_manual("Methylation status  ",
                      values = c( 'Full' = full.color,
                                  'High' = high.color,
                                  'Mid'  = mid.color,
                                  'Low'  = low.color, 
                                  'UnMeth' = unmeth.color)) +
              theme_bw(base_size = 14) +
              labs(title = paste0('Methylation status by sample (absolut counts) '),
                   x = "Samples",  y = paste0("Number of ", LEVEL)) + 
              theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
                    axis.title.x = element_text(size = 12 ), 
                    axis.title.y = element_text(size = 12 ))
  
    print(bps)
    
    
} else if (UNITE == "one" & notempty.fu1) {
  
  
  df_fu1 <- getData(fu1.meth)
  df_final <- data.frame()
  sample_cpt <- 1 
  
  for (i in seq(5, length(fu1.meth), by=3)) {
    
    df_f.meth_annot <- NULL
    df <- cbind(df_fu1[, 0:4], df_fu1[, i:(i+2)])
    df <- na.omit(df)
    meth_perct <- df[,6] / df[,5]
  
    meth_status <- ifelse(meth_perct == 0  , "UnMeth", ifelse(meth_perct ==  1, "Full", ifelse(meth_perct < 1/4, "Low", ifelse(meth_perct > 3/4, "High", "Mid"))))
    
    sample <- rep(fu1.meth@sample.ids[sample_cpt], dim(df)[1])
    df <- cbind(df, meth_status, sample)
    df_f.meth_annot <- rbind(df_f.meth_annot, df)
    colnames(df_f.meth_annot) <- c("chr" ,"start" ,"end" , "strand" , "coverage" , "numCs" , "numTs" , "meth_status" , "sample")
    
    if(sample_cpt ==  1){
      df_final <- df_f.meth_annot
    }
    else{
      df_final <- rbind(df_final, df_f.meth_annot)
    }
    sample_cpt <- sample_cpt + 1 
    
  }
  
  bps <-  ggplot(df_final, aes(x= sample, fill = factor(meth_status,  levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
              geom_bar(position = "fill") + 
              scale_fill_manual("Methylation status ",
                      values = c( 'Full' = full.color,
                                  'High' = high.color,
                                  'Mid'  = mid.color,
                                  'Low'  = low.color, 
                                  'UnMeth' = unmeth.color)) +
              theme_bw(base_size = 14) +
              labs(title = paste0('Methylation status by Sample  '),
                   x = "Samples",  y = paste0("Number of ", LEVEL)) + 
              theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
                    axis.title.x = element_text(size = 12 ), 
                    axis.title.y = element_text(size = 12 ))
  
    print(bps)

}

rm(df_fu)
rm(df)
rm(meth_perct)
rm(meth_status)
rm(df_final)

15.4 Methylation status at genomic features

Now for each sample, methylator gives the number (1st plot) and the proportion(2nd plot) of CpG with a given methylation status
according to basic genomic annotations (Genes & CGI)

df__f.meth_annot_glb <- NULL 

# Plot with fu.meth  and fu1.meth!!! 

if(UNITE == "all" & notempty.fu){

  df_fu <- getData(fu.meth)
  y <- 1
  
  for (i in seq(5, length(fu.meth), by=3)) {
    
    df_f.meth_annot <- NULL
    df <- cbind(df_fu[, 0:4], df_fu[, i:(i+2)])
    meth_perct <- df[,6] / df[,5]
  
    meth_status <- ifelse(meth_perct == 0  , "UnMeth", ifelse(meth_perct ==  1, "Full", ifelse(meth_perct < 1/4, "Low", ifelse(meth_perct > 3/4, "High", "Mid"))))
    
    df <- cbind(df, meth_status)
    df_f.meth_annot <- rbind(df_f.meth_annot, df)
    
    dfpc_regions <- makeGRangesFromDataFrame(df_f.meth_annot ,
                                             keep.extra.columns=TRUE,
                                             seqinfo= annotations@seqinfo,
                                             ignore.strand=F)
    
    
    ##########################################################################################
    
    
    dfpc_annotated <- annotate_regions(regions = dfpc_regions,
                                       annotations = annotations,
                                       ignore.strand = T,
                                       quiet = FALSE)
    
    dfpc_annotated <- dfpc_annotated  %>% data.frame
    dfpc_annotated$annot.type <- factor(dfpc_annotated$annot.type , levels = IGV)
  
    
    ##########################################################################################
    
    
    bpmaa <-  ggplot(dfpc_annotated, aes(x= annot.type, fill = factor(meth_status,  levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
              geom_bar() + 
              scale_fill_manual("Methylation status  ",
                      values = c( 'Full' = full.color,
                                  'High' = high.color,
                                  'Mid'  = mid.color,
                                  'Low'  = low.color, 
                                  'UnMeth' = unmeth.color)) +
              theme_bw(base_size = 14) +
              labs(title = paste0('Methylation status by annotation (absolut counts) '),
                   subtitle = paste0(f.dm[[y]]@sample.id), 
                   x = "Annotations",  y = paste0("Number of ", LEVEL)) + 
              theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
                    axis.title.x = element_text(size = 12 ), 
                    axis.title.y = element_text(size = 12 ))
  
    print(bpmaa)
    
    
    bpmare <-  ggplot(dfpc_annotated, aes(x= annot.type, fill = factor(meth_status,  levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
              geom_bar(position = "fill") + 
              scale_fill_manual("Methylation status  ",
                      values = c( 'Full' = full.color,
                                  'High' = high.color,
                                  'Mid'  = mid.color,
                                  'Low'  = low.color, 
                                  'UnMeth' = unmeth.color)) +
              theme_bw(base_size = 14) +
              labs(title = paste0('Methylation status by annotation (relative counts)  '),
                   subtitle = paste0(f.dm[[y]]@sample.id), 
                   x = "Annotations",  y = paste0("Number of ", LEVEL)) + 
              theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
                    axis.title.x = element_text(size = 12 ), 
                    axis.title.y = element_text(size = 12 ))
  
    print(bpmare)
    
    y <- y+1
  }

  
  
} else if (UNITE == "one" & notempty.fu1) {


  
  df_fu1 <- getData(fu1.meth)
  y <- 1
  
  for (i in seq(5, length(fu1.meth), by=3)) {
    
    df_f.meth_annot <- NULL
    df <- cbind(df_fu1[, 0:4], df_fu1[, i:(i+2)])
    df <- na.omit(df)
    meth_perct <- df[,6] / df[,5]
  
    meth_status <- ifelse(meth_perct == 0  , "UnMeth", ifelse(meth_perct ==  1, "Full", ifelse(meth_perct < 1/4, "Low", ifelse(meth_perct > 3/4, "High", "Mid"))))
    
    df <- cbind(df, meth_status)
    df_f.meth_annot <- rbind(df_f.meth_annot, df)
    
    dfpc_regions <- makeGRangesFromDataFrame(df_f.meth_annot ,
                                             keep.extra.columns=TRUE,
                                             seqinfo= annotations@seqinfo,
                                             ignore.strand=F)
    
    dfpc_annotated <- annotate_regions(regions = dfpc_regions,
                                       annotations = annotations,
                                       ignore.strand = T,
                                       quiet = FALSE)
    
    
    dfpc_annotated <- dfpc_annotated  %>% data.frame
    dfpc_annotated$annot.type <- factor(dfpc_annotated$annot.type , levels = IGV)
    
    bpmaa <-  ggplot(dfpc_annotated, aes(x= annot.type, fill = factor(meth_status,  levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
              geom_bar() + 
              scale_fill_manual("Methylation status ",
                      values = c( 'Full' = full.color,
                                  'High' = high.color,
                                  'Mid'  = mid.color,
                                  'Low'  = low.color, 
                                  'UnMeth' = unmeth.color)) +
              theme_bw(base_size = 14) +
              labs(title = paste0('Methylation status by annotation (absolut counts) '),
                   subtitle = paste0(f.dm[[y]]@sample.id), 
                   x = "Annotations",  y = paste0("Number of ", LEVEL)) + 
              theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
                    axis.title.x = element_text(size = 12 ), 
                    axis.title.y = element_text(size = 12 ))
  
    print(bpmaa)
    
    bpmare <-  ggplot(dfpc_annotated, aes(x= annot.type, fill = factor(meth_status,  levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
              geom_bar(position = "fill") + 
              scale_fill_manual("Methylation status ",
                      values = c( 'Full' = full.color,
                                  'High' = high.color,
                                  'Mid'  = mid.color,
                                  'Low'  = low.color, 
                                  'UnMeth' = unmeth.color)) +
              theme_bw(base_size = 14) +
              labs(title = paste0('Methylation status by annotation (relative counts)'),
                   subtitle = paste0(f.dm[[y]]@sample.id), 
                   x = "Annotations",  y = paste0("Number of ", LEVEL)) + 
              theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
                    axis.title.x = element_text(size = 12 ), 
                    axis.title.y = element_text(size = 12 ))
  
    print(bpmare)
    
    y <- y+1
  }

}

## Annotating...

## Annotating...

## Annotating...

## Annotating...

## Annotating...

## Annotating...

15.5 Methylation status at custom genomic features

Now for each sample, methylator gives the number (1st plot) and the proportion(2nd plot) of CpG with a given methylation status
according custom genomic features & tracks

if((CUSTOM_ANNOT == TRUE) && (merge_annot == FALSE)){

  library(GenomicRanges)
  
  df_custom_tracks <- data.frame(list_custom_tracks[[1]])
  
  for(i in 2:length(list_custom_tracks)){
    df <- data.frame(list_custom_tracks[[i]])
    df_custom_tracks <- rbind(df_custom_tracks, df)
  }
  GR_custom_tracks <- makeGRangesFromDataFrame(df_custom_tracks, keep.extra.columns = TRUE)
  
  
  
  if(UNITE == "all" & notempty.fu){
  
    df_fu <- getData(fu.meth)
    j <- 1
  
    for (i in seq(5, length(fu.meth), by=3)) {
  
      for(y in 1:length(unique(meta_annot$group))){
  
        df_f.meth_annot <- NULL
        df <- cbind(df_fu[, 0:4], df_fu[, i:(i+2)])
        meth_perct <- df[,6] / df[,5]
  
        meth_status <- ifelse(meth_perct == 0  , "UnMeth", ifelse(meth_perct ==  1, "Full", ifelse(meth_perct < 1/4, "Low", ifelse(meth_perct > 3/4, "High", "Mid"))))
  
        df <- cbind(df, meth_status)
        df_f.meth_annot <- rbind(df_f.meth_annot, df)
  
  
        dfpc_regions <- makeGRangesFromDataFrame(df_f.meth_annot ,
                                                 keep.extra.columns=TRUE,
                                                 #seqinfo= annotations@seqinfo,
                                                 ignore.strand=F)
  
        dfpc_annotated <- annotate_regions(regions = dfpc_regions,
                                            annotations = GR_custom_tracks[GR_custom_tracks$group == unique(meta_annot$group)[y] ],
                                            ignore.strand = TRUE,
                                            quiet = TRUE)
  
        dfpc_annotated <- dfpc_annotated  %>% data.frame
        dfpc_annotated$annot.type <- factor(dfpc_annotated$annot.type)
  
        print("-------------------------------------------------------")
        print(meta_annot$group[y])
        print("-------------------------------------------------------")
  
        bpmaa <-  ggplot(dfpc_annotated, aes(x= annot.type, fill = factor(meth_status,  levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
                  geom_bar() +
                  scale_fill_manual("Methylation status ",
                          values = c( 'Full' = full.color,
                                      'High' = high.color,
                                      'Mid'  = mid.color,
                                      'Low'  = low.color,
                                      'UnMeth' = unmeth.color)) +
                  theme_bw(base_size = 14) +
                  labs(title = paste0('Methylation status by annotation (absolut counts) '),
                       subtitle = paste0(f.dm[[j]]@sample.id),
                       x = "Annotations",  y = paste0("Number of ", LEVEL)) +
                  theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
                        axis.title.x = element_text(size = 12 ),
                        axis.title.y = element_text(size = 12 ))
  
        print(bpmaa)
  
        bpmare <-  ggplot(dfpc_annotated, aes(x= annot.type, fill = factor(meth_status,  levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
                  geom_bar(position = "fill") +
                  scale_fill_manual("Methylation status ",
                          values = c( 'Full' = full.color,
                                      'High' = high.color,
                                      'Mid'  = mid.color,
                                      'Low'  = low.color,
                                      'UnMeth' = unmeth.color)) +
                  theme_bw(base_size = 14) +
                  labs(title = paste0('Methylation status by annotation (relative counts)  '),
                       subtitle = paste0(f.dm[[j]]@sample.id),
                       x = "Annotations",  y = paste0("Number of ", LEVEL, " ")) +
                  theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
                        axis.title.x = element_text(size = 12 ),
                        axis.title.y = element_text(size = 12 ))
  
        print(bpmare)
  
      }
      j <- j+1
    }
  
  } else if (UNITE == "one" & notempty.fu1) {
  
  
    df_fu1 <- getData(fu1.meth)
    y <- 1
  
    for (i in seq(5, length(fu1.meth), by=3)) {
  
      df_f.meth_annot <- NULL
      df <- cbind(df_fu1[, 0:4], df_fu1[, i:(i+2)])
      df <- na.omit(df)
      meth_perct <- df[,6] / df[,5]
  
      meth_status <- ifelse(meth_perct == 0  , "UnMeth", ifelse(meth_perct ==  1, "Full", ifelse(meth_perct < 1/4, "Low", ifelse(meth_perct > 3/4, "High", "Mid"))))
  
      df <- cbind(df, meth_status)
      df_f.meth_annot <- rbind(df_f.meth_annot, df)
  
      dfpc_regions <- makeGRangesFromDataFrame(df_f.meth_annot ,
                                               keep.extra.columns=TRUE,
                                               seqinfo= annotations@seqinfo,
                                               ignore.strand=F)
  
      dfpc_annotated <- annotate_regions(regions = dfpc_regions,
                                         annotations = annotations,
                                         ignore.strand = T,
                                         quiet = FALSE)
  
  
      dfpc_annotated <- dfpc_annotated  %>% data.frame
      dfpc_annotated$annot.type <- factor(dfpc_annotated$annot.type , levels = IGV)
  
      bpmaa <-  ggplot(dfpc_annotated, aes(x= annot.type, fill = factor(meth_status,  levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
                geom_bar() +
                scale_fill_manual("Methylation status ",
                        values = c( 'Full' = full.color,
                                    'High' = high.color,
                                    'Mid'  = mid.color,
                                    'Low'  = low.color,
                                    'UnMeth' = unmeth.color)) +
                theme_bw(base_size = 14) +
                labs(title = paste0('Methylation status by annotation (absolut counts) '),
                     subtitle = paste0(f.dm[[y]]@sample.id),
                     x = "Annotations",  y = paste0("Number of ", LEVEL)) +
                theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
                      axis.title.x = element_text(size = 12 ),
                      axis.title.y = element_text(size = 12 ))
  
      print(bpmaa)
  
      bpmare <-  ggplot(dfpc_annotated, aes(x= annot.type, fill = factor(meth_status,  levels=c("Full", "High", "Mid", "Low", "UnMeth")))) +
                geom_bar(position = "fill") +
                scale_fill_manual("Methylation status ",
                        values = c( 'Full' = full.color,
                                    'High' = high.color,
                                    'Mid'  = mid.color,
                                    'Low'  = low.color,
                                    'UnMeth' = unmeth.color)) +
                theme_bw(base_size = 14) +
                labs(title = paste0('Methylation status by annotation (relative counts)  '),
                     subtitle = paste0(f.dm[[y]]@sample.id),
                     x = "Annotations",  y = paste0("Number of ", LEVEL, " ")) +
                theme(axis.text.x = element_text( size = 8, angle = 45, vjust = 1, hjust = 1),
                      axis.title.x = element_text(size = 12 ),
                      axis.title.y = element_text(size = 12 ))
  
      print(bpmare)
  
      y <- y+1
    }
  }
}


#rm(fu.meth)
rm(fu1.meth)
rm(dfpc_annotated)
rm(dfpc_regions)

16 Enriched Heatmap filtered data only (TSS)

# list_sample_gr <- NULL
# list_sample_id <- NULL
# 
# for(i in 1:length(f.dm)){
#   sample_gr <- data.frame(f.dm[[i]])
#   sample_gr <- makeGRangesFromDataFrame(sample_gr , keep.extra.columns = TRUE)
#   meth_perct <- (sample_gr$numCs/ (sample_gr$numCs + sample_gr$numTs ))*100
#   meth_perct[is.na(meth_perct)] <- 0 
#   sample_gr$meth_perct <- meth_perct
#   sample_gr@strand@values <- rep("*", length(sample_gr@strand@values)) 
# 
#   list_sample_gr <- c(list_sample_gr, sample_gr)
#   list_sample_id <- c(list_sample_id, f.dm[[i]]@sample.id)
#   rm(sample_gr)
#   rm(meth_perct)
# }
# 
# names(list_sample_gr) <- list_sample_id
# 
# genes_gr <- annotations[annotations$type == paste0(ORG,"_genes")]
# tss = IRanges::promoters(genes_gr, upstream = 0, downstream = 1)
# i <- 1 
# list_heatmap <- NULL
# 
# for(group in names(list_sample_gr)){
#   
#   name <- paste0("enrichedHeatmap_", group, sep = "")
#   meth_sample <- data.frame(list_sample_gr[i])
#   colnames(meth_sample) <- c("seqnames", "start", "end", "width", "strand", "coverage", "numCs", "numTs", "meth_perct")
#   meth_sample <- makeGRangesFromDataFrame(meth_sample, keep.extra.columns = TRUE)
#   mat = normalizeToMatrix(meth_sample, tss, value_column = "meth_perct", mean_mode = "absolute", extend = 5000, w = 50, background = NA, smooth = TRUE)
#   enrich_heatmap <- EnrichedHeatmap(mat, name = paste0("methylation", group) , column_title = group, top_annotation = HeatmapAnnotation(
#                                                                                       enriched = anno_enriched(
#                                                                                       ylim = c(0, 100),
#                                                                                       axis_param = list(
#                                                                                       side = "left",
#                                                                                       facing = "inside"
#                                                                                       ))))
#   assign(name, enrich_heatmap)
#   list_heatmap <- list_heatmap + get(name)
#   i <- i + 1 
# }
# 
# 
# draw(list_heatmap, show_heatmap_legend = FALSE, row_title = "Enriched Heatmap, Methylation near TSS")

17 Enriched Heatmap filtered data only (custom annotations)

# df_gr_cust_all <- data.frame(list_custom_tracks[[1]])
# 
# for(i in 2:length(list_custom_tracks)){
# 
#   df_gr_cust <- data.frame(list_custom_tracks[[i]])
#   df_gr_cust_all <- rbind(df_gr_cust_all, df_gr_cust)
# }
# 
# gr_cust_all <- makeGRangesFromDataFrame(df_gr_cust_all, keep.extra.columns = TRUE)
# 
# for(i in unique(meta_annot$group)){
# 
#   custom <- gr_cust_all[gr_cust_all$group == i]
#   mean_width <- round(mean(custom@ranges@width), digits = 0)
#   custom <- resize(custom, width = mean_width , fix = "center")
#   custom_name <- custom$group[1]
# 
#   sml <- ScoreMatrixList(targets=list_sample_gr, windows=custom, bin.num=50, strand.aware=TRUE, is.noCovNA = TRUE, weight.col = "numCs")
#   sml.sub =intersectScoreMatrixList(sml, reorder = FALSE)
#   multiHeatMatrix(sml.sub,  xlab= custom_name  , matrix.main=names(list_sample_gr))
#   plotMeta(sml.sub )
# }
# 
# rm(gr_cust_all)
# rm(df_gr_cust_all)
# rm(custom, custom_name)
# rm(sml, sml.sub)

18 Save RData

Saves the whole R session of the script (variables, objects …)
at the end of its execution. The RData file produced can then be opened in a local environment (if you have enough space)
to make modifications. For example to retouch figures.

rm(annotations)

# save the R Session in RData folder
save.image(file = rdata_out)

19 R session information

# R Session
sessionInfo()

## R version 4.1.3 (2022-03-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 11 (bullseye)
## 
## Matrix products: default
## BLAS/LAPACK: /opt/conda/envs/wgbsflow/lib/libopenblasp-r0.3.21.so
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8          LC_NUMERIC=C             
##  [3] LC_TIME=C.UTF-8           LC_COLLATE=C.UTF-8       
##  [5] LC_MONETARY=C.UTF-8       LC_MESSAGES=C.UTF-8      
##  [7] LC_PAPER=C.UTF-8          LC_NAME=C.UTF-8          
##  [9] LC_ADDRESS=C.UTF-8        LC_TELEPHONE=C.UTF-8     
## [11] LC_MEASUREMENT=C.UTF-8    LC_IDENTIFICATION=C.UTF-8
## 
## attached base packages:
## [1] stats4    grid      stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] GenomicFeatures_1.46.1 AnnotationDbi_1.56.2   Biobase_2.54.0        
##  [4] annotatr_1.20.0        stringr_1.5.0          FactoMineR_2.6        
##  [7] factoextra_1.0.7       yaml_2.3.5             xlsx_0.6.5            
## [10] ggplot2_3.3.6          magrittr_2.0.3         tidyr_1.3.0           
## [13] dplyr_1.0.10           methylKit_1.20.0       genomation_1.26.0     
## [16] readr_2.1.2            tzdb_0.3.0             data.table_1.14.2     
## [19] mixtools_1.2.0         EnrichedHeatmap_1.24.0 GenomicRanges_1.46.1  
## [22] GenomeInfoDb_1.30.1    IRanges_2.28.0         S4Vectors_0.32.4      
## [25] BiocGenerics_0.40.0    ComplexHeatmap_2.10.0 
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.3                    R.utils_2.12.2               
##   [3] tidyselect_1.2.0              RSQLite_2.2.20               
##   [5] htmlwidgets_1.6.1             BiocParallel_1.28.3          
##   [7] munsell_0.5.0                 codetools_0.2-19             
##   [9] DT_0.27                       withr_2.5.0                  
##  [11] colorspace_2.1-0              filelock_1.0.2               
##  [13] highr_0.10                    knitr_1.42                   
##  [15] leaps_3.1                     ggsignif_0.6.4               
##  [17] fastseg_1.40.0                rJava_1.0-6                  
##  [19] labeling_0.4.2                MatrixGenerics_1.6.0         
##  [21] emmeans_1.8.4-1               bbmle_1.0.25                 
##  [23] GenomeInfoDbData_1.2.7        farver_2.1.1                 
##  [25] seqPattern_1.26.0             bit64_4.0.5                  
##  [27] coda_0.19-4                   vctrs_0.5.2                  
##  [29] generics_0.1.3                xfun_0.37                    
##  [31] BiocFileCache_2.2.0           regioneR_1.26.0              
##  [33] R6_2.5.1                      doParallel_1.0.17            
##  [35] clue_0.3-64                   locfit_1.5-9.7               
##  [37] bitops_1.0-7                  cachem_1.0.6                 
##  [39] DelayedArray_0.20.0           assertthat_0.2.1             
##  [41] promises_1.2.0.1              BiocIO_1.4.0                 
##  [43] scales_1.2.1                  gtable_0.3.1                 
##  [45] multcompView_0.1-8            rlang_1.0.6                  
##  [47] scatterplot3d_0.3-42          GlobalOptions_0.1.2          
##  [49] splines_4.1.3                 rstatix_0.7.2                
##  [51] rtracklayer_1.54.0            impute_1.68.0                
##  [53] broom_1.0.3                   abind_1.4-5                  
##  [55] BiocManager_1.30.19           reshape2_1.4.4               
##  [57] backports_1.4.1               httpuv_1.6.8                 
##  [59] qvalue_2.26.0                 tools_4.1.3                  
##  [61] gridBase_0.4-7                ellipsis_0.3.2               
##  [63] jquerylib_0.1.4               RColorBrewer_1.1-3           
##  [65] Rcpp_1.0.10                   plyr_1.8.8                   
##  [67] progress_1.2.2                zlibbioc_1.40.0              
##  [69] purrr_1.0.1                   RCurl_1.98-1.10              
##  [71] prettyunits_1.1.1             ggpubr_0.5.0                 
##  [73] GetoptLong_1.0.5              SummarizedExperiment_1.24.0  
##  [75] ggrepel_0.9.3                 cluster_2.1.4                
##  [77] circlize_0.4.15               mvtnorm_1.1-3                
##  [79] matrixStats_0.63.0            hms_1.1.2                    
##  [81] xlsxjars_0.6.1                mime_0.12                    
##  [83] evaluate_0.20                 xtable_1.8-4                 
##  [85] XML_3.99-0.11                 emdbook_1.3.12               
##  [87] mclust_6.0.0                  shape_1.4.6                  
##  [89] compiler_4.1.3                biomaRt_2.50.0               
##  [91] bdsmatrix_1.3-6               tibble_3.1.8                 
##  [93] KernSmooth_2.23-20            crayon_1.5.2                 
##  [95] R.oo_1.25.0                   htmltools_0.5.4              
##  [97] segmented_1.6-2               later_1.3.0                  
##  [99] DBI_1.1.3                     dbplyr_2.3.0                 
## [101] MASS_7.3-58.2                 rappdirs_0.3.3               
## [103] car_3.1-1                     Matrix_1.5-3                 
## [105] cli_3.6.0                     R.methodsS3_1.8.2            
## [107] parallel_4.1.3                pkgconfig_2.0.3              
## [109] flashClust_1.01-2             GenomicAlignments_1.30.0     
## [111] numDeriv_2016.8-1.1           xml2_1.3.3                   
## [113] foreach_1.5.2                 bslib_0.4.2                  
## [115] XVector_0.34.0                estimability_1.4.1           
## [117] digest_0.6.31                 Biostrings_2.62.0            
## [119] rmarkdown_2.20                restfulr_0.0.15              
## [121] curl_4.3.3                    kernlab_0.9-32               
## [123] shiny_1.7.4                   Rsamtools_2.10.0             
## [125] gtools_3.9.4                  rjson_0.2.21                 
## [127] lifecycle_1.0.3               nlme_3.1-162                 
## [129] jsonlite_1.8.4                carData_3.0-5                
## [131] limma_3.50.3                  BSgenome_1.62.0              
## [133] fansi_1.0.4                   pillar_1.8.1                 
## [135] lattice_0.20-45               KEGGREST_1.34.0              
## [137] fastmap_1.1.0                 httr_1.4.4                   
## [139] plotrix_3.8-2                 survival_3.5-0               
## [141] interactiveDisplayBase_1.32.0 glue_1.6.2                   
## [143] png_0.1-8                     iterators_1.0.14             
## [145] BiocVersion_3.14.0            bit_4.0.5                    
## [147] stringi_1.7.12                sass_0.4.5                   
## [149] blob_1.2.3                    AnnotationHub_3.2.0          
## [151] memoise_2.0.1

DNA methylation exploratory analysis on CpG for 5mC

Authors

Olivier Kirsh1

Elouan Bethuel2

28 May, 2024