Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diff_methylsig not working on tile_by_windows() object #47

Open
anairlema opened this issue Jul 17, 2020 · 8 comments
Open

diff_methylsig not working on tile_by_windows() object #47

anairlema opened this issue Jul 17, 2020 · 8 comments

Comments

@anairlema
Copy link

Dear Raymond,
Thank you for your previous help on "input methylRaw object #46": methyldacker works very good and fast for us!
I'm sorry but we need again your help!
We have done the analysis for DMCs and it works good. Here you can find our workflow:

library(methylSig)
library(bsseq)
files=c("R1T0Sorted_CpG.bedGraph", "R2T0Sorted_CpG.bedGraph","R3T0Sorted_CpG.bedGraph","R4T0Sorted_CpG.bedGraph","R5T0Sorted_CpG.bedGraph","NR1T0Sorted_CpG.bedGraph","NR2T0Sorted_CpG.bedGraph","NR3T0Sorted_CpG.bedGraph","NR4T0Sorted_CpG.bedGraph","NR5T0Sorted_CpG.bedGraph")
sampleTable=data.frame(row.names=c("R1T0","R2T0","R3T0","R4T0","R5T0","NR1T0","NR2T0","NR3T0","NR4T0","NR5T0"), >condition= c("RT0", "RT0", "RT0", "RT0","RT0","NRT0","NRT0","NRT0","NRT0","NRT0"))
bsseq_stranded= bsseq::read.bismark(files, colData= sampleTable, rmZeroCov = FALSE, strandCollapse = FALSE)
bs = filter_loci_by_coverage(bsseq_stranded, min_count = 10, max_count = 450)
bs = filter_loci_by_group_coverage(bs = bs ,group_column = "condition",c("RT0" = 4, "NRT0" = 4))
diff_gr = diff_methylsig(bs = bs,group_column ="condition", comparison_groups = c("case"= "RT0", "control"= "NRT0"), disp_groups = c("case"= TRUE, "control"= TRUE), local_window_size = 200, t_approx = TRUE)
diff_gr
GRanges object with 1738232 ranges and 9 metadata columns:
seqnames ranges strand | meth_case meth_control meth_diff
|
[1] chr1 10496 * | 91.1583 93.0084 -1.850019
[2] chr1 10524 * | 91.4252 93.1693 -1.744116
[3] chr1 10541 * | 91.5805 93.2643 -1.683799
[4] chr1 133164 * | 96.0348 95.0444 0.990369
[5] chr1 133179 * | 96.0451 95.0886 0.956530
... ... ... ... . ... ... ...
[1738228] chrM 15761 * | 0.00611571 0.255843 -0.249728
[1738229] chrM 15927 * | 0.58592912 0.434289 0.151640
[1738230] chrM 16361 * | 1.15901535 0.000000 1.159015
[1738231] chrM 16412 * | 1.01423891 0.000000 1.014239
[1738232] chrM 16428 * | 0.97455756 0.000000 0.974558
direction pvalue fdr disp_est log_lik_ratio

[1] NRT0 0.171782 0.359444 27.1023 1.98211
[2] NRT0 0.176765 0.368282 28.2966 1.93061
[3] NRT0 0.191406 0.393883 29.0562 1.80436
[4] RT0 0.221074 0.444723 147.7053 1.62245
[5] RT0 0.253027 0.497847 141.0267 1.40681
... ... ... ... ... ...
[1738228] NRT0 1.00000e+00 1.00000e+00 1e+06 -91875.8
[1738229] RT0 1.00000e+00 1.00000e+00 1e+06 -33330.5
[1738230] RT0 1.20458e-39 1.07358e-38 1e+06 494245.5
[1738231] RT0 4.16131e-44 3.79191e-43 1e+06 645829.0
[1738232] RT0 1.32290e-42 1.19683e-41 1e+06 636780.2
df

[1] 24.3996
[2] 25.3728
[3] 24.6859
[4] 15.8837
[5] 15.8837
... ...
[1738228] 15.27924
[1738229] 8.19568
[1738230] 17.12816
[1738231] 18.80084
[1738232] 18.10066


seqinfo: 25 sequences from an unspecified genome; no seqlengths

We have good results for DMCs, in line with the biology of our experiment. Now, we would like to analyse also DMRs, so we have applied the "diff_methylsig" function to the tile data , as described in the manual:

Tile_bs200 = tile_by_windows(bs = bs, win_size = 200)
diff_grTile200 = diff_methylsig(bs = Tile_bs200,group_column ="condition", comparison_groups = c("case"= "RT0", "control"= "NRT0"), disp_groups = c("case"= TRUE, "control"= TRUE), local_window_size = 0, t_approx = TRUE)

If we looked at the created object we get:

diff_grTile200
GRanges object with 15467792 ranges and 9 metadata columns:
seqnames ranges strand | meth_case meth_control meth_diff
|
[1] chr1 1-200 * | NA NA NA
[2] chr1 201-400 * | NA NA NA
[3] chr1 401-600 * | NA NA NA
[4] chr1 601-800 * | NA NA NA
[5] chr1 801-1000 * | NA NA NA
... ... ... ... . ... ... ...
[15467788] chrM 15801-16000 * | 0.613726 0.445216 0.168510
[15467789] chrM 16001-16200 * | NA NA NA
[15467790] chrM 16201-16400 * | 2.380692 0.000000 2.380692
[15467791] chrM 16401-16600 * | 0.686588 0.000000 0.686588
[15467792] chrM 16601-16628 * | NA NA NA
direction pvalue fdr disp_est log_lik_ratio df

[1] NA NA NA NA 0
[2] NA NA NA NA 0
[3] NA NA NA NA 0
[4] NA NA NA NA 0
[5] NA NA NA NA 0
... ... ... ... ... ... ...
[15467788] RT0 0.8252913 0.961851 1e+06 0.0520292 8
[15467789] NA NA NA NA 0
[15467790] RT0 0.0519281 0.491728 1e+06 5.2064488 8
[15467791] RT0 0.1583328 0.636830 1e+06 2.4209577 8
[15467792] NA NA NA NA 0


seqinfo: 25 sequences from an unspecified genome; no seqlengths

As you can see we have almost "NA" values and when we filter the data based on fdr and meth_diff (fdr<0.1 and meth_diff≥|25|) we have no results.

SubsetSigDMRsHyper=subset(diff_grTile200, "fdr"<= 0.1 & "meth_diff">=25)
SubsetSigDMRsHypo=subset(diff_grTile200, "fdr"<= 0.1 & "meth_diff"<=-25)
SubsetSigDMRsHyper
GRanges object with 0 ranges and 9 metadata columns:
seqnames ranges strand | meth_case meth_control meth_diff direction
|
pvalue fdr disp_est log_lik_ratio df


seqinfo: 25 sequences from an unspecified genome; no seqlengths

SubsetSigDMRsHypo
GRanges object with 0 ranges and 9 metadata columns:
seqnames ranges strand | meth_case meth_control meth_diff direction
|
pvalue fdr disp_est log_lik_ratio df


seqinfo: 25 sequences from an unspecified genome; no seqlengths

diff_grTile200$"fdr"<= 0.1
[1] NA NA NA NA NA NA NA NA NA NA NA NA
[13] NA NA NA NA NA NA NA NA NA NA NA NA
[25] NA NA NA NA NA NA NA NA NA NA NA NA
[37] NA NA NA NA NA NA NA NA NA NA NA NA
[49] NA NA NA NA FALSE NA NA NA NA NA NA NA
[61] NA NA NA NA NA NA NA NA NA NA NA NA
[73] NA NA NA NA NA NA NA NA NA NA NA NA
[85] NA NA NA NA NA NA NA NA NA NA NA NA
[97] NA NA NA NA NA NA NA NA NA NA NA NA
[109] NA NA NA NA NA NA NA NA NA NA NA NA
[121] NA NA NA NA NA NA NA NA NA NA NA NA
[133] NA NA NA NA NA NA NA NA NA NA NA NA
[145] NA NA NA NA NA NA NA NA NA NA NA NA
[157] NA NA NA NA NA NA NA NA NA NA NA NA
[169] NA NA NA NA NA NA NA NA NA NA NA NA
[181] NA NA NA NA NA NA NA NA NA NA NA NA
[193] NA NA NA NA NA NA NA NA NA NA NA NA
[205] NA NA NA NA NA NA NA NA NA NA NA NA

What are we missing? Is it possible to have 11482 significant DMCs and No significant DMRs?
Moreover, we have also tried to use local information when analysing the tile data but we get an error:

diff_grTile200 = diff_methylsig(bs = Tile_bs200,group_column ="condition", comparison_groups = c("case"= "RT0", "control"= "NRT0"), disp_groups = c("case"= TRUE, "control"= TRUE), local_window_size = 200, t_approx = TRUE)
Error in diff_methylsig(bs = Tile_bs, group_column = "condition", comparison_groups = c(case = "RT0", :
Cannot use local information on region-resolution data. Detected local_window_size > 0 and median width of loci > 2

If I understand well, in the old version of methylsig (we usually use version 0.4.5beta) we can set "local.disp" and "local.meth" in the "methylSigCalc" function when doing it on tile data but it seems to be not possible on the new version: Is it correct?

Thank you in advance for all your support!

Anair

@rcavalcante
Copy link
Member

Hello,

I'm glad methyldackel helped! I will address your questions a little out of order.

If I understand well, in the old version of methylsig (we usually use version 0.4.5beta) we can set "local.disp" and "local.meth" in the "methylSigCalc" function when doing it on tile data but it seems to be not possible on the new version: Is it correct?

I can't recall whether this was allowed then, but, conceptually, tiling the data is taking local information with a different kernel function (one that is evenly weighted). I don't think it makes sense to both tile the data and explicitly use the local information in the old methylSigCalc function.

As to the problem of NAs, when tiling data I usually do the following order of calls:

  1. bsseq::read.bismark()
  2. tile_by_windows()
  3. filter_loci_by_coverage()
  4. filter_loci_by_group_coverage()
  5. diff_methylsig()

In other words, I don't throw out any data until I've grouped the CpG-level data into tiles.

The presence of regions like chr1 1-200 in your diff_grTile200 object suggests to me that you didn't run filter_loci_by_group_coverage() after tiling the data. All those NAs are the result of regions with no signal in them. The tiling function is a little "dumb" in the sense that it doesn't filter out non-zero regions for you, so you should run filter_loci_by_group_coverage() after tiling.

What are we missing? Is it possible to have 11482 significant DMCs and No significant DMRs?

It's hard to know without looking at some examples in your data. It might be helpful to look at the distribution of methylation differences in your regions and at volcano plots for CpG-level and region-level results. It might also be helpful to look at the number of significant regions with different thresholds.

Thanks.

@anairlema
Copy link
Author

Thank you for your prompt answer!
I can't recall whether this was allowed then, but, conceptually, tiling the data is taking local information with a different kernel function (one that is evenly weighted). I don't think it makes sense to both tile the data and explicitly use the local information in the old methylSigCalc function.

Regarding this aspect of using local information on the old methylSigCalc function, we have some example of its utility on old analysis.
I agree that it really depends on the pattern of methylation of the different types of cancers. If there is a lot of local methylation, with some long stretches of consecutive CpGs that are equally affected, then using local methylation as TRUE would be the correct option. However, since in in an our previously analysis, when we did the different tile size it seemed to be that we were getting more results with smaller tiles, this gave us the indication that we should test for the local methylation FALSE option. Here you can find a summary of p-values histograms using local.meth TRUE compare to the FALSE one when using the old methylSigCalc function

meth<- methylSigReadData(files, sample.ids= sample.id, assembly= "hg19", treatment= treatment, context= "CpG", destranded=TRUE)
methFilter<- methylSigReadData(files, sample.ids= sample.id, assembly= "hg19", treatment= treatment, context= "CpG", destranded=TRUE, minCount=10, maxCount= 500, quiet=FALSE)
methTile=methylSigTile(methFilter, win.size=25)

You can find attached the file1 with p-value histograms:
Figure 1

AllDMRsT1_0TF= methylSigCalc(methTile, groups=c(1,0), min.per.group=3,local.disp=TRUE, local.meth=FALSE,num.cores=2)
Figure2
AllDMRsT1_0FF= methylSigCalc(methTile, groups=c(1,0), min.per.group=3,local.disp=FALSE, local.meth=FALSE,num.cores=2)
Figure3
AllDMRsT1_0TT= methylSigCalc(methTile, groups=c(1,0), min.per.group=3,local.disp=TRUE, local.meth=TRUE,num.cores=2)
Figure4
AllDMRsT1_0FT= methylSigCalc(methTile, groups=c(1,0), min.per.group=3,local.disp=FALSE, local.meth=TRUE,num.cores=2)

**As to the problem of NAs, when tiling data I usually do the following order of calls:

  1. bsseq::read.bismark()
  2. tile_by_windows()
  3. filter_loci_by_coverage()
  4. filter_loci_by_group_coverage()
  5. diff_methylsig()**

Correct suggestion! We have now performed again the analysis following your suggested order of calls:

bsseq_stranded= bsseq::read.bismark(files, colData= sampleTable, rmZeroCov = FALSE, strandCollapse = FALSE)
Tile200New=tile_by_windows(bs =bsseq_stranded, win_size = 200)
bsTile = filter_loci_by_coverage(Tile200New, min_count = 10, max_count = 450)
bsTile = filter_loci_by_group_coverage(bs = bsTile ,group_column = "condition",c("RT0" = 4, "NRT0" = 4))
diff_grTile200New = diff_methylsig(bs = bsTile ,group_column ="condition", comparison_groups = c("case"= "RT0", "control"= "NRT0"), disp_groups = c("case"= TRUE, "control"= TRUE), local_window_size = 0, t_approx = TRUE)

Anyway, we are still getting no significant DMRs using fdr≤ 0.1 and meth_diff≥|25|.
Looking at the results it seems that the problem are fdr values: we have significant differences regions on meth_diff but with high fdr values (the smaller one seems to be around 0.4!).
To confirm if it was a real biological event, we have done the analysis on the same samples using methylkit and eDMR (I know that methylkit have a better algoritm to do this type of analysis but we have tried anayway!).
In this case we were able to capture more than 2000 significant DMRs (2156 DMRs with fdr<0.05 and mreth.diff≥|20|):

library(methylKit)
list("R1T0_CpGfinal.txt", "R2T0_CpGfinal.txt", "R3T0_CpGfinal.txt", "R4T0_CpGfinal.txt", "R5T0_CpGfinal.txt", "NR1T0_CpGfinal.txt", "NR2T0_CpGfinal.txt", "NR3T0_CpGfinal.txt", "NR4T0_CpGfinal.txt", "NR5T0_CpGfinal.txt")
myobj=methRead(file.list, >sample.id=list("R1T0","R2T0","R3T0","R4T0","R5T0","NR1T0","NR2T0","NR3T0","NR4T0","NR5T0"), assembly="hg19",treatment=c(1,1,1,1,1,0,0,0,0,0),context="CpG")
filtered.myobj=filterByCoverage(myobj,lo.count=10,lo.perc=NULL,hi.count=450,hi.perc=99.9)
norm_myobj=normalizeCoverage(filtered.myobj)
meth=unite(norm_myobj, destrand=TRUE)
library(edmr)
myMixmdl_ResvsNonRes=myDiff.to.mixmdl(myDiffResvsNonRes)
myDMR_ResvsNonRes=edmr(myDiffResvsNonRes, mode=1, ACF=TRUE, num.DMCs=2, num.CpGs=3, DMC.qvalue=1, DMC.methdiff=0,DMR.methdiff=0 )
myDMR_ResvsNonRes.sig=filter.dmr(myDMR_ResvsNonRes, DMR.qvalue=0.05, mean.meth.diff=20, num.CpGs=3, num.DMCs=2)

It's hard to know without looking at some examples in your data. It might be helpful to look at the distribution of methylation differences in your regions and at volcano plots for CpG-level and region-level results. It might also be helpful to look at the number of significant regions with different thresholds.

You can find attached the volcano plot for our CpGs file2 and DMRs file3.

Thank you very much!

Anair
File1
File2
File3

@sartorma
Copy link
Member

sartorma commented Jul 17, 2020 via email

@anairlema
Copy link
Author

Dear Maureen,

thank you for your suggestion. We've tried to perform the analysis using 25 bp tiling window size, as usually done in the old methylsig version.
Unfortunately, we have again no significant DMRs, you can find attached the code used for the analysis.
Do you have any other suggestion?
Thank you for your help!

Best,
Code25Tilewindows.txt

Anair

@anairlema
Copy link
Author

Dear Maureen and Raymond,

We have performed a comparison analysis on the same data using an old (v0.4.5beta) and new version of methylsig.
You can find attached a summary with results and pvalue histograms.

As you can see, with these new version we are able to capture less differences in DMRs using these specific data.
Based on these results, we will probably use the older version 0.4.4 instead of the new one.

Since I have read that you're going to do a new release around October,I really would like to ask you to consider the possibility to use local information just for the dispersion calculation, just for the methylation or for both dispersion and methylation. It will be very helpful when analyzing data like ours.

Thank you very much!

Best,
Anair
Summary.xlsx

@rcavalcante
Copy link
Member

Hi Anair,

Indeed, the plan is to add back the option of specifying using local information for methylation, dispersion, or both.

However, the sites allowed for use in local information changed from 0.4.4 to the present version, so your results with local information will still likely differ. We do not plan to revert that change.

I'll give an example to illustrate how the local information differs, apologies if you read this in another issue. Say there is a CpG within a 200bp window of the CpG being tested, but that it only has data for 2 samples and the min.per.group is 4. In the old version (0.4.4), that local CpG would be used to estimate the methylation/dispersion, but in the current release it is not. In effect, we are constraining local information to higher quality CpGs.

Thanks,
Raymond

@santur90
Copy link

I met a totally same issue as anairlema described, got no DMRs using fdr although the |meth_diff| > 25. And the tile size is 25bp, with no local information. Actually the same datasets did give me perfect results with old version of methylSig (0.4.4)

@tuuba06
Copy link

tuuba06 commented Jun 7, 2024

Hi,
Sorry, I'm new to methylSig. I need help. I tried the diff_methylsig option and got my data. I want to see hyper/hypo methylation results, what command should I use for this?
Thank you for your help.

Thanks,
Tuba

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants