forked from rschwess/deepC
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtutorial_predict_and_plot.Rmd
511 lines (371 loc) · 23.6 KB
/
tutorial_predict_and_plot.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
---
title: "DeepC Tutorial Predict and Plot"
output:
html_document: default
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = "~/data_local/deepC_write_tutorials_offline/run_deepc_tutorial/")
# need to run in console as workaround
setwd("~/data_local/deepC_write_tutorials_offline/run_deepc_tutorial/")
```
This tutorial runs through predicting the chromatin interactions of an example region, plotting the predictions as well as the skeleton and the "raw" Hi-C interactions. It will also demonstrate how to plot 1D tracks such as ChIP-seq signals attached to the deepC plots. Further, the tutorial demonstrates how to make a prediction of the impact of sequence variant, plotting the reference, variant and interactions and a differential plot and also calculate an associated damage score.
You can find the smaller example files in the gitHub repository in *./tutorials*. Larger files are linked under *formatted_data_links*.
The tutorial runs in R which is what we use for analysing and plotting the predictions. The deepC model training and running predictions is run in python. Because the python runs can be slow depending on your set up we recommend running them in a terminal. Commands to copy paste are listed in the tutorial and example output files are provided.
For running the tutorials I recommend copying the tutorials folder from the deepC github repo to the outside of the repository. So the larger files you will download don't clock up your repo version. e.g. in terminal run:
```{bash, eval=FALSE}
# In terminal run
cp -R ../../repositories/deepC/tutorials ./run_deepc_tutorial
cd ./run_deepc_tutorial
```
## Global Set up
In your R session / RStudio:
Load packages and source the custom functions for deepC and general Hi-C processing and visualisation.
```{r packages, message=FALSE}
library(tidyverse)
library(cowplot)
theme_set(theme_cowplot())
library(RColorBrewer)
source("./deepC/helper_for_preprocessing_and_analysis/functions_for_HiC.R")
source("./deepC/helper_for_preprocessing_and_analysis/functions_for_deepC.R")
```
Set global options, specifiying bin size and bp context of the model.
```{r}
sample <- "GM12878"
bin.size <- 5000
window.size <- 1000000 + bin.size # bp_context
interaction.value.bins <- 10 # how many HiC skeleton bins have been used (10 for all published models)
prediction.bins <- window.size/bin.size # number of predictions bins (also called number of classes later). Refers to number of bins in the vertial interaction pole
# depends on the bin.size and bp_context selected: 201 for 5kb models | 101 for 10k b models
slope <- 1/(bin.size/2) # slope for visualizing interactions with the central bin in later visualizations
```
## Chromatin interactions of reference sequence
We start by predicting the chromatin interactions of a region of chr17 (test chromosome in the deepC manuscript). No variants yet just reference sequence.
### Running the deepC prediction
Have a look at the example regions file we provide the deepC prediction script with. Format: chromosome start end variant in tab separated columns. The format is bed-like so 0-based, half open coordinates!
To only predict on the reference sequence use the variant_tag 'reference' without quotation. We can run longer patches of sequence like in the example below (recommended up to ~ 5 Mb).
For longer patches it is recommended to split the region of intrest into chunks of 3 - 5 Mb and concatinate the predictions afterwards.
```{bash}
# Run in terminal
head example_region_short.bed
```
Now we run the predictions over the example regions. It is worth running this command in a shell script for nohup or submission to a cluster as this is the labour intensive step. I usually run the python script first and then move to analysis and visualization within R.
Please note that you will need a whole genome fasta file and a fasta index. For the tutorial you can download a chr17 only and a hg19 whole genome fasta file and index from (https://github.com/rschwess/deepC/tree/master/data_links). Alternatively, please download a genome wide fasta file from UCSC (e.g. https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/) and index it with e.g. samtools.
You will also need to download a trained model from here: https://github.com/rschwess/deepC/tree/master/models and have it extracted. The primary GM12878 model is used for this tutorial.
For running on a GPU set --run_on gpu otherwise --run_on cpu.
For orientation, the example prediction run takes ~ 3 min on a GeForce GTX 980M and ~ 2 h running on a CPU
Note: Example output files are already provided with the deepC github repository if you want to skip this step.
```{bash, eval=FALSE}
# Run in bash
# explanation
python ../tensorflow1_version/run_deploy_shape_deepCregr.py --input example_region_short.bed \ # input deepC variant bed-like file
--out_dir ./test_predict_out \ #output directory
--name_tag predict \ # name tag to add to ouput files
--model ./model_deepCregr_5kb_GM12878_primary/model \ # trained deepC model downloaded and extracted
--genome ./hg19_chr17_fasta_for_test/hg19_chr17.fa \ # link to whole genome or chromosome wise fasta file (needs a fasta index) or test chr17 fasta file dowloaded and extracted
--use_softmasked=False \ #Specify if to include base pairs soft masked in the fasta file (lower case) default=False
--bp_context 1005000 \ # bp context (1 Mb + bin.size)
--add_window 500000 \ # how much bp to add to either side of the specified window
--num_classes 201 \ # The number of classes corresponds to the number of outputs (output bins of the vertical pole) (201 for 5kb models; 101 for 10kb models)
--bin_size 5000 \ # bin size matching to the model and input data selected
--run_on gpu # specify to run on gpu or cpu (cpu takes significantly longer)
# actually run
python ../tensorflow1_version/run_deploy_shape_deepCregr.py --input example_region_short.bed \
--out_dir ./test_predict_out \
--name_tag predict \
--model ./model_deepCregr_5kb_GM12878_primary/model \
--genome ./hg19_chr17_fasta_for_test/hg19_chr17.fa \
--use_softmasked=False \
--bp_context 1005000 \
--add_window 500000 \
--num_classes 201 \
--bin_size 5000 \
--run_on gpu
```
Having a look at the output format. The variant file (also called so for reference predictions) is a tab separated file: chr start end pred_bin_1 pred_bin_2 ....
The coordinates specify the sequence window (1Mb + bin size) centered on this are the vertical pole interaction predictions moving from close to distal interactions. The coordinates refer to the center of a bin.sized bin. Get the center bin the predictions map to by calculating the center of this window (-half bin size to get the left most coordinate of the middle bin).
The predictions are regression values corresponding to the HiC skeleton.
The header files list the reference or variant region and where they map rlative to the reference genome. (#Mapping to relative reference coordinates:) which refers to the first and last bin over which the vertical interaction pole is predicted. They also specify if any bp shifts occured due to differences between reference and variant length.
This is obviously not relevant for reference predictions and for variants that do not alter the number of base pairs.
We can concatinate multiple reference predictions files to combine bigger regions or whole chromosome predictions in a single file. To do this just concatinate, remove header lines (grep -v "#") and sort (if not concatinated in order).
```{bash}
# run in terminal
# run this line if you are using the prediction you computed yourself
head test_predict_out/class_predictions_predict_1_chr17_71000000_71999999.txt | cut -f 1,2,3,4,5,6
# run this if you want to use the provided prediction output
#head test_predict_out/class_predictions_predict_provided_1_chr17_71000000_71999999.txt | cut -f 1,2,3,4,5,6
```
### Visualizing the predicted interactions
Read the reference prediction file. Use the *readDeepcVariantFile* function. If reading in a concatinated variant file without header use *readDeepcVariantFileNoHeader*
```{r}
# change file name to provided file if not running the prediction yourself
ref.pred <- readDeepcVariantFile("./test_predict_out/class_predictions_predict_1_chr17_71000000_71999999.txt",
prediction.bins = prediction.bins,
bin.size = bin.size,
gather = T) # specify gather TRUE or FALSE if to gather the prediction data in a long format for ggplot2 etc. FALSE: broad format
# print structure
str(ref.pred)
```
The deepC variant file is stored in a list which captures the actual predictions in a data frame as well as the genomic coordinates of the variant and relative mapping (same in reference case) and the bp shift. To view the predictions access:
```{r}
ref.pred$df
```
For plotting Hi-C data in triangular format (half of the matrix spanning the horizontally layed out genome) in ggplot2 we implemented a function that transforms the interaction values into a diamond shape (4 polygon points). We then plot them as the area between the points. Note this function yields really large vector graphics so only save as raster images (png/jpeg).
```{r}
# provide the interaction data frame and the bin.size
t.rdf <- triangularize(ref.pred$df, bin = bin.size)
```
We set the limits for plotting and plot the predicted interactions.
```{r single_ref_pred, fig.height = 3, fig.width = 11}
# set limits for plotting
limit1 <- ref.pred$variant.start - 500000
limit2 <- ref.pred$variant.end + 500000
# create plot
plot.rdf <- t.rdf %>%
ggplot(aes(x = pos, y = (bin*bin.size)/1000, fill = value, group = polyid)) +
geom_polygon() +
labs(y = "Genomic Distance [kb]") +
scale_fill_gradientn(colours = brewer.pal(9, 'YlOrRd')) +
coord_cartesian(xlim=c(limit1, limit2))
plot(plot.rdf)
```
The plot shows the predicted (skeleton) interaction values with genomic position on the x-axis and the genomic distance between the elements interacting on the y-axis.
Now lets overlap this with the actual Hi-C skeleton. Here we read in the already formated Hi-C skeleton data (example download link under *formated_data_links*). Check out the tutorial of how to generate your skeleton data from Hi-C data to reproduce or create your own. The format is tab separated: chr start end skeleton_interaction_values. The coordinates refer to the bp_context sized window and the interaction values are centered on this window. The interactions are comma separated coresponding to the prediction bins from close to diagonal to distal interactions.
```{bash}
# Run in Terminal
head ./example_skeleton_gm12878_5kb_chr17.bed
```
```{r}
skel.file <- "./example_skeleton_gm12878_5kb_chr17.bed"
# function can be used for HiC or Skeleton values as long as they are collated in a comma separated single column
skel.df <- readDeepcInputHicBed(skel.file, prediction.bins = prediction.bins, bin.size = bin.size, gather=TRUE)
head(skel.df)
```
Triangularize the skeleton and plot the overlay.
```{r skel_ref_overlay, fig.height = 6, fig.width = 11}
# set limits to what is covered in the reference prediction file
limit1 <- min(ref.pred$df$pos)
limit2 <- max(ref.pred$df$pos)
# triangularize --> format to diamond shape polygons
t.skel.df <- triangularize(skel.df, bin.size)
# create the plot
plot.skel.df <- t.skel.df %>%
ggplot(aes(x = pos, y = (bin*bin.size)/1000, fill = value, group = polyid)) +
geom_polygon() +
labs(y = "Genomic Distance [kb]") +
scale_fill_gradientn(colours = brewer.pal(9, 'YlOrRd')) +
xlim(limit1, limit2)
# plot using cowplot and helper themes for merging.
plot_grid(plot.skel.df + upper_overlay_theme,
plot.rdf + lower_overlay_theme,
nrow = 2, align = "v")
```
### Adding the source HiC data
To add the source HiC data we read in a sparse matrix file, either in HiC-Pro format (bins with bin ids and interaction values) and a bed file specifying the genomic coordinates of the bin ids or a matrix with genomic coordinates instead of bin ids e.g. as downloaded from Rao et al. (GSE63525) use the intrachromsosomal interaction matrices and apply norm factors as desired (here KRnorm). Example GM12878 sparse matrix for chr17 is linked in *formatted_data_links*. Note this format lists the left most base pair as identifier for the respective bin.
```{bash}
# Run in terminal
head gm12878_primary_chr17_5kb.contacts.KRnorm.matrix
```
Here is how we read in Hi-C data, process and plot them.
```{r}
# single chromosome intra chromosomal interaction matrix
matrix.norm.loc <- "./gm12878_primary_chr17_5kb.contacts.KRnorm.matrix"
# to read in a HiC matrix with the format coord_bin1 coord_bin2 interaction_value (e.g. as Rao et al. downloads)
# provide the matrix file, the chromosome and bin.size arguments
# if the format is like HiC-Pro (bin1_id bin2_id interaction_value) with a separate bed-like coordinate file
# provide the coordinate file in the "coords=" argument of the Import function
hic <- ImportHicproMatrix(matrix.norm.loc, chr="chr17", bin.size=bin.size) # import
hic <- trimHicCoords(hic, start = limit1-window.size, end = limit2+window.size) # remove HiC contacts that are not withon our region of interest
hic <- trimHicRange(hic, range = window.size + bin.size) # remove all interactions that are more then bp_context apart
# check the first covered position in the HiC data
start.pos <- hic$start.pos - bin.size/2
# to map the hic data into consistent bins for deepC training, prediction and plotting we create get a binned genome template
# this requires bedtools to be installed as the R function calls bedtools in the background in bash
# add at least the bp_context window.size to the region of interest coordinates
# make sure the start and end position match to the binning of the Hi-C data
# use custom floor/ceiling funcitons for your bin size
binned.genome <- getBinnedChrom(chr="chr17", start=custom_floor(limit1-window.size, bin.size), end=custom_ceil(limit2+window.size, bin.size), window=window.size, step=bin.size)
binned.genome <- as_tibble(binned.genome)
names(binned.genome) <- c("chr", "start", "end")
# inspect
binned.genome
# filter out incomplete megabase bins
binned.genome <- binned.genome %>%
mutate(width = end - start) %>%
filter(width == window.size) %>%
select(-width)
# # # if necessary add sub telomeric regions to binned genome
# to.add <- tibble(chr = chrom, start = seq(start.pos - window.size/2, start.pos - bin.size, bin.size))
# to.add <- to.add %>% mutate(end = start + window.size)
# binned.genome <- rbind(to.add, binned.genome)
# format HiC data into deepC vertical pole distance bin data frame format
# link to the helper perl script in your deepC repository copy
hic.df <- getZigZagWindowInteractionsPerl(hic, binned.genome, window.size, bin.size, query.pl =
"./deepC/helper_for_preprocessing_and_analysis/match_query_table.pl")
hic.df <- as_tibble(hic.df)
# to visualize hic data we add a position column and gather
hic.df <- hic.df %>%
mutate(pos = start + (end - start)/2) %>%
select(chr, pos, c(4:(prediction.bins+3))) %>%
gather(bin, value, -chr, -pos) %>%
mutate(bin = as.numeric(bin)) %>%
mutate(pos = if_else(bin %% 2 == 0, pos - bin.size/2, pos)) %>% # this indents the even distance bin positions to get the zig-zag layout
filter(value > 0)
# To visualize HiC data we susually quantile squeeze the very low and very high interaction values to increase contrast in the middle value range
hic.df$value <- SetValueRange(hic.df$value, min=as.numeric(quantile(hic.df$value, .05)), max=as.numeric(quantile(hic.df$value, .95)))
```
Now plot the Hi-C data as overlap.
```{r, fig.height = 9, fig.width = 11}
# triangularize
t.hic.df <- triangularize(hic.df, bin.size)
plot.hic.df <- ggplot(t.hic.df, aes(x = pos, y = (bin*bin.size)/1000, fill = value, group = polyid)) +
geom_polygon() +
labs(y = "Genomic Distance [kb]") +
scale_fill_gradientn(colours = brewer.pal(9, 'YlOrRd')) +
xlim(limit1, limit2) +
theme(strip.background = element_blank())
# plot using cowplot and helper themes for merging.
plot_grid(plot.hic.df + upper_overlay_theme,
plot.skel.df + upper_overlay_theme,
plot.rdf + lower_overlay_theme,
nrow = 3, align = "v")
```
### Overlap with 1D signals
We can also overlay 1D signals from bigwig files such as CTCF ChIP-seq or DNase-seq tracks. Example tracks for GM12878 in 50 bp windows are downloadable from *formated_data_links* or just checkout ENCODE or your ressources of choice.
```{r plot_tracks_overlap, fig.height = 11, fig.width = 11, message=FALSE}
# we require some more libraries for this
library(rtracklayer)
# library(GenomicRanges)
bw.ctcf <- "./ctcf_gm12878_encode_broad_merged_w50.bw"
bw.dnase <- "./dnase_gm12878_encode_uw_merged_w50.bw"
# initialize GRanges object
groi <- makeGRangesFromDataFrame(tibble(chrom = "chr17", start = limit1, end = limit2))
# import bgigwig signal over region of interest
gr.ctcf <- import(bw.ctcf, which = groi) # Import BigWig Signal
df.ctcf <- as.tibble(gr.ctcf)
df.ctcf <- df.ctcf %>%
mutate(pos = start + (end - start)/2) %>%
select(c(pos, width, score))
gr.dnase <- import(bw.dnase, which = groi) # Import BigWig Signal
df.dnase <- as.tibble(gr.dnase)
df.dnase <- df.dnase %>%
mutate(pos = start + (end - start)/2) %>%
select(c(pos, width, score))
# combine in data frame
df.tracks <- rbind(df.ctcf %>% mutate(source = "CTCF"),
df.dnase %>% mutate(source = "DNase"))
# plot 1D signal
p.tracks <- df.tracks %>%
ggplot(aes(x = pos, y = score, col = source)) +
geom_area() +
scale_color_brewer(palette = "Set1") +
xlim(limit1, limit2) +
facet_wrap(~source, nrow = 2, strip.position = "right") +
theme(strip.background = element_blank(), legend.position = "None")
plot.skel.pred.tracks <- plot_grid(plot.hic.df + upper_overlay_theme,
plot.skel.df + upper_overlay_theme,
plot.rdf + lower_overlay_theme,
p.tracks + theme(
axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.title.y = element_blank()
),
nrow = 4, align = "v", axis = "tlbr", rel_heights = c(1,1,1,.7))
plot(plot.skel.pred.tracks)
```
## Predicting the impact of sequence variants
Now lets delete a CTCF site. To delete a sequence supply '.' in the variant tag (without quotation) e.g.
```{bash}
#Run in terminal
head example_variant.bed
```
To replace a sequence chunk (or single base pair) with a sequence variant just supply the variant sequence in the variant_tag column. The length of the sequences don't have to match.
Note that when the reference and variant sequence have unequal length, the sequence window in focus is shifted according to the difference which can lead to a shadow of predicted changes. You might want to consider supplementing N's instead of deleting bases.
We go ahead and delete the example variant sequence as provided. We run the deepC predictor again supplying the variant deepC bed file. We specify to predict over an area of half the sequence window around the variant.
```{bash, eval=FALSE}
# Run in terminal
python ../tensorflow1_version/run_deploy_shape_deepCregr.py \
--input example_variant.bed \
--out_dir ./test_variant_out \
--name_tag predict_variant \
--model ./model_deepCregr_5kb_GM12878_primary/model \
--genome ./hg19_chr17_fasta_for_test/hg19_chr17.fa \
--bp_context 1005000 \
--add_window 500000 \
--num_classes 201 \
--bin_size 5000 \
--run_on gpu
```
First lets have a quick look at the variant prediction output. Format as with the reference file before but now the bp to adjust are non-zero because we deleted 350 bps.
```{bash}
# Run in terminal
# use the provided file instead if you did not run the prediction yourself.
head test_variant_out/class_predictions_predict_variant_1_chr17_71706322_71706671.txt | cut -f 1,2,3,4,5
```
We read in the variant prediction data. (Use the "provided" txt file if you haven't run the prediction yourself.)
```{r}
# load variant predictions
var.loc <- "test_variant_out/class_predictions_predict_variant_1_chr17_71706322_71706671.txt"
var <- readDeepcVariantFile(var.loc, prediction.bins = prediction.bins, bin.size = bin.size)
```
And plot the overlap between reference and variant predictions as well as a substraction plot
```{r deletion_overlap_plot, fig.height = 9, fig.width = 9, message=FALSE}
# filter the reference prediction for only those positions that are also covered (affected by) the variant prediction
ref.df2 <- ref.pred$df %>%
filter(pos > 0) %>%
filter(pos >= min(var$df$pos)) %>%
filter(pos <= max(var$df$pos))
# Make a new reference plot
p.ref <- triangularize(ref.df2, bin.size) %>%
ggplot(aes(x = pos, y = (bin*bin.size)/1000, fill = value, group=polyid)) +
geom_polygon() +
geom_vline(xintercept = var$variant.start, linetype = "dotted") +
labs(y = "Genomic Distance [kb]", x = "chr17") +
ggtitle("Reference") +
# scale_x_continuous(breaks=c(71400000, 71700000, 72000000)) +
scale_fill_gradientn(colours = brewer.pal(9, 'YlOrRd')) +
theme(strip.background = element_blank())
# make a variant
p.var <- triangularize(var$df, bin.size) %>%
ggplot(aes(x = pos, y = (bin*bin.size)/1000, fill = value, group=polyid)) +
geom_polygon() +
labs(y = "Genomic Distance [kb]", x = "chr17") +
geom_vline(xintercept = var$variant.start, linetype = "dotted") +
# scale_x_continuous(breaks=c(71400000, 71700000, 72000000)) +
ggtitle("Variant") +
scale_fill_gradientn(colours = brewer.pal(9, 'YlOrRd')) +
theme(strip.background = element_blank())
# Make a differential plot
p.diff <- plotDiffDeepC(ref.df2, var$df, bin.size = 5000) +
geom_vline(xintercept = var$variant.start, linetype = "dotted") +
geom_abline(slope = 1/(bin.size/(2*bin.size/1000)), intercept = -var$variant.start/(bin.size/(2*bin.size/1000)), linetype = "dashed") +
geom_abline(slope = -1/(bin.size/(2*bin.size/1000)), intercept = var$variant.start/(bin.size/(2*bin.size/1000)), linetype = "dashed") +
ggtitle(paste0("Var - Ref"))
# Make the combined plot
# adding the CTCF and DNase track from earlier as well
plot.variant.combined <- plot_grid(p.ref + upper_overlay_theme,
p.var + upper_overlay_theme,
p.diff + upper_overlay_theme,
p.tracks + xlim(c(min(var$df$pos), max(var$df$pos))) + theme(
axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.title.y = element_blank()
),
nrow = 4, align = "v", axis = "tlbr", rel_heights = c(1,1,1,.8))
plot(plot.variant.combined)
```
Note that the CTCF and DNase tracks are still mapped to the reference. For larger deletions we would want to shift all data past the deletion by the bp_difference. For smaller differences the effect is however neglectable.
To get a single difference score we calculate the absolute difference (var - ref) normalized for the total number of bin interactions affected.
```{r}
# calculate the total number of HiC interaction tiles (bin positions) affected by the variant
# used for normalizing the damage sscore
covered.tiles <- dim(var$df)[1]/prediction.bins * (window.size/bin.size)
# calculate the differntial data frame
diff.df <- getDiffDeepC(ref.df2, var$df)
# calculate the sum of the absolut difference and normalize for the number of covered tiles
avg.abs.diff <- sum(abs(diff.df$diff)) / covered.tiles
print(avg.abs.diff)
```