HiCDOC has moved to: mzytnicki/HiCDOC
HiCDOC normalizes intrachromosomal Hi-C matrices and detects A/B compartments with multiple replicates using unsupervised learning. This repository is a proof of concept.
The following dependencies are required:
-
R >= 3.5
-
Python >= 3.6
-
MultiHiCcompare[publication][installation]
R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' R -e 'BiocManager::install("multiHiCcompare")'
-
gcMapExplorer[publication][installation]
pip3 install Cython gcMapExplorer
On macOS, you might need this preliminary (with homebrew):
brew install cmake gcc export CC=/usr/local/bin/gcc-9
-
argparse, numpy, sklearn, statsmodels, matplotlib, plotly
R -e 'install.packages("argparse")' pip3 install argparse numpy sklearn statsmodels matplotlib plotly
HiCDOC is segmented in a collection of scripts, which can be used to construct a custom pipeline.
The default pipeline can be run with hicdoc.sh
. It consists of 5 steps:
- Normalize technical biases with
normalize_cyclic_loess.r
- Normalize biological biases with
normalize_knight_ruiz.py
- Normalize distance effect with
normalize_distance_rnr_combined.py
- Detect compartments and compute measures with
detect_constrained_k_means.py
- Plot compartment changes and measures with
plot_compartment_changes.py
Each script accepts a tab-separated multi-replicate sparse matrix with a header line and optional comment lines.
# Optional comment lines
# ...
chromosome position 1 position 2 replicate 1.1 replicate 1.2 replicate 2.1 ...
3 1500000 7500000 145 184 72 ...
...
The interaction proportions between position 1
and position 2
are reported
in each replicate column, named replicate <condition.replicate>
. There is no
limit to the number of replicates and conditions.
./hicdoc.sh
-i <file> Input matrix file
-d <directory> Output directory
Run the default pipeline on the input matrix. The matrix interactions will be
normalized by normalize_cyclic_loess.r
, normalize_knight_ruiz.py
and
normalize_distance_rnr_combined.py
, then A/B compartments will be detected
with detect_constrained_k_means.py
and plotted alongside various measures with
plot_compartment_changes.py
.
./join_replicates.py
-i <file> ... Input matrix files
-o <file> Output matrix file
--replicates <condition.replicate> ... Replicate column names
in the same order as the input files
[--inputs-have-headers] Add if the input matrices have a header line
[--comments "<comment line>" ...] Comment lines to add to the top of the output file
Join single-replicate matrices (chromosome, position 1, position 2, interaction) into one multi-replicate sparse matrix.
./normalize_cyclic_loess.r
-i <file> Input matrix file
-o <file> Output matrix file
Normalize technical biases (sequencing depth, restriction enzyme, etc.) with a cyclic loess[publication][implementation].
The cyclic loess method constructs a MD plot (difference ~ genomic distance) for each replicate pair, then recursively corrects their interaction proportions, such that the mean difference between each replicate pair reaches zero at each genomic distance.
Low-proportions interaction vectors are NOT filtered before normalization.
./normalize_knight_ruiz.py
-i <file> Input matrix file
-o <file> Output matrix file
Normalize biological biases (GC content, repeated sequences, etc.) with the Knight-Ruiz algorithm[publication][implementation].
The Knight-Ruiz matrix balancing algorithm normalizes coverage by transforming each replicate matrix into a doubly stochastic matrix.
A filter is applied before normalization, removing low-proportions interaction vectors whose number of zeros exceeds the 99th percentile of the distribution of zeros per interaction vector.
./normalize_distance_rnr_combined.py
-i <file> Input matrix file
-o <file> Output matrix file
[--expected <file>] Output "expected" interaction proportions
at each genomic distance
Normalize distance effect (linear proximity affecting interaction proportions) with a combined radius-neighbors regression[implementation].
First, an interactions ~ distance plot is constructed from all the combined replicates. Then a radius-neighbors regression is applied to estimate the "expected" interaction proportion at each genomic distance. Finally, each "observed" interaction proportion is divided by its "expected" value for its genomic distance.
A filter is applied to ignore empty interaction vectors before estimating "expected" interactions.
./normalize_distance_rnr_individual.py
-i <file> Input matrix file
-o <file> Output matrix file
[--expected <file>] Output "expected" interaction proportions
at each genomic distance
Normalize distance effect with an individual radius-neighbors regression[implementation] for each replicate.
First, an interactions ~ distance plot is constructed for each individual replicate. Then a radius-neighbors regression is applied to estimate the "expected" interaction proportion at each genomic distance for each replicate. Finally, each "observed" interaction proportion is divided by its "expected" value for its genomic distance.
A filter is applied to ignore empty interaction vectors before estimating "expected" interactions.
./normalize_distance_mean_individual.py
-i <file> Input matrix file
-o <file> Output matrix file
[--expected <file>] Output "expected" interaction proportions
at each genomic distance
Normalize distance effect with an individual interaction mean estimation[implementation] for each replicate.
First, an interactions ~ distance plot is constructed for each individual replicate. Then the mean of interactions is computed to estimate the "expected" interaction proportion at each genomic distance for each replicate. Finally, each "observed" interaction proportion is divided by its "expected" value for its genomic distance.
A filter is applied to ignore empty interaction vectors before estimating "expected" interactions.
./normalize_vectors_min_max.py
-i <file> Input matrix file
-o <file> Output matrix file
Min-max scale each interaction vector to [0, 1].
./detect_constrained_k_means.py
-i <file> Input matrix file
-o <file> Output compartments file
[-k <n>] Number of compartments to detect
Default: 2
[--distances <file> ...] Output distances to centroids
One file per compartment
[--concordance <file>] Output concordance file
[--silhouette <file>] Output Silhouette coefficient file
Detect compartments using constrained k-means[publication][implementation]. The algorithm applies a compartment label to each genomic position based on interaction vectors differences, with the additional constraint that, for each given condition, different replicates of the same genomic position must belong to the same compartment.
The original algorithm operates somewhat naively. At each iteration, it selects a compartment for the first replicate it encounters, then forces that compartment onto subsequent replicates. For HiCDOC, the implementation has been modified. At each iteration, the algorithm selects a compartment that fits best for the majority of replicates.
To determine the strength of membership of each genomic position replicate to its compartment, a Silhouette coefficient[implementation] is computed, as well as a concordance value. The concordance is a confidence measure designed specifically for HiCDOC. Each genomic position replicate is given a concordance between -1 and 1. The value is positive or negative depending on which compartment fits best, and its distance from 0 indicates the strength of membership. Concordance is the ratio of distance to each centroid, normalized by the distance between the two centroids, scaled to the [-1, 1] interval:
./plot_matrix.py
-i <file> Input matrix file
-p <prefix> Output figure prefix
[--measure <file>] Input measure file
[--name <name>] Name of the measure to write on the figure
Plot a matrix, with an optional measure (concordance, silhouette or distance).
One figure will be created per chromosome and replicate. Each figure is saved to
prefix_resolution_chromosome_replicate.png
.
./plot_ma.py
-i <raw file> <normalized file> Input raw and normalized matrix file
-p <prefix> Output figure prefix
Create MA plots (difference ~ average) for each pair of replicates. One figure
will be created per chromosome. Each figure is saved to
prefix_resolution_chromosome.png
.
./plot_expected.py
-i <file> Input "expected" interaction proportions file
-p <prefix> Output figure prefix
Create an interactions ~ distance plot with the "expected" interaction
proportions. One figure will be created per chromosome. Each figure is saved to
prefix_resolution_chromosome.png
.
./plot_compartment_changes.py
-i <file> Input compartments file
-p <prefix> Output figure prefix
[--distances <file> ...] Input distances to centroids
One file per compartment
[--concordance <file>] Input concordance file
[--silhouette <file>] Input Silhouette coefficient file
Plot compartment changes. One figure will be created per chromosome and
condition pair. Each figure is saved to
prefix_resolution_chromosome_conditions.png
.
./plot_concordance_changes.py
-i <compartments file> <concordance file> Input compartments and concordance files
-p <prefix> Output figure prefix
Plot concordance changes. One figure will be created per chromosome and
condition pair. Each figure is saved to
prefix_resolution_chromosome_conditions.png
.
Philip A. Knight, Daniel Ruiz, A fast algorithm for matrix balancing, IMA Journal of Numerical Analysis, Volume 33, Issue 3, July 2013, Pages 1029–1047, https://doi.org/10.1093/imanum/drs019
Rajendra Kumar, Haitham Sobhy, Per Stenberg, Ludvig Lizana, Genome contact map explorer: a platform for the comparison, interactive visualization and analysis of genome contact maps, Nucleic Acids Research, Volume 45, Issue 17, 29 September 2017, Page e152, https://doi.org/10.1093/nar/gkx644
John C Stansfield, Kellen G Cresswell, Mikhail G Dozmorov, multiHiCcompare: joint normalization and comparative analysis of complex Hi-C experiments, Bioinformatics, 2019, https://doi.org/10.1093/bioinformatics/btz048
Kiri Wagstaff, Claire Cardie, Seth Rogers, Stefan Schrödl, Constrained K-means Clustering with Background Knowledge, Proceedings of 18th International Conference on Machine Learning, 2001, Pages 577-584, https://pdfs.semanticscholar.org/0bac/ca0993a3f51649a6bb8dbb093fc8d8481ad4.pdf