hicstuff command line interface demo

Getting started

The easiest way to generate matrices is to use hicstuff pipeline. By default the command only requires two fastq files (forward and reverse reads) and a genome in fasta format.

The pipeline command can be used to generate the Hi-C contact map from the input reads.

hicstuff pipeline --genome genome.fa \
                  --outdir results \
                  forward.fq \
                  reverse.fq

For instance, this will create a directory named “results”, containing three output text files with tab-separated columns.

  • abs_fragments_contacts_weighter.txt: Sparse matrix file with 3 columns the rows, column and values of nonzero pixels. The first row contains the shape and total number of nonzero pixels in the matrix.
  • fragments_list.txt: Contains genomic coordinates of the matrix bins (row/columns).
  • info_contigs.txt: Contains chromosome names, theirs length and number of bins.

Using an indexed genome

When given a genome in fasta format, hicstuff regenerates the index everytime unless it finds an index starting by the genome filename. For example using bowtie2, you can index the genome like this:

bowtie2-build genome.fa genome.fa

And when running hicstuff with --genome=genome.fa, it will automatically find the index files and not regenerate it.

Setting the binning

By default, matrices generated are binned at 5kb. You can change this using the --enzyme option. This option allows to specify either a bin size, or an enzyme. For example if you set --enzyme='DpnII', the matrix will be at restriction fragments resolution.

Temporary files

By default temporary files are removed when the pipeline finishes. They can be kept by adding the --no-cleanup flag. For example, if you run:

hicstuff pipeline -g genome.fa --no-cleanup --prefix demo

The --prefix option used here gives a common prefix to all output files, overriding the default output file names as follows:

  • abs_fragments_contacts_weighted.txt -> demo.mat.tsv
  • fragments_list.txt -> demo.frags.tsv
  • info_contigs.txt -> demo.chr.tsv

And the output folder should now contain a tmp/ subfolder and look like this:

output
├── demo.chr.tsv
├── demo.frags.tsv
├── demo.hicstuff_20190423185220.log
├── demo.mat.tsv
├── demo.distance_law.txt
├── plots
│   ├── event_distance.pdf
│   ├── event_distribution.pdf
│   └── frags_hist.pdf
└── tmp
    ├── demo.for.bam
    ├── demo.genome.fasta
    ├── demo.rev.bam
    ├── demo.valid_idx_filtered.pairs
    ├── demo.valid_idx.pairs
    └── demo.valid.pairs

Additional options

The hicstuff pipeline command has additional options allowing to tweak various parameters and change the output files or their format. You can always consult hicstuff pipeline --help to get a comprehensive description of those options, but here are a few of them:

  • --filter: Filters out spurious 3C events, such as self religations or undigested fragments. This is only really useful at very fine resolutions (1-2kb) and not needed most of the time. This option is only meaningful when --enzyme is given a restriction enzyme and not a bin size.
  • --duplicates: Removes PCR duplicates, defined as sets of pairs having identical mapping positions for both reads.
  • --matfmt: allows to specify what file format should be used for the matrix. Available formats are bg2 (bedgraph2d), graal (the text format described above) and cool, a binary format that is probably the most appropriate for large genomes.
  • --distance-law: Computes the distance-law table (i.e. the probability of contacts as a function of genomic distance).
  • --plot: Enables plotting. When used in conjunction with --filter or --distance-law, this will generate figures showing properties of your Hi-C data.

Advanced usage

Starting from intermediate files

If your hicstuff run was interrupted, or you wish to align the reads separately, the --start-stage option allows the hicstuff pipeline command to take intermediate files as input. For example to skip the alignment step and start from aligned reads:

hicstuff pipeline --start-stage bam --genome genome.fa forward.bam reverse.bam

Or if you already have a pairs file:

hicstuff pipeline --start-stage pairs --genome genome.fa valid.pairs

Generating the distance law

The distance law is the probability of contact of two fragments in function of the distance between these fragments. There are two ways to compute it with hicstuff. The first one using the full pipeline with the option --distance-law, as done above. It’s possible to add an option --centromeres if you want to compute the distance law on separate arms. The output of this command will be a raw table of the distance without any treatment of the data. It will be then possible with the command distancelaw to process this table.

The second way is to use the command distancelaw with the pairs file as input:

hicstuff distancelaw --average \
                     --big-arm-only \
                     --centromeres centromeres.txt \
                     --frags output/demo.frags.tsv \
                     --inf 3000 \
                     --outputfile-img output/demo_distance_law.svg \
                     --labels labels.txt \
                     --sup 500000 \
                     --pairs output/tmp/demo.valid_idx_filtered.pairs

For instance, this will create an image with the distance law generated from the pairs file given in input. The distance law will be the average between all the distance laws of the arms bigger than 500kb. The logspace used to plot it will have a base 1.1 by default. The limits of the x axis will be 3kb and 500kb.

Note that if hicstuff pipeline was given the --distance-law option, the output folder should contain a file named distance_law.txt containing the precomputed interaction frequencies. This file can be provided to the hicstuff distancelaw command using --dist-tbl distance_law.txt instead of the pairs file.

Operations on Hi-C matrices

All commands described below can take the output of hicstuff as input. They will accept either a bg2 matrix, a cool matrix or a graal matrix. Note that when using a graal matrix, you will usually need to specify the fragments file using --frags since genomic coordinates are not encoded in this matrix format.

Visualizing the matrix

Below are example commands that can be used to visualise Hi-C matrices with hicstuff.

# Viewing a normalized (=balanced) matrix in graal format at 5kb resolution
hicstuff view --binning 5kb --normalize --frags output/demo.frags.tsv output/demo.mat.tsv
# Viewing a log ratio of 2 matrices in cool format
hicstuff view sample1.cool sample2.cool
# Viewing a raw matrix in bedgraph2 format at 500bp resolution with log transformed contacts
hicstuff view --binning 500bp --transform log high_res.bg2

This will show an interactive heatmap using matplotlib. In order to save the matrix to a file instead, one could add --output output/demo.png

Note there are others options allowing to process the matrix which are documented in the help message.

Converting between formats

Output files from hicstuff pipeline can be converted between different format using the command hicstuff convert. For example to generate the file cool_output/demo.cool from files in the default hicstuff format:

hicstuff convert --frags output/demo.frags.tsv \
                 --chroms output/demo.chr.tsv \
                 --to cool \
                 output/demo.mat.tsv
                 output/demo

Notice that the command takes 2 positional arguments, the first one is the matrix file and the second is the prefix to give for output files, to which the extension will be added depending on the output format chosen. Input format is inferred automatically from the input matrix.

Rebinning existing matrices

Files previously produced by hicstuff pipeline can be rebinned at a lower resolutions using the hicstuff rebin command. This will generate a new matrix, a new fragments_list.txt and a new info_contigs.txt, all with updated number of bins:

hicstuff rebin -f output/demo.frags.tsv \
               -c output/demo.chr.tsv \
               --binning 1kb \
               output/demo.mat.tsv \
               rebin_1kb

When working with cool or bedgraph2 files, the command is simpler as we don’t need fragments and contig files:

# Rebins demo.cool at 10kb and saves the results to rebinned.cool
hicstuff rebin --binning 10kb demo.cool rebinned

Subsampling contacts

For many applications, differences in sequencing coverage will impact results. To avoid this, one can subsample contacts from Hi-C matrices to ensure the different samples to be compared have comparable signal. This functionality is implemented in the hicstuff subsample command, which allows to keep a fixed number of contacts from a matrix, or to extract a fraction of contacts:

# Keep 30% contacts in matrix.cool and save the results to subsamples_30.cool
hicstuff subsample --prop 0.5 matrix.cool subsample_30
# Keep 1 million contacts in matrix.bg2 and save the result in subsample_1M.bg2
hicstuff subsample --prop 1000000 matrix.bg2 subsample_1M