Fragment-level HiC correction

This is a module class for fragment-level Hi-C data analysis. The base class “HiCdataset” can load, save and merge Hi-C datasets, perform certain filters, and save binned heatmaps.

Additional class HiCStatistics contains methods to analyze HiC data on a fragment level. This includes read statistics, scalings, etc.

Input data

When used together with iterative mapping, this class can load files from h5dicts created by iterative mapping.

This method can also input any dictionary-like structure, such as a dictionary, np.savez, etc. The minimal subset of information are positions of two reads, but providing strand informations is adviced.

If restriction fragment assignment is not provided, it will be automatically recalculated.

Warning

1-bp difference in positions of restriction sites will force certain algorithms, such as scaling calculations, to throw an exception. It is adviced to supply restriction site data only if it was generated by iterative mapping code.

Concepts

All read data is stored in a synchronized h5dict-based dictionary of arrays. Each variable has a fixed name and type, as specified in the self.vectors variable. Whenever the variable is accessed from the program, it is loaded from the h5dict.

Whenever a set of reads needs to be excluded from the dataset, a maskFilter method is called, that goes over all datasets and overrides them. This method automatically rebuilds fragments.

Filtering the data

This class has many build-in methods for filtering the data. However, one can easily construct another filter as presented in multiple one-liner examples below

>>> Dset = HiCdataset(**kwargs)
>>> Dset.fragmentFilter((Dset.ufragmentlen >1000) *     (Dset.ufragmentlen < 4000))
#Keep reads from fragments between 1kb and 4kb long.

>>> Dset.maskFilter(Dset.chrms1 == Dset.chrms2)  #keep only cis reads

>>> Dset.maskFilter((Dset.chrms1 !=14) + (Dset.chrms2 !=14))
#Exclude all reads from chromosome 15 (yes, chromosomes are zero-based!)

>>> Dset.maskFilter(Dset.dist1 + Dset.dist2 > 500)
#Keep only random breaks, if 500 is maximum molecule length

API documentation

class hiclib.fragmentHiC.HiCdataset(filename, genome, enzymeName=u'fromGenome', maximumMoleculeLength=500, inMemory=False, mode=u'a', tmpFolder=u'/tmp', dictToStoreIDs=u'dict')[source]

Base class to operate on HiC dataset.

This class stores all information about HiC reads on a hard drive. Whenever a variable corresponding to any record is used, it is loaded/saved from/to the HDD.

If you apply any filters to a dataset, it will actually modify the content of the current working file. Thus, to preserve the data, loading datasets is advised.

Methods

buildAllHeatmap(resolution[, ...]) __optimized for large datasets__
buildHeatmapWithOverlapCpp(resolution[, ...]) __NOT optimized for large datasets__
evaluate(expression, internalVariables[, ...]) Still experimental class to perform evaluation of any expression on hdf5 datasets Note that out_variable should be writable by slices.
exitProgram(a)
filterByCisToTotal([cutH, cutL]) __NOT optimized for large datasets__
filterDuplicates([mode, tmpDir, chunkSize]) __optimized for large datasets__
filterExtreme([cutH, cutL]) __optimized for large datasets__
filterLarge([cutlarge, cutsmall]) __optimized for large datasets__
filterOrientation() __NOT optimized for large datasets__
filterRsiteStart([offset]) __optimized for large datasets__
filterTooClose([minRsitesDist]) __NOT optimized for large datasets__
fragmentFilter(fragments) __optimized for large datasets__
fragmentSum([fragments, strands, useWeights]) __optimized for large datasets__
getFragmentHeatmap([absFragIdx1, absFragIdx2]) returns fragment heatmap from absFragIdx1 to absFragIdx2 not including the end
getHiResHeatmap(resolution, chromosome[, ...])
getHiResHeatmapWithOverlaps(resolution, ...)
iterativeCorrection([numsteps, normToLen]) Perform fragment-based iterative correction of Hi-C data.
iterativeCorrectionFromMax([minimumCount, ...]) TODO: rewrite this to account for a new fragment model
load(filename[, buildFragments]) Loads dataset from file to working file; check for inconsistency
maskFilter(mask) __optimized for large datasets__
merge(filenames) combines data from multiple datasets
parseInputData(dictLike[, zeroBaseChrom]) __NOT optimized for large datasets__
plotRsiteStartDistribution([offset, length]) run plt.show() after this function.
plotScaling([fragids1, fragids2, ...]) plots scaling over, possibly uses subset of fragmetns, or weigts,
printMetadata([saveTo])
printStats()
save(filename) Saves dataset to filename, does not change the working file.
saveByChromosomeHeatmap(filename[, ...]) Saves chromosome by chromosome heatmaps to h5dict.
saveHeatmap(filename[, resolution, ...]) Saves heatmap to filename at given resolution.
saveHiResHeatmapWithOverlaps(filename, ...) Creates within-chromosome heatmaps at very high resolution, assigning each fragment to all the bins it overlaps with, proportional to the area of overlaps.
saveSuperHighResMapWithOverlaps(filename, ...) Creates within-chromosome heatmaps at very high resolution, assigning each fragment to all the bins it overlaps with, proportional to the area of overlaps.
setSimpleHighResHeatmap()
updateGenome(newGenome[, oldGenome, putMetadata]) __partially optimized for large datasets__
writeFilteringStats()

__init__ method

Initializes empty dataset by default. If “override” is False, works with existing dataset.

Parameters:

filename : string

A filename to store HiC dataset in an HDF5 file.

genome : folder with genome, or Genome object

A folder with fastq files of the genome and gap table from Genome browser. Alternatively, mirnylib.genome.Genome object.

maximumMoleculeLength : int, optional

Maximum length of molecules in the HiC library, used as a cutoff for dangling ends filter

inMemory : bool, optional

Create dataset in memory. Filename is ignored then, but still needs to be specified.

mode : str

‘r’ - Readonly, file must exist

‘r+’ - Read/write, file must exist

‘w’ - Create file, overwrite if exists

‘w-‘ - Create file, fail if exists

‘a’ - Read/write if exists, create otherwise (default)

dictToStoreIDs : dict-like or “dict” or “h5dict”

A dictionary to store rsite IDs. If “dict”, then store them in memory. If “h5dict”, then creates default h5dict (in /tmp folder) If other object, uses it, whether it is an h5dict or a dictionary

Methods

buildAllHeatmap(resolution[, ...]) __optimized for large datasets__
buildHeatmapWithOverlapCpp(resolution[, ...]) __NOT optimized for large datasets__
evaluate(expression, internalVariables[, ...]) Still experimental class to perform evaluation of any expression on hdf5 datasets Note that out_variable should be writable by slices.
exitProgram(a)
filterByCisToTotal([cutH, cutL]) __NOT optimized for large datasets__
filterDuplicates([mode, tmpDir, chunkSize]) __optimized for large datasets__
filterExtreme([cutH, cutL]) __optimized for large datasets__
filterLarge([cutlarge, cutsmall]) __optimized for large datasets__
filterOrientation() __NOT optimized for large datasets__
filterRsiteStart([offset]) __optimized for large datasets__
filterTooClose([minRsitesDist]) __NOT optimized for large datasets__
fragmentFilter(fragments) __optimized for large datasets__
fragmentSum([fragments, strands, useWeights]) __optimized for large datasets__
getFragmentHeatmap([absFragIdx1, absFragIdx2]) returns fragment heatmap from absFragIdx1 to absFragIdx2 not including the end
getHiResHeatmap(resolution, chromosome[, ...])
getHiResHeatmapWithOverlaps(resolution, ...)
iterativeCorrection([numsteps, normToLen]) Perform fragment-based iterative correction of Hi-C data.
iterativeCorrectionFromMax([minimumCount, ...]) TODO: rewrite this to account for a new fragment model
load(filename[, buildFragments]) Loads dataset from file to working file; check for inconsistency
maskFilter(mask) __optimized for large datasets__
merge(filenames) combines data from multiple datasets
parseInputData(dictLike[, zeroBaseChrom]) __NOT optimized for large datasets__
plotRsiteStartDistribution([offset, length]) run plt.show() after this function.
plotScaling([fragids1, fragids2, ...]) plots scaling over, possibly uses subset of fragmetns, or weigts,
printMetadata([saveTo])
printStats()
save(filename) Saves dataset to filename, does not change the working file.
saveByChromosomeHeatmap(filename[, ...]) Saves chromosome by chromosome heatmaps to h5dict.
saveHeatmap(filename[, resolution, ...]) Saves heatmap to filename at given resolution.
saveHiResHeatmapWithOverlaps(filename, ...) Creates within-chromosome heatmaps at very high resolution, assigning each fragment to all the bins it overlaps with, proportional to the area of overlaps.
saveSuperHighResMapWithOverlaps(filename, ...) Creates within-chromosome heatmaps at very high resolution, assigning each fragment to all the bins it overlaps with, proportional to the area of overlaps.
setSimpleHighResHeatmap()
updateGenome(newGenome[, oldGenome, putMetadata]) __partially optimized for large datasets__
writeFilteringStats()
buildAllHeatmap(resolution, countDiagonalReads=u'Once', useWeights=False)[source]

__optimized for large datasets__ Creates an all-by-all heatmap in accordance with mapping provided by ‘genome’ class

Parameters:

resolution : int or str

Resolution of a heatmap. May be an int or ‘fragment’ for restriction fragment resolution.

countDiagonalReads : “once” or “twice”

How many times to count reads in the diagonal bin

useWeights : bool

If True, then take weights from ‘weights’ variable. False by default.

buildHeatmapWithOverlapCpp(resolution, countDiagonalReads=u'Twice', maxBinSpawn=10)[source]

__NOT optimized for large datasets__ Creates an all-by-all heatmap in accordance with mapping provided by ‘genome’ class

This method assigns fragments to all bins which the fragment overlaps, proportionally

Parameters:

resolution : int or str

Resolution of a heatmap. May be an int or ‘fragment’ for restriction fragment resolution.

countDiagonalReads : “once” or “twice”

How many times to count reads in the diagonal bin

maxBinSpawn : int, optional, not more than 10

Discard read if it spawns more than maxBinSpawn bins

evaluate(expression, internalVariables, externalVariables={}, constants={u'np': <module 'numpy' from '/home/magus/anaconda3/envs/py2/lib/python2.7/site-packages/numpy/__init__.pyc'>}, outVariable=u'autodetect', chunkSize=u'default')[source]

Still experimental class to perform evaluation of any expression on hdf5 datasets Note that out_variable should be writable by slices.

—If one can provide autodetect of values for internal variables by parsing an expression, it would be great!—

Note

See example of usage of this class in filterRsiteStart, parseInputData, etc.

Warning

Please avoid passing internal variables as “self.cuts1” - use “cuts1”

Warning

You have to pass all the modules and functions (e.g. np) in a “constants” dictionary.

Parameters:

expression : str

Mathematical expression, single or multi line

internal_variables : list of str

List of variables (“chrms1”, etc.), used in the expression

external_variables : dict , optional

Dict of {str:array}, where str indicates name of the variable, and array - value of the variable.

constants : dict, optional

Dictionary of constants to be used in the evaluation. Because evaluation happens without namespace, you should include numpy here if you use it (included by default)

out_variable : str or tuple or None, optional

Variable to output the data. Either internal variable, or tuple (name,value), where value is an array

filterByCisToTotal(cutH=0.0, cutL=0.01)[source]

__NOT optimized for large datasets__ Remove fragments with too low or too high cis-to-total ratio. Parameters ———- cutH : float, 0<=cutH < 1, optional

Fraction of the fragments with largest cis-to-total ratio to be removed.
cutL
: float, 0<=cutL<1, optional
Fraction of the fragments with lowest cis-to-total ratio to be removed.
filterDuplicates(mode=u'auto', tmpDir=u'default', chunkSize=100000000)[source]

__optimized for large datasets__ removes duplicate molecules

filterExtreme(cutH=0.005, cutL=0)[source]

__optimized for large datasets__ removes fragments with most and/or least # counts

Parameters:

cutH : float, 0<=cutH < 1, optional

Fraction of the most-counts fragments to be removed

cutL : float, 0<=cutL<1, optional

Fraction of the least-counts fragments to be removed

filterLarge(cutlarge=100000, cutsmall=100)[source]

__optimized for large datasets__ removes very large and small fragments

Parameters:

cutlarge : int

remove fragments larger than it

cutsmall : int

remove fragments smaller than it

filterOrientation()[source]

__NOT optimized for large datasets__

filterRsiteStart(offset=5)[source]

__optimized for large datasets__ Removes reads that start within x bp near rsite

Parameters:

offset : int

Number of bp to exclude next to rsite, not including offset

filterTooClose(minRsitesDist=2)[source]

__NOT optimized for large datasets__ Remove fragment pairs separated by less then minRsitesDist restriction sites within the same chromosome.

fragmentFilter(fragments)[source]

__optimized for large datasets__ keeps only reads that originate from fragments in ‘fragments’ variable, for DS - on both sides

Parameters:

fragments : numpy.array of fragment IDs or bools

List of fragments to keep, or their indexes in self.rFragIDs

fragmentSum(fragments=None, strands=u'both', useWeights=False)[source]

__optimized for large datasets__ returns sum of all counts for a set or subset of fragments

Parameters:

fragments : list of fragment IDs, optional

Use only this fragments. By default all fragments are used

strands : 1,2 or “both” (default)

Use only first or second side of the read (first has SS, second - doesn’t)

useWeights : bool, optional

If set to True, will give a fragment sum with weights adjusted for iterative correction.

getFragmentHeatmap(absFragIdx1=0, absFragIdx2=None)[source]

returns fragment heatmap from absFragIdx1 to absFragIdx2 not including the end Sort of works, but is still buggy

iterativeCorrection(numsteps=10, normToLen=False)[source]

Perform fragment-based iterative correction of Hi-C data.

iterativeCorrectionFromMax(minimumCount=50, precision=0.01)[source]

TODO: rewrite this to account for a new fragment model

load(filename, buildFragments=u'deprecated')[source]

Loads dataset from file to working file; check for inconsistency

maskFilter(mask)[source]

__optimized for large datasets__ keeps only reads designated by mask

Parameters:

mask : array of bools

Indexes of reads to keep

merge(filenames)[source]

combines data from multiple datasets

Parameters:

filenames : list of strings

List of folders to merge to current working folder

parseInputData(dictLike, zeroBaseChrom=True, **kwargs)[source]

__NOT optimized for large datasets__ (use chunking as suggested in pipeline2015) Inputs data from a dictionary-like object, containing coordinates of the reads. Performs filtering of the reads.

A good example of a dict-like object is a numpy.savez

Warning

Restriction fragments MUST be specified exactly as in the Genome class.

Warning

Strand information is needed for proper scaling calculations, but will be imitated if not provided

Parameters:

dictLike : dict or dictLike object, or string with h5dict filename

Input reads

dictLike[“chrms1,2”] : array-like

Chromosomes of 2 sides of the read

dictLike[“cuts1,2”] : array-like

Exact position of cuts

dictLike[“strands1,2”], essential : array-like

Direction of the read

dictLike[“rsites1,2”], optional : array-like

Position of rsite to which the read is pointing

dictLike[“uprsites1,2”] , optional : array-like

rsite upstream (larger genomic coordinate) of the cut position

dictLike[“downrsites1,2”] , optional : array-like

rsite downstream (smaller genomic coordinate) of the cut position

zeroBaseChrom : bool , optional

Use zero-base chromosome counting if True, one-base if False

enzymeToFillRsites : None or str, optional if rsites are specified

Enzyme name to use with Bio.restriction

keepSameFragment : bool, optional

Do not remove same fragment reads

noFiltering : bool, optional

If True then no filters are applied to the data. False by default. Overrides removeSS. Experimental, do not use if you are not sure.

keepSingleSided : bool, optional

Do not remove SS reads

keepReadsMolecules : bool, optional

Do not remove –> <– reads less than size selection

plotRsiteStartDistribution(offset=5, length=200)[source]

run plt.show() after this function.

plotScaling(fragids1=None, fragids2=None, useWeights=False, excludeNeighbors=None, enzyme=None, normalize=True, normRange=None, withinArms=True, mindist=1000, maxdist=None, regions=None, appendReadCount=True, **kwargs)[source]

plots scaling over, possibly uses subset of fragmetns, or weigts, possibly normalizes after plotting

Plan of scaling calculation:

  1. Subdivide all genome into regions.

    1. Different chromosomes

    2. Different arms

    3. User defined squares/rectangles on a contact map

      -(chromosome, start,end) square around the diagonal

      -(chr, st1, end1, st2, end2) rectangle

2. Use either all fragments, or only interactions between two groups of fragments

e.g. you can calculate how scaling for small fragments is different from that for large

It can be possibly used for testing Hi-C protocol issues.

One can see effect of weights by doing this

3. (optional) Calculate correction associated with fragment length dependence

  1. Subdivide all possible genomic separation into log-spaced bins

5. Calculate expected number of fragment pairs within each bin (possibly with weights from step 3).

If exclusion of neighbors is specificed, expected number of fragments knows about this

Parameters:

fragids1, fragids2 : np.array of fragment IDs, optional

Scaling is calculated only for interactions between fragids1 and fragids2 If omitted, all fragments are used If boolean array is supplied, it serves as a mask for fragments.

useWeights : bool, optional

Use weights calculated from fragment length

excludeNeighbors : int or None, optional

If None, all fragment pairs are considered. If integer, only fragment pairs separated by at least this number of r-fragments are considered.

enzyme : string (“HindIII”,”NcoI”)

If excludeNeighbors is used, you have to specify restriction enzyme

normalize : bool, optional

Do an overall normalization of the answer, by default True.

withinArms : bool, optional

Set to false to use whole chromosomes instead of arms

mindist, maxdist : int, optional

Use lengthes from mindist to maxdist

regions : list of (chrom,start,end) or (ch,st1,end1,st2,end2), optional

Restrict scaling calculation to only certain squares of the map

appendReadCount : bool, optional

Append read count to the plot label

plot : bool, optional

If False then do not display the plot. True by default.

**kwargs : optional

All other keyword args are passed to plt.plot

Returns:

(bins,probabilities) - values to plot on the scaling plot :

save(filename)[source]

Saves dataset to filename, does not change the working file.

saveByChromosomeHeatmap(filename, resolution=10000, includeTrans=True, countDiagonalReads=u'Once')[source]

Saves chromosome by chromosome heatmaps to h5dict. This method is not as memory demanding as saving allxall heatmap.

Keys of the h5dict are of the format [“1 14”], where chromosomes are zero-based, and there is one space between numbers.

Warning

Chromosome numbers are always zero-based.

Only “chr3” labels are one-based in this package.

Parameters:

filename : str

Filename of the h5dict with the output

resolution : int

Resolution to save heatmaps

includeTrans : bool, optional

Build inter-chromosomal heatmaps (default: False)

countDiagonalReads : “once” or “twice”

How many times to count reads in the diagonal bin

saveHeatmap(filename, resolution=1000000, countDiagonalReads=u'Once', useWeights=False, useFragmentOverlap=False, maxBinSpawn=10)[source]

Saves heatmap to filename at given resolution. For small genomes where number of fragments per bin is small, please set useFragmentOverlap to True. This will assign each fragment to all bins over which the fragment spawns.

Parameters:

filename : str

Filename of the output h5dict

resolution : int or str

Resolution of a heatmap. May be an int or ‘fragment’ for restriction fragment resolution.

countDiagonalReads : “once” or “twice”

How many times to count reads in the diagonal bin

useWeights : bool

If True, then take weights from ‘weights’ variable. False by default. If using iterativeCorrectionFromMax (fragment-level IC), use weights.

useFragmentOverlap : bool (optional)

Set this to true if you have few fragments per bin (bin size <20kb for HindIII) It will consume more RAM and be slower.

saveHiResHeatmapWithOverlaps(filename, resolution, countDiagonalReads=u'Twice', maxBinSpawn=10, chromosomes=u'all')[source]

Creates within-chromosome heatmaps at very high resolution, assigning each fragment to all the bins it overlaps with, proportional to the area of overlaps.

Parameters:

resolution : int or str

Resolution of a heatmap.

countDiagonalReads : “once” or “twice”

How many times to count reads in the diagonal bin

maxBinSpawn : int, optional, not more than 10

Discard read if it spawns more than maxBinSpawn bins

saveSuperHighResMapWithOverlaps(filename, resolution, chunkSize=20000000, chunkStep=10000000, countDiagonalReads=u'Twice', maxBinSpawn=10, chromosomes=u'all')[source]

Creates within-chromosome heatmaps at very high resolution, assigning each fragment to all the bins it overlaps with, proportional to the area of overlaps.

Parameters:

resolution : int or str

Resolution of a heatmap.

countDiagonalReads : “once” or “twice”

How many times to count reads in the diagonal bin

maxBinSpawn : int, optional, not more than 10

Discard read if it spawns more than maxBinSpawn bins

updateGenome(newGenome, oldGenome=u'current', putMetadata=False)[source]

__partially optimized for large datasets__ Updates dataset to a new genome, with a fewer number of chromosomes. Use it to delete chromosomes. By default, removes all DS reads with that chromosomes.

Parameters:

newGenome : Genome object

Genome to replace the old genome, with fewer chromosomes

removeSSreads : “trans”(default), “all” or “none”

“trans”: remove all reads from deleted chromosomes, ignore the rest. “all”: remove all SS reads from all chromosomes “None”: mark all trans reads as SS reads

putMetadata : bool (optional)

Writes metadata for M and Y reads

oldGenome : Genome object or idx2label of old genome, optional