Research‎ > ‎

(2012) Unique mappability tracks for several species

I have generated per-base unique mappability tracks for a large range of read lengths for several key species.

Download URL

Readme

Each directory corresponds to a particular assembly of a species and contains a file that is named globalmap_k<min>tok<max>.tgz file, The tar.gz file when unzipped will unzip to a directory called global_k<min>tok<max> which will contain C binary files representing unique mappability for each chromosome c \in C. Each track simultaneously encodes for mappability at all read lengths from <min> to <max>
    (a) The files are in uint8 (unsigned 8 bit integers) binary formats
    (b) Each file is basically a vector of unsigned 8bit integers that is the length of the chromosome. The elements of the vector are >= 0
    (c) A value of 'x' at a position means that position is PERFECTLY unique in the genome for all k-mers of length >= x starting at that position on the + strand
    (d) A value of 0 at a position means that position is not unique for any of the k-mer lengths (k=<min> to <max>)
    (d) In order to obtain the uniqueness map for a particular k, simply perform the following operation on the vector (vector > 0) & (vector <= k)
    (d) In order to obtain the uniquness map for the - strand, you simply need to right-shift the vector by <len-1>. 
        i.e. if position 1 is UNIQUE on the + strand for <len=3> then position 3 is UNIQUE on the - strand

Example Usage

How to read the files in matlab
%First gunzip and untar the globalmap_k20tok54.tgz file %You will see one file for each chromosome tmp_uMap = fopen('chr1.uint8.unique','r'); uMapdata = fread(tmp_uMap,'*uint8'); fclose(tmp_uMap);
You can similarly read the files in any other programming language as a vector of unsigned 8bit integers. Convert to doubles if you like (although this is a waste of memory) or write it out as a text file if you prefer.
Comments