Some resources:
My research proposal (attached) - explains the kind of analysis we did so far. I'll add my thesis here once it's written.
A page with links to fits and other figures produced by this analysis
A Quickstart guide - how to get the code, set it up and run the main flows
An overview of how the code is organized
"Hierarchical priors" (attached) - a document on how we use different levels of priors for our regression problem
"Minimization formulas" (attached) - derives the expressions used in the code for the objective function and its gradients
"Bootstrap" (attached) - discusses our approach to using bootstrap to get more robust estimates for the change distributions
Requested features
(0) compatibility. Add a soft link such that files addressing ronniemaor/HTR/ do not fail.
(1) scalabaility
-- png files under ...gene-region-fits should be put in separate directories by their gene name.
eg, gene-region-fits/CALB1 . Otehrwise, we end up with ~300K files in a single directory, which is not good for NFS, and confuses matlab.
-- Track the number of spline fits that were computed, without merging the pkl cache files.
-- Add a pythong script that distribute jobs over multiple machines, by issuing commands that look like:
command ssh $CTX "cd $DIR; python compute_fits.py @compute_fits.args --dataset $DATASET --pathway $PATHWAY"
1. Installation prerequisites
Packages are already installed on cortex.
To install on your local machine or laptop, you need to install git and python.
git. On linux: sudo apt-get install git. On windows I recommend using gitextensions. It's a great git GUI and installs the windows port of git itself as well.
On linux:
sudo apt-get install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose
sudo apt-get install python-pip
pip install -U scikit-learn
pip install jinja2
On windows a great way to start is to install winpython, which comes with the spyder IDE and all the above libraries.
If you work with virtualenv, then after you get the code using git (see below), you can use the file requirements.txt in the project's root folder to install all the required libraries, by running
pip install -r requirements.txt
PROJECTDIR=~/projects/timefit
mkdir $PROJECTDIR
cd $PROJECTDIR
Create a sibling directory data, linked to existing data directory
cd $PROJECTDIR; ln -s /cortex/users/ronniemaor/timefit/data/ data
Create two directories, cache and results,on /cortex (they will contain large files) and link thme from your project directory.
cd $PROJECTDIR
mkdir /cortex/users/$USER/timefits/cache
ln -s /cortex/users/$USER/timefits/cache
mkdir /cortex/users/$USER/timefits/results
ln -s /cortex/users/$USER/timefits/results
Get the code from github to a subdirectory called code:
git clone git@github.com:ronniemaor/timefit.git code
It is also recommended to copy the initial contents of the cache and results from other people to save time in reproducing previous runs. For example:
cp -r /cortex/ronniemaor/timefit/cache/* $PROJECTDIR/cache
cp -r /cortex/ronniemaor/timefit/results/* $PROJECTDIR/results
(this might take a while)
cd $PROJECTDIR/code/scripts/;
python compute_fits.py -v \
--dataset brainspan2014 \
--from_age postnatal \
--shape sigslope --sigma_prior normal --priors sigslope80 \
--pathway all --scaling none --html --mat
The script do_one_fit.py fits a shape for a single gene/region:
python <projdir>/code/scripts/do_one_fit.py
This creates the file <projdir>/results/fit.png
For more options:
python do_one_fit.py --help
The main script, code/scripts/compute_fits.py has many options. To see the various options and usage, run with --help:
python compute_fits.py --help
--pathway PATHWAY Specify the set of genes use. When PATHWAY=serotonin or cannabinoids or all, a predefined set of strings is used. Otherwise PATHWAY is treated as a file path (relative to the data directory), containing the list of genes. Files accepted formats are: Files ending in .mat that contain a matlab file with one variable that is a cell array of strings. Files not ending in .mat are expected to be text files with gene names separated by whitespace or commas.
--mat Create a matlab file with the fits data
--html Create figures and an html file pointing to them in a nice table
--dataset=DATASET. One of brainspan2014, kang2011, etc. use --help for all options.
--change_dist. Computes the distribution of changes.
--from_age postnatal. ignore prenatal timepoints.
--shape SHAPE. sigmoid, poly1 or spline. For sigmoid fits, the following sets the priors. --shape sigslope --sigma_prior normal --priors sigslope80
You can read also arguments from a text file by using @. For example, python compute_fits.py -v @compute_fits.args
will read a set of arguments from a file compute_fits.args.
Most options accept one value. If they are specified twice then the last value specified is used. This behavior allows you to use the file as defaults and override some of the values on the command line.
Lines starting with # in the text file are ignored (treated as comments).
Currently for each gene, regions are fit in parallel using N-1 processes, where N is the number of cores on your machine.
If you're fitting many genes, use the --part k/n option to split the work on several machines, e.g.
ctx03>> cd PROJECTDIR; python compute_fits.py --shape poly1 --part 1/3
ctx04>> python compute_fits.py --shape poly1 --part 2/3
ctx05>> python compute_fits.py --shape poly1 --part 3/3
Each of these will compute part of the genes and write the fits to files like e.g. <base filename>.pkl.2_of_3
After all the partial jobs are done, you should run again without --part. The “main” run will now detect the cached partial results and consolidate all the parts into one file.
code/timing/script_export_to_matlab.py
There are quite a number of other scripts in the scripts, timing and writing subdirectories (most are in scripts. timing contains code related to analyzing change distribution and some of it was ported to compute_fits.py. writing contains scripts with hardcoded values to produce the figures for my RP and thesis).
These scripts use the main library for specific ad-hoc needs. They are in varying states of disrepair - most should work or require minor fixes, but those that I haven’t used for a while will probably require at least some work to get running.