We developed a grid-based optimization method to search the best parameters for chipseq peak calling for identifying cell identity genes. The flow chart is shown in Figure 2A. Grid is constructed with 3 different parameters: peak height, left boundary for peak selection region and right boundary for peak selection region.
In the first round of grid-based survey, the entire grid is divided into 1000 equal size small grid tables (10 values from each parameters). Then the parameter sets in the center of each grid is used to perform peak selection to assign the peak width or height value to each gene. Afterwards, the gene is ranked by either peak width (Broadness) or peak height (Sharpness). At the end, the enriched P value of different categories of genes with top500 genes is calculated by Fisher test method. The grid with the best P value is saved for the survey of next round.
In the 2nd round of grid-based survey, the candidate grid from previous round will be first expanded by 2 additional grids in each dimension in order to generate a larger grid containing 125 grids. Then the new large grid will be surveyed as previous round.
This survey will keep going until the minimum step is reached (for selection region, the minimum step is 1kb; for peak height, only candidate height cutoffs will be considered). All searched parameter sets, and their corresponding P values are saved. Graph is made based on the largest negative log10 of P value for a single specific parameter within different values of other two parameters. The line plot of grid optimization of promoter width is created by excel. More details can be found in DANPOS2 Dev grid fuction.
Decision boundary is created by self-developed code in CIG_predict.py. Plot is created using python library matplotlib.
ROC curves are created by self-developed code by using sklearn in python library and the final curve is the average of 100 times cross validation on the test datasets。 P values is calculated by following Hanley’s method4. Detailed codes are in CIG_predict.py.
Spearman correlations between features are calculated by using python pandas library and then were submitted to the software MeV5 version 4.8.1 to draw heat maps. VIF is calculated by using scipy library. All the codes can be found in CIG_stat.py.