Materials and methods related to Figure 1
We first performed an intensive query of PubMed using the search expression “cell identity”[Title/abstract] OR “cell marker”[Title/abstract], which returned 7581 PubMed abstracts. We then searched the abstract for names of 297 cell types listed in the SHOGoiN database and ranked cell types by number of associated abstracts. To retrieve cell identity genes, we then conducted a manual literature review for the 10 top-ranked cell types that also have RNA-Seq data and ChIP-Seq data for the histone modifications H3K4me3, H3K4me1, H3K27ac, and H3K27me3. We also defined control genes by requiring that their names never appear together with the name of the given cell type in any literature or annotation database, i.e., the Entrez Gene, Gene Cards, Ensembl, Gene Ontology, and KEGG. Then we use Python random library to generate the random control genes list with the same number of genes in the curated identity gene list.
The RNA-seq, H3K4me3, H3K4me1, H3K27ac and H3K27me3 ChIP-seq data for 10 cell types: H1-hESC, CD34+ hematopoietic stem cell, GM12878, Human umbilical vein endothelial cells (HUVEC), Human mammary epithelial cells (HMEC), neural cells, mid radical glial cell, fibroblast of lung (NHLF), mesenchymal stem cell (MSC), Human skeletal muscle myoblast (HSMM), are downloaded from GEO database and ENCODE project (https://www.encodeproject.org/) 1.
Human reference genome sequence version hg19 and UCSC known gene list were downloaded from the UCSC Genome Browser website 2. For RNAseq data, RNA-Seq raw reads were mapped to the human genome version hg19 using TopHat version 2.1.1 with default parameter values. Expression value for each gene was determined by the function Cuffdiff in Cufflinks version 2.2.1 with default parameter values. For chip-seq data, reads were first mapped to hg19 human genome by bowtie:
bowtie -p 8 -m 1 --chunkmbs 512 –best hg19_reference_genome fastq_file
Wig file is generated using DANPOS2 2.2.3:
python danpos.py dpeak sample –b input --smooth_width 0 -c 25000000 --frsz 200 --extend 200 –o output_dir
Quantile normalization is performed using DANPOS2 2.2.3:
python danpos.py wiq --buffer_size 50 hg19.chrom.sizes.xls wig –reference reference.qnor.sort.wiq --rformat wiq --rsorted 1
bigwig is generated using the tool WigToBigWig with the following command line:
wigToBigWig -clip sample.bgsub.Fnor.wig hg19.sizes.xls sample.bw
The tool WigToBigWig was downloaded from the ENCODE project website (https://www.encodeproject.org/software/wigtobigwig/) 1. The “hg19.sizes.xls” in the command line is a file containing the length of each chromosome in the human genome. We then submitted the bigWig file to the UCSC Genome Browser (https://genome.ucsc.edu) to visualize ChIP-Seq signal at each base pair 2,3.
Peak calling is performed using DANPOS2 DEV:
python danpos.py dpeak wig –q heights --smooth_width 0 -c 25000000 –o output_dir
Skewness and kurtosis of peak is defined using self-developed python code: skewness_kurtosis_run.py. Both skewness and kurtosis are centered as zero. If no signal is being detected in the peak calling region, skewness and kurtosis are set as zero. Features value is extracted by using self-developed python code: CIG_feature_table.py. P values are determined by Wilcoxon test.