bioinformatics

Nucleotide is one of the structural components, or building blocks, of DNA and RNA. A nucleotide consists of a base (one of four chemicals: adenine, thymine, guanine, and cytosine) plus a molecule of sugar and one of phosphoric acid.

Promoter is the part of a gene that contains the information to turn the gene on or off. The process of transcription is initiated at the promoter.

Non-coding DNA is the strand of DNA that does not carry the information necessary to make a protein. The non-coding strand is the mirror image of the coding strand and is also known as the antisense strand

Autosome: Any chromosome other than sex chromosome. Gene: An ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes specific functional product such as an enzyme, protein, or RNA molecule.

Haploid: A cell with half the usual number of chromosomes or only one chromosome set. In the humans it would be 23 chromosomes. Sperm cells and egg cells are haploid.

The human haploid genome contains 23 chromosomes, of which 22 are autosomes common for both female and male. The two sex chromosomes are called X and Y. The egg cell contains only the X chromosome while the sperm cell contains either X or Y. Every somatic cell (other than germ cells) in the human body contains 46 chromosomes (22 autosomes from each of the parents and a pair of either X-X (female) or X-Y (male). Since there are two sets of the 22 chromosomes, it is enough to sequence only one set of 22 chromosomes, and the two sex chromosomes, X, and Y, amounting to a total of 24 chromosomes to be sequenced

http://scienceandreason.blogspot.com/2006/11/alternative-splicing.html

http://scienceandreason.blogspot.com/search/label/gene%20expression

http://scienceandreason.blogspot.com/2007/06/rna-tails-and-gene-expression.html

Human genome has fewer than 25,000 different genes in the genome. This is in a genome of 3.12 billion base pairs. And the human genome is far from the largest. Ordinary corn has 5 billion base pairs and 50,000 genes.

It's estimated that humans use at least 100,000 different proteins, maybe a lot more, so the point is that some genes must be capable of coding for a lot more than just one protein. It's now understood that this is accomplished by the process known as alternative splicing.

Only a few years ago – definitely less than ten years – gene expression was thought to be a fairly simple process. One gene coded for one protein. The gene was "transcribed" from DNA to messenger RNA (mRNA), and in turn the mRNA was used to direct the manufacture of proteins in structures called ribosomes.

But then there were a series of "complications". Genes could be turned "on" or "off" by means of transcription factors, which are separate proteins produced by separate genes, and which are capable of either promoting or suppressing the transcription of other genes. Further, genes are not straight uninterrupted segments of DNA that correspond directly (via mRNA) to proteins, because genes contain segments called introns that are edited out of finished mRNA and ignored. And what is more, coding segments of genes (called exons) can be spliced together in different ways to produced finished mRNA (discussed here). This makes it possible to obtain multiple distinct proteins from a single gene.

And then, outside of the RNA transcription process, it turns out that small bits of RNA, called microRNA (miRNA) and small interfering RNA (siRNA), and which are coded for in parts of the genome long thought to be "junk", can become attached to mRNA and inhibit (or perhaps at times promote) production of proteins from it. (See this.) Nor should we forget to mention ribozymes, which can also mess around with mRNA. And if all that weren't enough, there are also a variety of epigenetic factors which can turn on or off entire segments of a genome.

Is that all? No. There are probably a number of other mechanisms that modify, regulate, and control gene expression – mechanisms as yet undiscovered. After all, there's a lot of "junk" DNA, whose function we still have no clue about – except that a lot of it isn't truly "junk".

MicroRNA

MicroRNA (miRNA) is a short (about 21 to 23 nucleotides) single-stranded RNA molecule that is now recognized as playing an important role in gene regulation – even though the term has been in use only since 2001. It is similar to, but distinct from, another type of short RNA, known as small interfering RNA (siRNA).

Although miRNA and siRNA both have gene regulation functions, there are subtle differences. MiRNA may be slightly shorter than siRNA (which has 20 to 25 nucleotides). MiRNA is single-stranded, while siRNA is formed from two complementary strands. The two kinds of RNA are encoded slightly differently in the genome. And the mechanism by which they regulate genes is slightly different.

MiRNA attaches to a piece of messenger RNA (mRNA) – which is the master template for building a protein – in a non-coding part at one end of the molecule. This acts as a signal to prevent translation of the mRNA into a protein. SiRNA, on the other hand, attaches to a coding region of mRNA, and so it physically blocks translation.

In addition to the Wikipedia articles, here's another handy source of information on miRNA.

Cell

Chromosome

Exon /Intron

Exon is the region of a gene that contains the code for producing the gene's protein. Each exon codes for a specific portion of the complete protein. In some species (including humans), a gene's exons are separated by long regions of DNA (called introns or sometimes "junk DNA") that have no apparent function.

Intron is a noncoding sequence of DNA that is initially copied into RNA but is cut out of the final RNA transcript.

Gene Expression

Gene expression is the process by which proteins are made from the instructions encoded in DNA.

Networks

http://en.wikipedia.org/wiki/Gene_regulatory_network

http://en.wikipedia.org/wiki/Signal_transduction

BLAST, ClustaW,FASTA

Pairwise alignment / multiply sequence alignment http://en.wikipedia.org/wiki/Sequence_alignment

Levenstein distance: penalty gap=cost_opening+cost_extension*gap_length

http://www.biostat.wisc.edu/bmi776/syllabus.html

BLOSUM (block amino acid substitution matrix)

PAM (percent accepted mutation)

matrix_value=log(freq_observed/freq_expected)

matrix_value=0 means substitution expected at random

matrix_value<0 means substitution less likely then by chance

matrix_value>0 means substitution more often then by chance

Sequence Alignment and Assembling (local pdf)

http://www.dnabaser.com

sequence assemblers: http://en.wikipedia.org/wiki/Sequence_assembly

http://mummer.sourceforge.net for comparing an entire genome against another

http://www.repeatmasker.org screens DNA sequences for repeats and low complexity sequences

Local alignment /global alignment: In local alignment the alignment of local, high scoring sequences take precedence over the overall alignment

Smith-Waterman - dynamic programming method for local alignment

http://bix.ucsd.edu/bioalgorithms http://www.geneious.com

Microarrays http://discover.nci.nih.gov/tools.jsp

http://discover.nci.nih.gov/microarrayAnalysis/Affymetrix.Preprocessing.jsp

If you run the same biological sample on two separate microarrays you will get slightly different results.

This is just part of the inherent variation that you have with any laboratory assay.

Normalization is a method that attempts to remove some of this variation.

http://www.rci.rutgers.edu/~cabrera/ST/c5.pdf

1. Multiply each array by a constant to make the mean (median) intensity the same for each array.

2. Adjust the arrays using some control or housekeeping genes that you would expect to have the same intensity level across all of the samples.

3. Match the percentiles of each array.

4. Adjust using a nonlinear smoothing curve.

5. Adjust using control genes

Software for Next Generetion Sequencing (NGS)

http://ensembl.genome.tugraz.at/ http://www.phrap.org/ http://www.softgenetics.com/

http://code.google.com/p/mosaik-aligner/ http://www.clcbio.com/ http://www.scubeindia.com/SoftGenetics/nextgene.html

http://samtools.sourceforge.net/

http://maq.sourceforge.net/

http://maq.sourceforge.net/glfProgs.shtml

http://www.politigenomics.com/

http://bioinformatics.bc.edu/marthlab/EagleView http://bioinformatics.oxfordjournals.org/cgi/reprint/btp611v1.pdf

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2527701/ http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000369

http://www.geospiza.com/index.shtml

http://www.valexllc.com/a-compatibility.html

http://seqanswers.com/forums/showthread.php?t=43

http://www.politigenomics.com/next-generation-sequencing-informatics

Next-Generation Sequencing Statistics

Notes:

- Units: B – bytes, b – bases
- PA is primary analysis (includes image feature extraction and base calling)
- PA CPU is calculated as the wall clock multiplied by the number of CPU cores
- ABI SOLiD data, except rate, are representative of a single slide
- ABI SOLiD and Illumina GA IIx primary analysis is done on instrument
- 454 paired-end reads vary in length depending on location of internal adapter
- SRA is the size of the files (SFF, SRF, or FASTQ) that are submitted to the NCBI Short Read Archive

SOLiD

http://solidsoftwaretools.com/gf/

The SOLiD system has an interesting but confusing color-coding scheme, where two bases are incorporated at a time and ligated. There is a 3 base gap between incorporations. 4 reading frames are sequenced simultaneously

Page updated

Google Sites

Report abuse