gmonce_doc

glosario

(salvo indicación en contrario, las referencias son de la wikipedia) 

Coeficiente de Dice

El doble del tamaño de la intersección de dos conjuntos, dividido la suma del tamaño de los conjuntos

Para strings y bigramas:

2n_t/ (n_x + n_y)

número de bigramas en ambos strings, n_x y n_y número de bigramas de cada string

Gene / Proteins

In cells, a gene is a portion of an organism's DNA which contains both "coding" sequences that determine what the gene does, and "non-coding" sequences that determine when the gene is active (expressed.) When a gene is active, the coding and non-coding sequences are copied in a process called transcription, producing an RNA copy of the gene's information. This piece of RNA can then direct the synthesis of proteins via the genetic code. In other cases, the RNA is used directly, for example as part of the ribosome. The RNA may undergo special post-transcriptional processing steps required to convert it into a mature, functional form. These molecules resulting from gene expression, whether RNA or protein, are known as gene products, and are responsible for the development and functioning of all living things.

Genomics 

Genomics is the study of an organism's entire genome. The field includes intensive efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping efforts.

Molecular Biology  

Molecular biology is the study of biology at a molecular level.

Functional genomics 

Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic projects (such as genome sequencing projects) to describe gene (and protein) functions and interactions. Unlike genomics and proteomics, functional genomics focuses on the dynamic aspects such as gene transcription, translation, and protein-protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures.

Medidas de performance (Raychaudhuri)

Sensitivity = Recall =  TP / (TP + FN) 

La sensitividad es lo mismo que el recall, cuánto de los buenos que existen traigo.

Specifity =  TN / (TN + FP) 

La especificidad es cuántos de los que NO son identifico correcamente.

Accuracy =  (TP+TN)/(TP+TN+FP+FN)

La accuracy es cuantos pego bien del total (tanto positivos como negativos).

 Precision = TP / (FP + TP)

La precisión es cuántos de los positivos pego bien.

DNA Sequencing

The term DNA sequencing encompasses biochemical methods for determining the order of the nucleotide bases, adenine, guanine, cytosine, and thymine, in a DNA oligonucleotide. The sequence of DNA constitutes the heritable genetic information in nuclei, plasmids, mitochondria, and chloroplasts that forms the basis for the developmental programs of all living organisms. Determining the DNA sequence is therefore useful in basic research studying fundamental biological processes, as well as in applied fields such as diagnostic or forensic research.

Protein Sequencing 

Proteins are found in every cell and are essential to every biological process, protein structure is very complex: determining a protein's structure involves first protein sequencing - determining the amino acid sequences of its constituent peptides; and also determining what conformation it adopts and whether it is complexed with any non-peptide molecules. 

Genome

In biology the genome of an organism is its whole hereditary information and is encoded in the DNA (or, for some viruses, RNA). This includes both the genes and the non-coding sequences of the DNA.

Genetic code 

The genetic code is the set of rules by which information encoded in genetic material (DNA or RNA sequences) is translated into proteins (amino acid sequences) by living cells. Specifically, the code defines a mapping between tri-nucleotide sequences called codons, and amino acids; every triplet of nucleotides in a nucleic acid sequence specifies a single amino acid.  

Gene expression

Gene expression is the process by which inheritable information from a gene, such as the DNA sequence, is made into a functional gene product, such as protein or RNA.

 

 

 

measure gene expression

Quiere decir medir la cantidad de ARN producida (bajo ciertas condiciones). La cantidad de ARN --famoso intermediario--  determina en buena medida cuánta proteína --el obrero celular-- habrá en esas condidiciones. Los microarrays miden cantidad de ARN para miles de genes, en una condición dada. (Martín Graña)

The expression of many genes is regulated after transcription (i.e., by microRNAs or ubiquitin ligases), so an increase in mRNA concentration need not always increase expression. Nevertheless, mRNA levels can be quantitatively measured by Northern blotting, a process in which a sample of RNA is separated on an agarose gel and hybridized to a radio-labeled RNA probe that is complementary to the target sequence.

Homologous

[Anatomical structures that perform the same function in different biological species and evolved from the same structure in some ancestor species are homologous. In genetics, homology can be observed in DNA sequences that code for proteins (genes) and in noncoding DNA. For protein coding genes, one can compare translated amino-acid sequences of different genes. Sequence homology may also indicate common function.]

SWISS-PROT

Swiss-Prot is a manually curated biological database of protein sequences.

Yeast

Levadura 

BLAST

In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; 

Position-Specific Iterative BLAST (PSI-BLAST)

This program is used to find distant relatives of a protein. First, a list of all closely related proteins is created. These proteins are combined into a general "profile" sequence, which summarises significant features present in these sequences. A query against the protein database is then run using this profile, and a larger group of proteins is found. This larger group is used to construct another profile, and the process is repeated. By including related proteins in the search, PSI-BLAST is much more sensitive in picking up distant evolutionary relationships than a standard protein-protein BLAST

aberrantly expressed gene

si es un gen con comportamiento conocido (i.e. niveles de expresión acá), deben querer decir que es muy
baja o muy alta la expresión (MG)

promoter site

In biology, a promoter is a regulatory region of DNA generally located upstream (towards the 5' region of the sense strand) of a gene that allows transcription of the gene.

metabolic pathway

In biochemistry, a metabolic pathway is a series of chemical reactions occurring within a cell. In each pathway, a principal chemical is modified by chemical reactions.

chips de microarray

El diseño de un chip de microarrays requiere un genoma secuenciado (o en su defecto, una lista de pedazos de genes para un microarray más modesto). Disponer del genoma permite definir las sondas que captan los genes de interés. En tu ejemplo, probablemente hayan elegido esos 5000 y pico de genes y fabricado sondas para medir expresión de ese grupo en varias condiciones. Cada "spot" es especíico para un gen, y podrá detectar el gen de una muestra de ADN total (e.g. para comparar bichos en términos de presencia/ausencia de un gen) o ARN total (como en tu caso, para medir la expresión del grupo de genes bajo distintas condiciones).  (MG)

cosine metric

x \times x^T / sqrt(||x|| ||y||)

calcula la distancia entre dos vectores (por ejemplo, de palabras)

tf-idf

The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.