TY BSc. Botany (Sem VI)

TY BOT (Paper IV) SBO504 Sem VI    CURRENT TRENDS IN PLANT SCIENCES II


Unit I: Plant Biotechnology II

DNA sequence analysis–

·         DNA sequencing refers to methods for determining the order of the nucleotides bases adenine,guanine,cytosine and thymine in a molecule of DNA.

·         The first DNA sequence was obtained by academic researchers, using laboratories methods based on 2- dimensional chromatography in the early 1970s.

·         By the development of dye based sequencing method with automated analysis, DNA sequencing has become easier and faster.

Maxam – Gilbert Method (The Chemical Method)

 1) Maxam–Gilbert sequencing is a method of DNA sequencing developed by Allan Maxam and Walter Gilbert in 1977–1980.

2) This method is also known as chemical modification method because it involves chemical modification of DNA.

3) This method takes advantage of a two-step catalytic process involving Piperidine, Di-methyl Sulphate (DMS) and Hydrazine two chemicals that selectively attack Purines (A&G), Pyrimidine (T&C) and Sugar.

4) Purines will react with dimethyl sulfate (DMS) and pyrimidine will react with hydrazine in such a way as to break the glycosidic bond between the ribose sugar and the base.

5) Piperidine will then catalyze phosphodiester bond cleavage where the base has been displaced.


6) Moreover, dimethyl sulfate (DMS) and Piperidine alone will selectivelycleave guanine (G) nucleotides but dimethyl sulfate and Piperidine in formic acid will cleave both Guanine and

Adenine Similarly, hydrazine and  Piperidine will cleave both  thymine and cytosine nucleotides

whereas hydrazine and Piperidine in 1.5M NaCl will only cleave  cytosine nucleotides.

7) The use of these selective reactions to DNA sequencing then involved creating a singlestranded DNA substrate carrying a radioactive label on the 5’ end.

8) This labeled substrate subjected to four separate cleavage reactions, each of which would create apopulation of labeled cleavage products ending in known nucleotides.

9) The reactions loaded on high percentage polyacrylamide gels and the fragments resolved byelectrophoresis.

10) The gel thenbe transferred to a light-proof X-ray film cassette, apiece of X-ray film placed over the gel, and the cassette placed in a freezer for severaldays.

11) Wherever a labeled fragment stopped on the gel the radioactive tag would exposethe film due to particle decay (autoradiography).

12) Since electrophoresis, whether in acrylamide or an agarose matrix, will resolve nucleic acid fragments in the inverse order of length, that is, smaller fragments will run faster.

13) The gel matrix than larger fragments, the dark autoradiographic bands on the film will represent the 5’→3’ DNA sequence when read from bottom to top.

14) For example, a band in the lanes corresponding to the C only and the C + T reactions would be called a C. If the band was present in the C + T reaction lane but not in the C only reaction lane it would be called a T. The same decision process would obtain for the G only and the G + A reaction lanes. Sequences would be confirmed by running replicate reactions on the same gel and comparing the autoradiographic patterns between replicates.

 

Key Features:

·         Base-specific cleavage of DNA by certain chemicals

·         Four different chemicals, one for each base

·         A set of DNA fragments of different sizes

·         DNA fragments contain up to 500 nucleotides

 

Advantages:

·         Purified DNA can be read directly

·         Homopolymeric DNA runs are sequenced as efficiently as heterogeneous DNA sequences

·         Can be used to analyze DNA protein interactions (i.e. footprinting)

·         Can be used to analyze nucleic acid structure and epigenetic modifications to DNA

 

Sanger’s method of sequencing (Chain Termination Methods)

1. Sequencing is a process of determining the precise order of nucleotides in a DNA template after amplification using specific loci. Sanger sequencing was invented by Fredrick Sanger in 1977.

2. Sanger sequencing is also known as the chain termination method. In this process, there is selective incorporation of dideoxynucleotide to the sequencing reaction. Dideoxy nucleotide lacks 3′ OH group in the sugar molecule unlike deoxynucleotides that aids in the formation of phosphodiester bond formation between two nucleotides.

3. The following components are required for Sanger sequencing:

1. A DNA template to be sequenced

2. An oligonucleotide primer labelled at the 5′-end with 32P

3. A DNA sequencing polymerase

4. Four deoxynucleoside triphosphates: dATP, dGTP, dCTP, dTTP

5. Four dideoxy nucleoside triphosphates (nucleoside triphosphates lacking both 2′- and 3′- hydroxy groups): ddATP, ddGTP, ddCTP, ddTTP

4. This process would not provide any information on the sequence of the template and for this purpose four sequencing reactions are carried out in separate tubes.

5. In each tube a small quantity of the key ingredient, a 2',3′-dideoxy nucleoside triphosphate is added.

6. The dideoxy nucleoside triphosphate ddATP is added to tube 1, ddGTP to tube 2, ddCTP to tube 3 and ddTTP to tube 4.


7. The polymerase enzyme does not discriminate between deoxynucleotides triphosphates (dNTPs) and dideoxynucleotide triphosphates (ddNTPs), so either can be added at each step.

8. If a dNTP is added, the DNA chain will continue to grow.

9. If a ddNTP is added, the DNA chain will terminate, as it has no 3′-hydroxyl group to react with an incoming nucleoside triphosphate: no further nucleosides can be added.

10. The result in each tube is a mixture of oligonucleotides of different lengths, all terminated with a particular ddNTP


11. In tube 1 all the terminations will be at A, in tube 2 at G, in tube 3 at C, and in tube 4 at T.

12. The oligos can then be separated according to their size by electrophoresis.

13. If all four ladders are run side by side on a polyacrylamide gel, and the gel is exposed to a photographic film, the 32P-labelled fragments will produce an image that can be used to read the DNA sequence.


Key Features

·         Uses dideoxy nucleotides to terminate DNA synthesis.

·         DNA synthesis reactions in four separate tubes

·         Radioactive dATP is also included in all the tubes so the DNA products will be radioactive.

·         Yielding a series of DNA fragments whose sizes can be measured by electrophoresis.

·         The last base in each of these fragments is known.

·          

Advantage

Chain termination methods have greatly simplified DNA sequencing.

 

Pyro Sequencing:

1. Pyrosequencing is a method of DNA sequencing (determining the order of nucleotides in DNA) based on the "sequencing by synthesis" principle.

2. In this method the sequencing is performed by detecting the nucleotide incorporated by a DNA polymerase.

3. Pyrosequencing relies on light detection based on a chain reaction when pyrophosphate is released. Hence, the name pyrosequencing.

4. Procedure

 The DNA to be sequenced is broken up into fragments of ~100 base pairs and denatured to form single-stranded DNA (ssDNA).

5. Each Reaction receives a cocktail of reagents:

i. DNA polymerase — for adding deoxyribonucleotides to the ssDNA

ii. adenosine phosphosulfate (APS)

iii. ATP sulfurylase — an enzyme that forms ATP from adenosine phosphosulfate (APS) and pyrophosphate (PPi).

iv. Luciferin: substrate for enzyme luciferase

v. Luciferase — an ATPase that catalyzes the conversion of luciferin to oxyluciferin with the liberation of light.

6. As the reaction is started incoming nucleotides are added to the 3' end of the growing chain.

7. The nucleotides are supplied as four deoxynucleoside triphosphates. As each nucleotide is added, a molecule containing two phosphate groups — called pyrophosphate (PPi) is split off 

8.  For synthesis of DNA four deoxyribonucleotides are used  dTTP, dCTP, and dGTP but

 Instead of dATP (which would trigger the luciferin reaction), deoxyadenosine alpha-thiotriphosphate (dATPαS) is used. DNA polymerase ignores the difference and uses it whenever a T is encountered on the ssDNA template, but luciferase doesn't recognize to it.

9. The addition of one of the four deoxynucleotide triphosphates (dNTPs) (dATPαS, which is not a substrate for a luciferase, is added instead of dATP to avoid noise) initiates the second step.

10. DNA polymerase incorporates the correct, complementary dNTPs onto the template. This incorporation releases pyrophosphate (PPi).

11. ATP sulfurylase converts PPi to ATP in the presence of adenosine 5´ phosphosulfate.

12. This ATP acts as a substrate for the luciferase-mediated conversion of luciferin to oxyluciferin that generates visible light in amounts that are proportional to the amount.

13. The light produced in the luciferase-catalyzed reaction is detected by a camera and analyzed in a program.

14. In any well where the complementary nucleotide is present at the 3' end of the template, the nucleotide is added and pyrophosphate is liberated

15. The amount of light is proportional to the number of that nucleotide added. So if, for example, the incoming nucleotide is dGTP, and there is a string of 3 Cs on the template, the light emitted will be 3 times brighter than if only one C is present.

16. A detector picks up the light (if any) from each well and the data are recorded.

17. Then each of the remaining 3 nucleotides is added in sequence.

18. Then the sequence of 4 additions is repeated until synthesis is complete.

                   

Applications of Pyrosequencing

Global DNA methylation: Luminometric Methylation Assay (LUMA) is a high throughput and quantitative method to estimate genome-wide DNA methylation. The starting material is genomic DNA, and the method uses combined DNA cleavage by methylation-sensitive enzymes and polymerase extension assay using Pyrosequencing. The assay can be completed in ~50 mins (excluding the digestion process).

Gene-specific DNA methylation:The Pyrosequencer can be used to detect gene-specific DNA methylation patterns at specific genomic region (e.g., promoter of gene of interest). The average length size to be tested is ~100 bp. The starting material is bisulfite-treated genomic DNA. The assay can be completed in ~ 60 mins (excluding the PCR preparation).

 

 

Polymerase Chain Reaction (PCR).

1. The polymerase chain reaction (PCR) was originally developed in 1983 by the American biochemist Kary Mullis.

2. PCR is used in molecular biology to make many copies of (amplify) small sections of DNA.

3. Using PCR it is possible to generate thousands to millions of copies of a particular section of DNA from a very small amount of DNA.

4. Basic PCR set-up requires several components and reagents including:

i. A DNA template that contains the DNA target region to amplify

ii. A DNA polymerase; an enzyme that polymerizes new DNA strands; heat-resistant Taq polymerase is especially common, as it is more likely to remain intact during the high-temperature DNA denaturation process.

 

Taq polymerase

Like DNA replication in an organism, PCR requires a DNA polymerase enzyme that makes new strands of DNA, using existing strands as templates. The DNA polymerase typically used in PCR is called Taq polymerase, after the heat-tolerant bacterium from which it was isolated (Thermus aquaticus).

T. aquaticus lives in hot springs and hydrothermal vents. Its DNA polymerase is very heat-stable and is most active around 70 oC (a temperature at which a human or E. coli DNA polymerase would be non-functional). This heat-stability makes Taq polymerase ideal for PCR. As high temperature is used repeatedly in PCR to denature the template DNA, or separate its strands.

iii. two DNA primers that are complementary to the 3' (three prime) ends of each of the sense and anti-sense strands of the DNA target (DNA polymerase can only bind to and elongate from a double-stranded region of DNA; without primers there is no double-stranded initiation site at which the polymerase can bind.

Primers

Like other DNA polymerases, Taq polymerase can only make DNA if it's given a primer, a short sequence of nucleotides that provides a starting point for DNA synthesis. In a PCR reaction, the experimenter determines the region of DNA that will be copied, or amplified, by the primers.

PCR primers are short pieces of single-stranded DNA, usually around 20 nucleotides in length.


Two primers are used in each PCR reaction, and they are designed so that they flank the target region (region that should be copied). That is, they are given sequences that will make them bind to opposite strands of the template DNA. The primers bind to the template by complementary base pairing.

iv. Deoxynucleotides triphosphates, or dNTPs (sometimes called "deoxynucleotide triphosphates"; nucleotides containing triphosphate groups), the building blocks from which the DNA polymerase synthesizes a new DNA strand.

v. A buffer solution providing a suitable chemical environment for optimum activity and stability of the DNA polymerase.

vi. bivalent cations, typically magnesium (Mg) or manganese (Mn) ions; Mg2+ is the most common, but Mn2+ can be used for PCR-mediated DNA mutagenesis, as a higher Mn2+ concentration increases the error rate during DNA synthesis and mono valent cations, typically potassium (K) ions.

5. Typically, PCR consists of a series of 20–40 repeated temperature changes, called thermal cycles, with each cycle commonly consisting of two or three discrete temperature steps (see figure below).

6. The cycling is often preceded by a single temperature step at a very high temperature (>90 °C (194 °F), and followed by one hold at the end for final product extension or brief storage.

7. The temperatures used and the length of time they are applied in each cycle depend on a variety of parameters, including the enzyme used for DNA synthesis, the concentration of bivalent ions and dNTPs in the reaction, and the melting temperature (Tm) of the primers.

The steps common to most PCR methods are as follows


1. Denaturation (96 °C): Heat the reaction strongly to separate, or denature, the DNA strands. This provides single-stranded template for the next step.

2. Annealing (55C° - 65°C°):As the temperature of the mixture is slowly cooled to about 55° C, the primers base pair with the complementary regions flanking target DNA strands. This process is called renaturation or annealing. High concentration of primer ensures annealing between each DNA strand and the primer rather than the two strands of DNA.

3. Extension (72°C): Raise the reaction temperatures so Taq polymerase extends the primers, synthesizing new strands of DNA.

         ·            This cycle repeats 25 - 35 times in a typical PCR reaction, which generally takes 2 -4 hours, depending on the length of the DNA region being copied.

         ·            If the reaction is efficient (works well), the target region can go from just one or a few copies to billions.

         ·            That’s because it’s not just the original DNA that’s used as a template each time.

         ·            Instead, the new DNA that’s made in one round can serve as a template in the next round of DNA synthesis.

         ·            There are many copies of the primers and many molecules of Taq polymerase floating around in the reaction, so the number of DNA molecules can roughly double in each round of cycling.

         ·            This pattern of exponential growth is shown in the image below

Applications of PCR

1. PCR is used in analysing clinical specimens for the presence of infectious agents, including HIV, hepatitis, malaria, anthrax, etc.

2. PCR can provide information on a patient’s prognosis, and predict response or resistance to therapy. Many cancers are characterized by small mutations in certain genes.

3. PCR is used in the analysis of mutations that occur in many genetic diseases (e.g. cystic fibrosis, sickle cell anaemia, phenylketonuria, muscular dystrophy).

4. PCR is also used in forensics laboratories and is especially useful because only a tiny amount of original DNA is required, for example, sufficient DNA can be obtained from a droplet of blood or a single hair.

5. PCR is an essential technique in cloning procedure which allows generation of large amounts of pure DNA from tiny amount of template strand and further study of a particular gene.

6. The Human Genome Project (HGP) for determining the sequence of the 3 billion base pairs in the human genome relied heavily on PCR.

7. PCR has been used to identify and to explore relationships among species in the field of evolutionary biology.

8. In anthropology, it is also used to understand the ancient human migration patterns.

9. In archaeology, it has been used to spot the ancient human race. PCR commonly used by Palaeontologists to amplify DNA from extinct species or cryopreserved fossils of millions years and thus can be further studied to elucidate on.

 

Types of PCR

Nested PCR

1. Nested PCR is another similar technique, which aims to reduce how many errors are made during the amplification of the DNA.

2. In regular PCR the RNA-primers used sometime bind to non-specific sites of the DNA, which is not desire.


3. To avoid this, nested PCR makes use of two sets of primers. The first pair aims to amplify a longer fragment of DNA than is required and as such are aimed outside of the target DNA region.

4. The second pair is then aimed, or nested inside the target DNA region, and only transcribe the area of interest from the first PCR product.

Reverse transcription PCR (RT PCR)

1. This technique starts with an RNA template and works in reverse, producing a strand of complimentary DNA through use of an enzyme known as a reverse transcriptase.

2. This process can either be done entirely within one test-tube, known as the “one-step RT-PCT”, which has the advantage of minimising any possible temperature or environmental variations.

3. Or can be split across two different tubes, with the reverse transcriptase creating the complimentary DNA in one tube, and the PCR process being performed in the second.

4. This “two-step RT-PCT” has the advantage of minimizing the RNA degradation by keeping the PCR reaction separate, allowing you to perform multiple measurements from the same small RNA sample.

5. Both methods effectively allow for extremely low quantities of RNA to be detected, through amplification of the complimentary DNA which it encodes for.


Quantitative PCR (Real Time PCR)

1. Q-PCR (Quantitative PCR) is used to measure the quantity of a PCR product (preferably real-time).

2. It is the method of choice to quantitatively measure starting amounts of DNA, cDNA or RNA.

3. Q-PCR is commonly used to determine whether a DNA sequence is present in a sample and the number of its copies in the sample.

4. In qPCR, exactly the same procedure happens but with two major differences:

5. first the amplified DNA is fluorescently labelled (usually with cyanine based fluorescent dyes) and

6. Second, the amount of the fluorescence released during amplification is directly proportional to the amount of amplified DNA.

7. Fluorescence is monitored during the whole PCR process (along all 30 to 45 cycles).

8. The higher the initial number of DNA molecules in the sample, the faster the fluorescence will increase during the PCR cycles.

9. In other words, if a sample contains more targets the fluorescence will be detected in earlier cycles.

10. The cycle in which fluorescence can be detected is termed quantitation cycle (Cq for short) and is the basic result of qPCR.

11. Lower Cq values mean higher initial copy numbers of the target. This is the basic principle of quantitative approach that real-time PCR provides.

 

Amplified fragment length polymorphism (AFLP) PCR

1.      It is a PCR-based technique that uses selective amplification of a section of digested DNA fragments to generate unique fingerprints for genomes of interest.

2.      This technique can quickly generate large numbers of marker fragments for any organism, without prior knowledge of the genomic sequence.

3.      AFLP PCR uses restriction enzymes to digest genomic DNA and allows attachment of adaptors to the sticky ends of the fragments.

4.      A part of the restriction fragments is then selected to be amplified by using primers that are complementary to the adaptor sequence.

5.      The amplified sequences are separated and visualized on denaturing on agarose gel electrophoresis.

6.      AFLP PCR is employed for a variety of applications, as to assess genetic diversity within species or among closely related species, to infer population-level phylogenies and biogeographic patterns, to generate genetic maps and to determine relatedness among cultivars.


Basic features:

§  DNA barcoding has developed in concert with genomics-based investigations.

§  DNA barcoding is a tool for rapid species identification based on DNA sequences.

§  It is a relatively new concept that has been developed for providing rapid, accurate and automatable species identification using standardized DNA sequences

§  DNA barcodes consist of a standardized short sequence of DNA (400–800 bp) that in principle should be easily generated and characterized for all species on the planet.

§  A massive on-line digital library of barcodes will serve as a standard to which the DNA barcode sequence of an unidentified sample from the forest, garden, or market can be matched.

§  DNA barcoding will allow users to efficiently recognize known species and speed the discovery of species yet to be found in nature.

§  DNA barcoding aims to use the information of one or a few gene regions to identify all species of life.

§  The most important characteristic features of a DNA barcode are its universality, specificity on variation and easiness on employment.

§  This means that the gene segment used as a barcode should be suitable for a wide range of taxa, should have high variation between species but should be conserved within the species, so that the intra-specific variation will be insignificant.

§  DNA is to be used for PCR amplification. Therefore, the gene sequences used for barcoding should be short enough to be PCR amplified easily.

Nuclear genome sequence:

In DNA Barcoding nuclear DNA segment is expected to provide more information on species identity, including hybridization events.


Till date Internal Transcribed Spacer (ITS) that is the regions of the ribosomal DNA (rDNA) are the only nuclear DNA that have been tested for suitability as barcodes in plants.

 

The difficulty in obtaining high universality of the PCR amplification of single or low-copy genes, especially from low-quality DNA due to conservation of functional genes across large lineages could be the major reasons why such limited numbers of genes are being tested for nuclear genome barcoding.

 

Internal transcribed spacer (ITS) regions of nuclear ribosomal cistron

 

The rDNA cistron is a multigene family encoding the nucleic acid core of the ribosome. Within the cell, the rDNA is arranged as tandemly repeated units containing 18S, 5.8 S, 26 S coding regions and two internal transcribed spacers (ITS1 and ITS2) present on either side of 5.8S region

(Figure).


Generally, the rDNA units are reiterated thousands of times and are organized into large blocks in the chromosome called the nucleolar organizer regions.

One of the most remarkable features of the rDNA is that the individual unit of this multiple gene family does not evolve independently; instead all the units evolve in a concerted manner such that higher level of overall sequence homogeneity exists among copies of the rDNA within a species, but differs among different species.

This high sequence homogeneity is achieved through a process initially termed as horizontal evolution but later renamed as concerted evolution, which involves unequal crossing over and gene conversions.

Currently, nuclear ITS is considered as one of the most useful phylogenetic markers for both plants and animals, because of its ubiquitous nature, biparental inheritance, and comparatively higher evolutionary changes.

Likewise, species-level discrimination and technical ease have also contributed to its wider acceptability as a powerful phylogenetic marker.

Another advantage is that the ITS1 and ITS2 regions can be PCR-amplified separately by anchoring primers in the conserved coding genes.

This facilitates easy amplification of ITS even from poor quality or degraded DNA.

 nuclear ITS  is still considered to be a powerful phylogenetic tool at the species level when tested for its suitability as barcode in plants, nuclear ITS along with nine other loci from the chloroplast genome.

Considering the availability of universal primers, presence of multiple copies in cells, high universality and good species discriminatory power, nuclear ITS is a potential candidate for barcoding in plants.

It has recently been used as a barcode for identifying a reproductively isolated and cryptic species of Asimitellaria (a genus of flowering plants in the family Saxifragaceae) from its close relatives. On the basis of this observation, it was suggested that nrDNA can be of use for accurate and efficient delimitation of plant biological species in lineages with various life history traits (annuals, perennials, trees, aquatics) and evolutionary backgrounds.

A recent report on the use of ITS2 to identify medicinal plants and their close relatives again proved the potential of this nuclear gene as a useful barcode for plants.

 

Chloroplast genome sequence:

The chloroplast genome shares several attributes of mitochondrial genomes such as conserved gene order, high copy number per cell, amenability to PCR amplification and availability of universal primers.

Hence, chloroplast genes could be considered as analogous to the mitochondrial gene that has been used for DNA barcoding in animals.

However, compared to mt-DNA genes in animals, chloroplast genes in plants have slower rate of evolution; therefore, finding suitable gene sequences with sufficient species discriminatory power is a great challenge.

Nonetheless, due to the nature of uniparental inheritance, non-recombination and structural stability in both the genic and inter-genic regions of the chloroplast, many genes have been examined carefully for their potentiality as barcodes in plants.


Chloroplast genome of higher plants is a circular structure with a size of 120–160 k bp (Figure). The general architecture of the chloroplast genomes is represented by a Large and a Small Single-Copy region (LSC and SSC) intervened by two copies of a large Inverted Repeat (IRa and IRb).

The chloroplast genome contains all the rRNA genes (four genes in higher plants), tRNA genes (35 genes) and other genes for those proteins synthesized in the chloroplast (~ 100 genes) that are essential for its existence.

On the basis of the considerable amount of information available from phylogenetic studies and recent testing with limited number of taxa, potentially useful genic and intergenic loci were initially selected as potential candidates for testing as barcodes for the land plants.

Efficacy of these sequences as barcodes has been examined individually and in combination with other loci on a large number of samples from a wide range of species covering all the major taxonomic lineages.

 

SINGLE-LOCUS DNA BARCODES

Researchers have recently proposed the use of the whole-plastid genome sequence in plant identification but the concept has not yet been universally accepted due to high sequencing cost and difficulties involved in obtaining complete plastid genome sequences.  However single-locus barcodes widely studied are as follows.

 

rbcL gene (ribulose-bisphosphate carboxylase gene ) sequence:

Among the plastid genes, rbcL is the best characterized gene sequence. Therefore, most of the investigating groups tested its suitability in barcoding rbcL gene.

It is widely used in phylogenetic investigations with over 50000 sequences available in Genbank.

The advantages of this gene are that it is easy to amplify, sequence and align in most land plants and is a good DNA barcoding region for plants at the family and genus levels.

It encodes the large subunit of rubilose-1, 5-bisphosphate carboxylase/oxygenase (RUBISCO). rbcL was the first gene that was sequenced from the plants.

Although rbcL by itself does not meet the desired attributes of a barcoding locus, it is accepted that rbcL in combination with various plastid or nuclear loci can make accurate identifications.

mat K gene (MaturaseK) sequence:

MatK has a high evolutionary rate, suitable length and obvious inter specific divergence as well as a low transition/transversion rate. Among the chloroplast genes, matK is one of the most rapidly evolving genes. It has a length 43 of about 1550 bp and encodes the enzyme maturase which is involved in the splicing of type-II introns from RNA transcripts.

 

Since matK is embedded in the group II intron of the lysine gene trnK, it can be easily PCR-amplified with a primer set designed from the conserved regions of the genes trnK, rps16 and psbA. matK has been used as a marker to construct plant phylogenies because of its rapid evolution and the ubiquitous presence in plants.

However, failure of PCR amplification of matK in some taxonomic groups was also reported. In order to circumvent this problem, new sets of primers were developed, which work well in most of the major taxonomic groups.

Lahaye et al. (2008) used specific primers to amplify the matK gene of 1667 angiosperm plant samples and achieved a success rate of 100%.

Fazekas et al. (2008) attempted the identification of 92 species from 32 genera using the matK barcode but only achieved a success rate of 56%.

These findings demonstrate that the matK barcode alone is not a suitable universal barcode.


Present status of barcoding in plants:

The focus of barcoding studies in plants, mostly on assessing the relative efficiency of molecular markers and phylogenetic studies. Not all DNA segments form the plant are tested so far l for a standard barcode for plants.

Although some of the loci tested had many promising characters, they had several limitations as well. For instance, rbcL and trnL (UAA) have higher universality, but they lack adequate species discriminatory power.

matK and trnH–psbA have higher species resolution, but problems remain with PCR amplification and sequencing.

Considering all the available data on universality, sequence quality retrieved with a single pair of primers, difficulties in sequence alignment, and species discriminatory power along with cost of sequencing and other analysis, the CBOL–Plant Working Group 5 preferred a two-locus barcode combination consisting of rbcL and matK genes as the standard barcode for land plants.

 

This two-locus combination will act as the universal barcode for land plants.

 

The selection was based on the fact that rbcL has long been used in phylogenetic studies, and protocols for highquality sequence can be retrieved across phylogenetically divergent lineages. rbcL also performs well in discrimination tests in combination with other loci.

Likewise, the matK gene sequence, as stated above, has the highest evolution rate among plastid genes, and thus has high species discriminatory power.

Furthermore, recently developed primers have improved its PCR amplification and sequencing in a wide range of angiosperms.

 

In this context, it is also equally important to note some of the major criticisms against the currently proposed barcode system.

Spooner, after investigating the efficacy of psbA–trnH, matK and nrITS as barcodes on 104 accessions from 63 species of wild potatoes, reported that sequences of psbA–trnH, matK and nrITS failed to provide species-specific markers, especially in the section Petota.

 

The plastid genes failed to provide adequate differentiation, whereas the nrITS sequences exhibited high intraspecific variations.

Similar difficulties were also observed earlier in many genera of the subfamily Magnolioideae, family Magnoliaceae, and in another family Lauraceae with matK gene for elucidating interspecifc relationships.

Thus, for projects like identification or circumscription of species which require high resolution, the presently proposed two-locus standard barcode is not sufficient.

 

In the meantime some researchers, without waiting for a perfect barcoding system to emerge, have already started the barcoding of plants, and the results emerging from some of these studies are promising.

For instance, using trnH–psbA as barcodes, floating pennywort (Hydrocotyle ranunculoides L.f.) was distinguished from its most closely related congeners.

 

In another study using matK and trnH–psbA as DNA barcodes, Raghupathy et al. discriminated a new cryptic species of grass Tripogon cope, as deciphered by the hill tribes, from its close relatives in the Western Ghats and part of the Nilgiri Biosphere Reserve in India.

 

Further, using rbcL, matK and trnH–psbA as barcodes, a new genus Vachellia Wight & Arn. was discriminated from its closest relative Acacia Mill..

Subsequent morphometric analysis confirmed the cryptic nature of these sister species and the limitations of the existing classification based on ‘phenetic’ data. Results of some of these studies demonstrated that the DNA barcoding system has the potential to resolve some of the taxonomic problems which cannot be resolved by morphology-based taxonomy alone.


Unit II: Bioinformatics

Bioinformatics: the term is coined by “Paulien Hogeweg and Ben Hesper” in 1970.  It is a science of collecting and analyzing complex biological data by using information technology. It involves the computational tool and method used to manage, analyze and manipulate volumes and volumes of biological data.

The multidisciplinary approach of bioinformatics work with lots filed like computer science, biology, mathematics, biotechnology, statics, biochemistry, etc and it combines many scientific fields.

Aims of bioinformatics

      Development of database containing all biological information.

      Development of better tools for data designing, annotation and mining. Design and development of drugs by using simulation software.

      Design and development of software tools for protein structure prediction function, annotation and docking analysis.

      Creation and development of software to improve tools for analyzing sequences for their function and similarity with other sequences.


Where Bioinformatics help in:

      Experiment molecular biology

      In genetics and genomics

      Generating biological data

      Analysis of gene and proteins expression

      Comparison genomic data

      Simulation & Modeling of DNA, RNA & Protein.

 

Biological databases:

Biological data are complex, exception-ridden, vast and incomplete. Therefore several databases has been created and interpreted to ensure unambiguous results. A collection of biological data arranged in computer readable form that enhances the speed of search and retrieval and convenient to use is called biological database. A good database must have updated information.

 

Ø  One of the hallmarks of modern genomic research is the generation of enormous amounts of raw sequence data.

Ø  As the volume of genomic data grows, sophisticated computational methodologies are required to manage the data deluge.

Ø  Thus, the very first challenge in the genomics era is to store and handle the staggering volume of information through the establishment and use of computer databases.

Ø  A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system.

Ø  A simple database might be a single file containing many records, each of which includes the same set of information.

Ø  The chief objective of the development of a database is to organize data in a set of structured records to enable easy retrieval of information.

Example: A few popular databases are GenBank from NCBI (National Center for Biotechnology Information), SwissProt from the Swiss Institute of Bioinformatics and PIR from the Protein Information Resource.

 

Importance of biological database

A range of information like biological sequences, structures, binding sites, metabolic interactions, molecular action, functional relationships, protein families, motifs and homologous can be retrieved by using biological databases. The main purpose of a biological database is to store and manage biological data and information in computer readable forms.

 

Types of Biological Databases

 

Based on their contents, biological databases can be roughly divided into two categories 

 

1. Primary databases

·         Primary databases are also called as archieval database. 

·         They are populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure.

·         A primary database contains only sequence or structural information.

·         The database derived from the analysis or treatments of primary data are secondary database. It is very important for interfering protein function.

·         Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature.

·         Once given a database accession number, the data in primary databases are never changed: they form part of the scientific record.

 

Examples of some primary biological database:

Gene Bank,

EMBL, 

Swiss-Port, 

Protein Information Resource (PIR)

 

Gene Bank

One of the fastest growing repositories of known nucleotide sequences, GeneBank (Genetic Sequence Databank), has a flat file structure. It is an ASCII text file, readable by both humans and computers. Besides sequence data, GeneBank files contain information such as accession numbers and gene names, phylogenetic classification and references to published literature.  This database has been developed and maintained at the NCBI, Bethesda, MD, USA, as a part of International Sequence Database Collaboration (INSDC).

It is an open access sequence database.

It coordinates with individual laboratories and other sequence databases like EMBL and DDBJ.

It is an annotated collection of all nucleotide sequences that are available to the public.

 

The nucleotide database was divided into three databases at NCBI:

 Core Nucleotide database, Expressed Sequence Tag (EST) and Genome Survey Sequence (GSS). Core Nucleotide database has most of the nucleotide sequences used. It also encloses all nucleotide records that are not in the EST and GSS databases. Submission of sequences to GeneBank can be done using BankIt, Sequin and tbl2asn tools.

 

EMBL (European Molecular Biology Laboratory)

• A comprehensive database of DNA and RNA sequences, EMBL nucleotide sequence database is collected from scientific literature, patient offices and is directly submitted by researchers. EMBL has been prepared in collaboration with Gene Bank (USA) and the DNA Database of Japan (DDBJ).

• It is established in 1980.

• It is maintained by EBI (European Bioinformatics Institute)

 

Swiss-Port 

This is a curated protein sequence database that offers a high level of integration with other databases and also has a very low level of redundancy. Swiss-Port strives to provide protein sequences with a high level of annotation (for instance, the description of protein function, domain structure and post translational modifications, etc.).

 It is established in 1986 and maintained collaboratively, since 1987, by the department of Medical Biochemistry of the University of Geneva and the EMBL data Library. 

TrEMBL is a computer–annotated supplement of Swiss-Port that contains all translations of EMBL nucleotide sequence entries, which is not yet integrated in Swiss-Port.

Currently Swiss-Port has 0.5 and TrEMBL have 7.6 million sequences.

 

Protein Information Resource (PIR)

 PIR is an integrated public bioinformatics resource to support genomic and proteomic research and scientific studies. Nowadays, PIR offers a wide variety of resources mainly oriented to assisting the propagation and consistency of protein annotations like PIRSF, ProClass and ProLINK.

 

2. Secondary databases

·         Secondary databases comprise data derived from the results of analyzing primary data.

·         Secondary databases often draw upon information from numerous sources, including other databases (primary and secondary), controlled vocabularies and the scientific literature.

·         They are highly curate, often using a complex combination of computational algorithms and manual analysis and interpretation to derive new knowledge from the public record of science.

 

·         InterPro (protein families, motifs and domains)

·         UniProt Knowledgebase (sequence and functional information on proteins) 

·         Ensembl (variation, function, regulation and more layered onto whole genome sequences)

 

 However, many data resources have both primary and secondary characteristics. For example, UniProt accepts primary sequences derived from peptide sequencing experiments. However, UniProt also infers peptide sequences from genomic information, and it provides a wealth of additional information, some derived from automated annotation (TrEMBL), and even more from careful manual analysis (SwissProt).

There are also specialized databases that cater to particular research interests. For example, Flybase, HIV sequence database, and Ribosomal Database Project are databases that specialize in a particular organism or a particular type of data.

 

Example of Secondary Biological Database

Protein data bank

• PDB (Protein data bank) is a repository for 3D structural data obtained by x-ray crystallography or NMR spectroscopy of proteins and nucleic acids.

• Research Collaboratory for Structural Bioinformatics (RCSB) PDB provides a variety of tools and resources for studying the structures of biological macromolecules and their relationship with other sequences, its function and diseases caused if any.

 

Importance of Databases

·         Databases act as a store house of information.

·         Databases are used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria.

·         It allows knowledge discovery, which refers to the identification of connections between pieces of information that were not known when the information was first entered. This facilitates the discovery of new biological insights from raw data.

·         Secondary databases have become the molecular biologist’s reference library over the past decade or so, providing a wealth of information on just about any gene or gene product that has been investigated by the research community.

·         It helps to solve cases where many users want to access the same entries of data.

·         Allows the indexing of data.

·         It helps to remove redundancy of data.

 

 

Exploration of Data Bases:

                   Data exploration is the first step in biological data analysis involving the use of data visualization tools and statistical techniques to uncover data set in characteristics and initial patterns. During exploration, raw data or primary data is typically reviewed with a combination of manual workflows and automated data-exploration techniques to visually explore data sets, look for similarities, patterns and outliers of data and to identify the relationships between different variables.

 

Retrieval of Desired Data:

In any type of databases, data retrieval is the process of identifying and extracting data from a database, based on a query provided by the user or application.

Data retrieval typically requires writing and executing data retrieval or extraction commands or queries on a database. Based on the query provided, the database looks for and retrieves the data requested. Applications and software generally use various queries to retrieve data in different formats. In addition to simple or smaller data, data retrieval can also include retrieving large amounts of data, usually in the form of reports.

 

BLAST (Basic Local Alignment Search Tool):

An important goal of genomics is to determine if a particular sequence is like another sequence. This is accomplished by comparing the new sequence with sequences that have already been reported and stored in a database. Further, phylogenetic studies are necessary to determine the orthologous/paralogous nature of the two aligned sequences.

Basically sequences alignments are of two types that are global and local.

The global approach compares one whole sequence with other entire sequences. The output of a global alignment is a one-to comparison of two sequences. The global approach is useful when you are comparing a small group of sequences, but becomes become computationally expensive as the number of sequence in the comparison increases.

The local method uses a subset of a sequence and attempts to align it to subset of other sequences.

Local alignments reveal regions that are highly similar, but do not necessarily provide a comparison across the entire two sequences.  Local alignments use heuristic programming methods that are better suited to successfully searching very large databases, but they do not necessarily give the most optimum solution.

Even given this limitation, local alignments are very important to the field of genomics because they can uncover regions of homology that are related by descent between two otherwise diverse sequences.


The most common local alignment tool is BLAST (Basic Local Alignment Search Tool) developed by Altschul (1990). The operative phrase in the phrase is local alignment. The BLAST is a set of algorithms that attempt to find a short fragment of a query sequence that aligns perfectly with a fragment of a subject sequence found in a database. That initial alignment must be greater than a neighborhood score threshold (T).

For the original BLAST algorithm, the fragment is then used as a seed to extend the alignment in both directions. The alignment is extended in both directions until the T score for the aligned segment does not continue to increase. Said another way, BLAST looks for short sequences in the query that matches short sequences found in the database.

The first step of the BLAST algorithm is to break the query into short words of a specific length. A word is a series of characters from the query sequences. The default length of the search is three characters. The words are constructed by using a sliding window of three characters.

For example, twelve amino acids near the amino terminal of the Aradbidopsis thaliana protein phosphoglucomutase sequence are:

NYLENFVQATFN

This sequence is broken down into three character words by selecting the first amino acid characters, moving over one character, selecting the next three amino acid characters, and so on to create the following seven words:

NYL YLE LEN ENF NFV FVQ VQA QAT ATF TFN

These words are then compared against a sequence in a database. Here is an example of a word match with rabbit muscle phoshoglucomutase (subject line):

Query                                    ENF

Subject                    SSTNYAENTIQSIISTVEPAQR

 

On the basis of search quires and search database BLAST are of following types.


·         BLASTn (Nucleotide BLAST): When BLAST program will compares one or more nucleotide query sequences to a subject nucleotide sequence or a database of nucleotide sequences called as Nucleotide BLAST. This is useful when trying to determine the evolutionary relationships among different organisms by using nucleotides sequences.

·         BLASTx (translated nucleotide sequence searched against protein sequences): When a BLAST program will compares a nucleotide query sequence that is translated in six reading frames (resulting in six protein sequences) against a database of protein sequences called as BLASTx. Because blastx translates the query sequence in all six reading frames and provides combined significance statistics for hits to different frames, it is particularly useful when the reading frame of the query sequence is unknown or it contains errors that may lead to frame shifts or other coding errors. Thus blastx is often the first analysis performed with a newly determined nucleotide sequence.

·         tBLASTn (protein sequence searched against translated nucleotide sequences): When BLAST program will compares a protein query sequence against the six-frame translations of a database of nucleotide sequences called as tblast. It is useful for finding homologous protein coding regions in unannotated nucleotide sequences such as expressed sequence tags (ESTs) and draft genome records (HTG), located in the BLAST databases est and htgs, respectively. a tblastn search is the only way to search for these potential coding regions at the protein level.

·         BLASTp (Protein BLAST): When BLAST program will compares one or more protein query sequences to a subject protein sequence or a database of protein sequences. This is useful when trying to identify a protein.

Protein structure analysis and application:

The function of a protein is directly dependent on its structure, its interactions with other proteins, and its location within cells, tissues, and organs. The structure and function of proteins is studied on a large scale in proteomics, which enables the identification of protein biomarkers associated with specific disease states and provides potential targets for therapeutic treatment. The understanding of protein structure and mapping of protein location, expression levels, and interactions yield valuable information that can used to infer protein function.


Protein structure refers to the three-dimensional arrangement of amino acid atoms in a protein molecule. Proteins are polymers formed from sequences of amino acids, the monomers of the polymer. A single amino acid monomer may also be called a residue indicating a repeating unit of a polymer. Proteins form by amino acids undergoing condensation reactions, in which the amino acids lose one water molecule in each reaction attaching to one another with a peptide bond.

 

Protein Structure: Structure of protein can be classified in to four classes; primary, secondary, tertiary and quaternary structure by degree of structural complexity.

i.                    Primary structure is amino acid sequence that creates polypeptide chain.

ii.                  Secondary structure, non-covalent interactions particularly hydrogen bonds between amino acids form preliminary three-dimensional structures such as α-helices and β-strands.

iii.                Tertiary structure describes the assembly of secondary structures to obtain the overall structure of the protein, mainly divided into globular and fibrous proteins.

iv.                Quaternary structure explains the three-dimensional structure representation of a protein having two or more polypeptide chains linked by di-sulfide bridges or hydrogen bonds


Protein folding: Protein folding is the physical process by which a polypeptide, translated from a sequence of mRNA as a linear chain of amino acids folds into its characteristic three dimensional functional native structure.


Scientists are still trying to learn how the primary structure of a protein determines its other levels of structure. The primary forces that stabilize a protein's three dimensional structure are:

(a) Sequestration of hydrophobic amino acids away from water;

(b) Maximizing van der Waals contacts in the interior of proteins (minimizing open space);

(c) Maximizing hydrogen bonds (in α helices or ß sheets, for example), and

(d) Ion pairing between oppositely charged amino acids (Arg and Glu, for example).

 

PROTEIN STRUCTURE DETERMINATION

The determination of three-dimensional protein structures at atomic resolution is useful in the elucidation of protein function, structure-based drug design, and molecular docking.

NMR: Nuclear magnetic resonance (NMR) spectroscopy is used to obtain information about the structure and dynamics of proteins. In NMR, the spatial location of atoms is determined by their chemical shifts. For protein NMR, proteins are typically labeled with stable isotopes (15N, 13C, 2H) to enhance sensitivity and facilitate structural deconvolution. Isotopic labels are typically introduced by supplying isotopically labeled nutrients in the growth medium during protein expression.

X-ray crystallography: Protein X-ray crystallography can be used to obtain the three-dimensional structure of proteins through X-ray diffraction of crystallized proteins. Crystals are grown by seeding highly concentrated protein in solutions that promote precipitation, with ordered protein crystals forming under suitable conditions. X-rays are aimed at the protein crystal, which scatters the X-rays onto an electronic detector or film. The crystals are rotated to capture diffraction in three dimensions, enabling calculation of the position of each atom in the crystallized molecule by Fourier Transform.

PROTEIN MAPPING

Mapping of the location and expression level of proteins in specific cells, tissues, and organs aids in the functional study of the proteome. Spatial distribution of proteins is key to protein function, with improper localization or expression triggering various disease states. Mapping projects such as the Human Protein Atlas provide a proteomic resource for biomarker discovery and aid in the understanding of disease pathology. Mapping of the interactome helps define the molecular interactions that occur on a cellular level, assisting in the understanding of protein function and providing valuable potential drug targets for disease.

PROTEIN STRUCTURE DETERMINATION Vs. PREDICTION:

Protein structures are experimentally determined using X-ray crystallography, NMR and cryo-electron microscopy methods. However, each method has its own constrains in terms of sample preparation, resolution limits and molecular size.

 

Protein structure Uses and applicable in:

Ø  Structure-function relationship of a protein.

Ø  Structural characterization of drug target.

Ø  Structure based drug designing.

Ø  Structural basis of the disease.

Ø  Structural mechanism of drug toxicity.

Ø  Structural approach to overcoming drug adverse effects

Ø  Structural perspective to understand evolutionary background of protein.

Ø  Structural biology platform for personalized medicine.

 

Multiple Sequence Alignment

Multiple Sequence Alignment (MSA) is generally the alignment of three or more biological sequences (protein or nucleic acid) of similar length. From the output, homology can be inferred and the evolutionary relationships between the sequences studied.

Like in evolution of enzymes, functionally significant residues are conserved more than the rest of the residues in the sequence. The conserved residue may be present singly or as a contiguous short stretch, known as a sequence motif.

The residues inside active site of the enzyme are conserved for functional roles such as substrate binding and chemical catalysis. The catalytic residues are actual reactive amino acids and also involved in stabilization of the transition state.

In addition, conserved glycine and proline residues outside active site are conserved for accurate folding of the protein to position active site residues for binding and catalysis of the substrate.

Pairwise sequence alignment is used as similarity search tool to find similar sequences in databases to identify members of sequence family. Pair wise sequence alignment allows drawing structural, functional and evolutionary relationship between two sequences.

Therefore, alignment of several sequences for the same reaction is useful for detection of conserved residues to identity functional roles. The alignment of several sequences is known as multiple sequence alignment i.e. an MSA, which allows detection of conserved residues, which are otherwise hidden in pairwise alignment.

The detection of conserved residues gives an insight of substrate binding, chemical catalysis and folding patterns in proteins. Detection of conserved residues inside and outside active site may lay foundation to develop an initial set of peptides for developing a QSAR model for structure based protein design.

 

Further, an MSA is a prerequisite for constructing phylogenetic tree reflecting evolutionary divergence over time. Phylogenetic tree, therefore, may be used to select closely related multiple template structures for interactive homology modeling of protein sequences.

 

Phylogenetic analysis:

What is Phylogenetic Analysis?

The study of the evolutionary development of a species or a group of an organism or a particular characteristic of an organism or to analyze the relationship within and among the species is known as phylogenetic analysis.

       Branching diagrams are made to represent the evolutionary history relationship between the same or different species and organisms during the phylogenetic analysis that has been developed from a common ancestor.

       It is also used to analyze the characteristics of an organism’s genes, proteins, organs, etc.

       The diagram that is used during the phylogenetic analysis is also known as the phylogenetic tree.

       Phylogenetic analysis is used for different purposes that include a collection of biological diversity information, information on genetic classification, as well as learning of different developmental stages that occur during evolution.

       As the sequencing techniques have been advancing at a rapid pace, phylogenetic analysis has been used to understand the evolutionary relationship among the species by comparing the sequences of the gene.

       When the genetic materials are not available then the morphological estimates can be used to differentiate the evolutionary relationships.

       But most of the recent phylogenetic analysis software and algorithms have limitations to low accuracy, high time complexity, complex results, and restriction assumption on the size of the database.

       When two sequences of two organisms are quite similar, then we assume that both of the organisms have been derived from the same ancestors.

To understand the distance, clade, taxa, and the relationship between the species in a tree, it is important to understand the phylogenetic tree.


What is Phylogenetic Tree?

       The diagram that represents the lines of evolutionary descent of different species, organisms, or genes from a common ancestor is known as a phylogenetic tree or simply phylogeny.

       Since the time of Charles Darwin, tree diagrams have been used in evolutionary biology.

       The tree is compared on the basis of leaves (tips), nodes, and branches where the two nearby nodes (taxonomic units) are connected by one branch (internal branch).

       In the phylogenetic tree, species, population, individuals, or trees are represented by leaves, and these leaves can be connected to nodes with the help of a branch (external branch).

       The flow of genetic information between subsequent generations is determined by the branch, and genetic change or divergence is denoted by the branch length.

       Similarly, the average number of nucleotide substitutions per site generally estimates the degree of divergence.

       A node represents the exact position of two or more descendant lineages generation from an ancestral lineage while analyzing the phylogenetic tree from the roots towards the tips.

       But the evolution occurs autonomously in the case of newly generated lineages.

       Topology represents the evolutionary development of the generation through the progressive branching pattern created by lineage splitting.

       The phylogenetic tree can be rooted or unrooted as well as scaled or unscaled, depending on our study requirements and what kind of tree we require.

       So rooting of the phylogenetic tree is essentially required for a better understanding of the directionality of evolution and genetic evidence.

       There are various methods to accurately estimate the tree root using gene sequence data and assumptions that include a molecular clock, midpoint rooting, and outgroup rooting.

       Whereas the phylogenetic tree that is unrooted only represents the relationship among the species without showing the ancestral root of origin.

       Similarly, in a tree that is scaled relationship that exists between the branch length and the amount of genetic divergence that took place on the branch is proportional.

 

Different parts of the phylogenetic tree

·         Branches: The path of the genetic information transfer from one generation to another is determined by the branches. The genetic changes are noted by the branch length; that is larger branch length means a higher rate of divergence or more genetic change has occurred. Generally, we estimate the average number of nucleotide or protein substitutions sites and measure the genetic change.

It is more common to see the branch length represented by a scale bar, and this scale indicates the number of substitutions per site. Branch length can be shown on the phylogeny.

A sample sequence alignment of the human and the mouse.

In the simple alignment that is shown in the above figure of human and mouse, we can observe the number of sites that is different from each other between the two sequences. There is one site from both that is different from each other, and on the basis of this, we can say that there are 1/10 = 0.1 substitutions per site. Basically, the evolutionary model is used for interfering with the genetic changes that have occurred.


·         Nodes: At the end of the branches there is a node that represents sequences at various points in evolutionary history. The tree consists of the tips, internal nodes, and roots.

·         Tips: Tips are also known as the external nodes and represent the sample sequences or the interesting species during the construction of the phylogenetic tree.

·         Internal Nodes: The junction point of the branches or the point where more than one branch meets and represents the ancestral sequences is called the Internal Nodes.

·         The root is one of the important internal nodes that represent the most recent common ancestors of all the sequences in the constructed phylogenetic tree.

·         Root: The root is the most recent common ancestor of all the taxa and one of the important internal nodes in the tree. It tells us the direction of evolution and is one of the oldest parts of the tree. It shows the flow of genetic information moving from the root toward the tip.

 

Construction of Phylogenetic tree:

Prior to 1990, phylogenetic inferences were generally presented as narrative scenarios. Such methods are often ambiguous and lack explicit criteria for evaluating alternative hypotheses.

The phylogenetic analysis as a whole can be classified into distance based and character based analysis. However, MSA is the raw material for both. Once entering MEGA and a new alignment has been chosen, the FASTA format of the same genes which are to be used are retrieved.

The retrieved format is not completely aligned. Hence the alignment is done using the same MEGA software. The alignment is then saved in the MEGA format. The final file is completely aligned and shows the similar nucleotide sequences that have been taken.

The tree is generated by combining ClustalW using either all sites of the BLAST reconstructed multiple-alignment or gap-free sites only. Then the phylogenetic tree is constructed for the same set of given data and accordingly the phylogenetic relationship can be analyzed and used for further usage, the length of each clade is in correlation with the time units which is given in the phylogenetic tree.

Application of Phylogenetic Studies:

       Phylogenetic analysis provides in-depth knowledge and understanding of the species that evolved through different genetic modifications. Using this technique scientists and researchers are able to identify the path that connects the present-day organisms with their ancestral origin and is able to predict the genetic divergence that might occur in the future.

       It has applications in wide areas including medical and biological fields along with forensic science, conservation biology, drug discovery, epidemiology, prediction of protein structure and functions, and gene function prediction.

       It is applicable to accurately estimate the evolutionary relationship among species by the use of gene sequence data in molecular phylogenetic analysis.

       Phylogenetic analysis is also applicable for gathering information related to pathogen outbreaks. Besides this, the source of pathogen transmission can also be investigated by linking the epidemiological linkages.

       It plays a vital role in conservation biology for the prediction of the species which are in the verse of extension and which species must be taken care of.

       Phylogenetic analysis is also applicable in comparative genomics for studying relationships between the genome of different species and one example of it is gene finding.

       Similarly screening pharmacologically related species helps to identify the members which have closer pharmacological significance.