Tandem repeats in DNA are repeating sequence motifs that comprise approximately 3% of the human genome.1 Tandem repeats that are 1-6 base pairs (bp) in length are considered to be short tandem repeats (STRs). These are known to be enriched in conserved genomic regions.2 Some STRs have been shown to alter gene expression, and are thus labeled expression short tandem repeats (eSTRs).2 The specific mechanisms by which they influence gene expression are currently unclear, however, eSTRs have been shown to colocalize with regulatory elements and modulate histone modifications.2 They also have been shown to be binding sites for some transcriptional repressor proteins,3 and may also affect the affinity of such proteins for their binding regions. STRs have notably high mutation rates between generations due to DNA polymerase slippage events during DNA replication, which alter the number of repeats present in daughter strands. In particular, STR regions can have mutations rates between 10-8 to 10-2 mutations per locus per generation, which can be even higher than single nucleotide polymorphisms (SNPs) which typically have rates close to 10-8 mutations per generation.4 STRs therefore make a substantial contribution to overall genetic variation in the human population. Since they are also correlated with gene expression, they may be found to contribute to clinical phenotypes such as Huntington’s disease and fragile X syndrome.5
In 2014, Melnikov et al. published a protocol for a massively parallel reporter assay (MPRA), which was used to assess an entire library of synthesized oligonucleotides.6 Briefly, an oligonucleotide library containing the sequences to be studied was designed by the team and obtained from a commercial vendor. All of the single stranded DNA was made double stranded and amplified through emulsion PCR, and all of the amplified DNA products were cloned into plasmid vectors, along with a luciferase reporter gene. The result was a library of plasmid vectors containing both a reporter gene and one of the many DNA sequences to be analyzed. The entire plasmid library was then cotransfected into a human cell line, and the effects on gene expression were analyzed through sequencing counts of the mRNA from the reporter gene.6
Recent developments in sequencing technology has allowed multiplexed sequencing of heterogeneous samples, distinguishing between different sequences by incorporating a designed, unique barcode into each sample sequence.7 Because the outputs of MPRAs are the mRNA transcripts of a reporter gene collected from a large quantity and variety of transfected cells, we also want to ensure that the collected transcript includes a barcode at the 3’ end that indicates which sample it is associated with. To be able to accommodate the sheer amount of collected transcripts in a cost effective manner, high-throughput (multiplexed) RNA sequencing has been shown to be the ideal methodology. RNA sequencing utilizes the ability to convert mRNA into cDNA in order to run them through the high throughput next-gen sequencing process for quantification of read counts and the code of said read counts.8 To build off of RNA sequencing, we can make this process even more cost effective using a newer method that only requires the barcodes at the 3’ end of the sequence for analysis. This method is called Tag-Seq, and is used to generate a read only from the 3’ end of the cDNA, allowing capture of the barcode and an understanding of what was transcribed helping with the down-stream analysis of overall gene expression.9 This method may not be as reliable in understanding the exact sequence of the transcripts, but gives the information of how many reads a certain transcript produces, which is directly related to gene expression.