Methods
Methods
Method flowchart
Figure 1. Method flowchart. Data is extracted from GSE152216, then processed using Galaxy and Python before being analyzed on the UCSC Blat Search (cross referenced via BLAST) and g:Profiler.
Data Mining
Data used for this project can be found from GSE152216, published by Hänsel-Hertsch et al. (2020) [8, 9]. The dataset contains qG4-ChIP-seq data from 22 PDTX models, with 4 technical replicates being made for each model. Three random patients (STG139M_181, STG143_284, STG139M_284) were chosen for this project, and the data for the first replicate experiment was used (GSM4609246 for STG139M_181 rep1, GSM4609259 for STG143_284 rep1, GSM4609251 for STG139M_284 rep1) (Figure 2A). Raw NGS (Next-Generation Sequencing) datasets were obtained from the 3 patients by going to their respective GSM datasets and viewing their SRA (Sequence Read Archive) datasets (Figure 2B). Each patient’s SRA dataset has multiple SRA Run files, and the file with the highest number of bases was chosen for each patient (SRR11978411 for STG139M_181 model, SRR11978480 for STG143_284 model, SRR11978440 for STG139M_284 model) (Figure 2C). FASTA files were downloaded with both “Filter” and “Clipped” pre-processing options chosen. These files contain short immunoprecipitated DNA sequences that are derived from PDTX models.
Figure 2. Example screenshots of workflow. A) Excerpt screenshot from the GEO Accession Viewer website for GSE152216 and boxed are the 3 randomly chosen patient models, B) Example GEO Accession Viewer website screenshot for sample GSM4609246, corresponding to Patient 1. Boxed is the SRA Run Selector to view SRR files, C) All SRR files from a sample, the file with the highest total base count was chosen and the FASTA file was downloaded with both the “Filter” and “Clipped” pre-processing options chosen.
Data Processing
The c-MYC promoter sequence was chosen as it has been shown to contain G4-forming sequences. The chosen motif sequence for alignment purposes is GGG GAG GGT GGG GAG GGT GGG GA [10, 11]. This motif was aligned with the SRR fasta files from the data mining process using the fuzznuc tool from Galaxy, with up to 3 mismatches allowed (Figure 3A). Complementary strands were also searched (Figure 3A). The output text file was run through a custom Python code (see Supplementary Materials) to extract the full raw sequences from the SRR files containing the desired G4 motif sequence (Figures 3B&C).
Figure 3. Example screenshots of workflow. A) Excerpt screenshot from the fuzznuc tool on Galaxy, B). An example preview of the fuzznuc tool output, C) An example of the output from the Python code, showing the extracted full raw sequences containing the desired motif.
Data analysis
Two full raw sequences obtained from Data processing section, containing the desired G4 motif were chosen from each of the 3 patients’ SRR files, meaning a total of 6 raw sequences were chosen as reference. A UCSC BLAT Search was performed on each sequence to determine the location and which part of the sequence was aligned relative to the human genome (hg19) (Figure 4). The location of the reference sequence was then viewed on the UCSC Genome Browser to determine whether the sequence was part of any genes. To further cross-reference, BLAST was performed on the aligned sequence containing regions ~100 bp upstream and downstream to double-check whether the sequence can be found in any genes. A few genes for each patient upon cross referencing was noted for any regulatory or genetic elements. From this data, we chose 12 genes with their respective ENSG codes to perform g:Profiler enrichment analysis (refer to Genomic Analysis Table in the Supplementary Materials tab)
Figure 4. An example screenshot of the UCSC Blat Search output. “a” in red, leads to the UCSC Genome Browser showing the location of the BLAT sequence as well as the information of the surrounding locations. “b” shows the full aligned sequence and the sequences ~100bp upstream and downstream.