Methods

Constructing a motif finder

G-quadruplexes (G4s) were identified using RegEx [14] . RegEx is a module that uses pattern matching to identify specified motifs in a given sample [14-16]. Previous research indicates that G4s can be identified through the repeating DNA motif of three guanines followed by a stretch of one to seven varying nucleotides G(3+) + N(1-7) + G(3+) + N(1-7) + G(3+) + N(1-7) + G3(+) [3] . In our model we have chosen to be more stringent in G quadruplex identification and passed the following expression to RegEx:

GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG

Genome sequences for organisms of interest were then downloaded from the National Center for Biotechnology Information (NCBI) as FASTA files. The G4 motif finder was then applied to these sequences and provide matches for each sequence as either an iterator or a list, depending on the needs of the user.

Let's apply this to the Thermus thermophilus genome

Calculating the frequency of G4s of an organism

The matches can then be quantified by finding the length of output of the motif finder. This analysis is then furthered by calculating the frequency of G4s for each sequence after depositing genome lengths and G4 counts into individual lists.

Summarizing the results

The results of this analysis can be converted into a DataFrame to be easily visualized by the user.

Associating G-quadruplexes with genes

Once G4s were identified, their positions were compared with the known positions of genes from annotated genome files downloaded from the NCBI website as FASTA files.

Let's apply this to yeast chromosome 1

If the G4 is not within a gene, find the closest gene and the downstream gene.

Converting the results into a data frame of top-hits

Representative example generating an image for total G4s for each yeast chromosome

Associated Code and Data files

Figure Generation

All figures were made in Python using matplotlib.pyplot. Figures associated from GO-term enrichment analysis are adapted from G:Profiler [17].

Data Files

All genome sequences were downloaded from the NCBI website

Human genome sequences: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/

Annotated human genome sequences: https://www.ncbi.nlm.nih.gov/datasets/gene/GCF_000001405.40/

Yeast genome sequences: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000146045.2/

Annotated yeast genome sequences: https://www.ncbi.nlm.nih.gov/datasets/gene/GCF_000146045.2/

Candidatus Carsonella ruddii genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000287275.1/

Buchnera aphidicola genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_003099975.1/

Ehrlichia chaffeensis genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000632965.1/

Saccharopolyspora erythraea genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000062885.1/

Methanococcoides burtonii genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000013725.1/

Streptomyces avermitilis genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000009765.2/

Thermus caldilimi genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_004684245.1/

Mycoplasmoides genitalium genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000027325.1/

Colwellia psychrerythraea genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000012325.1/

Thermus thermophilus genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_019973715.1/

Methanopyrus kandleri genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000007185.1/

Aspergillus fumigatus genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000002655.1/

Fusarium oxysporom genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_013085055.1/

Data packages used

BioPython (SeqIO) RegEx

pandas matplotlib.pyplot

numpy os

Identifying G4s in genome arrays

import os

from Bio import SeqIO

import re

import pandas as pd

import matplotlib.pyplot as plt

# point directory to file with all sequences

os.chdir('/Users/zoyaansari/Desktop/Bioinformatics/Sequences')

path = '/Users/zoyaansari/Desktop/Bioinformatics/Sequences'

file_names = os.listdir(path)

single_genome_headers = []

single_genome_quadruplex_counts = []

single_genome_sequence_length = []

single_genome_quadruplex_frequency = []

single_genome_quadruplex_per_1M = []

# only select fasta files

sequence_list = []

for file in file_names:

if '.fasta' in file:

sequence_list.append(file.upper())

# read fasta sequences from the list of fasta files

for name in sequence_list:

seq_file = open(name, 'r')

for seq_record in SeqIO.parse(seq_file, 'fasta'):

sequence = str(seq_record.seq)

header = seq_record.description

single_genome_headers.append(header)

sequence_length = len(sequence)

single_genome_sequence_length.append(sequence_length)

# define a motif

motif = r'GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG'

# return the matches as something that a foor loop can iterate through

matches = re.finditer(motif, sequence)

#count how many g quadruplexes are in the sequences

number_of_quadruplexes = len(list(matches))

single_genome_quadruplex_counts.append(number_of_quadruplexes)

base_pair_per_quadruplex = number_of_quadruplexes/sequence_length

# calculate the frequency of g-quadruplexes by sequence length by dividing quadruplexes and lengths

for i in range(len(single_genome_headers)):

single_genome_quadruplex_frequency.append((single_genome_quadruplex_counts[i]/single_genome_sequence_length[i]))

single_genome_quadruplex_per_1M.append(single_genome_quadruplex_frequency[i]*1000000)

# turn this into a dataframe

df_single_genome = pd.DataFrame(list(zip(single_genome_headers, single_genome_quadruplex_counts, single_genome_sequence_length, single_genome_quadruplex_per_1M)),

columns = ['Name','Quadruplex count', 'Genome Length (bp)', 'Quadruplexes per million bps'])

df_single_genome

# identifying genes in A. fumigatus
import os

from Bio import SeqIO

import re

import pandas as pd

import matplotlib.pyplot as plt

# for genomes with multiple sequence (for Aspergillus fumigatus)

# point directory to file with all sequences

import os

os.chdir('/Users/zoyaansari/Desktop/Bioinformatics/Sequences/Aspergillus_fumigatus')

path = '/Users/zoyaansari/Desktop/Bioinformatics/Sequences/Aspergillus_fumigatus'

file_names = os.listdir(path)

length_of_genome = []

number_of_quadruplexes = []

af_name = ['Aspergillus fumigatus']

# only select fasta files

sequence_list = []

for file in file_names:

if '.fasta' in file:

sequence_list.append(file.upper())

# read fasta sequences from the list of fasta files

for name in sequence_list:

seq_file = open(name, 'r')

for seq_record in SeqIO.parse(seq_file, 'fasta'):

sequence = str(seq_record.seq)

header = seq_record.description

length_of_chromosome = len(seq_record.seq)

length_of_genome.append(length_of_chromosome)

# define a motif

motif = r'GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG'

# return the matches as a list something that a for loop can iterate through

matches = re.findall(motif, sequence)

#count how many g quadruplexes are in the sequences

length = len(matches)

number_of_quadruplexes.append(length)

total_quadruplexes = sum(number_of_quadruplexes)

# calculate the size of the genome

size_of_genome = sum(length_of_genome)

frequency_per_1M = (total_quadruplexes/size_of_genome)*1000000

# turn this into a DataFrame by first making a dictionary

af_dict = {'Name': af_name, 'Quadruplex count': total_quadruplexes, 'Genome Length (bp)': size_of_genome, 'Quadruplexes per million bps':frequency_per_1M}

af_df = pd.DataFrame(af_dict)

af_df

# identifying genes in F. oxysporom

import os

from Bio import SeqIO

import re

import pandas as pd

import matplotlib.pyplot as plt

# for genomes with multiple sequence (for Aspergillus fumigatus)

# point directory to file with all sequences

import os

os.chdir('/Users/zoyaansari/Desktop/Bioinformatics/Sequences/Fusarium_oxysporom')

path = '/Users/zoyaansari/Desktop/Bioinformatics/Sequences/Fusarium_oxysporom'

file_names = os.listdir(path)

length_of_genome = []

number_of_quadruplexes = []

fo_name = ['Fusarium oxysporom']

# only select fasta files

sequence_list = []

for file in file_names:

if '.fasta' in file:

sequence_list.append(file.upper())

# read fasta sequences from the list of fasta files

for name in sequence_list:

seq_file = open(name, 'r')

for seq_record in SeqIO.parse(seq_file, 'fasta'):

sequence = str(seq_record.seq)

header = seq_record.description

length_of_chromosome = len(seq_record.seq)

length_of_genome.append(length_of_chromosome)

# define a motif

motif = r'GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG'

# return the matches as a list something that a for loop can iterate through

matches = re.findall(motif, sequence)

#count how many g quadruplexes are in the sequences

length = len(matches)

number_of_quadruplexes.append(length)

total_quadruplexes = sum(number_of_quadruplexes)

# calculate the size of the genome

size_of_genome = sum(length_of_genome)

frequency = (total_quadruplexes/size_of_genome)*1000000

# turn this into a DataFrame by first making a dictionary

fo_dict = {'Name': fo_name, 'Quadruplex count': total_quadruplexes, 'Genome Length (bp)': size_of_genome, 'Quadruplexes per million bps':frequency}

fo_df = pd.DataFrame(fo_dict)

fo_df

Genome array visualization

# Join all of the dataframes into one dataframe using concatenate

important_dataframes = [df_single_genome, af_df, fo_df]

final_df = pd.concat(important_dataframes)

final_df

# Visualizing by name

# turn the data frame into a bar graph

final_df.plot(kind='bar', x='Name',y='Quadruplexes per million bps',color='darkblue', width = 0.7)

# set the title

plt.title('G-Quadruplex frequency across organisms')

# show the plot

plt.show()

# Add in the GC data as a new column

GC_values = [14, 25.5, 30, 71, 41, 70.5, 65, 31.5, 38, 69.5, 61, 50, 47.5]

final_df['GC percentage'] = GC_values

#final_df

#GC_dict = {'Streptomyces avermitilis' : 70.5, 'Saccharopolyspora erythraea': 71, 'Candidatus Carsonella ruddii': 14, 'Buchnera aphidicola': 25.5, 'Ehrlichia chaffeensis': 30, 'Methanococcoides burtonii' : 41 , 'Thermus caldilimi' : 65 ,'Mycoplasmoides genitalium' : 31.5,'Colwellia psychrerythraea' : 38 , 'Thermus thermophilus': 69.5,'Methanopyrus kandleri': 61 , 'Aspergillus fumigatus': 50 , 'Fusarium oxysporum': 47.5 }

# sort the list in order of GC content

final_df_sorted = final_df.sort_values(by=['GC percentage'], ascending = False)

# reset the index

final_df_sorted = final_df_sorted.reset_index(drop=True)

final_df_sorted

# sort the list in order of GC content

final_df_GC_graph = final_df.sort_values(by=['GC percentage'], ascending = True)

# reset the index

final_df_GC_graph_reset = final_df_GC_graph.reset_index(drop=True)

final_df_GC_graph_reset

# visualize by GC content

# turn the data frame into a bar graph

final_df_GC_graph.plot(kind='bar', x='GC percentage',y='Quadruplexes per million bps',color='purple', width = 0.7)

# set the title

plt.title('G-Quadruplex frequency across organisms with varying GC content')

# show the plot

plt.show()

# organize by genome length

# Look at relationship with genome length and temperature affect occurance of G quadruplexes

genome_length_df = final_df.sort_values(by=['Genome Length (bp)'], ascending = True)

genome_length_df_reset = genome_length_df.reset_index(drop=True)

genome_length_df_reset

# visualize the plot

# turn the data frame into a bar graph

genome_length_df.plot(kind='bar', x='Genome Length (bp)',y='Quadruplexes per million bps',color='maroon', width = 0.7)

# set the title

plt.title('G-Quadruplex frequency across organisms organized by genome length')

# show the plot

plt.show()

G4 motif finder tutorial

# Motif finder tutorial

# point directory to file with all sequences

import os

os.chdir('/Users/zoyaansari/Desktop/Bioinformatics/Sequences')

path = '/Users/zoyaansari/Desktop/Bioinformatics/Sequences'

from Bio import SeqIO

import re

# read in the FASTA file for the organism of interest

seq_file = open('Thermus thermophilus.fasta', 'r')

for seq_record in SeqIO.parse(seq_file, 'fasta'):

sequence = str(seq_record.seq)

header = seq_record.description

sequence_length = len(sequence)

# define a motif

motif = r'GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG'

# return the matches as something that a foor loop can iterate through

matches = re.findall(motif, sequence)

#count how many g quadruplexes are in the sequence

number_of_quadruplexes = len(matches)

# print('There are', number_of_quadruplexes, "G4s in", header)

# calculate the frequency of g quadruplexes in the sequence for every 1 million base pairs

frequency = (number_of_quadruplexes/sequence_length)*1000000

# print("The frequency of G4s in", header, "is", "{:.2f}".format(frequency), "G4s per 1 million base pairs")

# turn values into lists

list_header = []

list_genome_length = []

list_number_of_quadruplexes = []

list_frequency = []

list_header.append(header)

list_genome_length.append(sequence_length)

list_number_of_quadruplexes.append(number_of_quadruplexes)

list_frequency.append(frequency)

# turn this into a dataframe

T_thermophilus_df = pd.DataFrame(list(zip(list_header, list_number_of_quadruplexes, list_genome_length, list_frequency)),

columns = ['Name','Quadruplex count', 'Genome Length (bp)', 'Quadruplexes per million bps'])

T_thermophilus_df

Looping through multiple chromosome files

for i in range(1,18):

filename = "Yeast_Chromosome" + str(i)

seq_file = open(filename, 'r')

for seq_record in SeqIO.parse(seq_file, 'fasta'):

sequence = str(seq_record.seq)

positionsfile = "Yeast_Chromosome_Positions" + str(i)

positions_df = pd.read_csv(positionsfile, sep='\t')

for i in range(1,24):

filename = "Human_Chromosome" + str(i)

seq_file = open(filename, 'r')

for seq_record in SeqIO.parse(seq_file, 'fasta'):

sequence = str(seq_record.seq)

positionsfile = "Human_Chromosome_Positions" + str(i)

positions_df = pd.read_csv(positionsfile, sep='\t')

Identifying G4s within genes

GQuad_Starts = df['Start'].tolist()

GQuad_Ends = df['End'].tolist()

Gene_Starts = positions_df['Begin'].tolist()

Gene_Ends = positions_df['End'].tolist()

All_Genes = positions_df['Name'].tolist()

for i in range(num_rows):

A = GQuad_Starts[i]

B = GQuad_Ends[i]

for i in range(len(Gene_Starts)):

C = Gene_Starts[i]

D = Gene_Ends[i]

if A >= C and B <= D:

#print("This motif is within a gene")

#print("The gene is", All_Genes[i])

list_of_genes.append(All_Genes[i])

#else:

#print("This motif is not within a gene")

Identifying G4s outside genes

#chromosome 2

from Bio import SeqIO

import re

import pandas as pd

list_of_genes = []

closest_genes = []

gene_distance = []

downstream_genes = []

gene_downstream = []

chromosome_number = "Y"

# Open a file and define what the sequence is

seq_file = open(f'Human_chromosome_{chromosome_number}.fasta', 'r')

for seq_record in SeqIO.parse(seq_file, 'fasta'):

sequence = str(seq_record.seq).upper()

# Read in the annotated file for reference genome with gene annotation

positions_df = pd.read_csv('Human_annotated_genes.tsv', sep='\t')

# Filter the DataFrame to only contain genes for Chromosome I

positions_df = positions_df[positions_df['Chromosome'] == f"{chromosome_number}"]

motif = r'GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG'

matches = re.finditer(motif, sequence)

start_positions = []

end_positions = []

header = []

for match in matches:

start_positions.append(match.start() + 1)

end_positions.append(match.end())

header.append(match.group())

df = pd.DataFrame({'G quadruplex': header, 'Start': start_positions, 'End': end_positions})

num_rows = len(df)

print(num_rows)

GQuad_Starts = df['Start'].tolist()

GQuad_Ends = df['End'].tolist()

Gene_Starts = positions_df['Begin'].tolist()

Gene_Ends = positions_df['End'].tolist()

All_Genes = positions_df['Symbol'].tolist()

# Iterate through each G-quadruplex

for i in range(num_rows):

A = GQuad_Starts[i]

B = GQuad_Ends[i]

# Check if the G-quadruplex falls within any gene

within_gene = False

for j in range(len(All_Genes)):

C = Gene_Starts[j]

D = Gene_Ends[j]

if A >= C and B <= D:

within_gene = True

list_of_genes.append(All_Genes[j])

break

# If the G-quadruplex is not within a gene, find the closest gene and the downstream gene

if not within_gene:

min_distance = float('inf')

closest_gene = None

min_distance_d = float('inf')

downstream_gene = None

for j in range(len(All_Genes)):

C = Gene_Starts[j]

D = Gene_Ends[j]

distance = abs(C - B)

d_distance = (C - B)

if distance < min_distance:

min_distance = distance

closest_gene = All_Genes[j]

if 0 < d_distance < min_distance_d:

min_distance_d = d_distance

downstream_gene = All_Genes[j]

closest_genes.append(closest_gene)

gene_distance.append(min_distance)

downstream_genes.append(downstream_gene)

gene_downstream.append(min_distance_d)

# Create DataFrame

gene_df = pd.DataFrame(list(zip(closest_genes, gene_distance, downstream_genes, gene_downstream)),

columns=['Closest_gene', 'Distance', 'Downstream_gene', 'Downstream_distance'])

print(gene_df)

downstream_gene_count_sorted.to_csv('downstream_gene_count_sorted.csv', index=False)

print(len(list_of_genes))

print(len(gene_df))

Turning gene top-hits into a data frame

# count how many g quadruplexes are in a gene

gene_count = {}

for gene in list_of_genes:

if gene in gene_count:

gene_count[gene] += 1

else:

gene_count[gene] = 1

# turn gene count into a data frame

gene_count_df = pd.DataFrame.from_dict(gene_count, orient='index')

gene_count_df.columns =['Count']

# sort data frame by descending order

gene_count_sorted = gene_count_df.sort_values(by=['Count'], ascending = False)

# select the top ten enriched genes

gene_count_sorted.iloc[0:10]