G-quadruplexes (G4s) were identified using RegEx [14] . RegEx is a module that uses pattern matching to identify specified motifs in a given sample [14-16]. Previous research indicates that G4s can be identified through the repeating DNA motif of three guanines followed by a stretch of one to seven varying nucleotides G(3+) + N(1-7) + G(3+) + N(1-7) + G(3+) + N(1-7) + G3(+) [3] . In our model we have chosen to be more stringent in G quadruplex identification and passed the following expression to RegEx:
GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG
Genome sequences for organisms of interest were then downloaded from the National Center for Biotechnology Information (NCBI) as FASTA files. The G4 motif finder was then applied to these sequences and provide matches for each sequence as either an iterator or a list, depending on the needs of the user.
Let's apply this to the Thermus thermophilus genome
Calculating the frequency of G4s of an organism
The matches can then be quantified by finding the length of output of the motif finder. This analysis is then furthered by calculating the frequency of G4s for each sequence after depositing genome lengths and G4 counts into individual lists.
Summarizing the results
The results of this analysis can be converted into a DataFrame to be easily visualized by the user.
Once G4s were identified, their positions were compared with the known positions of genes from annotated genome files downloaded from the NCBI website as FASTA files.
Let's apply this to yeast chromosome 1
If the G4 is not within a gene, find the closest gene and the downstream gene.
Converting the results into a data frame of top-hits
Representative example generating an image for total G4s for each yeast chromosome
All figures were made in Python using matplotlib.pyplot. Figures associated from GO-term enrichment analysis are adapted from G:Profiler [17].
All genome sequences were downloaded from the NCBI website
Human genome sequences: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/
Annotated human genome sequences: https://www.ncbi.nlm.nih.gov/datasets/gene/GCF_000001405.40/
Yeast genome sequences: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000146045.2/
Annotated yeast genome sequences: https://www.ncbi.nlm.nih.gov/datasets/gene/GCF_000146045.2/
Candidatus Carsonella ruddii genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000287275.1/
Buchnera aphidicola genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_003099975.1/
Ehrlichia chaffeensis genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000632965.1/
Saccharopolyspora erythraea genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000062885.1/
Methanococcoides burtonii genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000013725.1/
Streptomyces avermitilis genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000009765.2/
Thermus caldilimi genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_004684245.1/
Mycoplasmoides genitalium genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000027325.1/
Colwellia psychrerythraea genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000012325.1/
Thermus thermophilus genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_019973715.1/
Methanopyrus kandleri genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000007185.1/
Aspergillus fumigatus genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000002655.1/
Fusarium oxysporom genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_013085055.1/
BioPython (SeqIO) RegEx
pandas matplotlib.pyplot
numpy os
import os
from Bio import SeqIO
import re
import pandas as pd
import matplotlib.pyplot as plt
# point directory to file with all sequences
os.chdir('/Users/zoyaansari/Desktop/Bioinformatics/Sequences')
path = '/Users/zoyaansari/Desktop/Bioinformatics/Sequences'
file_names = os.listdir(path)
single_genome_headers = []
single_genome_quadruplex_counts = []
single_genome_sequence_length = []
single_genome_quadruplex_frequency = []
single_genome_quadruplex_per_1M = []
# only select fasta files
sequence_list = []
for file in file_names:
if '.fasta' in file:
sequence_list.append(file.upper())
# read fasta sequences from the list of fasta files
for name in sequence_list:
seq_file = open(name, 'r')
for seq_record in SeqIO.parse(seq_file, 'fasta'):
sequence = str(seq_record.seq)
header = seq_record.description
single_genome_headers.append(header)
sequence_length = len(sequence)
single_genome_sequence_length.append(sequence_length)
# define a motif
motif = r'GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG'
# return the matches as something that a foor loop can iterate through
matches = re.finditer(motif, sequence)
#count how many g quadruplexes are in the sequences
number_of_quadruplexes = len(list(matches))
single_genome_quadruplex_counts.append(number_of_quadruplexes)
base_pair_per_quadruplex = number_of_quadruplexes/sequence_length
# calculate the frequency of g-quadruplexes by sequence length by dividing quadruplexes and lengths
for i in range(len(single_genome_headers)):
single_genome_quadruplex_frequency.append((single_genome_quadruplex_counts[i]/single_genome_sequence_length[i]))
single_genome_quadruplex_per_1M.append(single_genome_quadruplex_frequency[i]*1000000)
# turn this into a dataframe
df_single_genome = pd.DataFrame(list(zip(single_genome_headers, single_genome_quadruplex_counts, single_genome_sequence_length, single_genome_quadruplex_per_1M)),
columns = ['Name','Quadruplex count', 'Genome Length (bp)', 'Quadruplexes per million bps'])
df_single_genome
# identifying genes in A. fumigatus
import os
from Bio import SeqIO
import re
import pandas as pd
import matplotlib.pyplot as plt
# for genomes with multiple sequence (for Aspergillus fumigatus)
# point directory to file with all sequences
import os
os.chdir('/Users/zoyaansari/Desktop/Bioinformatics/Sequences/Aspergillus_fumigatus')
path = '/Users/zoyaansari/Desktop/Bioinformatics/Sequences/Aspergillus_fumigatus'
file_names = os.listdir(path)
length_of_genome = []
number_of_quadruplexes = []
af_name = ['Aspergillus fumigatus']
# only select fasta files
sequence_list = []
for file in file_names:
if '.fasta' in file:
sequence_list.append(file.upper())
# read fasta sequences from the list of fasta files
for name in sequence_list:
seq_file = open(name, 'r')
for seq_record in SeqIO.parse(seq_file, 'fasta'):
sequence = str(seq_record.seq)
header = seq_record.description
length_of_chromosome = len(seq_record.seq)
length_of_genome.append(length_of_chromosome)
# define a motif
motif = r'GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG'
# return the matches as a list something that a for loop can iterate through
matches = re.findall(motif, sequence)
#count how many g quadruplexes are in the sequences
length = len(matches)
number_of_quadruplexes.append(length)
total_quadruplexes = sum(number_of_quadruplexes)
# calculate the size of the genome
size_of_genome = sum(length_of_genome)
frequency_per_1M = (total_quadruplexes/size_of_genome)*1000000
# turn this into a DataFrame by first making a dictionary
af_dict = {'Name': af_name, 'Quadruplex count': total_quadruplexes, 'Genome Length (bp)': size_of_genome, 'Quadruplexes per million bps':frequency_per_1M}
af_df = pd.DataFrame(af_dict)
af_df
# identifying genes in F. oxysporom
import os
from Bio import SeqIO
import re
import pandas as pd
import matplotlib.pyplot as plt
# for genomes with multiple sequence (for Aspergillus fumigatus)
# point directory to file with all sequences
import os
os.chdir('/Users/zoyaansari/Desktop/Bioinformatics/Sequences/Fusarium_oxysporom')
path = '/Users/zoyaansari/Desktop/Bioinformatics/Sequences/Fusarium_oxysporom'
file_names = os.listdir(path)
length_of_genome = []
number_of_quadruplexes = []
fo_name = ['Fusarium oxysporom']
# only select fasta files
sequence_list = []
for file in file_names:
if '.fasta' in file:
sequence_list.append(file.upper())
# read fasta sequences from the list of fasta files
for name in sequence_list:
seq_file = open(name, 'r')
for seq_record in SeqIO.parse(seq_file, 'fasta'):
sequence = str(seq_record.seq)
header = seq_record.description
length_of_chromosome = len(seq_record.seq)
length_of_genome.append(length_of_chromosome)
# define a motif
motif = r'GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG'
# return the matches as a list something that a for loop can iterate through
matches = re.findall(motif, sequence)
#count how many g quadruplexes are in the sequences
length = len(matches)
number_of_quadruplexes.append(length)
total_quadruplexes = sum(number_of_quadruplexes)
# calculate the size of the genome
size_of_genome = sum(length_of_genome)
frequency = (total_quadruplexes/size_of_genome)*1000000
# turn this into a DataFrame by first making a dictionary
fo_dict = {'Name': fo_name, 'Quadruplex count': total_quadruplexes, 'Genome Length (bp)': size_of_genome, 'Quadruplexes per million bps':frequency}
fo_df = pd.DataFrame(fo_dict)
fo_df
# Join all of the dataframes into one dataframe using concatenate
important_dataframes = [df_single_genome, af_df, fo_df]
final_df = pd.concat(important_dataframes)
final_df
# Visualizing by name
# turn the data frame into a bar graph
final_df.plot(kind='bar', x='Name',y='Quadruplexes per million bps',color='darkblue', width = 0.7)
# set the title
plt.title('G-Quadruplex frequency across organisms')
# show the plot
plt.show()
# Add in the GC data as a new column
GC_values = [14, 25.5, 30, 71, 41, 70.5, 65, 31.5, 38, 69.5, 61, 50, 47.5]
final_df['GC percentage'] = GC_values
#final_df
#GC_dict = {'Streptomyces avermitilis' : 70.5, 'Saccharopolyspora erythraea': 71, 'Candidatus Carsonella ruddii': 14, 'Buchnera aphidicola': 25.5, 'Ehrlichia chaffeensis': 30, 'Methanococcoides burtonii' : 41 , 'Thermus caldilimi' : 65 ,'Mycoplasmoides genitalium' : 31.5,'Colwellia psychrerythraea' : 38 , 'Thermus thermophilus': 69.5,'Methanopyrus kandleri': 61 , 'Aspergillus fumigatus': 50 , 'Fusarium oxysporum': 47.5 }
# sort the list in order of GC content
final_df_sorted = final_df.sort_values(by=['GC percentage'], ascending = False)
# reset the index
final_df_sorted = final_df_sorted.reset_index(drop=True)
final_df_sorted
# sort the list in order of GC content
final_df_GC_graph = final_df.sort_values(by=['GC percentage'], ascending = True)
# reset the index
final_df_GC_graph_reset = final_df_GC_graph.reset_index(drop=True)
final_df_GC_graph_reset
# visualize by GC content
# turn the data frame into a bar graph
final_df_GC_graph.plot(kind='bar', x='GC percentage',y='Quadruplexes per million bps',color='purple', width = 0.7)
# set the title
plt.title('G-Quadruplex frequency across organisms with varying GC content')
# show the plot
plt.show()
# organize by genome length
# Look at relationship with genome length and temperature affect occurance of G quadruplexes
genome_length_df = final_df.sort_values(by=['Genome Length (bp)'], ascending = True)
genome_length_df_reset = genome_length_df.reset_index(drop=True)
genome_length_df_reset
# visualize the plot
# turn the data frame into a bar graph
genome_length_df.plot(kind='bar', x='Genome Length (bp)',y='Quadruplexes per million bps',color='maroon', width = 0.7)
# set the title
plt.title('G-Quadruplex frequency across organisms organized by genome length')
# show the plot
plt.show()
# Motif finder tutorial
# point directory to file with all sequences
import os
os.chdir('/Users/zoyaansari/Desktop/Bioinformatics/Sequences')
path = '/Users/zoyaansari/Desktop/Bioinformatics/Sequences'
from Bio import SeqIO
import re
# read in the FASTA file for the organism of interest
seq_file = open('Thermus thermophilus.fasta', 'r')
for seq_record in SeqIO.parse(seq_file, 'fasta'):
sequence = str(seq_record.seq)
header = seq_record.description
sequence_length = len(sequence)
# define a motif
motif = r'GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG'
# return the matches as something that a foor loop can iterate through
matches = re.findall(motif, sequence)
#count how many g quadruplexes are in the sequence
number_of_quadruplexes = len(matches)
# print('There are', number_of_quadruplexes, "G4s in", header)
# calculate the frequency of g quadruplexes in the sequence for every 1 million base pairs
frequency = (number_of_quadruplexes/sequence_length)*1000000
# print("The frequency of G4s in", header, "is", "{:.2f}".format(frequency), "G4s per 1 million base pairs")
# turn values into lists
list_header = []
list_genome_length = []
list_number_of_quadruplexes = []
list_frequency = []
list_header.append(header)
list_genome_length.append(sequence_length)
list_number_of_quadruplexes.append(number_of_quadruplexes)
list_frequency.append(frequency)
# turn this into a dataframe
T_thermophilus_df = pd.DataFrame(list(zip(list_header, list_number_of_quadruplexes, list_genome_length, list_frequency)),
columns = ['Name','Quadruplex count', 'Genome Length (bp)', 'Quadruplexes per million bps'])
T_thermophilus_df
for i in range(1,18):
filename = "Yeast_Chromosome" + str(i)
seq_file = open(filename, 'r')
for seq_record in SeqIO.parse(seq_file, 'fasta'):
sequence = str(seq_record.seq)
positionsfile = "Yeast_Chromosome_Positions" + str(i)
positions_df = pd.read_csv(positionsfile, sep='\t')
for i in range(1,24):
filename = "Human_Chromosome" + str(i)
seq_file = open(filename, 'r')
for seq_record in SeqIO.parse(seq_file, 'fasta'):
sequence = str(seq_record.seq)
positionsfile = "Human_Chromosome_Positions" + str(i)
positions_df = pd.read_csv(positionsfile, sep='\t')
GQuad_Starts = df['Start'].tolist()
GQuad_Ends = df['End'].tolist()
Gene_Starts = positions_df['Begin'].tolist()
Gene_Ends = positions_df['End'].tolist()
All_Genes = positions_df['Name'].tolist()
for i in range(num_rows):
A = GQuad_Starts[i]
B = GQuad_Ends[i]
for i in range(len(Gene_Starts)):
C = Gene_Starts[i]
D = Gene_Ends[i]
if A >= C and B <= D:
#print("This motif is within a gene")
#print("The gene is", All_Genes[i])
list_of_genes.append(All_Genes[i])
#else:
#print("This motif is not within a gene")
#chromosome 2
from Bio import SeqIO
import re
import pandas as pd
list_of_genes = []
closest_genes = []
gene_distance = []
downstream_genes = []
gene_downstream = []
chromosome_number = "Y"
# Open a file and define what the sequence is
seq_file = open(f'Human_chromosome_{chromosome_number}.fasta', 'r')
for seq_record in SeqIO.parse(seq_file, 'fasta'):
sequence = str(seq_record.seq).upper()
# Read in the annotated file for reference genome with gene annotation
positions_df = pd.read_csv('Human_annotated_genes.tsv', sep='\t')
# Filter the DataFrame to only contain genes for Chromosome I
positions_df = positions_df[positions_df['Chromosome'] == f"{chromosome_number}"]
motif = r'GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG[A-Z]{1,7}GGG'
matches = re.finditer(motif, sequence)
start_positions = []
end_positions = []
header = []
for match in matches:
start_positions.append(match.start() + 1)
end_positions.append(match.end())
header.append(match.group())
df = pd.DataFrame({'G quadruplex': header, 'Start': start_positions, 'End': end_positions})
num_rows = len(df)
print(num_rows)
GQuad_Starts = df['Start'].tolist()
GQuad_Ends = df['End'].tolist()
Gene_Starts = positions_df['Begin'].tolist()
Gene_Ends = positions_df['End'].tolist()
All_Genes = positions_df['Symbol'].tolist()
# Iterate through each G-quadruplex
for i in range(num_rows):
A = GQuad_Starts[i]
B = GQuad_Ends[i]
# Check if the G-quadruplex falls within any gene
within_gene = False
for j in range(len(All_Genes)):
C = Gene_Starts[j]
D = Gene_Ends[j]
if A >= C and B <= D:
within_gene = True
list_of_genes.append(All_Genes[j])
break
# If the G-quadruplex is not within a gene, find the closest gene and the downstream gene
if not within_gene:
min_distance = float('inf')
closest_gene = None
min_distance_d = float('inf')
downstream_gene = None
for j in range(len(All_Genes)):
C = Gene_Starts[j]
D = Gene_Ends[j]
distance = abs(C - B)
d_distance = (C - B)
if distance < min_distance:
min_distance = distance
closest_gene = All_Genes[j]
if 0 < d_distance < min_distance_d:
min_distance_d = d_distance
downstream_gene = All_Genes[j]
closest_genes.append(closest_gene)
gene_distance.append(min_distance)
downstream_genes.append(downstream_gene)
gene_downstream.append(min_distance_d)
# Create DataFrame
gene_df = pd.DataFrame(list(zip(closest_genes, gene_distance, downstream_genes, gene_downstream)),
columns=['Closest_gene', 'Distance', 'Downstream_gene', 'Downstream_distance'])
print(gene_df)
downstream_gene_count_sorted.to_csv('downstream_gene_count_sorted.csv', index=False)
print(len(list_of_genes))
print(len(gene_df))
# count how many g quadruplexes are in a gene
gene_count = {}
for gene in list_of_genes:
if gene in gene_count:
gene_count[gene] += 1
else:
gene_count[gene] = 1
# turn gene count into a data frame
gene_count_df = pd.DataFrame.from_dict(gene_count, orient='index')
gene_count_df.columns =['Count']
# sort data frame by descending order
gene_count_sorted = gene_count_df.sort_values(by=['Count'], ascending = False)
# select the top ten enriched genes
gene_count_sorted.iloc[0:10]