Code, Data Sets, and Gal4 Lines

The first thing we needed to do was find the downregulated genes and the upregulated genes in the Atilano_2021_polyGR google sheets file, and then sort through all the names in column H and compare them to the gene names in the first column in the Cell_Types_Avg_Counts. Here is the data that we are handling, outlined in the sheets:

Steps for determining the final list of genes:

1. Use sagescode1.py 

2. Edit the blank lines and sort upregulated.csv and downregulated.csv in excel

3. Isolate just the gene names from both and create two new csv files called upgenes.csv and downgenes.csv

4. Isolate just the gene names from the big fly with human orthologs gene list

you're comparing it to and make sure its also in a csv file

5. Use the merging program to merge upgenes.csv with the gene names csv file from step 4

6. Repeat step 5 with downgenes.csv

7. Use sagescode2.py to put the mergedup.csv file (output from step 5) back into a csv file with each of the gene's average values as stated in cell counts

8. Repeat step 7 with mergeddown.csv(output from step 6)

9. Now you just need to find the cut off value for the averages to make the final lists

1. Use sagescode1.py  to come up with upregulated and downregulated genes with respective Non-MB of all genes that are also present in Cell_Types_Avg_Count.csv 

Step 1: sagescode1.py


The coding below was created in Visual Studio Code using Python 3 to compare the items in the upregulated columns, the downregulated columns and see if they were in the average counts google sheet.


This part of the code finds all items in the downregulated columns, splits them up by the commas between them and adds the gene name as keys to a dictionary, with a placeholder value of 1 as the corresponding value.

down = {}

with open('down.csv') as f:

    reader = csv.reader(f)

    for row in reader:

        for _ in row[7].split(", "):

            down[_] = 1


This part does the same with the upregulated genes

up = {}

with open('up.csv') as f:

    reader = csv.reader(f)

    for row in reader:

        for _ in row[7].split(", ")

            up[_] = 1


This creates a dictionary that pairs the gene names in the cell types file as the key in the dictionary and sets the non-MB avg count as the value. 

count = {}

with open('count.csv') as f:

    reader = csv.reader(f)

    for row in reader:

        count[row[0]] = row[3]



downregulated.csv is created as a file. For every key in the down dictionary, if that gene name is found in the count dictionary (so if it is in the cell types file) then a new excel row is added to the file, with the gene name at the first column in the row and the non-mb counts as the second column. However, this creates a bug of white space rows between every row, but that is easier to take care of later in Excel.

with open('downregulated.csv', 'w') as f:

    w = csv.writer(f)

    for i in down:

        if i in count:

            w.writerow([i, count[i]])



upregulated.csv is created as a file. This does the same thing for the up genes as the ones in the downregulated genes

with open('upregulated.csv', 'w') as f:

    w = csv.writer(f)

    for i in up:

        if i in count:

            w.writerow([i, count[i]])


2. Edit the blank lines and sort upregulated.csv and downregulated.csv using Excel

This is what the gene list looks like starting off, straight out of being created from the code in step 1 

Find and select  ---> Go to Special  ---> Blanks

Hit delete to delete all the empty rows in the file that were selected.

This is the gene list, pre sort.

Now sort the genes,

Final Upregulated sorted list:

Repeat all these steps for the downregulated list as well 

3 & 4. Isolate just the gene names from both and create two new csv files called upgenes.csv and downgenes.csv. Then, isolate just the gene names from the previously found gene list of all fly genes with human orthologs you're comparing it to and make sure it's also in a .csv file


We did this by copying the first column of upregulated.csv and pasting it to a new excel spreadsheet. Then we saved the spreadsheet as a .csv file and named it upgenes.csv. The same was done for the downregulated.csv, creating downgenes.csv

The same thing happened with the merged gene lists: isolated just the gene names and saved it to a separate file, which for our group was named flyhumanecandm.csv

5 & 6. Use the merging program to merge upgenes.csv with the genes file from step 4, and repeat with downgenes.csv


Code for the merging program:


## Author: Catherine Calma

## Date: 9/15/21

## Description: User will input mode (?-get current directory,T-convert text to csv,C-compare two csv files)

##              T: use this mode to convert .txt files to .csv files for easier comparison

##                 The output will be a user-named .csv file in the current working directory

##              O: enter two file to compare. The output will be a user-named .csv file in the

##                 current working directory containing all overlap in two files

##              ?: get working directory

##              #: change working directory

##              J : enter to .csv files to join. The ouput will be user-named .csv file with union

##                  of entered files

##              C: convert a FBgn list to gene IDs. External web link

##              Currently only works for when files with genes ONLY are entered and returns CSV with

##              genes ONLY. C:\Users\cathe\Desktop\Brain VIP\Data Sets


import os

import webbrowser



def read_file(file_to_open):

    #open text file and read into lines

    gene_file = open(file_to_open, 'r')

    lines = gene_file.readlines()

    return lines, gene_file


def csv_to_list(lines, gene_file):

    #open csv file and read into list

    gene_list = []

    for item in lines:

        item = item.rstrip(',\n')

        if (item != ''):

            gene_list.append(item)

    gene_file.close()

    return gene_list


def create_list(lines, gene_file):

    #create a list from lines of text

    gene_list = []

    for gene in lines:

        gene = gene.rstrip('\n')

        gene_list.append(gene)

    gene_file.close()

    return gene_list


def store_output(file_to_store, gene_list):

    # write csv into output file

    out_file = open(file_to_store, 'w')

    for item in gene_list:

        out_file.write(str(item) + ',\n')

    out_file.close()


def compare(gene_list_one, gene_list_two):

    #comparing two lists

    shared_genes = set(gene_list_one) & set(gene_list_two)

    shared_genes = list(shared_genes)

    return shared_genes


def join(gene_list_one, gene_list_two):

    #join two lists

    joined_genes = set(gene_list_one).union(set(gene_list_two))

    joined_genes = list(joined_genes)

    return joined_genes


def main():

    prompt = 'A'

    while (prompt != 'Q'):

        mode = input('What would you like to do? (?-get current directory, #-change\n'

        'directory, T-convert text to csv, O-compare overlap two csv files, J-join two csv files\n'

        'C-convert FBgn to Gene IDs)\n')

        if mode == 'T':

            file_to_open = input('Enter the name of a text file you want to open:\n')

            file_to_store = input('Name the csv file you want to use for outputs:\n')

            lines, gene_file = read_file(file_to_open)

            gene_list = create_list(lines, gene_file)

            store_output(file_to_store, gene_list)

            print('Your data is now in ' + file_to_store +'.\n')

        elif mode == 'O':

            file_one =  input('Enter name of first csv file:\n')

            file_two = input('Enter name of second csv file:\n')

            file_to_store = input('Name the csv file you want to use for outputs:\n')

            lines_one, genes_one = read_file(file_one)

            lines_two, genes_two = read_file(file_two)

            gene_list_one = csv_to_list(lines_one, genes_one)

            gene_list_two = csv_to_list(lines_two, genes_two)

            shared_genes = compare(gene_list_one, gene_list_two)

            store_output(file_to_store, shared_genes)

            print('Overlap from ' + file_one + ' and ' + file_two + ' are now stored in ' + file_to_store + '.\n')

        elif mode == 'J':

            file_one =  input('Enter name of first csv file:\n')

            file_two = input('Enter name of second csv file:\n')

            file_to_store = input('Name the csv file you want to use for outputs:\n')

            lines_one, genes_one = read_file(file_one)

            lines_two, genes_two = read_file(file_two)

            gene_list_one = csv_to_list(lines_one, genes_one)

            gene_list_two = csv_to_list(lines_two, genes_two)

            joined_genes = join(gene_list_one, gene_list_two)

            store_output(file_to_store, joined_genes)

            print(file_one + ' and ' + file_two + ' are now joined in ' + file_to_store + '.\n')

        elif mode == '?':

            print('Your current working directory is ' + os.getcwd()) #give current working directory

        elif mode == '#':

            path = input('What directory to use? (C:\\...)\n')

            os.chdir(path)

        elif mode == 'C':

            webbrowser.open("https://www.biotools.fr/drosophila/fbgn_converter")

        else:

            print('Invalid mode')

        prompt = input('Would you like to continue? (hit enter or enter Q to quit)\n')


main()


This is the terminal output for using the program:

7 & 8.  Use sagescode2.py to put the mergedup.csv file (output from step 5) and mergeddown.csv file (output from step 6) back into a csv file with each of the gene's average values as stated in cell counts.

sagescode2.py 


finalup = {}

with open('mergedup.csv') as f:

    reader = csv.reader(f)

    for row in reader:

        for _ in row[0].split(", "):

            finalup[_] = 1


finaldown = {}

with open('mergeddown.csv') as f:

    reader = csv.reader(f)

    for row in reader:

        for _ in row[0].split(", "):

            finaldown[_] = 1


count = {}

with open('count.csv') as f:

    reader = csv.reader(f)

    for row in reader:

        count[row[0]] = row[3]


Creates two new files: upwithavgs.csv and downwithavgs.csv    

with open('upwithavgs.csv', 'w') as f:

    w = csv.writer(f)

    for i in finalup:

        if i in count:

            w.writerow([i, count[i]])

             

with open('downwithavgs.csv', 'w') as f:

    w = csv.writer(f)

    for i in finaldown:

        if i in count:

            w.writerow([i, count[i]])

             

Similar to the sagescode1.py, there is a spacing bug that needs to be handled and the data needs to be sorted. To do this, just repeat step 2 on upwithavgs.csv and downwithavgs.csv and save the files. 

9. Find the cut off value for the averages to make the final lists

The upregulated cutoff value was 500, and for the downregulated it was 2110

Gal4 Lines:

The final merged gene lists for upregulated and downregulated with cut offs are above. Since most of our genes, except for CG4678, were in group 2B's lists as well, our group used what information they had to produce our list of stocks.