Dr. Janhavi Gupta

Code to Decode Mysteries of Cellular Processes

August 2022

It is Summer 2022. Another school year just wrapped up and it was a roller coaster ride trying to walk towards normalcy after almost two years of remote/hybrid learning due to the COVID-19 pandemic. But my students persevered relentlessly to learn. Of course, the last two years have made us realize so many different facets of our lives that we ignore in the mundane humdrum of everyday life. Any mention of yet another Zoom/Google Meets/Teams meeting is accompanied by an eye roll. However, I am thankful that it opened the opportunity for me to engage in distance learning offered by Fred Hutchinson Cancer Center to a science educator sitting approximately 2900 miles away in New Jersey! I hope this blog post inspires my students to pursue opportunities regardless of the many obstacles they may perceive. The logistics and the path forward can always be worked out later!

As a former research scientist, I can satisfy students’ curiosity about the research lab environment. Yet I often yearn to go back to the research field and keep abreast of exciting discoveries. The Hutch Fellowship for Excellence in STEM Teaching presented a perfect opportunity to learn about cutting-edge research on the topic I am passionate about - cancer biology. As a high school teacher at UCVTS - Magnet High School I have taught courses covering Honors Biology, AP Biology, Human Anatomy and Physiology, and a new Fundamentals of Biomedical Engineering. I was asked by Dr. Rafalowski, our school principal, to design the biomedical engineering course since many students are interested in exploring/pursuing biomedical engineering, but had no engineering electives to satisfy this niche. I designed this course so that students have the freedom of exploring any topic in the vast field of biomedical engineering in depth via research paper presentations or final projects. Over the years I have found my students gravitate towards neural networks and machine learning. However, as a person with zero coding experience, I wanted to learn a few programming languages used in computational biology, such as Python and R.

My direct mentor Christine Dien, bioinformatics analyst in Dr. Setty’s lab. Photo provided by Ms. Dien.

Communicating my learning experience and implementing portions of computational biology data analysis in my biomedical engineering curriculum will equip my students with different skill sets, especially where learning basic coding is ubiquitously becoming a necessity. The Hutch Teacher Fellowship offered a perfect solution of being able to work remotely while learning some basics of computational biology. I was matched to work remotely with Dr. Manu Setty’s lab, which develops novel algorithms to model and understand some of the fundamental and unanswered questions in biology hidden in big datasets such as those generated by RNA-seq experiments. Throughout the summer I was patiently guided by my amazing direct mentor Christine Dien, who is a bioinformatic analyst in Dr. Setty’s lab.

During my first summer of the Hutch Teacher Fellowship, I worked through a graduate-level online course on Computational Biology (TFCB 2021) taught at Fred Hutch by Drs. Subramaniam, Bedford, Matsen, Bradley, Bloom, Ha, and Setty. As a novice who was clueless about ‘command prompt’, the initial learning curve was steep. I am thankful to my eldest son who loves programming and my husband who has a knack for understanding complex math and statistics for encouraging me to persevere through the initial phase where I was swimming in numerous new terms. After completing this course I can vouch that the toughest part was to install all the software and get them working harmoniously! I did have tremendous help from my son in installing the software on my Windows machine, which required a bit of expertise.

Research in Dr. Setty’s Lab

Dr. Setty’s lab focuses on developing new computational tools to help model and visualize complex biological processes such as cell differentiation and gene regulatory networks governing cell fate. The machine learning algorithms developed in his lab are instrumental in unraveling and envisioning the answers to fundamental questions hidden in massive datasets. For example, the Palantir algorithm helps align differentiating cells from stem cells to terminally differentiated states along their trajectories in a continuous process. Many dimensionality reductions and data visualization techniques use statistical analysis such as tSNE, K-means clustering, PCA, UMAP, and heat maps. To gain an in-depth understanding of these methods, I frequently referred to the YouTube channel StatQuest with Josh Stamer. This channel is an excellent resource covering simple statistics fundamentals to complex statistical models, machine learning, and neural networks. This resource can be easily added to upper-level classes such as AP Biology to enrich the student's understanding of basic statistical analysis.

Computational biologist Dr. Manu Setty, Photo by Fred Hutch.

TFCB2021-Tools For Computational Biology Course

Overview: The TFCB2021 course, which spanned 19 lectures and a few homework exercises, was an excellent primer for learning many computational biology tools. During the course, I found it efficient to work on two machines. One to run the code and another to simultaneously follow along with the recorded video. The goal of the course was to learn to program in Unix, Python, and R using VSCode and/or Jupyter notebook. These code editors can run the code as cells or individual lines by using # to mask the lines not run. As a novice, the use of # to comment out pieces of code proved critical in understanding each step used in the code for data analysis. One of the important concepts covered in the first few classes was related to reproducible science and how to construct tables (tidy data) that can be read by the computer easily. Tools to manipulate tables in both Python and R were covered in later classes. In an effort toward reproducible science, many publications now require the submission of digital artifacts. GitHub is one such tool used to host and share code publicly. In my quest to understand more about data reproducibility, I stumbled across an extremely useful resource published by the journal Nature. Statistics for Biologists can be adapted for higher-level courses such as AP Biology which include basic statistical analyses as a part of the curriculum.

My work set-up with two machines to run the code simultaneously with the corresponding video lecture series. Image credit: Janhavi Gupta.

Unix/bash shell: Massive datasets are a common output of high throughput technologies and managing them is out of the realm of our personal computers. The initial portion of this course was designed to understand and execute basic Unix shell commands such as changing directories, navigating around the file systems, and connecting remotely to the ‘rhino’ high-performance computing servers housed at Fred Hutch.

Python (the for loops and pandas): This was a great learning primer for coding neophytes. The lessons started with learning basic data structures, designing ‘for loop’ statements, and creating/using dictionaries. As a beginner the new terminology was staggering! It was heartening to learn that many experienced coders frequently resort to reading the manual, using Google, or Stack Overflow to seek answers when stuck. Code editors also have a helpful feature suggesting the syntax structure when writing the code.

One of the in-class problems was finding the prime numbers. The initial code I wrote required a couple of trials to get it working without error messages and was super slow crunching out the primes. The next two improvements in the code were much faster and returned a list of prime numbers up to 50,000 in merely a few seconds. There are many ways to solve the same problem! The Python series of lectures ended with using data libraries pandas and sci-kit learn to conduct basic supervised and unsupervised exercises such as regression and PCA analyses, respectively. Both these methods are frequently used to reduce the dimensionality of big data sets needed to visualize the patterns.

Some fun with Word Art of all the terminology within the first few lectures. Image credit: Janhavi Gupta. Created using wordart.com.

Screenshot of VSCode showing the code run reverse complement of select tumor suppressor genes.

Image credit: TFCB2021 course.

R (and the kernel problem): The next half of the lecture series used the R program. The Bioconductor and the tidyverse packages have many built-in functions to work with biological data tables and visualize the results.

My first foray into R was smooth sailing. However, the program stopped working the next day after loading a few packages commonly used for biological analysis and repeatedly returned a kernel error message. This is a common incompatibility problem with Windows and Conda environments. Christine Dien, my direct mentor in Dr. Manu Setty’s lab, suggested a solution of transferring the lecture files to RStudio instead and modifying a few commands. The R programming language is powerful for many statistical analyses and data visualization. However, for more heavy-duty computational analyses some scientists revert to a relatively faster Python rather than R. In this course, R was used to analyze genomic data and bulk RNA-seq data. We reverted to Python for analyzing single-cell RNA-seq data which has many available packages such as scanpy and cellxgene to process and visualize sc RNA-seq datasets.

Course completion: This course was a tremendous learning experience despite the few hiccups mainly stemming from software programs not working harmoniously. These hiccups and the error messages during the coding exercises were an excellent reminder of how real-world science research does not work perfectly and needs iterative improvements. Just like learning a new skill or a language, repeated practice was instrumental in gaining confidence in my first foray into the world of computational biology. Towards the end of the course, it was easy to follow along with the lectures discussing more complex genomic and RNA-seq analyses. In retrospect, the coding portion can be solved by referring to free online manuals or scanning websites such as Stack Overflow. It requires a bit of patience and willingness to learn from mistakes, some of which were as silly as forgetting to close a parenthesis!

The more important aspect is to understand the underlying reasons why the data is manipulated in a certain way for statistical analysis or data visualization. I am thankful for an incredibly rich and priceless learning experience this summer in computational biology conducive to remote participation. I am more abreast with current research and can enrich student learning by incorporating another dimension of computational tools into the curriculum taught.

An immense thanks to my son Aditya Gupta, my awesome mentor Christine Dien, Dr. Manu Setty, and Dr. Kristen Bergsman.

Dr. Janhavi Gupta teaches Honors Biology and Fundamentals of Biomedical Engineering at Union County Vocational Technical School – Magnet High School in Scotch Plains, New Jersey. She is a former research scientist who worked in the field of cell and molecular biology.