Blog

Epigenetics – gene expression controller

According to the evolution theory of Charles Darwin, heritable variation and natural selection are the causes of evolution where a mutation occurs in the genetic code. Over the generations, a mutant survives and is naturally selected. DNA, the rule book of the cellular mechanism, consists of entire genomes, out of which different segments (known as genes – pages of the book) get expressed in different cells, providing a unique identity to a cell. In order to be able to transform into a specific cell, it requires an additional layer of information, “memory,” that can record the past and subsequently establish the future to mark the cell identity – the phenomenon is executed through epi-genetics (above-genetics). Two major factors affecting epigenetics are DNA methylation and the form of the histone protein. The most heavily methylated group of DNA dampened its activity. Histone, on the other hand, is known as part of the inner scaffold for DNA, and different forms of histones activate different genes. Interestingly, modification in both factors can occur due to the local environment factors “nurture” and thus can modify the behavior of a cell. In conclusion, epigenetic codes can make part of the genome more active or inactive. Heritable variation of the epigenetics code can further add the role of nurture in the evolution to the original Darwin theory; however, it is still in debate.

The attached image is taken from (https://alternityhealthcare.com/telomere-epigenetic-testing/), which emphasises on how even after having the same genetic code (nature) the same person can have different appearance simply because of lifestyle and environmental factors (nurture).

Bash Scripting: An automation of commands

Hello fellas!

Yesterday, I was sitting in my house, thinking about all the things that I learned in the past few years. The very first thing that came into my mind was Bash Scripting which made my life so much easier and agile that I think I should share it with others and let them experience how useful a thing it is. Although I have used it extensively in my research, I still perceive that there is a lot more to learn. That’s the reason, I will not be writing about technical details here. Instead I will give you the idea of its wide applicability. I intend to motivate you to learn this and I hope that this will ease your life in the same way it did to mine. There is a small story of how I ended up learning this. I suppose that this story will tell you the importance of Bash Scripting.

I joined IISc for my Ph.D. and I decided to work under Prof. KGA and Prof. SP as a joint student. My research area, in a nutshell, was molecular simulation of biological systems. Even though most of the tools of MD simulation were certainly available on Linux, I, being a regular windows user, started working in the lab on windows only. Once I went to SP’s office with some preliminary results. He asked me to draw figures impromptu from raw data and analyze it. I started using GUI (Graphical user interface) of some usual windows software. Within a minute, he told me that you should change your operating system to Linux and start using CLI (command line interface). For a while, I thought why should I use some other operating system when I could do all my needful in the existing one. Since I was unknown to the fact that this will save me from a lot of trouble in the future, I remained stubborn. Then, comes the global Guru to help me in my decision “Google” I googled and noticed that all the tools of my research field were well developed for Linux. Thereby, I decided to change my operating system and started using CLI. It was not smooth to work in the beginning; however, over the time I kept on asking Guru for help and I successfully wrote whole scripts for my research. Today, I thank Prof. SP who pushed me into this in the initial phase of my research.

Let me give a brief introduction of Linux CLI for those who have never been exposed to CLI and have no clue what it is. In Linux CLI, you give commands through syntax on the terminal (similar to command prompt of the window), the same way, you give input either from mouse or keyboard in windows for a command to the system. When you write all those syntaxes of CLI line by line in a file that becomes a script. Executing that script, execute all written command of the script without user interference. This can work as macros; how much time can be saved in this way — can be imagined by the fact that the impromptu figures that I was asked to draw takes around few mintues on window but on Lincux CLI, I can get it done within a minutes. Moreover, it is a lot of fun writing scripts and executeing them.

For example: If you want to copy 100 files from one folder to another folder and then based on their name, you want to do some operation in each file (say you want to copy the last column of the files with .txt extension and first column of the files with .dat extension into a single new file). These calls can be done with a single script. In case, this is what you need to do in your daily job, you just have to execute the script once and sit back.

Over the one year of my research, I only learned and used what I needed in my script. I never learned Bash scripting formally as a course which, in result, constrained my knowledge about it but, The lockdown of the pandemic period which I am sure you must be aware of, allowed me to learn this formally and most of its use. Now, I can say even with more confidence that this is a way to ease your life. I also admit that once you go level up in the learning to advance stuff, this becomes much more difficult. But, we all know, practice makes a man perfect. I suggest you apply whatever you learn in your day to day life.

To start working on bash scripting, I have some recommendations.

As a beginner:

Better to start using some command on the terminal from cheat sheets

Then you can learn more extensively from tutorials.

1. Wiki Page: This has comprehensively covered most of the information.

2. Ryan tutorials: He also has some other amazing tutorial, do not shy to learn other stuff.

As an advanced user:

1. Advanced Bash-Scripting Guide , Mendel Cooper: nicely done and quite complete but not easy, do not start with that as a beginner.

2. Classic Shell Scripting, Arnold Robbins, Amazon: very well done, but assume you have previous knowledge of some shell scripting.

3. Bash Pocket Reference, Arnold Robbins: for advance user only.

I would like to end this with a statement: “Change is the synonym of the SUCCESS” — This world is changing fast, everyday something new and better is coming. If you don’t train yourself to adapt to the change and inculcate yourself with today’s knowledge, your growth will be halted.

Burrah! Have a Bash!

My Ph.D. story: Revealing the molecular-level mechanism of all lifeforms

Suppose you are provided with a simple machine, and you are asked to enhance its efficacy. How would you go about it? Firstly, you would need to understand every part of the machine, and their functioning. After you fully understand every detail, you need to understand the collective motion of the parts. Once you gather all this information, you can, in principle, modify the machine in any desired way. Similarly, all lifeforms on earth, including the human body, one of the most complex machines, would require understanding every minute detail of its entities and their communication networks.

One of the major entities in humans is proteins; the human body consists of approximately ~20 percent of proteins that exist in every part of our body. Proteins are synthesized naturally within our bodies. During synthesis, protein has to fold precisely in a well-arranged structure to sustain a specific function. During protein folding, proteins change their conformation at the molecular level, i.e., proteins undergo a change in their 3D structure by rearranging the atomic positions. Misfolding of proteins results in aggregate (amyloid) formation that often leads to diseases such as kidney failure, Alzheimer’s, and neurodegenerative diseases. On the other hand, appropriately folded proteins take part in every bit of body physiological function, ranging from muscular strength to cell signaling. Additionally, the folded proteins again can change their structure by interacting with external molecules such as drugs, vaccines, ligands, etc. For example, enzymes are proteins which are essential to speed up many processes, such as phosphorylation, digestion, respiration, etc. They interact with external molecules and activate by changing their structure.

In some cases, the human body also generates unfolded proteins known as intrinsically disordered proteins (IDPs), which do not have a well-arranged structure. However, IDPs tend to fold into a stable configuration by binding to other molecules such as cell membrane, ligands, or other proteins, to support specific protein functions. This protein folding process from an unstable disordered structure to an ordered structure is also an example of conformational change. Proper molecular-level knowledge of their folding mechanism is essential to gain insights into the nature of IDPs.

The process of IDPs folding requires flexibility in the parts of the protein that allows them to fold into a stable structure upon the interaction with foreign molecules. However, the extent to which flexibility is essential in other proteins is unknown, especially for membrane proteins. Membrane proteins are the ones that interact with the cell membrane; in doing so, they change their structure. Cell membrane works as a defensive mechanism against the many pathogenic activities such as the insertion of a virus or bacterial infections. Our research is interested in a specific membrane protein, i.e., pore-forming toxin, cytolysin A (ClyA). Bacteria secrete ClyA in a water-soluble form that binds to the plasma membrane of a target cell (human cell), thereby puncturing the cell by forming a pore on the cell membrane. ClyA displays remarkable flexibility that enables it to interconvert between two structures: water-soluble form and cell-bound form. ClyA shows one of the largest conformational changes in its protein structure on membrane binding.

In our work, we figured out the parts (motif) of ClyA, whose flexibility directly affects its activity. Our finding suggests that the flexibility of all proteins, in general, might be the key to the conformational change and folding for their functional activity. The loss of flexibility in those parts does not allow ClyA to puncture the human cell membrane. This information can now help drug development by targeting those regions and blocking the flexibility. For example, your room door can only be open if the hinges are attached to the corner. Hinges provide it the flexibility to open the door when you apply an external force at the handle. Any deformity in the hinges would leave your door stuck, and it will not be functional anymore. Therefore, the knowledge that the hinges are the essential part of the door functioning can allow you to design a blocker to block the hinge’s motion. You all must be familiar with the coronavirus; spike proteins on the top of the virus first interact with the human cell to facilitate coronavirus insertion inside the cell. Spike proteins have to undergo a structural change to carry out their function. Most vaccines are designed to block the interaction of coronavirus with the human cell membrane.

A molecular-level understanding of a protein’s structural change is essential, which would allow us to tune the process in the desired way, such as avoiding the misfolding of proteins. For example, understanding the active sites in a conformational change of enzymes can help to make better drugs. The lock and key mechanism of drug binding, whereby a small molecule (key) binds to a protein target (lock) and blocks its active site, is well understood. In practice, protein binding sites are not rigid; however, in some cases, quite significant changes in protein shape are associated with small-molecule binding. However, without knowing and understanding the nature of these changes at the outset, the rational design of small-molecule drugs to block these flexible pockets is impossible.

Overall capturing the changes in proteins, either due to internal mechanism (folding or conformational change) or external stimuli (drugs or vaccines), leads to my Ph.D. work where I use the Molecular Dynamics (MD) simulation technique. Experiments lack in capturing the detail of how atoms and molecules move and carry out their function and require a huge cost in comparison to the computational simulation technique. In contrast, MD simulation, a computational technique, allows us to track the time evolution of every atom of a protein. Although MD simulation theoretically can capture any structural change, we can only simulate a very small-time duration (a few microseconds). And most of the exciting phenomena occur at a large time scale. For example, protein conformational change often takes place in seconds to minutes. Therefore, capturing the molecular-level mechanism of a large time scale process is the present-day challenge for the scientific community.

That is where we started developing a method that would allow us to capture a large time scale process using the idea of the string method. String method idea is to minimize the total energy on the path of a protein’s structural change from one state to another state. Consider if you want to go from Kashmir to Kanyakumari, you would prefer a road that would cost you minimum energy in terms of money and time. Finding a minimum energy path connecting the initial and final state is the principle we used to develop a method. The developed method can now allow us to capture any enzyme or protein’s mechanism. The method is not restricted to only protein; practically, the method can be applied to any process where one can write down the total energy cost equation. A better understanding of the mechanism of the proteins can allow us to see the flexible binding site of the protein. Subsequently, it would enable us to design more efficient drugs or vaccines for a disease to facilitate the proper interaction of the drug with the protein. Molecular-level knowledge of every protein’s mechanism would be the first step towards understanding the mechanism of all lifeforms on the earth, as stated at the beginning of the story.

Unspoken aspects of life in IISc

Hello everyone!

In this blog, I am going to talk about the minor aspects of life in IISc which seem minor but I believe are essential. Let’s start with why someone joins IISc. Students join the IISc with two different goals; either they want to carry out research and pursue academia by becoming a professor somewhere in a highly reputed university or they want to get a good placement. IISc does a pretty good job of helping students in achieving their dreams. IISc is a great place to learn and achieve something but everything has its pros and cons and IISc is no exception. The issues presented here are not just from my personal experience but also from the prospective of several of my IISc fellow.

For a Ph.D. student, a journey of 5 years (around) involves the transformation from an aspiring student to an accomplished research scholar. It is this journey that prepares you for the life ahead and this should be as important as after IISc for everyone. If all does not go well then, this journey may cause a tremendous negative effect on the student's mental and physical health and even on his/her career. We cannot even measure the loss suffered by students who go through distressing circumstances. I have witnessed such cases but I am not going to discuss those. However, I will try to give you the idea of some general and necessary things which IISc falls short in providing.

I believe my writing will give you a gist of some minor unspoken details of IISc. If you are planning to join the IISc, this can help you to keep your expectations at a certain limit and to decide consciously if you are ready for IISc. Nevertheless, if you think you can fit in this environment or you think that you can adjust yourself, welcome to the IISc community.

Since I have been here for more than 2 years, I think I am qualified enough to talk about it; and since I am also an IIT graduate, I know how life in IISc differs from life in IITs. At some point I will be comparing the life in IISc with the life in IITs, just to give you an idea.

The first and most important thing is the hostel to live on campus. IISc has several hostels for boys and girls which are mostly single occupancy except for a PD boys’ hostel which is an extension of the campus separated from the main campus by a public road. Most of the hostels are quite good with the exception of certain hostels with very old building philosophies. Hostels are allotted based on a lucky draw where you have to pick up a chit and if you are lucky enough, you may get newly constructed hostel or you may end up in a hostel to a corner side of the institute which does not even have a direct road connecting to the main campus. The problem with that isolated hostel is that you have to climb foot overpass every time you go for the mess, lab or class and it requires around 1 Km of walk till messes and much more till departments. Luckily, they tend to shift people to the hostels of the main campus as soon as they get a good number of vacancies.

Just having a hostel is not enough. It should be equipped with basic facilities and IISc fails to provide some essential facilities. The most surprising thing is that IISc does not provide internet in any hostel. So, either you have to purchase a router or go to your department for internet access. This is a research-oriented college but doesn’t provide internet which is a must. There are no washing machines for students from the IISc side but mostly students purchase it in groups and keep it in the washrooms. Lifts are required in only two new buildings of IISc and they often do not work (my room is on the 6th floor).

The second important thing is the mess. There are four messes and the food in all the messes is exactly what we generally expect from an institute mess. But that’s not the main issue. The major problem is that except for ‘A’ mess, all the messes are so crowded that you have to stand in queue for quite a while before getting food and that happens almost every day. Sometimes food items get finished even before the mess closing time. I hope the IISc administration considers building more messes to overcome the overcrowding issue of the messes.

Now coming to sports. IISc does not spend enough money to motivate the sports activities on the campus. In the name of sports, it has just a single very old building known as ‘Gymkhana’ which consists of 3 badminton courts, around 300 square ft of a co-ed gym, a small dance room, an aerobics room, a yoga room, three billiards boards, and one Chess room. Considering the strength of campus (3842) and given that there are no sports facilities in most of the hostels (few hostels have table tennis, carrom, and chess), the facilities at gymkhana is certainly not enough as you have to wait for too long for your chance. Also, they have one basketball court, one volleyball court, one lawn tennis court, a hockey ground, a football ground, the main ground, and a swimming pool. Grounds look like barren lands and are not maintained at all. The sports facilities at Gymkhana are open till 10 PM only. So, if you are not interested in playing badminton or are interested in playing anything in the off timings, then there is nothing you can do about it.

IISc is a research-oriented college and it is very common for students to stay up till late at night. In that case, one may feel the need of food options in case he/she feels hungry or sports to take his/her mind off study. If you are such a student, then you are going to be in great dismay here. There is just one-night canteen but that too is far from the hostel zone. It does not provide hygienic food and a lack in variety. In the case of sports at night, just as I told you earlier, you are caught in off-timings of Gymkhana. So, badminton is off the table. Carrom, chess or table tennis are also off the list as most of the hostels don’t have those. So, in short, you don’t have any playing options at all during the night. On top of all this, you don’t have the internet in your room. Isn’t it amazing!! Basically, you are expected to be a person who is very well-disciplined, early to bed and early to rise type. Use your room just to sleep there, do all your work in labs.

In the case of the hospital facility, IISc must consider increasing the number of doctors as it is open not only for students but also for employees and their families. Most of the time there are 3 or 4 doctors available and every time you have to wait for more than an hour for your turn. This issue can be resolved with a well active online system of appointment. They have an online appointment system but that only gives you a duration of timing and even if you go at the end of the time you always have to wait for hours.

The most surprising thing is despite failing to provide decent facilities, the fees of IISc is high in comparison of the IITs (around 80,000 per year, including academic, hostel and mess fees). Even such a huge fee doesn’t cover all your campus expenses. If you want to learn anything, in sports or art, you have to pay from your pocket. For example, IISc has its swimming pool and if you want to learn you have to pay 800 per month for coach and 300 per month for a monthly pass. You have to pay for athlete training, yoga class, Archie, music or anything which requires a coach else learn on your own.

If you compare it with any other IITs, it fails to provide what IITs provide to their students. I have been in IIT Roorkee for my MTech and IISc is too far behind IITs in terms of campus life. Unlike IISc, swimming pool, taekwondo, and other sports are all free of cost in IIT Roorkee. They even encourage students to join and have proficiency in at least one sport and for that, they provide free coaching. There are badminton courts, volleyball courts, gym, carrom, chess facilities in the hostel itself. Night canteens are there in every hostel. Hostels are also equipped with washing machines and high-speed internet.

There are many more things to talk about such as research, course curriculum, placements, student-professor relations, library, and so on but those will be covered in my next writing.

All I want to say is IISc is a nice college if you are interested in research (more on this in the next part). In my view, despite being tiny factors, the above-mentioned problems affect student life and indirectly his/her research performance.

Thanks for reading!

Avijeet Kulshrestha

Principal component analysis (PCA): Basic mathematics to the practical implementation

Hello peeps!

Before jumping to the topic, I would like to give you the idea of PCA. PCA is a powerful technique of dimensionality reduction (helps in visualization). Further, you can classify the data on the reduced dimension and much more. Dimensionality reduction is one of the biggest challenges wherever we deal with a large amount of data. The simplest example is the transformation of 2-D points into 1-D points by representing them as a distance measured from one endpoint to another endpoint.

(SEE below Figure A)

Another example: consider a class of 100 students; each student has credited two courses. We are interested in stratifying the students based on their performance. This is a trivial problem; we can simply plot a 2D graph with the x and y-axis as two different courses and plot points based on the score obtained in the respective subjects. In the below figure, the scatter plot helps to classify the students into two categories based on their performance.

(SEE below Figure B)

Now consider that every student registered in 10 subjects. This poses a problem; we cannot plot a 10-dimensional graph. This is where PCA transforms the data into another 10-dimensional data, called principal components, such that the first dimension has the maximum variance of the data and the last dimension has the lowest. In doing so, most of the time first and second principal components are enough to give most of the information about the data.

In the below figure, the amount of money spent versus the population is given for 100 cities. Where the green line is the first principal component, and the blue dotted line is the second principal component. The first principal component is in the direction of maximum variance in the data, and it is orthogonal to the second principal component, which ensures the components are independent of each other. Here, just the first principal component is enough to give the maximum information of the data since there is not much variation in the second principal component. PCA is an unsupervised approach since this includes only independent variables, not the dependent variable (response of independent variables).

(SEE below Figure C)

How do you calculate principal components?

Let’s consider that there is a total n number of observations of p number of variables. To visualize, we can draw a 2-D plot of all possible combinations of variables. This is very tedious and hard to visualize. PCA helps to find a low number of uncorrelated dimensions of these correlated variables. Each component of PCA is a linear combination of variables of the original study. The first principal component Z1 is the normalized linear combination of the variables for X1, X2 … Xp variables.

(SEE below Figure D with i=1)

Here, W11, W21…Wp1 are loadings of the first principal component. Normalization condition is forced to avoid explode of data. Since the first principal component has the maximum variance and without this condition, the value of the loading will contentiously increase.

Here, we are only interested in the variance of the data; therefore, we assume that the mean of each variable is zero (column mean is zero). We can transform our data by substituting the mean value (of all observations, column) from each variable to get mean as zero.

Now, the linear combination of the variables that has the largest variance is given by, (SEE below Figure D)

Since the average of all columns is zero, the average of Z’s will also be zero. Therefore, to maximize the sample variance, we have to maximize the variance of n values of Zi1. Here Z11, Z21…Znp are the scores of the first principal component. This is an optimization problem. (SEE below Figure E)

This problem can be solved with eigen decomposition. From the method of Lagrange multiplier, the Lagrangian function can be written as, (SEE below Figure F)

Here, the variance is: (SEE below Figure G)

By solving the above two equations: (SEE below Figure H)

W is an eigenvector of covariance matrix V, and the maximizing vector will be the one associated with the largest eigenvalue λ. We can evaluate the loading W with this simple eigenvalue problem, therefore, the scores of the principal components.

Example: Before doing a problem, let's brush up on some basic knowledge of statistical analysis concepts.

Variance: It is a measure of the variability or spread in a set of data. Mathematically, it is the average squared deviation from the mean score. We use the following formula to compute the variance.

Var(X) = Σ (Xi - X )2 / N = Σ xi2 / N

where,

N is the number of scores in a set of scores X is the mean of the N scores. Xi is the ith raw score in the set of scores, xi is the ith deviation score in the set of scores Var(X) is the variance of all the scores in the set

Covariance: It is a measure of the extent to which corresponding elements from two sets of ordered data move in the same direction. We use the following formula to compute covariance.

Cov(X, Y) = Σ ( Xi - X ) ( Yi - Y ) / N = Σ xiyi / N

Variance-Covariance Matrix: Variance and covariance are often displayed together in a variance-covariance matrix, also known as a covariance matrix. The variances appear along the diagonal and covariances appear in the off-diagonal elements.

Problem: In the table below, the data shows the scores obtained by the students in three different subjects.

Shown in red along the diagonal, we see the variance of scores for each test. The art test has the biggest variance (720); and the English test has the smallest (360). So, we can say that art test scores are more variable than English test scores. (SEE below Figure I)

The covariance between math and English is positive (360), and the covariance between math and art is positive (180). This means the scores tend to covary in a positive way. As scores in math go up, scores in art and English also tend to go up, and vice versa.
The covariance between English and art, however, is zero. This means there tends to be no predictable relationship between the movement of English and art scores.

NOTE: Here, you can divide the columns by their corresponding variance so that they have a single unit variance.

Compute the eigenvectors and eigenvalues from the covariance matrix:

By, Det ( V - λI) = 0

Calculate the corresponding eigenvectors and sort them with decreasing values of the corresponding eigenvalues λ.

Choose K number of eigenvectors. In this case, we are reducing 3 dimensions into 2 dimensions; therefore, we keep two eigenvectors. This gives the ‘W’ matrix of size (3×2)

Transform the original data into new subspace: Y (2×5) = WT (2×3) × AT (3×5)

Y is your new principal components matrix. You can plot these components.

Python code to solve this problem;

Import some important libraries;

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import seaborn as sns

%matplotlib inline (Use in Jupyter to plot)

Read the data from Excel sheet (you can use any method to load the data); I have added two more columns here to get more interpretation of the data.

df_original=pd.read_excel('data.xlsx')

Check the head of data;

df_original.head()

(SEE below Figure J)

Drop the last two columns (later we will use them to classify the data)

df=df_original.drop(['percentage','dictation'],axis=1)

Now we need to transform our data, there is a standard scalar feature in SciKit learn tool of Python with transform data by subtracting the mean and dividing it by its standard deviation so that each column has a single unit variance.

from sklearn.preprocessing import StandardScaler

Make an object of this feature and fit to our data;

scaler = StandardScaler()

scaler.fit(df)

scaled_data = scaler.transform(df)

Now we can import PCA feature and define an object (here we can define how many components we are interested in), later fit to our data, and then transform our data with transform feature;

from sklearn.decomposition import PCA

pca = PCA(n_components=2)

pca.fit(scaled_data)

x_pca = pca.transform(scaled_data)

We can check the shape of our data;

scaled_data.shape

(5, 3)

x_pca.shape

(5, 2)

We have reduced dimension from 3 to 2. Now, lets make a scatter plot of these two PCA with the colour coding based on original dictation data.

plt.figure(figsize=(5,3))

plt.scatter(x_pca[:,0],x_pca[:,1],cmap='plasma',c=df_original['dictation'])

plt.xlabel('First principal component')

plt.ylabel('Second Principal Component')

(SEE below Figure K)

Here, the color has been provided based on the distinction, and we can see a clear separation between the students provided by PCA.

We can also plot principle components in relation to the original data. (SEE below Figure L)

In the above figure, 1st component correlates well with math and English, whereas it does not correlate well with art. 2nd component does not correlate well with anyone, therefore, here just one component was enough to describe the data.

We can check the set of all eigenvectors (also known as loadings). They are stored in an attribute of the fitted PCA object.

print(pca.components_)

[[-0.70710678 -0.66666667 -0.23570226] [-0. 0.33333333 -0.94280904]]

This has so many applications in real-world problems. Most of the systems are described in hyperspace, such as the energy surface of a system consisting of a large number of atoms, the universe, and any process that responds due to a large number of variables. I think God also has used it for the simplification of the universe. We see only 3D, however, there are more dimensions in this world (Just a thought).

Thank you for reading. If there is any mistake or way to improve the article, please write to me. For any further queries, you can reach out to me.