According to the evolution theory of Charles Darwin, heritable variation and natural selection are the causes of evolution where a mutation occurs in the genetic code. Over the generations, a mutant survives and is naturally selected. DNA, the rule book of the cellular mechanism, consists of entire genomes, out of which different segments (known as genes – pages of the book) get expressed in different cells, providing a unique identity to a cell. In order to be able to transform into a specific cell, it requires an additional layer of information, “memory,” that can record the past and subsequently establish the future to mark the cell identity – the phenomenon is executed through epi-genetics (above-genetics). Two major factors affecting epigenetics are DNA methylation and the form of the histone protein. The most heavily methylated group of DNA dampened its activity. Histone, on the other hand, is known as part of the inner scaffold for DNA, and different forms of histones activate different genes. Interestingly, modification in both factors can occur due to the local environment factors “nurture” and thus can modify the behavior of a cell. In conclusion, epigenetic codes can make part of the genome more active or inactive. Heritable variation of the epigenetics code can further add the role of nurture in the evolution to the original Darwin theory; however, it is still in debate.
The attached image is taken from (https://alternityhealthcare.com/telomere-epigenetic-testing/), which emphasises on how even after having the same genetic code (nature) the same person can have different appearance simply because of lifestyle and environmental factors (nurture).
Hello fellas!
Yesterday, I was sitting in my house, thinking about all the things that I learned in the past few years. The very first thing that came into my mind was Bash Scripting which made my life so much easier and agile that I think I should share it with others and let them experience how useful a thing it is. Although I have used it extensively in my research, I still perceive that there is a lot more to learn. That’s the reason, I will not be writing about technical details here. Instead I will give you the idea of its wide applicability. I intend to motivate you to learn this and I hope that this will ease your life in the same way it did to mine. There is a small story of how I ended up learning this. I suppose that this story will tell you the importance of Bash Scripting.
I joined IISc for my Ph.D. and I decided to work under Prof. KGA and Prof. SP as a joint student. My research area, in a nutshell, was molecular simulation of biological systems. Even though most of the tools of MD simulation were certainly available on Linux, I, being a regular windows user, started working in the lab on windows only. Once I went to SP’s office with some preliminary results. He asked me to draw figures impromptu from raw data and analyze it. I started using GUI (Graphical user interface) of some usual windows software. Within a minute, he told me that you should change your operating system to Linux and start using CLI (command line interface). For a while, I thought why should I use some other operating system when I could do all my needful in the existing one. Since I was unknown to the fact that this will save me from a lot of trouble in the future, I remained stubborn. Then, comes the global Guru to help me in my decision “Google” I googled and noticed that all the tools of my research field were well developed for Linux. Thereby, I decided to change my operating system and started using CLI. It was not smooth to work in the beginning; however, over the time I kept on asking Guru for help and I successfully wrote whole scripts for my research. Today, I thank Prof. SP who pushed me into this in the initial phase of my research.
Let me give a brief introduction of Linux CLI for those who have never been exposed to CLI and have no clue what it is. In Linux CLI, you give commands through syntax on the terminal (similar to command prompt of the window), the same way, you give input either from mouse or keyboard in windows for a command to the system. When you write all those syntaxes of CLI line by line in a file that becomes a script. Executing that script, execute all written command of the script without user interference. This can work as macros; how much time can be saved in this way — can be imagined by the fact that the impromptu figures that I was asked to draw takes around few mintues on window but on Lincux CLI, I can get it done within a minutes. Moreover, it is a lot of fun writing scripts and executeing them.
For example: If you want to copy 100 files from one folder to another folder and then based on their name, you want to do some operation in each file (say you want to copy the last column of the files with .txt extension and first column of the files with .dat extension into a single new file). These calls can be done with a single script. In case, this is what you need to do in your daily job, you just have to execute the script once and sit back.
Over the one year of my research, I only learned and used what I needed in my script. I never learned Bash scripting formally as a course which, in result, constrained my knowledge about it but, The lockdown of the pandemic period which I am sure you must be aware of, allowed me to learn this formally and most of its use. Now, I can say even with more confidence that this is a way to ease your life. I also admit that once you go level up in the learning to advance stuff, this becomes much more difficult. But, we all know, practice makes a man perfect. I suggest you apply whatever you learn in your day to day life.
To start working on bash scripting, I have some recommendations.
As a beginner:
Better to start using some command on the terminal from cheat sheets
Then you can learn more extensively from tutorials.
1. Wiki Page: This has comprehensively covered most of the information.
2. Ryan tutorials: He also has some other amazing tutorial, do not shy to learn other stuff.
As an advanced user:
1. Advanced Bash-Scripting Guide , Mendel Cooper: nicely done and quite complete but not easy, do not start with that as a beginner.
2. Classic Shell Scripting, Arnold Robbins, Amazon: very well done, but assume you have previous knowledge of some shell scripting.
3. Bash Pocket Reference, Arnold Robbins: for advance user only.
I would like to end this with a statement: “Change is the synonym of the SUCCESS” — This world is changing fast, everyday something new and better is coming. If you don’t train yourself to adapt to the change and inculcate yourself with today’s knowledge, your growth will be halted.
Burrah! Have a Bash!
Suppose you are provided with a simple machine, and you are asked to enhance its efficacy. How would you go about it? Firstly, you would need to understand every part of the machine, and their functioning. After you fully understand every detail, you need to understand the collective motion of the parts. Once you gather all this information, you can, in principle, modify the machine in any desired way. Similarly, all lifeforms on earth, including the human body, one of the most complex machines, would require understanding every minute detail of its entities and their communication networks.
One of the major entities in humans is proteins; the human body consists of approximately ~20 percent of proteins that exist in every part of our body. Proteins are synthesized naturally within our bodies. During synthesis, protein has to fold precisely in a well-arranged structure to sustain a specific function. During protein folding, proteins change their conformation at the molecular level, i.e., proteins undergo a change in their 3D structure by rearranging the atomic positions. Misfolding of proteins results in aggregate (amyloid) formation that often leads to diseases such as kidney failure, Alzheimer’s, and neurodegenerative diseases. On the other hand, appropriately folded proteins take part in every bit of body physiological function, ranging from muscular strength to cell signaling. Additionally, the folded proteins again can change their structure by interacting with external molecules such as drugs, vaccines, ligands, etc. For example, enzymes are proteins which are essential to speed up many processes, such as phosphorylation, digestion, respiration, etc. They interact with external molecules and activate by changing their structure.
In some cases, the human body also generates unfolded proteins known as intrinsically disordered proteins (IDPs), which do not have a well-arranged structure. However, IDPs tend to fold into a stable configuration by binding to other molecules such as cell membrane, ligands, or other proteins, to support specific protein functions. This protein folding process from an unstable disordered structure to an ordered structure is also an example of conformational change. Proper molecular-level knowledge of their folding mechanism is essential to gain insights into the nature of IDPs.
The process of IDPs folding requires flexibility in the parts of the protein that allows them to fold into a stable structure upon the interaction with foreign molecules. However, the extent to which flexibility is essential in other proteins is unknown, especially for membrane proteins. Membrane proteins are the ones that interact with the cell membrane; in doing so, they change their structure. Cell membrane works as a defensive mechanism against the many pathogenic activities such as the insertion of a virus or bacterial infections. Our research is interested in a specific membrane protein, i.e., pore-forming toxin, cytolysin A (ClyA). Bacteria secrete ClyA in a water-soluble form that binds to the plasma membrane of a target cell (human cell), thereby puncturing the cell by forming a pore on the cell membrane. ClyA displays remarkable flexibility that enables it to interconvert between two structures: water-soluble form and cell-bound form. ClyA shows one of the largest conformational changes in its protein structure on membrane binding.
In our work, we figured out the parts (motif) of ClyA, whose flexibility directly affects its activity. Our finding suggests that the flexibility of all proteins, in general, might be the key to the conformational change and folding for their functional activity. The loss of flexibility in those parts does not allow ClyA to puncture the human cell membrane. This information can now help drug development by targeting those regions and blocking the flexibility. For example, your room door can only be open if the hinges are attached to the corner. Hinges provide it the flexibility to open the door when you apply an external force at the handle. Any deformity in the hinges would leave your door stuck, and it will not be functional anymore. Therefore, the knowledge that the hinges are the essential part of the door functioning can allow you to design a blocker to block the hinge’s motion. You all must be familiar with the coronavirus; spike proteins on the top of the virus first interact with the human cell to facilitate coronavirus insertion inside the cell. Spike proteins have to undergo a structural change to carry out their function. Most vaccines are designed to block the interaction of coronavirus with the human cell membrane.
A molecular-level understanding of a protein’s structural change is essential, which would allow us to tune the process in the desired way, such as avoiding the misfolding of proteins. For example, understanding the active sites in a conformational change of enzymes can help to make better drugs. The lock and key mechanism of drug binding, whereby a small molecule (key) binds to a protein target (lock) and blocks its active site, is well understood. In practice, protein binding sites are not rigid; however, in some cases, quite significant changes in protein shape are associated with small-molecule binding. However, without knowing and understanding the nature of these changes at the outset, the rational design of small-molecule drugs to block these flexible pockets is impossible.
Overall capturing the changes in proteins, either due to internal mechanism (folding or conformational change) or external stimuli (drugs or vaccines), leads to my Ph.D. work where I use the Molecular Dynamics (MD) simulation technique. Experiments lack in capturing the detail of how atoms and molecules move and carry out their function and require a huge cost in comparison to the computational simulation technique. In contrast, MD simulation, a computational technique, allows us to track the time evolution of every atom of a protein. Although MD simulation theoretically can capture any structural change, we can only simulate a very small-time duration (a few microseconds). And most of the exciting phenomena occur at a large time scale. For example, protein conformational change often takes place in seconds to minutes. Therefore, capturing the molecular-level mechanism of a large time scale process is the present-day challenge for the scientific community.
That is where we started developing a method that would allow us to capture a large time scale process using the idea of the string method. String method idea is to minimize the total energy on the path of a protein’s structural change from one state to another state. Consider if you want to go from Kashmir to Kanyakumari, you would prefer a road that would cost you minimum energy in terms of money and time. Finding a minimum energy path connecting the initial and final state is the principle we used to develop a method. The developed method can now allow us to capture any enzyme or protein’s mechanism. The method is not restricted to only protein; practically, the method can be applied to any process where one can write down the total energy cost equation. A better understanding of the mechanism of the proteins can allow us to see the flexible binding site of the protein. Subsequently, it would enable us to design more efficient drugs or vaccines for a disease to facilitate the proper interaction of the drug with the protein. Molecular-level knowledge of every protein’s mechanism would be the first step towards understanding the mechanism of all lifeforms on the earth, as stated at the beginning of the story.
Hello peeps!
Before jumping to the topic, I would like to give you the idea of PCA. PCA is a powerful technique of dimensionality reduction (helps in visualization). Further, you can classify the data on the reduced dimension and much more. Dimensionality reduction is one of the biggest challenges wherever we deal with a large amount of data. The simplest example is the transformation of 2-D points into 1-D points by representing them as a distance measured from one endpoint to another endpoint.
(SEE below Figure A)
Another example: consider a class of 100 students; each student has credited two courses. We are interested in stratifying the students based on their performance. This is a trivial problem; we can simply plot a 2D graph with the x and y-axis as two different courses and plot points based on the score obtained in the respective subjects. In the below figure, the scatter plot helps to classify the students into two categories based on their performance.
(SEE below Figure B)
Now consider that every student registered in 10 subjects. This poses a problem; we cannot plot a 10-dimensional graph. This is where PCA transforms the data into another 10-dimensional data, called principal components, such that the first dimension has the maximum variance of the data and the last dimension has the lowest. In doing so, most of the time first and second principal components are enough to give most of the information about the data.
In the below figure, the amount of money spent versus the population is given for 100 cities. Where the green line is the first principal component, and the blue dotted line is the second principal component. The first principal component is in the direction of maximum variance in the data, and it is orthogonal to the second principal component, which ensures the components are independent of each other. Here, just the first principal component is enough to give the maximum information of the data since there is not much variation in the second principal component. PCA is an unsupervised approach since this includes only independent variables, not the dependent variable (response of independent variables).
(SEE below Figure C)
How do you calculate principal components?
Let’s consider that there is a total n number of observations of p number of variables. To visualize, we can draw a 2-D plot of all possible combinations of variables. This is very tedious and hard to visualize. PCA helps to find a low number of uncorrelated dimensions of these correlated variables. Each component of PCA is a linear combination of variables of the original study. The first principal component Z1 is the normalized linear combination of the variables for X1, X2 … Xp variables.
(SEE below Figure D with i=1)
Here, W11, W21…Wp1 are loadings of the first principal component. Normalization condition is forced to avoid explode of data. Since the first principal component has the maximum variance and without this condition, the value of the loading will contentiously increase.
Here, we are only interested in the variance of the data; therefore, we assume that the mean of each variable is zero (column mean is zero). We can transform our data by substituting the mean value (of all observations, column) from each variable to get mean as zero.
Now, the linear combination of the variables that has the largest variance is given by, (SEE below Figure D)
Since the average of all columns is zero, the average of Z’s will also be zero. Therefore, to maximize the sample variance, we have to maximize the variance of n values of Zi1. Here Z11, Z21…Znp are the scores of the first principal component. This is an optimization problem. (SEE below Figure E)
This problem can be solved with eigen decomposition. From the method of Lagrange multiplier, the Lagrangian function can be written as, (SEE below Figure F)
Here, the variance is: (SEE below Figure G)
By solving the above two equations: (SEE below Figure H)
W is an eigenvector of covariance matrix V, and the maximizing vector will be the one associated with the largest eigenvalue λ. We can evaluate the loading W with this simple eigenvalue problem, therefore, the scores of the principal components.
Example: Before doing a problem, let's brush up on some basic knowledge of statistical analysis concepts.
Variance: It is a measure of the variability or spread in a set of data. Mathematically, it is the average squared deviation from the mean score. We use the following formula to compute the variance.
Var(X) = Σ (Xi - X )2 / N = Σ xi2 / N
where,
N is the number of scores in a set of scores X is the mean of the N scores. Xi is the ith raw score in the set of scores, xi is the ith deviation score in the set of scores Var(X) is the variance of all the scores in the set
Covariance: It is a measure of the extent to which corresponding elements from two sets of ordered data move in the same direction. We use the following formula to compute covariance.
Cov(X, Y) = Σ ( Xi - X ) ( Yi - Y ) / N = Σ xiyi / N
Variance-Covariance Matrix: Variance and covariance are often displayed together in a variance-covariance matrix, also known as a covariance matrix. The variances appear along the diagonal and covariances appear in the off-diagonal elements.
Problem: In the table below, the data shows the scores obtained by the students in three different subjects.
Shown in red along the diagonal, we see the variance of scores for each test. The art test has the biggest variance (720); and the English test has the smallest (360). So, we can say that art test scores are more variable than English test scores. (SEE below Figure I)
The covariance between math and English is positive (360), and the covariance between math and art is positive (180). This means the scores tend to covary in a positive way. As scores in math go up, scores in art and English also tend to go up, and vice versa.
The covariance between English and art, however, is zero. This means there tends to be no predictable relationship between the movement of English and art scores.
NOTE: Here, you can divide the columns by their corresponding variance so that they have a single unit variance.
Compute the eigenvectors and eigenvalues from the covariance matrix:
By, Det ( V - λI) = 0
Calculate the corresponding eigenvectors and sort them with decreasing values of the corresponding eigenvalues λ.
Choose K number of eigenvectors. In this case, we are reducing 3 dimensions into 2 dimensions; therefore, we keep two eigenvectors. This gives the ‘W’ matrix of size (3×2)
Transform the original data into new subspace: Y (2×5) = WT (2×3) × AT (3×5)
Y is your new principal components matrix. You can plot these components.
Python code to solve this problem;
Import some important libraries;
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline (Use in Jupyter to plot)
Read the data from Excel sheet (you can use any method to load the data); I have added two more columns here to get more interpretation of the data.
df_original=pd.read_excel('data.xlsx')
Check the head of data;
df_original.head()
(SEE below Figure J)
Drop the last two columns (later we will use them to classify the data)
df=df_original.drop(['percentage','dictation'],axis=1)
Now we need to transform our data, there is a standard scalar feature in SciKit learn tool of Python with transform data by subtracting the mean and dividing it by its standard deviation so that each column has a single unit variance.
from sklearn.preprocessing import StandardScaler
Make an object of this feature and fit to our data;
scaler = StandardScaler()
scaler.fit(df)
scaled_data = scaler.transform(df)
Now we can import PCA feature and define an object (here we can define how many components we are interested in), later fit to our data, and then transform our data with transform feature;
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(scaled_data)
x_pca = pca.transform(scaled_data)
We can check the shape of our data;
scaled_data.shape
(5, 3)
x_pca.shape
(5, 2)
We have reduced dimension from 3 to 2. Now, lets make a scatter plot of these two PCA with the colour coding based on original dictation data.
plt.figure(figsize=(5,3))
plt.scatter(x_pca[:,0],x_pca[:,1],cmap='plasma',c=df_original['dictation'])
plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')
(SEE below Figure K)
Here, the color has been provided based on the distinction, and we can see a clear separation between the students provided by PCA.
We can also plot principle components in relation to the original data. (SEE below Figure L)
In the above figure, 1st component correlates well with math and English, whereas it does not correlate well with art. 2nd component does not correlate well with anyone, therefore, here just one component was enough to describe the data.
We can check the set of all eigenvectors (also known as loadings). They are stored in an attribute of the fitted PCA object.
print(pca.components_)
[[-0.70710678 -0.66666667 -0.23570226] [-0. 0.33333333 -0.94280904]]
This has so many applications in real-world problems. Most of the systems are described in hyperspace, such as the energy surface of a system consisting of a large number of atoms, the universe, and any process that responds due to a large number of variables. I think God also has used it for the simplification of the universe. We see only 3D, however, there are more dimensions in this world (Just a thought).
Thank you for reading. If there is any mistake or way to improve the article, please write to me. For any further queries, you can reach out to me.
An Introduction to statistical learning - with application in R by Gareth James et al.
Figures A, B and C, D, E, F, G and H, I and J, K and L