Tutorials

Introduction

This tutorial provides a step-by-step introduction to use dMI. It includes (1) learning dynamics of trajectories by model training; (2) searching for proper number of states by evaluating the training goodness and (3) the degree of overfitting; (4) quantifying integrated information embedded in dynamical patterns of signaling responses; (5) the comparison with the previous methods of calculating information transmission.

After downloading the package, the user can implement the algorithm by the commands below.


Installing dMI

Required software

dMI runs in the MATLAB environment. MATLAB R2019a or higher is encouraged, as earlier version may have compatibility issue.


Download the package

Go to GitHub, and download the package. The "server" user can clone the package.

After downloading, please add the path of the package in MATLAB software option "Set Path", or cd folder path. Assuming that the code is in the folder C:\MATLAB\dMI for Windows user, then

>cd C:\MATLAB\dMI

For Unix or Mac user, the format of folder path is different and needs to be adjusted.


Test the package

>TestCode;

>clear;

The command 'clear' is to clean up the workspace. We suggest to use 'clear' first when changing dataset.


Computational time

We implemented the package on personal desktop with intel(R) core(tm) i7-8700 CPU @ 3.7GHz. For a dataset with two stimuli conditions and each has around 500 trajectories of 150 time points, the computational time of each steps is summarized below.

Step 1 (Learning dynamics of trajectories): Using hidden Markov model with 4 hidden states and 4 emission states, it took around 10 minutes. Using time-inhomogeneous Markov model with 4 states, it took around 5 minutes. The computational time increases when the numbers of states increase. When the number of states reaches 32, it may take around 12 hours.

Step 2 (Training goodness of the model): As this step requires to scan a set of states, then the computational time scales with the set of parameter scan.

Step 3 (Overfitting of the model): As this step also requires to scan a set of states, it takes similar time as step 2.

Step 4 (Maximum mutual information from dynamics): This step took less than 10 minutes for two stimuli. It scales with the number of stimuli in the dataset.

Step 5 (Maximum mutual information from the previous methods): It took around 3 minutes for the vector method and 1 minute for the time-point method. It scales with the number of stimuli in the dataset.

A quick implementation on the package

Before going to step-by-step usage of the package, we first provide a quick experience. With being in the main folder of the package, the user simply need run

>Main;

dMI will train hidden Markov model for the test dataset with two stimuli. The model has 3 hidden states and 3 emission states as default. The results will be saved as a mat file in the main folder.

dMI generates other output files in the sub-folder with the same name as the mat file:

  • The transition matrix and emission matrix for the model.

  • The heatmaps of the data, binned data, and sampled trajectories.

  • The measures on the training goodness.

The figure, mat, .xlsx files of the maximum mutual information will be saved in the sub-folder of "Figures/MaximumMI", and named with pre-fix "MaxMI" followed by the name of the file.


The user can improve the model training by increasing the number of states, and can also use the other methods such as time-inhomogeneous model, vector method. These and other useful manipulations are carefully demonstrated below.

Step 1: Learning dynamics of trajectories

Dataset

We have provided two sample datasets. One is a test dataset with two conditions, data1.mat. To use it,

>Dataset=0;

The other is the NFkB dataset, which can be download from the supplementary of our published paper. After downloading the dataset to the main folder, you can use it with setting

>Dataset=1;

for the WT data. For the KO data, please use:

>MainPara.Mutant=1;

Then, run the script "Main.m".

The user can choose either of the datasets for the following steps, but the test dataset will take less computational time.


User-specified data input

If the user wants to input dataset, please replace the test dataset as follows.

dMI requires one input file:

  • Data: the file name of the dataset, 'data.mat'. The dataset should contain a set of numeric matrices of single-cell signaling responses, where rows are time points, and columns are single cells. It should be put into the same folder of the package main folder, otherwise please specify the folder path. You can replace the mat file of test dataset and provide the name of the data file

>Data='data1.mat';

The data can be the following data structure in MATLAB:

    1. Cell: 1xN, with N the number of stimulus conditions and each element is a matrix. Each matrix contains the single-cell signaling responses as a set of time series.

    2. Matrix: if data of various stimuli is saved in a set of matrices, then the user needs to construct it as a Cell. See a sample code in MatrixToCell.m.

    3. Table: the user needs to provide the label of each stimulus condition. dMI then generates the matrices of all the conditions as a cell, as shown by the first sample dataset in LoadDataset.m.

The data will be normalized with maximum to be 10 (arbitrary unit).


Three other input files are not necessary, but can be specified if necessary:

  • condition: the name label of each stimulus condition. It should be the string of labeling (as a cell structure) for the stimuli's conditions.

>condition={'1','2'};

If the user does not specify it, it will be assigned as '1', '2', '3',... . Please make sure the number of conditions is the same as that of the data.

  • ID: the index of each stimulus condition. It should be the numerical indices (as a vector):

>ID=[1,2];

If the user does not specify it, it will be assigned as 1, 2, 3,... .

  • Label: name of the output file

>LabelName='UserData';

If the user does not specify it, it will be assigned as 'UserData' or 'NFkB', depending on which dataset was used. The output folder name will be the LabelName with the parameters of model type and states number.


Parameter input

dMI also allows the user to specify the parameters for training the model:

  • Model type: the model used to learn the data or calculate channel capacity.

>MainPara.Model=1;

The current choices include: 1. Time-inhomogeneous Markov model (Kampen, 1992); 2. Hidden Markov model (Bahl et al., 1986); 3. The vector method (Selimkhanov et al., 2014 ); 4. The time-point method (Cheong et al., 2011). Every time the user wants to apply a type of model, this parameter needs to be reassigned. The latter two choices will be elaborated below in the section of the previous methods.

  • Number of states:

    1. Time-inhomogeneous Markov model: dimension of transition matrix.

>OtherPara.state=2; %The number of states.

    1. Hidden Markov model: number of states.

>OtherPara.binsize=2; %The number of emission states.

>OtherPara.state=2; %The number of hidden states.

  • Other parameters:

    1. Genotype: genotype of cells. For the current NFkB dataset, 0 is wild type and 1 is a mutant. To use the mutant data, when Dataset=1, set

>MainPara.Mutant=1;

    1. TimePointsToUse: the number of time points used to train the model.

>MainPara.TimePointsToUse=143;

    1. Time length in the unit of hour:

>MainPara.TotalTimeLength=11.9;

After specifying all the parameters, run the package by Main.m.

In a short summary, all the above parameters, using the test dataset and hidden Markov model, are given together as:

>Dataset=0;Data='data1.mat';

>MainPara.Model=1;OtherPara.binsize=2;OtherPara.state=2;

>MainPara.Mutant=0;MainPara.TimePointsToUse=143;MainPara.TotalTimeLength=11.9;

>Main;

If the user did not specify, the parameter will be the above default ones. For the NFkB dataset, the states number needs to reach 32 for sufficiently good model training.


Perturbations

dMI also provides various perturbation on the data for interesting study:

  • Selective conditions: it calculates the maximum mutual information for arbitrary choice of selected conditions.

>MainPara.CCSelect=0;% First run the package with 0, and then use 1 to select conditions.

>MainPara.CCSelectCondi=[1 2];

MainPara.CCSelectCondi is to provide the index of chosen conditions.

We suggest to first train the model for all the conditions, with setting MainPara.CCSelect=0. Then, user can choose certain conditions, and run the Main script. The package will calculate and plot the maximum mutual information among the selected conditions, based on the currently trained model.

  • PermuteData: it allows to permute the ordering of time points for the data of time series.

>MainPara.permuteData=1;% To permute data.

>MainPara.permuteMode=3;% mode of permutation.

There are different ways of permuting the data by setting MainPara.permuteMode: 1. descending order; 2. ascending order; 3. random permutation.

  • PartialTraining:

>MainPara.PartialTraining=1;

>OtherPara.PartialRatio=1/2;

the variable OtherPara.PartialRatio specifies the sub-ratio of full data.

  • Jackknife method has been included when calculating the channel capacity.


Output

dMI generates the output files as follows:

  • The model trained by the single-cell signaling activities of each stimulus condition.

  1. Time-inhomogeneous Markov model: the set of transition matrices.

  2. Hidden Markov model: one emission matrix and one transition matrix.

  • The heatmaps of the data, binned data, and sampled trajectories, as shown below.

  • The maximum mutual information between selected stimuli, such as certain pairwise stimuli. One can then plot the maximum mutual information between various datasets, such as different genotypes shown below.

The results will be saved as a mat file in main folder. The name have a structure of:

Dataset+Model+States number+Perturbations.

The figures will be saved in the sub-folder with the same name as the mat file.

Step 2: Training goodness of the model

dMI can quantify the goodness of model training, when using various number of states.

To conduct the procedure, the user should first set

>MainPara.PlotGoodness=1;

Then, the user needs to choose how to scan the number of states.

  • For the hidden Markov model, you can choose to:

    1. Vary both the states with their ratio fixed,

>OtherPara.Scancase=1;

The default ratio between hidden states and emission states is 2, and the numbers of scanning hidden states are 2^[1:6]. The user can adjust these by using

>OtherPara.RatioScan=2;

>OtherPara.stateScan=2.^[1:6];

    1. Vary the number of hidden states,

>OtherPara.Scancase=2;

The default fixed emission states is 32, and the numbers of scanning hidden states are 2^[1:6]. The user can adjust these by using

>OtherPara.binsizeScanFix=32;

>OtherPara.stateScan=2.^[1:6];

    1. Vary the number of emission states,

>OtherPara.Scancase=3;

The default fixed hidden states is 64, and the numbers of scanning emission states are 2^[1:6]. The user can adjust these by using

>OtherPara.stateScanFix=32;

>OtherPara.binsizeScan=2.^[1:6];

  • For the time-inhomogeneous Markov model, there is only one state number to change, by using OtherPara.Scancase=1.

The default value OtherPara.Scancase=0 is to run the model training with a fixed number of states as above. To change to the default, use

>OtherPara.Scancase=0;

After choosing the option to scan states, run the package by

>Main;


Output

dMI generates the output of model goodness as follows:

  • The histogram of false nearest neighbor counts.

  • The KL-divergence between data and sampled trajectories, and entropy of data at each time point.

  • The log-likelihood of test dataset.

All the three measures are estimated with various number of states, as shown below. The results will be saved as mat files in main folder, with the file name specifying the number of states. After generating the mat files, the package will plot three measures versus the numbers of states. The figures will be saved in the sub-folder of "Figures/Goodness", with subfolder's name as the type of states' scanning.

Step 3: Overfitting of the model

dMI evaluates the overfitting of the model by conducting an equal-partition procedure within each condition.

To use it,

>OtherPara.Scancase=4;

The default numbers of scanning hidden states are 2^[1:6]. The user can adjust these by using

>OtherPara.stateScan=2.^[1:6];

After that, run the Main script.


Output

It will repeat the procedures for various number of states, and provide a hint on the critical number of states for the overfitting, as examplified in the figures below. The maximum mutual information between the two equal-partitioned trajectory ensembles within each condition evaluates how much they are recognized as two different conditions by the model. The ideal maximum mutual information between the two subsets should be 0, because they are under the same stimulus. Otherwise, the model is likely to be overfitted.

The results will be saved as mat files in main folder's subfolder with post-fix "Partition", and the files' name specify the numbers of states. The figures will be saved in the sub-folder of "Figures/Overfitting", with subfolder's name as the model name plus a post-fix "Partition".

Step 4: Maximum mutual information from dynamics

After finding the proper number of states from the training goodness and overfitting, the user can go back to calculate the maximum mutual information using the proper number of states, as demonstrated above. The user can also choose the specific stimuli.


Output

The results will be saved in the sub-folder of "Figures/MaximumMI", and the figures will be named with pre-fix "MaxMI". The post-fix "condi_" specifies the number of conditions, such as for the selected conditions. Besides the figures, the data will also be saved as a mat file and a .xlsx file for further uses.

Step 5: Maximum mutual information from the previous methods

Vector method:

To use vector method (Selimkhanov et al., 2014), simply set

>MainPara.Model=3; Main;

There are two different schemes of using the time points. 1. the scheme of using the continuous time points from the first to the current; 2 the neighbor scheme: for each time point, using its 5 neighbor time points. These two schemes can be chosen by

>MainPara.WindowMode=1;

>MainPara.WindowLength=5;

where the second parameter allows to choose the number of neighbor time points. After that, run the Main script. For the vector method, the user can also permute the time points as demonstrated above.


Time-point method

To use time-point method (Cheong et al., 2011), simply set

>MainPara.Model=4; Main;


Output

The results will be saved as a mat file in main folder. The file name have a structure of:

Method+Dataset+Number of conditions.

The figures will be saved in the sub-folder of "Figures/MaximumMI", and named with pre-fix "MaxMI" followed by the name of the file, specifying the type of method.

Acknowledgements

  • Third-party packages

  1. We used the MACKtrack package (https://github.com/brookstaylorjr/MACKtrack.git ) to plot the heatmaps of single-cell signaling trajectories.

  2. We thank Roy Wollman's group for sharing the code of vector method.

  3. The decoding-based method was not included here, because it can be separately implemented by the user-friendly package (https://github.com/swainlab/mi-by-decoding).

  • Data

  1. The Dataset=2 was designed for the p53 dataset from (Chen et al., 2016). We suggest the user to ask the authors of the paper for the data.

  2. The Dataset=3 was designed for the Erk dataset from (Regot et al., 2014). We suggest the user to ask the authors of the paper for the data.

Getting help

For any questions, please contact jamestang23@gmail.com, or Signaling Systems Laboratory.