# Systems biology modeling

This page contains agreed upon standards and procedures for model development and analysis within our extended group.

## model storage

The model should be stored in the systems biology markup language format `sbml`

, there must be a root model file from which variants for specific purposes are derived. An example for model variants: a simplified simulation file for parameter search which is using conservation laws and thermodynamic constraints explicitly (to reduce complexity). These Constraints are perhaps missing from the root model but fulfilled implicitly if the model is instantiated with parameter values that fulfill them. To Parse the model, it can be converted into the `SBtab`

format. SBtab's tabular format (csv) makes the model accessible to standard tools such as `awk`

, `sed`

and `grep`

).

When the model is in an early stage and cannot yet be made public, it is stored either in a private version control repository like Git (maybe on KTH GitHub) or as a file in the Human Brain Project collaboratory.

Once a model needs to be made public, it should be uploaded to the biomodels database.

## Version numbers

The model should have a date stamp and a version number of the form `major.minor.patch`

in its metadata.

Whenever a consistent set of biological components are present and some justification can be given for all of their kinetic laws, the model can be versioned `1.0.0`

. An addition of subsystems, compounds or reactions should increase the main version. A change in some reaction kinetic term, and similar small changes to the model increase the minor version `1.1.0`

and trivial mathematical changes or fixes for implementation-mistakes increment the patch version `1.0.1.`

## Model Simulation

An `sbml`

model can be exported into various formats for specific given purposes such as sampling. It must be confirmed that any set of model exports (into python, matlab, R, etc.) are mutually compatible: they produce very similar output signals, at least within the solver algorithm's error bounds, given a test input.

## Parameter Estimation

Given a running model, the free parameters must be determined (fitting the model to available data). Typically the model may well be not fully identifiable and the amount of data not sufficient yet to determine even those parameters that do have an influence on the output function. For this reason we should use a robust parameter search method that includes some kind of regularization, or globally explores the objective function.

Methods of this kind are:

- MCMC sampling; e.g. adaptive Metropolis algorithm, Hamiltonian Monte Carlo, SMMALA
- Profile Likelihood method; explores the objective function globally
- swarm optimization with regularizing term (such as minimum
*p*-norm solution)

The objective is to minimize the mismatch between the simulations output function and data in some statistically justified way. In Bayesian model analysis this process starts at the noise model of the measurement. For the most part, biological data can be assumed to have a Gaussian (normal) error or maybe a (multiplicative) log-normal error. However, data is sometimes obtained from very few replicas (three measurements) and the parameters of these error probability densities *N(\mu,\Sigma)* remain very crudely guessed. This distribution is then used to quantify the disagreement between model an data. It is a measure of whether the probability of such a difference being due to measurement noise.

## Prior Parameter Distribution and Regularisation

Because of the difficulties described above concerning identifiability, but also because of the high cost and consequently low amount of data the parameter search is often an *ill-posed* inverse problem. If we insist on one *true* answer, then this presumably *optimal parameter vector* will typically depend strongly on the starting locations of the search (and lie in a local minimum of the objective function), perhaps it will be volatile and change drastically if new data is added to the set. For these reasons it won't be a satisfying answer and we prefer a posterior probability density function over the parameter space, which represents all acceptable fits and also respects the constraints from the prior distribution.

The numerical convergence issues in optimization of an ill-posed problem can be overcome by introducing a regularizing term (e.g. minimizing the norm of the solution). A prior distribution can play the role of a regularization in Bayesian model analysis. On the other hand it can be used to give bounds to the search space; by choosing a flat prior with an upper and lower bound. In that case the prior provides no regularization, only bounds. Of course, we can do both at the same time and in addition also include previous results as the name *prior distribution* suggests. The prior distribution is a useful feature of Bayesian methods, not a burden.

### Options for the Prior Distribution

- Log-Normal distribution; these are just normal in log-space: theta=log(k), where the optimization will take place. Thermodynamic constraints on parameters (detailed balance restrictions) are expressed as products and convert to sums in log-space, for this reason dependent parameters will appear normal.
- Log-Uniform distgribution favors small 1-norm in normal space
- Uniform, only restricts the search space, does not favor any specific value in the search space.
- Kernel Density Estimate: a sum of kernel distributions centered around a previously obtained representative sample, can describe any shape; useful for iteration. It is expensive to calculate.
- Copula based calculation of an arbitrary distribution derived from a previously obtained sample.