## Elephants

Parameter fitting is the dirty secret of applied math and mathematical biology. It is difficult, frustrating, time consuming and very easy to mess up. The good news is that when it is done well there may be very little reward mainly due to a general mistrust in the scientific community for fit models. Outside of more mathematically oriented journals the parameter fitting process applied is buried in a brief footnote or relegated to the supplementary material.

This mistrust for models with many parameters is reflected in the famous scientific idiom attributed to John von Neumann “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” This expresses the idea that a complex model with many parameters is likely to display any behaviour you want if you have enough parameters.

The reality of course is much more complicated than the elephant quote suggests; whether you can really fit the elephant depends crucially on the structure of our model and the structure of the data used in the fit. However, there are ways to determine how well the data constrains the parameters. Even when the parameters are poorly determined by the data, the model can still make useful predictions (see Sloppy Models) . This is a growing and fascinating topic and should be useful to modelers and experimentalists alike as it can be used in both model selection and experimental design.

For example, a particularly frustrating misconception in the circadian literature is that the ability of a model to show 24 hour oscillations is evidence that it can used to describe the circadian clock. No . Any dynamical system which has a limit cycle can show 24 hour oscillations by a simple rescaling of time. Perhaps what is meant by this assertion is that the system shows 24 hour oscillations for biophysically reasonable parameter values. Although even this may be difficult to argue for many of the parameters we may have little idea of what is biophysically reasonable. This is an example of the data doing little to constrain the model. We will see that other forms of data can provide better evidence in favor of the model.

In this series of posts I am going to examine parameter fitting for biochemical models, specifically tailored to systems biology and rhythmic processes (circadian rhythms, etc). In this first post I introduce the least-square cost function typically used to compare simulations and data. Later posts will address finding good parameter fits and then accessing the model.

## Least Square Cost Functions

In parameter fitting for dynamical systems our model takes the form:

$\displaystyle \frac{d\vec{y}}{dt}=\vec{f}(\vec{y},\vec{\theta}) \ \ \ \ \ (1)$

where ${\vec{\theta} \in \mathbb{R}^{N_p}}$ is the parameter vector. We assume that we have data of the form ${\{(Y_j, \sigma_j)\}_{j=1}^{N_d}}$ corresponding to a value of the dynamical variable ${Y_j}$ at time ${t_j}$ with uncertainty ${\sigma_j}$.

The idea of the parameter fitting procedure is to define some cost function which gives a metric for how well the data and the model agree as a function of the parameters. Thus, the cost function will achieve a minimum where the “best fit” is achieved.

Mathematically, we may define the problem as: Find a parameter vector ,${\vec{\theta}}$ ,which minimizes the cost ${C:\vec{\theta} \rightarrow \mathbb{R}}$. A good way to define the cost function in this case is to set

$\displaystyle C(\vec{\theta})=\frac{1}{2} \sum_{j=1}^{N_d} \left(R_j(\vec{\theta})\right)^2 \ \ \ \ \ (3)$

where ${R_j(\vec{\theta})}$ is the residual between data point j and the simulation. When a cost function is written in these terms (the sum of the residuals squared) it is called a least square cost function. Under this definition our parameter fitting routine will look for the parameter set which minimizes the distance between the simulation and the model squared.