Work package 2 of our project involves the development of computational tools for bioimaging with particular application to low-resource settings. This means avoiding, where possible, reliance on high performance computing facilities. It also means producing models that are robust to changes in imaging modality since we anticipate model users producing images using a wide variety of microscopes, including those they have constructed themselves.
Our computational tools should be able to perform
Image classification. Ultimately, we want to be able to diagnose cells from images of them. We also want to check images are accurately labelled in our data sets.
Visual summaries of whole libraries of image data. We want to help researchers look for patterns in large data sets of image data, both patterns due to biologically interesting phenomena and also to unwanted artifacts of experimental procedure.
Image compression. We want to help researchers make the best use of large amounts of information. In practice, this means keeping only a fraction of the signals from every image, which can be stored and analysed quickly.
Our approach to the problems outlined above is guided predominantly by the objective of image compression. More specifically, we try to engineer mathematical functions that will encode image data in a parsimonious way. These functions are allowed to vary with cell-type and imaging modality so that some variants are better at encoding specific types of images. If we are presented with a new image we can try compressing it with each of the different function variants and see which does best. This provides us with a way of guessing the cell-type and imaging modality of that image.
This approach is not new. It can be seen playing a part in many other statistical procedure such as Principal Components Analysis (PCA) and Independent Components Analysis (ICA), in which we try to separate signals into one set of components (that we might want to keep) that vary in an interesting way from one data object to another, and one set of components (that we might want to discard) that are relatively homogeneous across data objects.
Our work in this area has exposed theoretical and methodological challenges that we are pursuing for our own purposes, and to contribute to the wider machine learning research community.
The best ways to measure and reward parsimony are not obvious. To compress data we want to find some properties of it we can afford to throw away, but it is rarely straightforward to define an explicit cost that would allow us to make sense of the word 'afford' here. It is also not clear which types of cost function are useful for guiding models towards parsimonious data encodings.
Model architecture is hard to specify. Model architecture refers to the way different sub-functions are composed to produce a more complicated function. The machine learning literature includes many customizable schematics for such architectures, whose strengths are often supported by empirical evidence rather than theoretical arguments. In our experiments we are prioritizing architectures that are simple for a human to understand and computationally efficient.
Model training is highly sensitive to changes in the loss function. More specifically, we have found that models trained to minimize a particular loss function are difficult to retrain to minimize a second loss function, even when the two loss function seem to be very similar.