Lecture abstracts

Plenary 1: Dr Bryan Lawrence, British Atmospheric Data Centre

Managing complex datasets and accompanying information for reuse and repurpose

For centuries the main method of scientific communication has been the academic paper, itself developed as a reaction to the non-scalability of the “personal communication” (then known as a “letter”). In the 21st century, we now find that the academic paper is not always sufficient to communicate all that needs to be known about some scientific event, whether it’s the development of a theory, an observation, or a simulation, or some combination thereof. As a consequence, nearly all scientific communities are producing methods of defining and documenting datasets of importance, and building systems to augment (annotate) their data resources or to amalgamate, reprocess, and reuse their data – often in ways unforeseen by the originator of the data. Such systems range from heavily bespoke, tightly architected systems, such as that developed in the climate community to support global climate model inter-comparison; via systems of intermediate complexity developed, for example, using the “linked data” principles; to loose assemblies of web pages, using vanilla web technologies. Concepts of publication are becoming blurred, with publication meaning anything from “I put it on twitter” to “I published in Nature”.

In this talk, we’ll present a taxonomy of the information that (nearly) all these systems try to address and discuss the nature of publication in the 21st Century. We’ll describe how information is built up during the life-cycle of datasets, and the importance of data provenance in the production of knowledge. We’ll present our concept of the “value proposition” for maintaining digital data, and we’ll confront the conflict between the open access revolution and the importance of information security. The material will be mainly illustrated with examples from the environmental sciences, but we believe the concepts discussed, and the conclusions drawn, are generic.

Plenary 2: Prof. Jeremy Nicholson, Imperial College London

Modelling supersystem biology in health and disease- the translational medicine challenge

Systems biology tools are now being applied at the individual and population level utilizing analytical and statistical methods that report non-invasively on integrated biological functions. Metabolic phenotyping offers an important window on integrated system function and both NMR and mass spectrometric methods have been successfully applied to characterize and quantify a wide range of metabolites in biological fluids and tissues to explore the biochemical sequelae of human disease processes (1). A major feature of human biology that has only recently been recognised is the extensive interaction with the gut microbiome (2). These symbiotic supraorganismal interactions greatly increase the degrees of freedom of the system and there is extensive transgenomic control of metabolism that poses a significant challenge to current modelling approaches. In disease states metabolic profiles and spectroscopic signatures are changed characteristically according to the exact site and mechanism of the lesion (3). The use of chemometrics allows interrogation of spectroscopic data and can give direct diagnostic information and aid the detection of novel biomarkers of disease and the integration of metabolic data with other omics sets including direct genome-metabolome mapping (4). Such diagnostics can be extremely sensitive for the detection of low-level damage in a variety of organ systems and is potentially a powerful new adjunct to conventional procedures for disease assessment and can help explain complex gene-environment interactions that generate disease risks. Examples of the application of metabonomics to system level information recovery from tissues and biofuids will be given with reference to personalised healthcare and pharmaco-metabonomic profiling (5,6), phenotyping patient journeys and human population screening using novel spectroscopy driven quantitative metabolome-wide association study approaches (7) to discover population biomarkers of disease risk.

1. Nicholson JK et al (2002) Nature, Rev. Drug Disc. 1 (2) 153-161.

2. Nicholson JK et al (2004) Nature, Biotech. 22 1268-74.

3. Nicholson J.K. and Lindon, J.C. (2008) Nature 455 1054-1056.

4. Dumas, M.E. et al (2007) Nature, Genetics 39 666-672.

5. Clayton, T.A. et al (2006) Nature 4401073-1077.

6. Clayton, T.A. et al (2009) PNAS 106 14728-14733.

7. Holmes, E. et al (2008) Nature 453 396-400.

Plenary 3: Prof. Richard Brereton, University of Bristol

Pattern recognition as an aid in Mass Spectrometry

Modern Mass Spectrometry allows the generation of large quantities of data, the size and complexity of databases being almost unimaginable to a previous generation. Coupled with this enormous expansion in technical capability, the power of desktop computers has increased exponentially : Moore's law suggests a doubling of computer power per unit cost each year, or 1000 fold improvement in a decade.

Much of chemometrics as currently known uses methods first implemented twenty or more years ago, in an era of small datasets, limited computing power and linear problems. Many of the modern frontline applications in mass spectrometry and related coupled techniques involve areas such as biology, medicine, forensics and heritage studies. These often require different approaches to the traditional ones for data analysis. Machine learning and data mining have available many tools that are as yet often not applied to mass spectrometry data.

This presentation will show how pattern recognition techniques can be applied to a large variety of problems. These include as follows. Use of head space mass spectrometry to determine whether soils are polluted or not; use of gas chromatography mass spectrometry for metabolic profiling of mice and men; use of tandem mass spectrometry in forensics; use of gas chromatography mass spectrometry for studying archaeological origins of pottery.

The application of modern approaches from machine learning, including self organising maps and support vector machines, will be illustrated, alongside more traditional methods for pattern recognition.

Lecture 1: Alice Laures, GlaxoSmithKline

Use of chemometrics for anti-counterfeiting activities: Ensuring patients safety and assisting criminal investigations

Counterfeiting of GSK’s toothpaste brands, especially Sensodyne, is an increasing problem. The substitution of cheaper diethylene glycol (DEG) for glycerine in counterfeit toothpaste was identified a few years ago. DEG has potential toxicity and its level in counterfeit samples is measured by GC-MS in order to evaluate health risk to patients who would be using the counterfeit toothpaste.

In addition to DEG levels, extra information can be gained from the counterfeit toothpaste analyses by GC-MS, which might be useful to classify the toothpastes and determine the number of sources/manufacturers of the counterfeit material. This information can be used to help lead investigators to the site of manufacture of the counterfeit material with the ultimate aim of closing these sites down.

Organoleptic compounds are used in many commercial products to improve sensory properties such as taste, colour, odour or feel. They can be readily measured by GC-MS and utilising an electronic nose (e-nose). This is an ideal method for rapidly measuring these compounds and objectively comparing the results with other samples using pattern recognition techniques. An in house e- nose has been developed where a large amount of GC-MS data can be generated, exported, pre-processed and analysed using Principal Component Analysis and K-means clustering in Matlab. This system can clearly distinguish between samples and has been used to show a number of potential groups of counterfeit suppliers.

Lecture 2: Dr Mathias Nilsson, University of Manchester

Multivariate analysis of diffusion weighted NMR data

Mixture analysis by diffusion NMR is a powerful technique that is steadily gaining ground. The standard way to resolve individual component NMR spectra is by diffusion-ordered spectroscopy (DOSY)1, where each peak is analysed individually. However, the whole spectrum of a molecular species generally shows identical diffusion behaviour, and this covariance can be effectively exploited to defeat problems with spectral overlap by multivariate methods such as SCORE2 where whole spectra are fitted simultaneously. When diffusion is complemented by a third independent dimension (such as relaxation3 or concentration change during a reaction4,5) the data can become trilinear, allowing the use of powerful multi-way methods, such as PARAFAC (Parallel Factor analysis)6. All multivariate methods are very sensitive to deviations from model behaviour; examples include the effects of non-uniform field gradients, which cause deviations from pure exponential decay in hard modelled methods such as SCORE, and inconsistencies in peak, phase, frequency and shape, that also affect model-free decompositions such as PARAFAC. The effects of such deviations, and some remedies, are discussed, along with illustrative examples. Many of the above methods have been implemented as Matlab Toolboxes and are freely available.

1Johnson CS. Prog. Nuc. Mag. Res. Sp. 1999, 34, 203-256.

2Nilsson M, Morris GA. Anal Chem. 2008, 80. 3777-3782.

3Nilsson M, Botana A, Morris GA. Anal. Chem. 2009, 81, 8119-8125.

4Nilsson M, Khajeh M, Botana A, Bernstein MA, Morris GA. Chem. Commun. 2009, 1252-1254.

5Khajeh M, Botana A, Bernstein MA, Nilsson M, Morris GA. Anal. Chem. 2010, 82, 2102-2108.

6Bro R. Chemometr. Intell. Lab. 1997, 38, 149-171.

Lecture 3: Dr Heather Chassaing, Pfizer

Automated data processing - Enabling rapid Mass Spectrometric analysis of complex pharmaceutical samples

As mass spectrometric advances in technology facilitate the acquisition of information rich data sets, the interpretation of meaningful data remains the rate determining step of the metabolite identification process. A variety of software technologies that facilitate the metabolite identification process during the various stages of the drug discovery process will be presented. In addition, solutions to storing and retrieving this valuable information will also be discussed.

Lecture 4: Mark Earll, Syngenta

Data processing and visualisation in MS based metabolomics

MS metabolomics presents a huge challenge of deconvoluting 3-dimensional LCMS or GCMS data followed by metabolite identification. In our approach we have used the open source MZmine software combined with accurate mass UPLC as the foundation of a metabolomics platform which uses both in-house and public metabolite libraries to aid identification. We compare the results of data processing using alternative software such as Genedata’s Refiner MS and Metalign and the robustness of the analytical chemistry by re-analysis of samples 6 months apart. Visualisation of the metabolic trajectories of four varieties of experimental tomato genotypes during ripening will then be illustrated using time-based Orthogonal PLS modelling, which clearly differentiates genotype, ripening and systematic experimental effects.

Lecture 5: Dr John Langley, University of Southampton

We've always had data, so what's changed? - What's the challenge and what's the solution?

The question of whether data to knowledge is a new issue is worth considering. Is this a new phenomenon? The answer is definitely no. Is it an issue today? This time the answer is a resounding yes.

This presentation considers the origins of the problems of Data to Knowledge, what has changed that now brings the issue to the front. Examples will show that there is a wealth of information in a variety of data from many applications, from a simple mass spectrum, tandem MS, chromatographic and molecular modelling approaches. Some present informatic tools will be briefly reviewed and finally there will be an attempt to consolidate some of these approaches into a vision for automated Data to Knowledge in the future.

Lecture 6: James McKenzie, University of York

Data fusion in metabolomic studies

Chemometric approaches are often used for the analysis of complex data sets, such as those obtained from metabolomic studies. Principal Components Analysis, for example, creates new variables from existing ones, whilst Genetic Algorithms maintain the existing variables, selecting those most important for discrimination. Enhanced discriminating power can be achieved by the integrated analysis of multiple data sets, yet chemometric methods

are usually applied to data sets on an individual basis. Here, data fusion approaches are used to combine information from complementary data sets obtained by 1H NMR and LC-MS in order to maximise the information extracted. These analytical techniques are both widely used in metabolomic studies, but as yet, few chemometric methods exist to merge the data analysis. The use of two ionisation techniques (ESI and APCI) and the analysis of both positive and negative ions potentially give four data sets from LC-MS alone with no immediately obvious relationship. Furthermore, the combination of data from LC-MS and NMR allows ambiguities in compound identification to be removed. Issues that are of importance in fused systems include the scaling of data sets, feature selection, and ensuring that results can be related back to the original variables. 1H NMR and LC-MS

in order to maximise the information extracted. These analytical techniques are both widely used in metabolomic studies, but as yet, few chemometric methods exist to merge the data analysis. The use of two ionisation techniques (ESI and APCI) and the analysis of both positive and negative ions potentially give four data sets from LC-MS alone with no immediately obvious relationship. Furthermore, the combination of data from LC-MS and NMR allows ambiguities in compound identification to be removed. Issues that are of importance in fused systems include the scaling of data sets, feature selection, and ensuring that results can be related back to the original variables.

Lecture 7: Kirsten Hobby, AstraZeneca

Semi-quantitation of metabolites by LC-MS to satisfy MIST guidelines: A bespoke solution for delivering concise summary data when facing multiple dose, patient and sampling time variables

The pharmaceutical industry faces new analytical challenges in the practical application of “Metabolites In Safety Testing” (MIST) FDA guidelines that require the provision of ‘semi-quantitative’ assessment of drug metabolites in advance of the availability of bio-analytical standard compounds or radio-labelled parent compound. Semi-quantitation of metabolites spans both quantitative and qualitative mass spectrometry disciplines and while the LC-MS hardware is capable of creating suitable data, the conversion of this MS data into a clear and concise output is currently a major bottleneck.

The MIST guidelines recommend expressing the ‘semi-quantitation’ of metabolite AUC, assuming a similar response to parent, relative to the parent AUC when ‘steady state’ plasma concentrations are achieved during multiple dosing. The industry leading qualitative metabolite mining software is currently unable to provide tabulated summaries of sample groups, and is certainly not capable of calculating AUC data for metabolites from multiple analyses, or to accommodate for any ‘calibration’ of the metabolite chromatographic peak areas via the independent calculation of ‘MS response factors’.

A rapid processing turnaround of the summarised metabolite information LC-MS datasets is very desirable as it is highly likely that the need to rapidly confirm or deny the presence of metabolites is intertwined with the need to discover their quantitative significance across both dose and patient groups. A custom application will be described that directly interfaces with the data output from the metabolite mining software and performs the necessary calculation of metabolite AUC’s, metabolite specific ‘calibration’ and a final tabular summary according to dose levels.

Vendor Presentations

ACD/Labs

Spectrus – An introduction to the latest in analytical and chemical data to knowledge

ACD/Spectrus is a new generation of analytical and chemical knowledge management software that enables research organizations to extract, retain, and leverage their knowledge in ways not previously possible.

Built on the foundation of our unique capabilities in bringing together disparate forms of analytical and chemical information, ACD/Spectrus enables:

    • Greater efficiency and productivity for Chemists and Spectroscopists

    • Access to legacy knowledge

    • Savings of time and money for the organization

The evolution of ACD/Spectrus will include a portal to the knowledge base, an all-in-one processing tool for routine analytical interpretation; analytical laboratory workbooks for advanced processing and interpretation of analytical, chromatographic, and chemical project data; and an improved databasing architecture.

Spectrus will better enable the transformation of chemical and analytical information into knowledge in a single, integrated platform, complementing your organization's knowledge management infrastructure.

Agilent

Accurate mass measured MSMS & library searching routines - Making target and non-target compound screening a meaningful reality

The use of accurate mass measured data from LCMS analysis has become more routine over recent years, especially with significant advances in the design and use of orthogonal acceleration-TOF MS. Instrument developments have resulted in greater resolution, higher mass measurement accuracy, increased in-spectrum dynamic range, stability of instrument to lab temperature change and easy to use operation – all resulting in more meaningful and useful data. Whilst accurate mass measurement of a single molecular ion of interest provides the user with some way of determining elemental formulas, the use of accurate mass MSMS data provides more comprehensive information of a molecules structure. In this paper, data will presented to show how isotopic patterns from a QTOF instrument can be used with very high degrees of accuracy to deconvolute elemental formulas. In addition, data will be shown how both accurate mass measured MS and MSMS information can be used in conjunction to reduce the potential number of elemental formulas that fit a certain mass measurement and thus give more detailed structural information. A screening approach will then be presented to show how accurate mass measured MSMS data from LC-QTOF instrumentation can be searched automatically using forward and reserve fit algorithms (similar to that applied to GC-EI data) to identify both target and non-targeted compounds in relation to a specific application

Bruker

From gene to compound: Mining for meaning in LC-MS secondary metabolome data sets

Myxobacteria are promising producers of natural products exhibiting potent biological activities, and several myxobacterial metabolites are currently under investigation as potential leads for novel drugs. However, the myxobacteria are also a striking example of the divergence between the genetic capacity for the production of secondary metabolites and the number of compounds that could be characterised to date. The number of identified metabolites is usually significantly lower than expected from genome sequence information. ESI-UHR-Q-TOF analysis of secondary metabolites from both wild and mutant type myxobacteria produces rich data on known and unknown compounds. Appropriate data handling allows the same data sets to be used both for (1) automated screening for metabolites of interest using high accuracy MS data and for (2) the discovery of new components using molecular feature discovery algorithms and principle component analysis (PCA). The chemical species found can then be identified by molecular formula and database searching. Finally, structures can be further probed or confirmed by MS/MS analysis, to take the discovery path from the gene knockout to the resulting change in chemical composition.

Mestrelab Research

From analytical data to structural information

The fully automated validation of structural proposals based on analytical data has been a long standing aim for the chemistry community, to which many resources and efforts have been devoted over a period of years. Such aim, however, has been elusive, mainly due to the many difficulties presented by the automatic analysis of NMR spectra, in particular 1H NMR and 2D spectra. In this presentation, we will cover our latest advances in the fully automatic analysis of 1H NMR spectra, and how the derived analysis results from 1H NMR are combined with other experimental results in a novel, extremely flexible, scoring system which evaluates and potentially ranks structure proposals in full automation. We will present the underlying basis of the system, as well as current results and work in progress, aiming to give the audience an accurate feel of what can currently be achieved with these types of systems, what the current obstacles and difficulties are and what can be expected in the near future.

Thermo

Title TBA

Waters

Using SDMS to manage and reduce complex MS data sets using a simple print capture technique

Waters NuGenesis SDMS is used in many MS laboratories to capture and catalogue the raw data files from multiple MS instrument types. However, the biggest gains in productivity and confident data management, result from simply capturing the printed data reports and reusing that data to produced specialised and summarized reports. This presentation will illustrate the principle and a number of practical MS examples where this is used.