Developing models in systems biology is a complex, knowledge-intensive activity. Some modelers (especially novices) benefit from model development tools with a graphical user interface (GUI). However, advanced model development is frequently done using text-based representations of models. At present, the tools for text-based model development are limited, typically just a text editor that provides features such as copy, paste, find, and replace. Since these tools are not "model aware", they do not provide features for model building (e.g., autocompletion of species names), model analysis (e.g., hover messages that provide information about chemical species), and model translation (converting between model representations). We refer to these as BAT features.
We developed the Antimony Web Editor (AWE), a tool for building, analyzing, and translating models written in the Antimony modeling language, a human readable representation of SBML models. AWE is a source editor, a tool with language-specific features. For example, there is autocompletion of variable names to assist with model building, hover messages that aid in model analysis, and bidirectional editing of model representations to improve model translation. These features are made possible by incorporating several sophisticated capabilities into AWE: analysis of the Antimony grammar (e.g., to identify model symbols and their types); a query system for accessing knowledge sources for chemical species and reactions; and automatic conversion between different model representations (e.g., between Antimony and SBML).
A beta of AWE will be available in April, 2024. A prior implementation of the AWE features, VSCode-Antimony, is available as an open source extension in the VSCode Marketplace. Details of the VSCode-Antimony implementation can be found in a publication in Oxford Bioinformatics.
This is joint work with a number of students, research staff, and facility at the University of Washington. In the Allen School of Computer Science: Steve Ma, Adrich Fan, Eva Liu, Sai Anis Konanki, Edison Shunming, and Sam Chou. In BioEngineering, Prof. Herbert M Sauro and Dr. Lucian Smith. In Biomedical and Health Informatics, Prof. John H Gennari. In the eScience Institute, Dr. Joseph L Hellerstein.
A Two Species Harmonic Oscillator With An Exact Solution in the Time Domain
Oscillatory behavior is critical to many life sustaining processes such as cell cycles, circadian rhythms, and notch signaling. Important biological functions depend on the characteristics of these oscillations (hereafter, oscillation characteristics or OCs): frequency (e.g., event timings), amplitude (e.g., signal strength), and phase (e.g., event sequencing). This work is a theoretical study of an oscillating reaction network to quantify the relationships between OCs and the structure and parameters of the reaction network. We consider a reaction network with two species that have dynamics that can be described by a system of linear differential equations. This makes it possible to construct a two species harmonic oscillator (2SHO). We obtain exact, closed-form formulas for the OCs of the 2SHO. These formulas are used to develop the parameterizeOscillator algorithm that parameterizes the 2SHO to achieve desired oscillation characteristics. The OC formulas are employed to analyze the roles of the reactions in the network, and to comment on other studies of oscillatory reaction networks. For example, others have stated that nonlinear dynamics are required to create oscillations in reaction networks. Our 2SHO is a counter example to this claim in that 2SHO is an oscillating reaction network whose dynamics are described by a system of linear differential equations. A further insight of potential interest is that our formulas show that in order to obtain oscillations in the 2SHO, the rate of the reaction that causes negative feedback must exceed the rate of the reaction that causes positive feedback. Details of this work are reported in Joseph L. Hellerstein. An Oscillating Reaction Network With an Exact Closed Form Solution in the Time Domain. BMC Bioinformatics. (24) 466. 2023.
Control engineering deals with designing feedback loops with objectives such as regulation to a setpoint, elimination of oscillations, and fast settling times. ControlSBML is a python package that leverages the Python Control Systems Library (based on a similar MATLAB package) to do control engineering with SBML models. ControlSBML embeds a capability to simulate SBML models and produces TransferFunction objects that can be used in the Python Control Systems Library. Doing so is helpful with the following tasks:
controllability analysis to select system inputs and outputs;
system identification by fitting SBML models to transfer functions; and
control design by producing open loop and closed loop transfer functions and a grid search for "optimal" control designs.
ControlSBML is an open source python package on github and can be installed using pip install controlSBML. It has been active development since 2022, and has been used in the course BIOEN 498/599 "Advanced Biological Control Systems" and the University of Washington department of BioEngineering.
The Reproducibility Portal is a website for accessing biomedical papers, their simulation models, and associated data. The effort is part of the NIH funded Center for Reproducible Biomedical Models. The website includes a search capability and displays of abstracts, models, and a flexible plotting facility for data. Links are provided to the full paper and the BioSimulations website that provides tools for running simulations. Project participants are at the University of Washington. Allen School of Computer Science: Tony Yuan, Tanya Naveem, Juliette Park; Department of BioEngineering: Joseph L. Hellerstein, Herbert M. Saurio, Lucian Smith.
Developing credible biomedical models is often essential for creating novel medical diagnostics and commercially viable metabolic pathways. A central challenge in the development of these models is properly identifying the chemical species in reactions, and the reactions themselves. For example, there are over a thousand variants of glucose. Such identifications are achieved through the use of annotations, standards for specifying the elements of biomedical models. Selecting appropriate annotations remains a barrier to constructing credible biomedical models. This motivated our development of an automated model annotation system (AMAS). AMAS uses information embedded in the models such as species names and the reactions in which they participate. By extending techniques from Natural Language Processing (NLP), AMAS scores possible annotations of model elements that are then presented to modelers to examine in more detail. AMAS is being developed as a pip installable python package. Details of the work can be found in ths publication in Oxford Bioinformatics.
This is joint work with Woosub Shin (University of Auckland), John H Gennari (Biomedical and Health Informatics), and Herbert M. Sauro (Department of Bioengineering).
Developing credible biomedical models is often essential for creating novel medical diagnostics and commercially viable metabolic pathways. A central challenge in the development of these models is properly identifying the chemical species in reactions, and the reactions themselves. For example, there are over a thousand variants of glucose. Such identifications are achieved through the use of annotations, standards for specifying the elements of biomedical models. Selecting appropriate annotations remains a barrier to constructing credible biomedical models. This motivated our development of an automated model annotation system (AMAS). AMAS uses information embedded in the models such as species names and the reactions in which they participate. By extending techniques from Natural Language Processing (NLP), AMAS scores possible annotations of model elements that are then presented to modelers to examine in more detail. AMAS is being developed as a pip installable python package. Details of the work can be found in ths publication in Oxford Bioinformatics.
Joint work with Woosub Shin (University of Auckland), John H Gennari (Biomedical and Health Informatics), and Herbert M. Sauro (Department of Bioengineering).
This project studies evolutionary adaptation between two organisms ( Desulfovibrio vulgaris, a baterium, and Methanococcus maripaludis, an archaeon) that do not co-occur in any known ecosystem. We study how these two organisms adapt to an artificial environment in which they are grown together. The data collected describes changes in genotype (mutations) and phenotype (growth rate and yield). Progress to date includes: identifying coordinated groups of mutations that occur within species (e.g., for changes in transport, signaling, and metabolism); identifying relationships between mutations of different species (e.g., for metabolic synergies); and developing a regression model that relates genotype to phenotype. This is joint work with David Stahl (Civil & Environmental Engineering, University of Washington), Nejc Stopnisek (Michigan State University), and Serdar Turkarslan (Institute for Systems Biology).
Hundreds to thousands of kinetics models of biological systems have been developed to study gene regulation, metabolic pathways, evaluate drug therapies, and other applications. For example, the BioModels database contains approximately a thousand models described using SBML (systems biology markup language). Models written in SBML (or similar markup languages) are a special kind of software that deal with mass action kinetics. However, current practice for model development fails to consider testing, debugging, design for reuse, design patterns or any of the vast knowledge commonly applied by software engineers. As a result, most kinetics models are relatively small (maybe 25 reactions), rarely reused or even reproducible, and scale poorly. This project is about bringing well understood tools and techniques from software engineering into the development of kinetics models to improve the quality of the models, speed their development, and increase scalability. Two efforts underway are: TemplateSB provides kinetics models with a template capability to reduce the number of reactions that must be written and SBMLlint checks conservation of mass in SBML models. This is joint work with Herbert Sauro (BioEngineering, University of Washington). Key publication is
· Joseph L. Hellerstein, Stanley Gu, Kiri Choi, and Herbert Sauro. Recent Advances in Biomedical Simulations: A Manifesto for Model Engineering. To appear in F1000. Details how various technologies and best practices in software engineering can aid in building biomedical models.
Digital spreadsheets are arguably the most pervasive environment for end user programming on the planet. Although spreadsheets simplify many calculations, they fail to address requirements for expressivity, reuse, complex data, and performance. SciSheets (from "scientific spreadsheets") is an open source project that provides novel features to address these requirements: (1) formulas can be arbitrary Python scripts as well as expressions (formula scripts), which addresses expressivity by allowing calculations to be written as algorithms; (2) spreadsheets can be exported as functions in a Python module (function export), which addresses reuse since exported codes can be reused in formulas and/or by external programs and improves performance since calculations can execute in a low overhead environment; and (3) tables can have columns that are themselves tables (subtables), which addresses complex data such as representing hierarchically structured data and n-to-m relationships. SciSheets is an open source project at https://github.com/ScienceStacks/SciSheets. Key publication is
· A Clark and JL Hellerstein. SciSheets: Providing the Power of Programming with the Simplicity of Spreadsheets. Scientific Computing With Python (SciPy), 2017. DOI 10.25080/shinma-7f4c6e7-007. The describes the motivations for, use cases, and design of SciSheets.
This self-paced course is intended for social science professionals who are familiar with the web and spreadsheets, but lack programming experience. The course describes a data process used at Google for preparing, analyzing, and applying data to organizational needs. My role was the subject matter expert who lead the development of the content. The course has been featured in the Huffington Post and Techcrunch.
Motivated by a need to identify future workloads for the Google cloud that are big data and/or big compute, I initiated a BigScience project with the following objectives: (a) drive integration within the Google cloud; (b) identify new value-add capabilities; (c) make contributions to data-intensive science; and (d) drive revenue. The project had two engineering initiatives: (1) Exacycle, a low-cost supercomputer constructed from “waste cycles” in the Google Cloud and (2) BigApply, an cloud infrastructure for scientists to manage parallel jobs. Some of the science results from Exacycle include insights into the mechanisms for G-Protein Coupled Receptors.
The traditional approach to capacity planning and performance tuning in Google is to use traces to drive benchmarks, an approach that provides little ability to do workload forecasting or to exploit workload characteristics in scheduling. I led efforts that: (a) characterized task resource consumption (Sigmetrics Performance Evaluation Review, 2010); (b) demonstrated that task resource consumption is fairly constant during the execution of most long-running, resource intensive tasks (The 5th Workshop on Large Scale Distributed Systems and Middleware, Seattle, 2011); (c) characterized and modeled task placement constraints that restrict which resources tasks can consume (Symposium on Cloud Computing, 2011); and (d) published Google Cluster Data (published ~50 variables for 1 month of data). The results have been used in Google for performance tuning, capacity planning, and scheduler design.
Managing Google compute clusters requires tools that can answer what-if questions about the performance impact of changes in hardware, jobs, and scheduling policies. I led a team that built tools used by cluster administrators and service managers to diagnose and resolve problems with scheduling quality. Our contributions include: (a) scaling the tools to be used for all submission of production jobs; (b) developing dashboards that provide actionable explanations to why a job experiences scheduling delays; and (c) inventing and deploying a new technology for what-if analysis of scheduling quality using resource shapes supplied by machines and demanded by jobs.
Google's storage stack contains multiple layers, a structure that greatly complicates the identification and resolution of performance problems. Motivated by problems with intermittent long response times (tail latencies) in the storage stack, I formed a team to use the Dapper RPC tracing tool to analyze problems with storage latencies. We developed tools that identified and led to the resolution of performance problems in several key Google storage technologies.
The Google data center operating system employs a charge-back system to ensure that cloud resources are allocated in accordance with corporate investments. However. jobs often specify resource requirements far in excess of what they actually consume, thereby underutilizing expensive computing resources. Resource estimation provides a mechanism to dynamically adjust resource requirements in a way that makes available resources that can be used speculatively by other jobs. My contributions were to: (a) provide a formal analysis of the resource estimation feedback mechanism (which identified several bugs) and (b) optimize the choice of parameters used in the feedback mechanism to achieve better resource utilizations. Further, I designed and implemented a scheme that provides a two tier system for resource management to provide an explicit mechanism for speculative use of resources.
The requirements for thread management depend on the specifics of the workloads. I/O intensive workloads do better with a high level of concurrency; CPU-intensive workloads prefer lower concurrencies. To address these varied requirements, I modified the Google thread manager to provide a mechanism for user policies for managing thread concurrency levels.
The .NET thread pool in version 3.x was a major source of bugs because of the use of complex, interrelated thresholds used to determine the concurrency levels. I developed a thread pool that is based on a simple hill climbing principle that was introduced in .NET 4.0 (Sigmetrics Performance Evaluation Review, 2009). Reports to date are that reliability and performance have improved considerably.
This project was motivated by problems at Microsoft with meeting service level expectations for responding to software errors. It turned out that the processes for sustaining engineering in Developer Division involved 7 to 10 different departments with largely informal data exchanges and service expectations. I modeled the process, and identified several optimizations that significantly reduced delays and increased reliability. The results drove the development of a new dashboard for tracking bugs and fixes. Aspects of this work are reported in Network Operations and Management, April, 2008.
Developed formal methodologies and constructed engineering patterns to enable software practitioners to apply control engineering to computing systems. This work was motivated by the observation that software systems often scale poorly because of deficiencies in the way they handle dynamics, especially changes in workloads and resource characteristics (e.g., failures). The problem of dynamics exists in many engineering disciplines, such as mechanical, electrical, and aeronautic engineering. In these areas, control theory has proved to be an effective tool for analysis and design. The goal of this work has been to achieve similar benefits in engineering software systems.
Prior to this work, there were a number of papers that used control theory to analyze data networks and servers. While these efforts developed interesting techniques, they did not address the objective of providing practical results for professional software engineers. To achieve this objective, three steps were required: (1) identify a simple subset of control theory that can be easily digested by software engineers and provides a good correspondence between model predictions and the empirical observations of dynamics, especially settling times and oscillations; (2) demonstrate that these techniques can provide substantial improvements in commercial products with production workloads; (3) develop educational materials and pedagogy to teach control theory and its application to software practitioners.
The first step, identifying an appropriate subset of control theory, was largely accomplished through a series of "science experiments" on IBM's Lotus Domino Server and the Apache Web Server. We used discrete, deterministic, linear time-invariant (LTI) systems with pole placement design. Our innovations were largely in modeling as described in "Using Control Theory to Achieve Service Level Objectives in Performance Management" (Real Time Systems Journal, 2002, 159 citations) and in system identification for multiple input multiple output (MIMO) models as described in "Using MIMO Feedback Control to Enforce Policies for Interrelated Metrics With Application to the Apache Web Serve" (Network Operations and Management, 2002, 95 citations).
The second step, demonstrating the value of control theory in commercial products, has been an on-going effort for several years. In 2005, IBM shipped the Universal Database Server (DB2) v8.2 that incorporates utilities throttling, a solution to a persistent problem that administrators have with ensuring that utilities such as BACKUP, RESTORE, and REBALANCE make progress but do not cause excessive performance degradation of production work. As detailed in "Throttling Utilities in the IBM DB2 Universal Database Server" (American Control Conference, 2004), our methodology for control engineering played a central role in DB2 v8.2, especially in designing an effective actuator using self-imposed sleep, estimating the impact on production work, and designing a controller that responds quickly but does not oscillate. It turns out that the Microsoft Hotmail team also had challenges with managing administrative work, and used our the DB2 solution for their design.
Release 9.1 of the DB2 product contains another feature in which control engineering plays a central role --- self-tuning memory management. Administrators face considerable challenge with choosing the correct size of buffer pools for mixes of production workloads. In concept, this is a constrained optimization problem in which the objective is to minimize data access times by allocating memory to buffers subject to the constraint of the total size of memory. As discussed in "Incorporating Cost of Control Into the Design of a Load Balancing Controller" (Invited paper, Real-Time and Embedded Technology and Application Systems Symposium, 2004), this problem can be re-cast as a simple regulatory control. This turns out to be a significant advance in control modeling since the approach has broad application to load balancing problems, which are very common in resource management of computing systems.
Last, I have made progress with education on the theory and practice of control engineering for software systems. The starting point was to write a text book for software professionals and researchers--- Feedback Control of Computing Systems (Wiley, 2004, 136 citations). Next, I did tutorials and short classes at ACM Sigmetrics, University of California at Berkeley, and Stanford University. Ultimately, this led to full semester courses at Columbia University (Spring & Fall, 2004) and University of Washington at Seattle (Winter, 2008).
Key publications are:
Feedback Control of Computing Systems (with Yixin Diao, Sujay Parekh, and Dawn Tilbury), Wiley (2004).
Using MIMO Feedback Control to Enforce Policies for Interrelated Metrics With Application to the Apache Web Serve," Y Diao, N Gandhi, JL Hellerstein, S Parekh, and DM Tilbury. Network Operations and Management, April 15-19 2002, pp. 219-234. Best paper award.
"Using Control Theory to Achieve Service Level Objectives in Performance Management," S Parekh, N Gandhi, JL Hellerstein, D Tilbury, TS Jayram, J Bigus, Real Time Systems Journal, Vol.23, No. 1-2, 2002. [159 citations]
“Achieving Service Rate Objectives With Decay Usage Scheduling,'' IEEE Transactions on Software Engineering, Vol. 19, 1993, 813-825.