[2/27/2020] - I realized that this document is a little too long and lacks focus. I have decided that links to supporting information is probably a better approach.
Collaboration. I have written different software packages in materials science and have worked with collaborators with different national labs and graduate students as well as undergraduates. There is a fundamental difficulty of sharing information between collaborators, people interested in your work. Scientists generally do not operate through a "top-down" research process, and as a result traditional project management tools such as GANT charts have little value. Research focus often changes direction and the affects research goals. This makes
Between writing different software packages for materials science. I have decided to standardize my approach to software development. I wrote this document for two purposes: (1) I wasn't really sure what I was doing at the time, so I went around and collected ideas, tried them out, and adjusted them for my purposes, (2) I am often doing collaborations with new students and there is a tendency for people to do things in the way you don't want them done if you don't provide specific direction, and (3) since I aspire to have a research group on day, it seems easier to develop this documentation, piece-by-piece, rather than to write this documentation all at once.
Right now, most of my software development uses python, so this documentation might be a little python specific. The choice of using python is somewhat straightforward, python is lightweight, easy-to-use, and is an ideal programming language for dealing with the issues of writing input to be used in a third party code (e.g. such as LAMMPS and VASP) and for processing the third party output code.
This sofware development process is pieced together based upon modern software development practices, incorporating the practices popular in open source projects, modified for the academic environment, and tailored for scientific software development. These choices were used in the development of the Materials Ex Machina package (MEXM).
Developing software to support computational science efforts requires modification of the Agile methodology. A computational materials science produces three major products: publications, software, and computational results. At the current time, scientific research largely focuses upon publications.
Ultimately, the purpose of any research group is to dissemination of research results. Typically, these research goals consists of three major products: publications, software, computational results. This paper suggest that centering a research groups efforts around the development of software will lead to increased productivity of all three results.
The direct adoption of an Agile software development process is likely to fail, since the process primarily focuses around DEVELOPING SOFTWARE FOR CUSTOMERS. For a computational material science group, the researchers are both the producer and the consumer of software, where the purpose of the software is to produce computational results relevant for publishing.
While it is important to allow the choice of tools, it is important to maintain a consistent toolchain. There should be a clear preference for the choice of tools, but this should not be a hard constraint depending upon circumstances. For example, let us consider the case where a research group has decided to standardize their software development in python. This choice should not be treated as a hard constraint. Many applications are written different programming languages. For example VASP is written in Fortran while LAMMPS is written in C. It may be more expedient to get the job done in the native application of the tool of choice, and choosing to release the results as a patch. However, this needs to be balanced that there now exists tight coupling between your patch which is dependent upon the stability of the 3rd party code base.
Personally, I prefer an integration approach where software code is used to create the input files and parse the output files of the 3rd party code. This creates what I refer to as loose coupling, if the format of the input file or output file changes, then only the classes which produce these files need to be updated. This creates a clean delineation between your software application and the 3rd Party software application
[ADD UML FILE HERE]
Let us consider the three major products of a computational materials science research group: publications, software, and computational results.
Publications. Ultimately, the production of publishable results is the end goal. In the same way that software development needs to be iterative, the production of documents needs to be iterative as well. As researchers become more familiar with a research topic their understanding of the problem as well as their approach changes. These changes should be documented as it will make writing papers dependent upon background information, relevant equations, description of the methods (including find the citations for prior work, simpler). In essence, a computational science group needs tools to manage their thoughts, notes, and documents.
Software/ComputationalTools. It is inevitable that a researcher will be using the results from the simulation in order in order to make calculations and produce graphs. These artifacts may include spreadsheets, outfiles from other codes, matlab scripts, and one shot applications. For each project, the principal investigator, should probably establish a work folder and set appropriate permissions. If the visibility of these application is set appropriately within the group, then prior projects can easily build upon this knowledge. In addition, when publishing papers, the ability for readers to follow and produce similar results, provides a level of provenance for the use of the work.
Computational Results. A computational groups produces a large amount of data which is lost due to the costs of storage. This is unfortunate considering the high computational cost of some simulations. The setup and results of these simulations are infinitely invaluable for software development as they can be used to for different types of integration/regression testing. Additionally, in an age where machine learning and artificial intelligence is being increasingly used requires large amounts of existing datasets.
When designing a software package or computational science automation, it is important to discuss computational simulations as portions to a process. As a result, MEXM is largely designed around passing data objects, which can be marshaled and demarshaled into data files. As a result, should think of each software process as taking in datafiles from other software processes and producing datafiles which can be used in other software processes.
I need to figure out a workflow toolkit that can work on a cluster with some level of persistence with either a flat file or sqlite3 and that can provide state maintenance using a cronjob. The best way to probably do this is by implementing a sqlite3 scheme. I need to think of this as an application as a service approach, which communicates to each other through a message bus. But more on that later!
One of the issues with input and output files is that parsing proprietary formats are difficult. For the purposes of automation, it is better to have an standards based format for data and then have utility programs that can create and display these files. For embedded applications, JSON, YAML, and XML files have become the ubiquitous. The differences in these files are discussed in [1].
The days where computational research groups have dedicated computing clusters is over. It is more common for research groups to purchase or have an allocated resources on one or more high-performance computing (HPC) centers. I saw this change from when I started working on my PhD at the University of Florida, where our research group purchased and maintained our own MPI clusters when I started (2013), and proceeded to move to using computational researches maintained by the University of Florida, the Department of Energy, or the National Science Foundation (2019). This is similarly the case at the place I am now at Ohio State University (2020).
Moreover, computational scientists may have resources as part of the their research grant. This can include hours from project sponsors (such as the Department of Energy) or from a granting institution such as the National Science Foundation (NSF) .
At the current time, most computational efforts involve the use of third party code such as VASP for DFT and LAMMPS for molecular dynamics. Early efforts to deal with the automation of computational materials science workflows, were designed for a singular purpose and concepts for code reuse are largely an afterthought.
These are some notes I am putting together for developing a software development process to support a scientific computing environment.
The waterfall model is a breakdown of project activities into linear sequential phases, where each phase depends on the deliverables of the previous one and corresponds to a specialisation of tasks. The approach is typical for certain areas of engineering design. In software development, it tends to be among the less iterative and flexible approaches, as progress flows in largely one direction ("downwards" like a waterfall) through the phases of conception, initiation, analysis, design, construction, testing, deployment and maintenance. [Wikipedia - Waterfall method].
The traditional method of software development is known as the Software Development Lifecycle (SDLC).
The waterfall model has a fundamental problem in the assumption that the software project can be defined at the beginning of the software development effort. The development effort in a research environment is fast-paced and iterative in nature, which both the methodology and the software must be continually updated and refined. More recent software development models which supported iterative development models, which implemented concepts of continual improvements in software development. These approaches include Rapid Application Development and Agile. [Wikipedia - Rapid Application Development] [Wikipedia-Agile]. The construction phase of the iterative development model consists of programming and application development, coding, unit-integration and system testing.
However, the Agile Software Development process has fundamental weaknesses[2].Individuals and interactions over processes and tools.
Working software over comprehensive documentation.
Customer collaboration over contract negotiation.
Responding to change over following a plan
"Agile methodologies core is people. People are the main factor for success or failure. One of the main important principles in agile methodologies is that people can respond earlier and faster. Also fast response to emergent issues in agile methodologies enhance the overall process since it doesn’t require following procedures for writing documents and calling several level of managers."[2] This makes Agile style processes ideal for the research environment where researchers are working on projects separately, within their own research group, with external collaborators, and maintaining the software for the general interest.
One of the main reasons behind using iterative development its allowing the developer to take benefit of what was learnt through the development and earlier deliverables versions of the system. [3] This matches up with typically how science develops, where code develops over time.
The purpose of this process is to reduce technical debt, a which implies cost of additional rework caused by choosing an easy solution now instead of using a better approach that would take longer. In the development of computational solutions, these solutions may come in several forms stemming from both software architecture issues as well as theoretical/methodological implementation considerations.
Unit tests are conducted by developers and test a unit code developed. It is a testing method by which individual units of code are tested to determine if they are ready to use. It helps to reduce the cost of bug fixes since the bugs are identified during the early phases of the development lifecycle.
Integration testing is executed by testers and tests integration between software modules. It is a software testing technique where individual units of a program are combined and tested as a group. Test stubs and test drivers are used to assist in Integration Testing. Integration test is performed in two way, they are a bottom-up method and the top-down method.
Functional testing - Software fulfills the functional requirements of the system.
Non-functional testing - Specific functionality.
Regression Testing - regression testing is done after code fixes, upgrades, or any other system maintenance to check that the new code has not affected the existing code.
The Kanban Template. Use this simple Kanban template to keep the engineering team on the same page and moving through work fluidly.
Break down the roadmap by adding tasks as cards to the Backlog list.
Move the cards one-by-one through Design as they becomes more fleshed out.
When a card is fully specced out and designs are attached, move it to To Dofor engineers to pick up.
Engineers move cards to Doing and assign themselves to the cards, so the whole team stays informed of who is working on what.
Cards then move through Code Reviewwhen they're ready for a second set of eyes. The team can set a List Limit (with the List Limit Power-up) on the number of cards in Code Review, as a visual indicator for when the team needs to prioritize reviews rather than picking up new work.
Once cards move through Testing and eventually ship to production, move them to Done and celebrate!
[1] William Wong. "What's the Difference Between JSON, XML, and YAML?" Link.
[2] Mohammad, Adel Hamdan, and Tariq Alwada'n. "Agile software methodologies: Strength and weakness." International Journal of Engineering Science and Technology 5.3 (2013): 455. Link
[3] Larman, C. and V.R. Basili, Iterative and Incremental Development: A Brief History. Computer,2003. 36(6): p. 47-56
Coding Standards
Software Package Standards
Source Code Control
Developing New Features and Testing