First Steps in Scientific Programming

Scientific programming is a relatively well defined concept in which software design and implementation are applied to analyse data acquired with various scientific instruments, to build mathematical models, to compare the results of models against data, to do data exploration and visualisation, to explore unexpected results leading to discovery of new phenomena, to process raw data and obtain new data products, to do statistics, to simulate possible results using Monte-Carlo techniques, etc. In my experience, software to perform many of these tasks do exist, but it is rarely enough to do everything necessary. It is then necessary to get your hands dirty and start programming, after all, a good percentage of scientific work is at the leading edge, pushing technology and creating new ways of solving problems, and scientific programming plays a big role in it.

This book is not focused on teaching you a specific programming language, for that, there are hundreds of excellent references plus the resources available on the Internet. My objective is to provide you with guidance and programming tips which apply to nearly every language, principles which you need to grasp in order to construct a piece of software applicable to scientific applications.

Available from iBooks as an e-book and amazon as e-book and print on demand.

Who would benefit from reading this book?

Researchers in various fields of science and engineering are faced with similar challenges to solve the problems they need to face. What changes is their subject, from physics, astronomy, engineering, chemistry, biology, Earth observation and many others.

My expertise is in the physical sciences, astronomy to be specific, but I've had the chance to notice how similar the situations are in other branches. Naturally, I focus on some aspects related to physics, astronomy and engineering, like how to treat time.

I consciously focused on programming techniques instead of focusing on a single language. The reason is that there is such variety of languages which one has to be flexible, plus, it is vital to recognise early on that there is not a single language which could solve all problems faced in research projects.

About the author

Patricio Ortiz holds a PhD. in astronomy from the University of Toronto, Canada. He has a keen interest in programming as a mean to create tools to help his research when no tools were available. He has taught at the graduate and undergraduate levels in subjects about astronomy, instrumentation, and applied programming. Throughout his career, Patricio has interacted with students at any level as well as post-graduates, helping him identify the most critical subjects needed by young scientists in the physical sciences and usually not covered by current literature. He has worked on projects involving automated super-nova detection systems; detection of fast moving solar system bodies, including Near-Earth objects and he was involved in the Gaia project (European Space Agency) for nearly ten years. Patricio also developed an ontology system used since its conception by the astronomical community to identify equivalent quantities. He also worked on an Earth Observation project, which gave him the opportunity to work extensively with high-performance computers, leading to his development of an automated task submission system which significantly decreases the execution time of data reduction of extended missions.

Patricio now works as a Research Software Engineer at the Department of Automatic Control and Systems Engineering, University of Sheffield. He uses C, Fortran, Python, Java and Perl as his main toolkits, and as a pragmatic person, he uses the language which suits a problem best. Amongst his interests are: scientific data visualisation as a discovery tool, photography and (human) languages.

Email: firststepsinsciprog@gmail.com / pfortiz@protonmail.com

Why writing this book?

I wrote this book based on two premises. a) knowledge inside my head is useless if I don't share it with others who could benefit from it by making the learning curve associated to programming in a scientific environment less steep, and b) because I noticed a niche, a gap in the available literature. As our field is relatively reduced, there is not much material focused on it. There are plenty of great books on how to learn computer languages, but their examples do not apply to science. There are plenty of books now about big data science, but their paradigm is different to ours. I also want to open the reader's minds to show them that programming well is more important than the language used, as during their careers they will probably have to learn not one, but several computer languages and apply to different situations.

Contents' summary

1: The modern computer

I believe it is essential for users to know what parts a computer is composed of, whether we are talking about a smartphone, a laptop, a desktop or a supercomputer. I go through the most relevant components to make readers understand the role they play in the execution of a program: CPU (central processing unit), memory, GPU (graphics processing unit), permanent storage (hard disks, both local and remote), and the operating system (OS).

2: How computers store information

As most scientific projects involve the reading and writing of data, understanding that data is stored in binary format, and how the information is organised inside the storage devices is vital. Input and output I/O must be understood because it is one of the reasons a program may be slowed down significantly if it is not done correctly. It is also crucial to understand how the operating system protects files from being accessed by other users. This is a key point when working on large projects.

3: Which language to use?

This is a question without a unique answer as there is a variety of good languages available. Many times you will be forced to use a language because that is the one the project uses. In many cases, it may not be a language that you have studied before. What I try to emphasise is that one should see a language as a tool, and as such, there is no such thing as a universal language. Some will be superior in specific tasks, but not in others. You may also have to decide between a language which allows you to program very quickly but it is not exactly designed for high performance or one which runs very fast but which is complex to program. Sometimes you have the choice between a compiled and a scripting language or a UNIX tool which will save from programming. On top of that, you may choose an object-oriented language or a procedural language, or a mix of both.

4: What software can and can't do

Knowing the capacities and limitations of what software can do for you is vital to avoid frustrations arising from unfounded expectations. It is better, of course, to deal with data with a program which can be easily repeated than to use a spreadsheet.

5: How to write (scientific) software

Perhaps one of the main mistakes we make is to start programming by starting to edit the code in the language of choice. This situation is like taking your car and start driving without planning your journey. Understanding of the objectives, the resources available, the data to use and produce is paramount.

Typing the code is not enough though, except in script languages, you will need to compile the code, use collections of software called libraries or packages which your code needs. This is a step which you will need to perform several times until your code is working correctly. It is the recommended to use tools which build your code with simple commands and recognise what has changed and needed recompilation.

Under all circumstances, be positive, but also be realistic. Having a single copy of work in which you have invested hundreds of hours is not advisable. If you have it on your laptop, what if it is stolen, or if it ends up at the bottom of a river or it is affected by fire? Don't think that university computers are 100% protected. The point is that I strongly encourage you to use some versioning system, which not only keeps a copy of your code but also keeps versions of it in a safe place, usually the cloud.

6: Main software elements and tools

Most languages have elements in common, and this is what I describe in this chapter. What goes in memory, variables, constants, objects. Assignment of values to variables, either from constants, other variables or expressions involving constants and variables. I cover flow control; rarely a computer will execute a fixed sequence of instructions.

The flexibility to execute instructions many times in loops makes computers great for computing complex repetitive groups of instruction in which only a part changes from one loop's cycle to the next. The flexibility to include decisions which depend on the value of a given variable in the form of conditionals is also one of the reasons you will program.

The same set of instructions may be applicable to several parts of a program, and for that, languages provide you with subprograms or methods which can be applied to the same kind of arguments but with different values at different stages of the execution.

How you utilise the system's memory can vary from "I don't care, I'll leave it up to the language" to having full control in cases when you are pushing the system to the limits due to the data volume.

Human-readable text is sometimes all that you are interested in, so I give a brief description of the common functionality offered to do things with the so-called "strings".

As part of the planning, I mention the use of flowcharts and one of the many languages designed to handle project modelling.

7: interfaces

Rarely a piece of software is written as a monolithic entity, with all functionality in a single piece of code/file. Using subprograms and methods is highly advisable to reduce the amount of coding and also to make use of standardised methods to do something, like reading and writing files. Interface, in this context, is defining the way a subprogram is to interact with its calling body.

Also, if your program produces scientific data which may or will be used by other pieces of code or even other users, the more thorough you are in describing the content of a data file, the better. I call this a "data interface". It is quite difficult to predict how long your results will be needed and how widespread they will be, hence, writing a well-described data-file (and proper documentation) is adviced.

Finally, how you name and organise your data-files is another form of interface.

8: Demo code vs production code

The software is not different to other products in the market, like cars, for example. When you buy a car, you buy a well tested, production-quality model. All the prototypes, proof of concepts, failures, half-failures and intermediate steps never make it to the market.

There is a big difference between code which is written to implement an equation found in a book, where the number of executions will be one or very limited. In this case, every element of the equation needs to be defined at the same time. You do need this code; you need to prove that it produces the correct results for your equation, you must show that it is robust enough to handle situations where the input parameters are not within the expected range.

In production mode, once you need to apply this equation hundreds of thousands of times within loops, you must think of a smart way to implement the demo code so that it is not computing things time and time again when computing them once would do.

The sooner you start thinking from this perspective, the better.

9: Altering someone else's code

As a young scientist/engineer, you will rarely need to start coding from scratch. It is highly likely that one of your first tasks will be to modify already-existing code to enhance its functionality.

I present you with tips on how to make this process simpler, emphasising the need to create new and well-documented versions, always keeping a copy of the original piece of software intact.

10: Finding problems in the code

Unless you are a programming genius, chances are that somewhere in the design or the implementation of the programs you write will have problems, bugs in the jargon. Finding those bugs is time-consuming, and I give you a list of hints on how to make this procedure less painful.

Even if your program is doing everything it is supposed to be doing, once you start plugging data into it, a new set of circumstances may arise which can make your code crash (if you are lucky) or to produce meaningless results. This scenario is also part of debugging, and also related to how to make your code resilient to unexpected failures. At the ones you can predict could happen.

11: Testing code

Thoroughly testing that each software component is behaving in the way it is expected may sound obvious to some, strange to others, yet, in the era of big and complex coding, testing that everything is working as expected is vital.

Testing does not stop there; once we know everything works properly individually, we must test how it works as an ensemble.

Any failures need to be fixed as soon as possible.

Testing also involves other elements like performance, does my code do the job in a reasonable amount of time? Will it scale well once I apply it to a ton of data?

Finally, despite all tests you have made, it is advisable to plot the most relevant results of your code to have your brain inspect for anomalies, no matter how low-frequency they could be when the data volume is large, you will see them appear. When you create plots, use the full dynamic range of variables, not just the range you expect them to cover.

12: Performance enhancers

It doesn't matter if your code runs in 0.5 seconds or in 2 seconds, or if it creates a data file of 300Kb or 3.5Mb to store the data IF AND ONLY IF you will run that code a handful of times, always under your supervision.

On the other hand, if your code will end up being applied hundreds of thousands of times on massive data sets, only apps which run fast and optimise the use of disk space will allow you to complete your project.

I give a series of tips on how to enhance the performance of a piece of software, and I strongly believe that good programming practices should be learned as soon as possible and applied to situations when the software will be used only a handful of times. Keep in mind that you can not foresee the future, and sometimes that little program which is meant to run only a few times ends up having to be applied on a much larger scale.

Performance enhancement helps to make the software scalable.

13: Code scalability

Code scalability refers to how a piece of software will perform if we need to apply it to a volume 10 or 100 times its intended original volume. In the last few years, it also means how easy it is to port that piece of software to a high-performance computer (HPC).

Granted, for some projects, this is not an issue, but for others, it is critical to think about scalability from the very beginning.

14: Working in parallel environments

Parallel computing has been around for a while, and it refers to the capacity to design the code in such a way that it can be executed in several processors within the same machine, several computers (HPC) or making use of the processors in a GPU.

These situations force you to make a much more thorough design around the system in which the code will run. Some parts of the code will be easily parallelised, while others will necessarily need to be run sequentially.

I emphasise the aspect of working in HPC environments, where a number of issues need to be accounted and which are not an issue when running on a single computer.

15: Working with remote computers

Gone is the time in which one would sit at one computer and do all the processing in one workstation (PC). Universities and national facilities provide large facilities for data storage and processing, and we are expected to make use of them. It is important then to know how to connect to these machines and what the best way to maximise its use is.

Working remotely also means transferring data from one machine to another, and there are some ways which are far superior to others under certain circumstances.

16: UNIX basics

UNIX is the base of several widely used operating systems like Linux and macOS, and starting with Windows 10, Microsoft is allowing to use some flavour of it.

UNIX provides a rich environment to customise your experience (the shells) as well as providing you with hundreds of "tools" (commands) which can make your life significantly simpler.

I list and describe what I consider to be the most useful commands in Unix, as I am very aware that this has a steep learning curve.

The other characteristics of Unix based systems, particularly when accessed remotely, is that they require you to use what is called "terminal". This situation is not new for pre-millennials, but it may be entirely unknown to a generation which has grown up working only with a graphical user interface (i.e., clicking here and there)

17: Automated execution

Human controlled execution is when it is you who launch a piece of software to run, and it is you who evaluates minutes, hours or days afterwards if it was properly executed and launch then the next step (if any). Human controlled execution is adequate for low volume, not so data intensive situations, but it is not at all suitable for circumstances where the data volume is huge.

Automation of the execution steps is highly recommended in high volume regime. You do want an automated procedure that is capable of launching each step of processing, wait for it to complete, evaluate its results and act accordingly: ring a bell if it failed, launched the next processing step if it succeeds.

You need to prepare your software to work in automated mode; you must make it generate "messages", which the control software can evaluate.

18: Random numbers

Random numbers are needed in many aspects of modern software applications, in particular in simulations like the Monte-Carlo method. Many languages provide ways to generate "random numbers", but generating truly random numbers is not that simple, and many of the generating routines produce the so-called pseudo-random-numbers sequences. These sequences repeat if you use the same seed. Sometimes this is sufficient, sometimes not.

19: Working with time

Time is one of these variables which is simple to use, but its complexity is how you represent time, that is, in which scale you express time? For some applications keeping as a date is fine, for others, just counting the number of seconds, minutes or days after an event is suitable, yet for others, a better-defined scale is needed, like Julian Date, some flavour of Modified Julian Date or Unix-time.

If you need to design a database with time in mind, how do you keep time? Or do you hold more than one measure of time? Human-related activities may benefit from distinguishing weekdays from weekends.

Plenty to talk about time.

20: Coordinate systems

I wrote this section for those who need to deal with spatial coordinates, like in astronomy or geography or anything which requires a geo-location attached to it.

21: Data processing

I wrote this section focused on the applications in which data from instruments experience several transformations, including calibration, quality control and others before converting them into intermediary or final products.

As products, it means that you are likely not to be the only consumer of the data, and in that respect, it is important to follow some steps to make sure that whoever uses your data uses it for the purposes which are suitable.

Data processing involves strategies on how to organise data from the acquisition stage to the final stages, passing through how to organise your data on disk in the best possible way.

22: Databases: the basics

Organising your information is key to the success of your project and how easy it is for others to access its results.

One aspect of organising the data is the use of databases. Sometimes you can store your data in databases, sometimes the volume is too large that this is impossible, but it is possible to collect metadata in databases to make the data more discoverable, and perhaps to expose your data to external users via a web interface.

I wrote about basic concepts about databases, but what you need to keep in mind is that the choice of your database system is critical from day one. Choosing a database system is a bit like marriage.