Non-Technical

These pages describe my work at a non-technical level, suitable for a general audience. To read about my work on a technical level, see the research pages.

My work as an astronomer and "data scientist" divides into two main areas:

- Astronomy
- Data mining

As an astronomer, I study the universe beyond the Earth. A lot of the universe that we know to exist is beyond the Earth. Written as a fraction of known space, it is:

0.9999999999999999999999999999999999999999999999999999999999999

and possibly much larger!

As a data scientist, I employ data mining in order to further my research. Data mining is the exploration of datasets in order to find new and useful patterns. The datasets are often very large, complex, and high-dimensional. This means that there are many numbers do describe each object, and just as three numbers describe a point in 3D space, so many more numbers can be said to describe a point in a space with many more dimensions. This complexity and high dimensionality means that many useful patterns may exist in the data, but are very difficult to find except by this approach. (Think of, say, a method that is equivalent to plotting every dimension against every other all at the same time, instead of making many many individual plots and trying to see patterns.)

The Fourth Paradigm

Data mining embodies the fourth paradigm of science. The first three paradigms are:

1. Empirical observation (thousands of years)
2. Hypothesis and theories under the scientific method (last 500 years)
3. Computation and simulation (last 50 years)

The fourth paradigm is exploration of large data in a data-driven manner, which means that, as well as hypotheses being formed and checked, the data itself shows new patterns that can lead to new insight.

Data mining also has negative connotations in some contexts: it is possible for a great deal of information to be collected about a person, and their actions. However, as with any powerful technology, it is not the technology itself, but the uses to which it is put, that can be positive or negative. Data mining has been use for a great deal of positive benefit.

My research

An important aspect of both astronomy and data mining are that they are science-driven. One asks the questions first, and then uses appropriate tools to try to answer them. This includes data mining, which is also data-drive, but usually we have some idea what we are looking for!

Questions I am interested in include:

- What is the faint-end slope of the galaxy luminosity function in the Virgo Cluster?
- How does the luminosity function vary with environment?
- Is there a universal luminosity function for a given type of galaxy, regardless of environment?
- Did the faint galaxies in clusters form in situ, or did they fall into the cluster over times?

The galaxy luminosity function is the number of galaxies per volume of space, as a function of how bright they are. Generally, there are fewer bright galaxies, and more faint ones, and the overall numbers are a strong function of where one is. The faint end slope tells how rapidly the numbers of galaxies increase as one probes to fainter brightnesses. If one maps out where the galaxies are, the large scale structure of the universe resembles a bath sponge, with filaments and walls of galaxies, galaxy clusters where these intersect, and in between, huge voids containing almost no galaxies at all.

The Virgo Cluster is the nearest large cluster of galaxies to us. If our galaxy, the Milky Way, can be thought of as being in the "suburbs", then Virgo is the nearest city. As such, because we can see much more detail in Virgo than in clusters further away, it provides a natural "laboratory" in which to study galaxies, in order to find out how they formed and evolved, and the effect of the dense cluster environment.

The Next Generation Virgo Cluster Survey is a survey of 100 square degrees of sky containing the Virgo Cluster of galaxies. The full moon is half a degree across, so think of 20 full moons in a row, then a square on the sky with sides of that length. It is probing to a brightness of 25.7 magnitudes in the g band, which is a particular range of wavelengths of visible light. It is probing to comparable depths in other bands, u, r, i, and z. These are the same bands as the well-known Sloan Digital Sky Survey, but this survey is probing 3.5 magnitudes fainter, which corresponds to a factor of well over 100 in brightness.

Big data

Many areas are facing a flood of data, with the amounts to be analyzed doubling every 18 months, or, in some fields, such as areas of medicine, every 6 months! Astronomy is no exception to this, as detectors on telescopes act like giant digital cameras, constantly sending back huge files.

This rapid expansion of data means that it is becoming increasingly impractical to move it around: although data sizes, hard disk space, and processor speeds are all growing exponentially, and thus somewhat match each other, other parts of the system, such as network speed and hard disk speed, are not.

Thus, it is becoming increasingly sensible to, rather than download the data to your own computer (bringing the data to the analysis), it makes more sense to run the analysis on a machine where the data is: bring the analysis to the data. In astronomy, this is embodied by the concept of the Virtual Observatory. While not developed just for this - it also aims to make all astronomy data available in a coherent and interoperable way, providing tools at the various data centres is starting to help a great deal in enabling large datasets to be analyzed.

Another interesting aspect of the largest data is that it is in effect becoming necessary for it to be streamed, which means that the computer works through successive parts of it in turn, because it cannot all be held in memory at once. While obviously true of data in real-time, such as that from social networks, or variable astronomical objects, it means that even data that are nothing to do with time effectively become a time-series. So the application of time-series methods will become increasingly important.

The data size of the Virgo survey is about 50 terabytes. By astronomical standards, this is not particularly large: some surveys are reaching into the petabyte range. However, it is still considerably larger than the hard drive on one's desktop, and is subject to the complexity, high dimensionality mentioned above.

The data sizes will continue to increase. In the early 2020s, the Square Kilometer Array is expected to produce many exabytes of data. While we may be able to store it, remember what was said above about disk input/output speed and transferring things over a network!

Cloud computing

Cloud computing is the notion that, much as the utility company provides your electricity and water, and your ISP the internet, so, because we have the internet, companies can provide you with the computing power itself.

Linking to the big data idea above, this means that, if the data are stored on the cloud, users can naturally take the analysis to the data, as required when the data are large.

At the Canadian Astronomy Data Centre, where I am based, we have embodied cloud computing in a project called CANFAR (Canadian Advanced Network for Astronomical Research). This allows Canadian astronomers to operate a virtual machine (VM), which is just like operating a normal desktop or laptop, but does not physically correspond to a particular piece of hardware. Once one has installed the software and written one's code to say what to do (a job), the job, or many jobs, can be submitted, and run on up to 500 processors simultaneously. This is like taking one's desktop or laptop, and cloning it 500 times. And thus one gets results 500 times faster. A job that would take a day on one machine instead takes less than 3 minutes!

An important aspect of CANFAR is that this means that one is not restricted in what code can be run. Scientists like to be able to run their own code, and it is often vital, because some the code has to be written to do the analysis they want to do.

Links with industry

As a member of the National Research Council Canada, which is part of the Canadian Government, rather than a university, an important part of our work is to cultivate links with local businesses. The data mining portion of my work provides a natural route to this, because it provides the ability to find useful information from large amounts of real-world data. Astronomy data may seem esoteric, and somewhat "not of this world", but it shares a great deal with more down-to-Earth data such as medical images, pharmaceuticals, behaviour of customers, social networks, and so on. The data mining methods used on the astronomy data can be applied to all these other data.

The World's First Science Analytics Cloud

With this in mind, we are collaborating with the Fastlab, a group at the Georgia Institute of Technology in Atlanta, who are the world's largest dedicated to machine learning, which is another name for data mining: training computers to find useful patterns in data. Originally a purely academic group, they have now produced a commercial software product, Skytree, which is designed to be the first large scale industrial-strength machine learning software.

We have placed the Skytree software on the Canadian Astronomy Data Centre's CANFAR cloud computing system (mentioned above), and it is available to be used on all 500 of CANFAR's processors simultaneously. Yet another name for data mining, machine learning, etc. is analytics, and we have thus created the world's first Science Analytics Cloud. This is ahead of what both other science institutions, and businesses, even in places such as Silicon Valley in California, are currently doing.