Research : Big Data Astronomy

Let's PySpark the Universe

Do you like to know about what these figures mean ? Please check out : http://cosmicweb.barabasilab.com/

1. Why PySpark for Astronomy?

Apache Spark is now a de facto standard in Big Data platforms. Also, in astronomy, python has become the major tool for scientific calculations.


Though I have found that `dataframe` is not yet widely used in astronomy, many astronomical studies are inevitably catalog-driven. Hence, I am pretty sure that it is a matter of time for astronomers to adopt `dataframe` for their main data-handling tools.


So... how can we handle dataframes at ``astronomical'' scales? How can we apply python-udfs in massive and parallel ways?

Yes... PySpark. Though this tool was born for Big Data industries, this is a so perfect tool for astronomy!


2. How to build a private stand-alone Spark Cluster?

I have built a stand-alone spark cluster, composed of iMac (master/name) and linux ubuntu machines (slave/data).

The details can be found at :

https://github.com/shongscience/PySpark-the-Universe/wiki/%5B1%5D-My-Private-Spark-Hadoop-Cluster

3. Juypter Notebook and CLI script

PySpark shell can be launched with a Jupyter Nodebook. We can scratch all ideas in the Jupyter environment.

Then, we can wrap up the scratches to a single python script. We can run this pyspark code in CLI, using `spark-submit`.

4. Dataproc in Google Cloud Platforms

Though most calculations can be done with my standalone Spark/Hadoop cluster, some memory intensive jobs (such as, needing 1 tera-byte memory) still need GCP.

About, how to use Dataproc, check this link:

https://github.com/shongscience/PySpark-the-Universe/wiki/%5B3%5D-Dataproc-in-Google-Cloud

Notable Updates on Spark:

For the 2.3 release, Apache Arrow is now available, which removes the pain of `pickling' and `unpickling' in PySpark.

Also, Spark can run on Kubernetes clusters.

For the 2.4 release, the notorious 2GB buffer limitation is removed. Now, we can broadcast quite large variables to workers.

The new `Project Hydrogen' has started, which will bring `dist-tensorflow' to Spark!

The new `koalas` project has started. There will be little gap between pandas and Spark dataframe!

5. How to use PySpark for solving astronomical problems.

5.1 Multiverse Simulations

This work is submitted to a Journal, MNRAS. Here is the main text: arXiv:1903.07626


< Spark Clusters >

< Sample Selections >

< Results >

The triangle counts, connected components, and average degrees can constrain cosmologies very effectively at the Big Data scales.

The largest sample is STD-HR2048, composed of 57 millions nodes with 206 millions connections.

Only Big Data tools can make this plot!

Finite-Size Scalings of Graph Statistics