My Data Science Blogs and Useful Links

My PySpark Examples : Jupyter Notebooks and Python Scripts

1. Big Data and Deep Learnings

  • Read and trim a Big csv file, which has 300+ millions data, from hdfs, then save it as a parquet table in a local file system: Jupyter Notebook
  • Trim the 29791 sub-volumes as train and test sets
    • A Simple Trim without feature extraction and pyspark parallelism : Jupyter Notebook
    • A Serious Trim with extracting features using PandasUDF with parallel computation : Jupyter Notebook
  • Run MLPs: Jupyter Notebook
    • For better predictions, Voronoi Tessellation? or CNN with Healpix Convolution Kernels?


2. Big Data and Graph Analyses

  • Interactively calculate graph statistics using the graphframes on a stand-alone spark cluster : Jupyter Notebook
  • Calculate graph statistics using spark:graphframes on GCP Dataproc : PythonCode
  • Summarize the results with figures : Jupyter Notebook

Some Typical Data Science

A Kaggle Competition


< Feature Importances from the best fit of Random Forest Model >


Overall, LightGBM trains faster and predicts better than Random Forest for this problem.

Some Little Data Science

A Usual Data Science

  • Data Description : Public Data of NYPD Vehicle Accidents Report
  • This data set is good to play with using Pandas: Jupyter Notebook
    • Multiple car collisions are more likely in January than May with a very high statistical significance. The intuitive guess, more multiple car collisions in winter, is confirmed by data with statistical numbers.
    • Since 2013, the number of car accidents has been increasing.
    • Brooklyn is the #1 Borough in the occurrences of car accidents.
    • The most frequent car accidents per area occur near ZIP = 10022 around East 55th Street and 2 Ave. in Manhattan.

Some Little but Big Data Science

A Big Data Science

  • Data Description : All Job Postings from NYSE and Nasdaq corporations, provided by Thinknum. 116GB in csv format.
  • What I am doing:
    • Put all csv files to hdfs. Then, trim the data set based on job `category`: Jupyter Notebook
    • Let's do some quick check about Monthly Job Openings in Information Technology for each state; especially, CA, NY, and TX : Jupyter Notebook

Though the apparent "March Spike" in the CA job market seems special, the job titles look normal, mostly hiring engineers and analysts as usual. Though the Thinknum's data are big (120GB), the trimmed data set for the CA job market is still too small to do science for any specific trend in IT job openings.

Settings for Fun Data Sciences on YouTube Statistics :


  • Set up YouTubeDataAPI
    • conda `google-api-python-client` and `oauth2client`
    • In Google Cloud, enable `YouTubeDataAPI` and get a DevAPIKey


Extracting Information from JSON query results

Have Fun!



About My Oldie Data Reduction Experiences :

About developing new data reduction techniques. I have developed an objective method to remove the stellar continuum emission from narrow-band images to derive emission-line images (2014, PASP, 126, 79; https://sites.google.com/site/shongcontisub/home) during working on stellar feedbacks in nearby starburst galaxies, and an automated and objective method to detect emission lines in faint-object spectroscopy (2014, PASP, 126, 1048) during working on clustering properties of Lyman alpha emitters. About the HETDEX project, I have found a way to minimize the effect of time-dependent sampling patterns on power spectrum measurement and improved an algorithm to find guide and wave front sensor stars for the HETDEX survey. My personal github repository is https://github.com/shongscience.


=== legacy links ===

===HOD Tools=======================================================

Python Modules (Python 2.x) for Halo Occupation Models from Dr. Steven Murray (http://hmf.icrar.org/)

== Install :

$ conda install numpy scipy matplotlib astropy ipython numba

$ pip install cached_property

$ CAMBURL=http://camb.info/CAMB_Mar13.tar.gz pip install git+git://github.com/steven-murray/pycamb.git #need "wget"

$ pip install git+git://github.com/steven-murray/hmf.git@develop

$ pip install git+git://github.com/steven-murray/halomod.git@develop

== check : pip list |grep halomod (or hmf, pycamb)

== editable dev-install for "halomod" from a local dir (setup.py dir)

[shong@beethoven:~/work/pywork/halomodels/halomod-develop]$ pip install -e .

== example : see the attachment "shong_test_halomod.html", and for further questions you can contact with Dr. Seven Murray.

==================================================================