My Data Science Blogs and Useful Links

Useful Articles and Blogs:

My Docs:

My PySpark Examples : Jupyter Notebooks and Python Scripts

1. Big Data and Deep Learnings

Read and trim a Big csv file, which has 300+ millions data, from hdfs, then save it as a parquet table in a local file system: Jupyter Notebook
Trim the 29791 sub-volumes as train and test sets
- A Simple Trim without feature extraction and pyspark parallelism : Jupyter Notebook
- A Serious Trim with extracting features using PandasUDF with parallel computation : Jupyter Notebook
Run MLPs: Jupyter Notebook
- For better predictions, Voronoi Tessellation? or CNN with Healpix Convolution Kernels?

2. Big Data and Graph Analyses

Interactively calculate graph statistics using the graphframes on a stand-alone spark cluster : Jupyter Notebook
Calculate graph statistics using spark:graphframes on GCP Dataproc : PythonCode
Summarize the results with figures : Jupyter Notebook

Some Typical Data Science

A Kaggle Competition

Micorsoft Malware Predictor Revisited 2019
Let's solve this problem using Apache Spark
- Read, explore, index, and impute the data: Jupyter Notebook
- Assemble all features for MLs : Jupyter Notebook
- Try some MLs for finding the best predictor (Random Forest vs. LightGBM) : Jupyter Notebook

< Feature Importances from the best fit of Random Forest Model >

Overall, LightGBM trains faster and predicts better than Random Forest for this problem.

Some Little Data Science

A Usual Data Science

Data Description : Public Data of NYPD Vehicle Accidents Report
This data set is good to play with using Pandas: Jupyter Notebook
- Multiple car collisions are more likely in January than May with a very high statistical significance. The intuitive guess, more multiple car collisions in winter, is confirmed by data with statistical numbers.
- Since 2013, the number of car accidents has been increasing.
- Brooklyn is the #1 Borough in the occurrences of car accidents.
- The most frequent car accidents per area occur near ZIP = 10022 around East 55th Street and 2 Ave. in Manhattan.

Some Little but Big Data Science

A Big Data Science

Data Description : All Job Postings from NYSE and Nasdaq corporations, provided by Thinknum. 116GB in csv format.
What I am doing:
- Put all csv files to hdfs. Then, trim the data set based on job `category`: Jupyter Notebook
- Let's do some quick check about Monthly Job Openings in Information Technology for each state; especially, CA, NY, and TX : Jupyter Notebook

Though the apparent "March Spike" in the CA job market seems special, the job titles look normal, mostly hiring engineers and analysts as usual. Though the Thinknum's data are big (120GB), the trimmed data set for the CA job market is still too small to do science for any specific trend in IT job openings.

Settings for Fun Data Sciences on YouTube Statistics :

Useful tools and docs :
- Statistics about YouTube, Scraping YouTube Data, YouTube Analytic APIs
- YouTubeAPIsEX1, YouTubeAPIsEX2

Set up YouTubeDataAPI
- conda `google-api-python-client` and `oauth2client`
- In Google Cloud, enable `YouTubeDataAPI` and get a DevAPIKey

Extracting Information from JSON query results

Have Fun!

About My Oldie Data Reduction Experiences :

About developing new data reduction techniques. I have developed an objective method to remove the stellar continuum emission from narrow-band images to derive emission-line images (2014, PASP, 126, 79; https://sites.google.com/site/shongcontisub/home) during working on stellar feedbacks in nearby starburst galaxies, and an automated and objective method to detect emission lines in faint-object spectroscopy (2014, PASP, 126, 1048) during working on clustering properties of Lyman alpha emitters. About the HETDEX project, I have found a way to minimize the effect of time-dependent sampling patterns on power spectrum measurement and improved an algorithm to find guide and wave front sensor stars for the HETDEX survey. My personal github repository is https://github.com/shongscience.