My Data Science Blogs and Useful Links
Useful Articles and Blogs:
- SQL and Dataframe
- Semi-Structured Data
- Immutable Data and Functional Programming
- Back Propagation of ConvNet
- Microsoft Machine Learning for Apache Spark
- Introducing Flint: A time-series library for Apache Spark
- XGBoost and LightGBM on Spark
- XGBoost vs. LightGBM vs. Catboost
- Data Scientists vs. Data Engineers
- What Separates Good from Great Data Scientists?
- Combating High Cardinality Features in Supervised Machine Learning
- 6 Different Ways to Compensate for Missing Values In a Dataset (Data Imputation with examples)
- Machine Learnings with PySpark : EX1, EX2, ...
- Outlier Detections : EX1, EX2, EX3, ...
- Misc. Tips : Fastest Way to remove `$`chars
- Developing Applications with Google Cloud Platform
- Analyzing billion-objects catalog interactively: Apache Spark for physicists (arXiv:1807.03078)
- ASTROIDE: A Unified Astronomical Big Data Processing Engine over Spark (IEEE Transactions on Big Data)
- AXS: A framework for fast astronomical data processing based on Apache Spark (arXiv:1905.09034)
- Vaex: Big Data exploration in the era of Gaia (arXiv:1801.02638)
My PySpark Examples : Jupyter Notebooks and Python Scripts
1. Big Data and Deep Learnings
- Read and trim a Big csv file, which has 300+ millions data, from hdfs, then save it as a parquet table in a local file system: Jupyter Notebook
- Trim the 29791 sub-volumes as train and test sets
- A Simple Trim without feature extraction and pyspark parallelism : Jupyter Notebook
- A Serious Trim with extracting features using PandasUDF with parallel computation : Jupyter Notebook
- Run MLPs: Jupyter Notebook
- For better predictions, Voronoi Tessellation? or CNN with Healpix Convolution Kernels?
2. Big Data and Graph Analyses
- Interactively calculate graph statistics using the graphframes on a stand-alone spark cluster : Jupyter Notebook
- Calculate graph statistics using spark:graphframes on GCP Dataproc : PythonCode
- Summarize the results with figures : Jupyter Notebook
Some Typical Data Science
A Kaggle Competition
- Micorsoft Malware Predictor Revisited 2019
- Let's solve this problem using Apache Spark
- Read, explore, index, and impute the data: Jupyter Notebook
- Assemble all features for MLs : Jupyter Notebook
- Try some MLs for finding the best predictor (Random Forest vs. LightGBM) : Jupyter Notebook
< Feature Importances from the best fit of Random Forest Model >
Overall, LightGBM trains faster and predicts better than Random Forest for this problem.
Some Little Data Science
A Usual Data Science
- Data Description : Public Data of NYPD Vehicle Accidents Report
- This data set is good to play with using Pandas: Jupyter Notebook
- Multiple car collisions are more likely in January than May with a very high statistical significance. The intuitive guess, more multiple car collisions in winter, is confirmed by data with statistical numbers.
- Since 2013, the number of car accidents has been increasing.
- Brooklyn is the #1 Borough in the occurrences of car accidents.
- The most frequent car accidents per area occur near ZIP = 10022 around East 55th Street and 2 Ave. in Manhattan.
Some Little but Big Data Science
A Big Data Science
- Data Description : All Job Postings from NYSE and Nasdaq corporations, provided by Thinknum. 116GB in csv format.
- What I am doing:
- Put all csv files to hdfs. Then, trim the data set based on job `category`: Jupyter Notebook
- Let's do some quick check about Monthly Job Openings in Information Technology for each state; especially, CA, NY, and TX : Jupyter Notebook
Though the apparent "March Spike" in the CA job market seems special, the job titles look normal, mostly hiring engineers and analysts as usual. Though the Thinknum's data are big (120GB), the trimmed data set for the CA job market is still too small to do science for any specific trend in IT job openings.
Settings for Fun Data Sciences on YouTube Statistics :
- Useful tools and docs :
- Set up YouTubeDataAPI
- conda `
google-api-python-client` and `
oauth2client` - In Google Cloud, enable `YouTubeDataAPI` and get a DevAPIKey
- conda `
Extracting Information from JSON query results
Have Fun!
About My Oldie Data Reduction Experiences :
About developing new data reduction techniques. I have developed an objective method to remove the stellar continuum emission from narrow-band images to derive emission-line images (2014, PASP, 126, 79; https://sites.google.com/site/shongcontisub/home) during working on stellar feedbacks in nearby starburst galaxies, and an automated and objective method to detect emission lines in faint-object spectroscopy (2014, PASP, 126, 1048) during working on clustering properties of Lyman alpha emitters. About the HETDEX project, I have found a way to minimize the effect of time-dependent sampling patterns on power spectrum measurement and improved an algorithm to find guide and wave front sensor stars for the HETDEX survey. My personal github repository is https://github.com/shongscience.
=== legacy links ===
===HOD Tools=======================================================
Python Modules (Python 2.x) for Halo Occupation Models from Dr. Steven Murray (http://hmf.icrar.org/)
== Install :
$ conda install numpy scipy matplotlib astropy ipython numba
$ pip install cached_property
$ CAMBURL=http://camb.info/CAMB_Mar13.tar.gz pip install git+git://github.com/steven-murray/pycamb.git #need "wget"
$ pip install git+git://github.com/steven-murray/hmf.git@develop
$ pip install git+git://github.com/steven-murray/halomod.git@develop
== check : pip list |grep halomod (or hmf, pycamb)
== editable dev-install for "halomod" from a local dir (setup.py dir)
[shong@beethoven:~/work/pywork/halomodels/halomod-develop]$ pip install -e .
== example : see the attachment "shong_test_halomod.html", and for further questions you can contact with Dr. Seven Murray.
==================================================================