Machine Learning USING Stata
In this post, I show how to implement machine learning algorithms in Stata 16 using the two dedicated commands I produced:
- r_ml_stata_cv: for ML regression purposes
- c_ml_stata_cv: for ML classification purposes
As illustrative example, I show how to implement step-by-step a regression tree.
Before starting, install Python (from version 2.7 onwards), and the Python packages scikit-learn, numpy, and pandas. If you want suggestion on how to install Python and its packages look here.
* Install the Stata ML command
. ssc install r_ml_stata_cv
* Look at the documentation
. help r_ml_stata_cv
* Load intial dataset
sysuse boston, clear
* Form the train and test datasets
get_train_test , dataname("boston") split(0.80 0.20) split_var(svar) rseed(101)
* Form the target and the features
global y "medv"
global X "zn indus chas nox rm age dis rad tax ptratio black lstat"
* Run tree regression in default mode
. use boston_train, clear
. r_ml_stata_cv $y $X , mlmodel("tree") data_test("boston_test") default prediction("pred") seed(10)
* Run tree regression with specific tree depth
. cap rm CV.dta
. use boston_train, clear
. r_ml_stata_cv $y $X , mlmodel("tree") data_test("boston_test") prediction("pred") tree_depth(3) /// cross_validation("CV") n_folds(5) seed(10)
* Run tree regression with cross-validated tree depth
. cap rm CV.dta
. use boston_train, clear
. r_ml_stata_cv $y $X , mlmodel("tree") data_test("boston_test") prediction("pred") ///
tree_depth(1 2 3 4 5 6 7 8 9) cross_validation("CV") n_folds(5) seed(10) graph_cv
References
Cerulli, G. 2020. C_ML_STATA: Stata module to implement machine learning classification in Stata. Statistical Software Components, Boston College Department of Economics. Available at: https://econpapers.repec.org/software/bocbocode/s458830.htm.
Cerulli, G. 2020. R_ML_STATA: Stata module to implement machine learning regression in Stata. Statistical Software Components, Boston College Department of Economics. Available at: https://econpapers.repec.org/software/bocbocode/s458831.htm.
Cerulli, G. 2020. A super-learning machine for predicting economic outcomes, MPRA Paper 99111, University Library of Munich, Germany, 2020.
Gareth, J., Witten, D., Hastie, D.T., Tibshirani, R. 2013. An Introduction to Statistical Learning : with Application in R. New York, Springer.
Raschka, S., Mirjalili, V. 2019. Python Machine Learning. 3rd Edition, Packt Publishing.
How to cite this post
Cerulli, G. (2020). "Machine learning using stata". Available at: https://sites.google.com/view/giovannicerulli/machine-learning-in-stata