Machine Learning USING Stata

In this post, I show how to implement machine learning algorithms in Stata 16 using the two dedicated commands I produced:


  • r_ml_stata_cv: for ML regression purposes

  • c_ml_stata_cv: for ML classification purposes

As illustrative example, I show how to implement step-by-step a regression tree.

Before starting, install Python (from version 2.7 onwards), and the Python packages scikit-learn, numpy, and pandas. If you want suggestion on how to install Python and its packages look here.

* Install the Stata ML command

. ssc install r_ml_stata_cv

* Look at the documentation

. help r_ml_stata_cv

* Load intial dataset

sysuse boston, clear


* Form the train and test datasets

get_train_test , dataname("boston") split(0.80 0.20) split_var(svar) rseed(101)


* Form the target and the features

global y "medv"

global X "zn indus chas nox rm age dis rad tax ptratio black lstat"


* Run tree regression in default mode

. use boston_train, clear

. r_ml_stata_cv $y $X , mlmodel("tree") data_test("boston_test") default prediction("pred") seed(10)


* Run tree regression with specific tree depth

. cap rm CV.dta

. use boston_train, clear

. r_ml_stata_cv $y $X , mlmodel("tree") data_test("boston_test") prediction("pred") tree_depth(3) /// cross_validation("CV") n_folds(5) seed(10)


* Run tree regression with cross-validated tree depth

. cap rm CV.dta

. use boston_train, clear

. r_ml_stata_cv $y $X , mlmodel("tree") data_test("boston_test") prediction("pred") ///

tree_depth(1 2 3 4 5 6 7 8 9) cross_validation("CV") n_folds(5) seed(10) graph_cv

References

  • Cerulli, G. 2020. C_ML_STATA: Stata module to implement machine learning classification in Stata. Statistical Software Components, Boston College Department of Economics. Available at: https://econpapers.repec.org/software/bocbocode/s458830.htm.

  • Cerulli, G. 2020. R_ML_STATA: Stata module to implement machine learning regression in Stata. Statistical Software Components, Boston College Department of Economics. Available at: https://econpapers.repec.org/software/bocbocode/s458831.htm.

  • Cerulli, G. 2020. A super-learning machine for predicting economic outcomes, MPRA Paper 99111, University Library of Munich, Germany, 2020.

  • Gareth, J., Witten, D., Hastie, D.T., Tibshirani, R. 2013. An Introduction to Statistical Learning : with Application in R. New York, Springer.

  • Raschka, S., Mirjalili, V. 2019. Python Machine Learning. 3rd Edition, Packt Publishing.

How to cite this post

Cerulli, G. (2020). "Machine learning using stata". Available at: https://sites.google.com/view/giovannicerulli/machine-learning-in-stata