Stata tips


sysuse auto.dta, clear

reg mpg foreign

tabstat mpg, by(foreign)

ttest mpg, by(foreign)


** use regressions to get statistical significance

foreach i of varlist price mpg weight  length {

reg `i' foreign

** collect results by storing coefficients as scalars

estadd scalar Domestic = _b[_cons] 

estadd scalar Foreign = _b[_cons]  + _b[foreign] 

estadd scalar Difference =  _b[foreign]

estimates store `i'

}


** display results

esttab price mpg weight  length,  nose nogaps b(%9.3f) star(* 0.1 ** 0.05 *** 0.01) scalar( Domestic Foreign Difference Observations)


** export results

esttab price mpg weight  length using ttest, csv replace nose nogaps b(%9.3f) star(* 0.1 ** 0.05 *** 0.01) scalar(Domestic Foreign Difference N)


With few manipulations in Excel you can obtain the table on the right.

use http://www.stata-press.com/data/r10/bpwide.dta

table agegrp sex, stat(mean bp_before) nototal

** use regressions to get statistical significance

reg bp_before i.agegrp if sex==1 // women


** collect results by using the command nlcom, use post to store them 

nlcom (type1: _b[_cons]) ///

  (type2: _b[_cons] + _b[2.agegrp]) ///

  (type3: _b[_cons] + _b[3.agegrp]) ///

  (diff1: _b[2.agegrp]) ///

  (diff2: _b[3.agegrp]) ///

  (diff3: _b[3.agegrp] - _b[2.agegrp]), ///   

  post

estimates store ttestf  


reg bp_before i.agegrp if sex==0 // men

nlcom (type1: _b[_cons]) ///

  (type2: _b[_cons] + _b[2.agegrp]) ///

  (type3: _b[_cons] + _b[3.agegrp]) ///

  (diff1: _b[2.agegrp]) ///

  (diff2: _b[3.agegrp]) ///

  (diff3: _b[3.agegrp] - _b[2.agegrp]), ///   

  post

estimates store ttestm


** display results

esttab ttest* , star(* 0.1 ** 0.05 *** 0.01) noomitt not  nogaps scalar(  )  b(%9.3f)  nocons   mtitle("Women" "Men")


** export results

esttab ttest*  using  ttest_part2 , csv replace star(* 0.1 ** 0.05 *** 0.01) noomitt not  nogaps scalar(  )  b(%9.3f)  nocons   mtitle("Women" "Men") 


With few manipulations in Excel you can obtain the table on the right.


use  http://www.stata-press.com/data/r10/auto.dta

**here I create a variable with all missing values

replace rep78= .

dropmiss, force


use  http://www.stata-press.com/data/r10/auto.dta

global allvar "price mpg trunk"

foreach x of global allvar{

   sum `x', meanonly

   if r(mean) <= 20 {

       display in smcl as text "dropped {result}`x' " 

       drop `x'

   }

}


use  http://www.stata-press.com/data/r10/auto.dta

local allvar "price mpg trunk"

** storing labels

foreach x of local  allvar  {

local l`x': variable label `x'

}

collapse price mpg trunk, by(rep78)

**restoring the labels

foreach x of  local  allvar {

label var `x' "`l`x''"

}

Outcome:

use  http://www.stata-press.com/data/r10/auto.dta

local allvar "price mpg rep78"


** Here I create some missing data-points

replace price = . if price<4600

ge missingvars = ""

foreach x of  local  allvar  {

replace missingvars = missingvars + "`x'" + "; " if `x' ==. 

}


use  http://www.stata-press.com/data/r10/auto.dta


** here I create some missing data-points

replace price = . if price<4600


** We are interested in identify the unit of observation that has at least one missing value in the 3 variables: price, mpg and rep78

egen missing = rownonmiss(price mpg rep78)

 keep if missing < 3

outsheet using Missing.csv, c replace

    - or How to create a local that contains the values taken by a variable (levelsof)


use  http://www.stata-press.com/data/r10/auto.dta


** Here I create a string variable with 3 entry: low, middle and high to be used later in the example


ge category = "low" if price < 3900

replace category = "middle" if price>=3900 & price<8000

replace category = "high" if price>=8000 


** Here I create a local that contains the element of the variable category: "low middle high"

** This command is useful when the variable contains a long list of countries, for example, and one is interested in producing a graph or a particular statistic 

** for each of them and the option "by" is not available.


levelsof category, local(categ)  

foreach i of local categ{

    sum mpg if category=="`i'"

}

This command works also for regressions:

levelsof category, local(categ)  

foreach i of local categ{

reg mpg price weight length  if category=="`i'"

}


use  http://www.stata-press.com/data/r10/auto.dta


** the local  "makeless3900"  lists all the makers that have a price lower than 3900, names are separated by a comma

levelsof make if price < 3900, local(makeless3900) separate(,) clean


** the variable "makeless3900" lists  all the makers that have a price lower than 3900 (separated by a comma):

ge makeless3900 = "`makeless3900'" 


The easiest way to extract a portion of a string variable is to first map the content of the string variable into its various components. This can be done using the command regexm.

Below I report some examples.

To extract one of the components one should use the command: regexs:  ge newvar = regexs(#component) if (Map))


- to extract the first component: ge newvarname = regexs(1) if (regexm(varname,"([0-9][0-9])[/]([0-9][0-9])[/]([0-9][0-9][0-9][0-9])"))  gives you the string 10 (using the first example)

- to extract the second component: ge newvarname = regexs(2) if (regexm(varname,"([0-9][0-9])[/]([0-9][0-9])[/]([0-9][0-9][0-9][0-9])")) gives you the string 07 (using the first example)

- etc. 


If the structure of the string variable varies across observations then the method above cannot be applied. There are other ways in which one can extract a portion of a string variable; these can be found here: http://www.ats.ucla.edu/stat/stata/faq/regex.htm


 ->   8b. How to extract the first or the last portion of a string variable: 


a. To select the the first NUMERIC part of a string, for example the first 3 numbers of 3340098:

                ge newvarname = regexs(1) if (regexm(varname,"(^[0-9][0-9][0-9])"));   

the last 3 numbers: ge newvarname = regexs(1) if (regexm(varname,"([0-9][0-9][0-9]$)"))


b. To select the the first LETTERS of a string, for example the first 3 letters of Lon0098:

                ge newvarname = regexs(1) if (regexm(varname,"(^[a-zA-z][a-zA-z][a-zA-z])"));   

the last 3 letters: ge newvarname = regexs(1) if (regexm(varname,"([a-zA-z][a-zA-z][a-zA-z]$)"))


c. To select the the first LETTERS or NUMERIC characters of a string, for example the first 3 elements of A3b0098:

                ge newvarname = regexs(1) if (regexm(varname,"(^[0-9a-zA-z][0-9a-zA-z][0-9a-zA-z])"));  

the last 3 elements: ge newvarname = regexs(1) if (regexm(varname,"([0-9a-zA-z][0-9a-zA-z][0-9a-zA-z]$)"))


use http://www.stata-press.com/data/r10/bpwide.dta

In this dataset people are categorized by gender and age-group. I intend to create a variable indicating to which percentile of the distribution of a variable (bp_before) a person belongs to within their gender and age-group. 


Here is an example of the data:

** Here I create a variable that gives a unique identifier to each combination of sex and age-group.

egen levels = group(sex agegrp)

** Here I create a local containing all the unique elements of the variable levels

levelsof levels, local(locallevel)

** Here I create a variable that indicate the decile of the distribution each observation belongs to by sex and age-group

gen decile= .

foreach i in `locallevel' {

    xtile decile_temp=bp_before if levels ==`i', nq(10)

    replace decile= decile_temp if missing(decile)

    drop decile_temp

}

** Here is the final dataset:

** Patient number 39 is in the top decile of the distribution of bp_before within his gender and age-group: male patients aged 46-59. In the same group, patient number 40 belongs instead to the 7th decile of the distribution of db_before.


When variable names have a numeric component it is possible to combine them sequentially using loops. Below is a simple example.

Consider a dataset that records a firm's capital costs for every year and the variables have the following format: capital2000, capital2001, capital2002 etc. To construct, for example, biannual changes in capital one could use the following loop:


forvalues i = 0/10{

    ge change`=`i'+2002' 

  = capital`=`i'+2002' - capital`=`i'+2000'

}


Note that the above calculations could be better conducted by first reshaping the dataset and then using time-series operator, nevertheless the procedure might be useful in other situations where a reshape is not convenient option.  


Consider a dataset where variables names include a common component such as the year of reference (for example: consumption1998, income1998, age1998 etc.).  

To change the year of reference in all variable names you can use the following procedure:


foreach x of varlist *1998{

    local `x'_y = regexr("`x'", "1998", "2000")

    rename `x' ``x'_y'

}


the command regexr will substitute all "1998" with "2000".  If you want to get read of the common part ( "1998")  from all variable names:


foreach x of varlist *1998{

    local `x'_y = regexr("`x'", "1998", "")

    rename `x' ``x'_y'

}


Consider a dataset with a long list of variables with a common prefix such as:

marketshare_a

marketshare_b

...

marketshare_z


To create a local containing the non-common part of the variable names (a, b, ..., z) one could use the following procedure:

To create a local that contains the names of all variables:


local c ""

foreach x of varlist marketshare_a-marketshare_z{

    local c = "`c' `x'"

}


The commands above will create the local:  "marketshare_a marketshare_b marketshare_c ... marketshare_z"

To create a local that contains the unique part of the variable names:


local c=subinstr("`c'","marketshare_","",.)

di "`c'"


The commands above will create the local:  "a b c ... z"

Consider a dataset where a variable contain a description and you want to classify an observation on the basis of some elements of this description. In the "auto" dataset below, for example, we have a variable indicating the model of the car: Merc. Bobcat, Chev. Impala ...


use  http://www.stata-press.com/data/r10/auto.dta


To create a variable that indicates whether the car is a Mercedes, for example, we can use the following command:


gen d_merc = strpos(make, "Merc")


This command will produce a variable that takes value zero if the word "Merc" does not appear in the string and a number corresponding to the position of the word in the string when the word is present.


This is very useful when one needs to reshape a dataset and the j variable is a string with spaces. It is possible to convert strings with spaces into Stata names using the following command:


ge stringwithnospace = strtoname(stringwithspace)


Blanks are substituted with underscores _ .

Considering a dataset that record data at pair level, pairs are likely to be repeated (for example a-b and b-a). To drop duplicate pairs one can use the following command:

** generate identifier that uses a unique code for identical pairs


egen pair1 = concat(reporter partner)

egen pair2 = concat(partner reporter)

replace pair1 = pair2 if reporter > partner


** keep only one pair


bysort pair1: keep if _n==1


Consider several datasets with the following filenames:


ISIC24_ITA_18aug2003

ISIC25_FRA_19aug2003

etc.


If one wants to merge or manipulate these datasets using a macro it is possible to select each of them and modify their names with the following commands:

** fs needs to be installed: ssc install fs

** Select path to locate the folder where all files are stored


cd "C://Myfolder"


** the following command select all datasets containing the word "ISIC" in the folder indicated above


fs *ISIC*.dta


** Here we decompose the name of each dataset into its different elements. 


foreach f in `r(files)'{

    local sector=substr("`f'",5,2)

    local country=substr("`f'",8,3)

    di "`country'"

    ** here we open each dataset, modify some variables (if necessary) and give the dataset a temporary name with the format we consider useful.

    use "`f'"

    rename x y

    tempfile data_`sector'_`country'

    save `data_`sector'_`country''

}


Now the temporary dataset so created can be used in standard macro routines.




I found this much easier way that the one I was using before (presented further below):

ssc install find

ssc install rcd

rcd  "C:\Mydirectory" : find *.dta , match( VariableIamlookingfor ) show

also works with do-files:

rcd  "C:\Mydirectory" : find *.do , match( VariableIamlookingfor ) show


Consider a folder with several datasets. To search for a variable through all the datasets one can use the following command.


** Select the directory where the folder is located

cd "C:\Mydirectory" 


** Create a local with all the filenames in the folder

fs *.dta


** Search through the datasets. In this example the variable we are looking for is called "commID". I generate a variable named "comm" so that the program does not stop if it cannot find "commID".


foreach f in `r(files)'{

    clear

    use `f'

    di "`f'"

    ge comm = 1

    codebook comm*, compact

}


** If the word you are looking for (e. g. district) appears in the variable label then you should use lookfor:

foreach f in `r(files)'{

    clear

    use `f'

    di "`f'"

    lookfor district

}




The following commands work only with Stata12 and newer versions. With the option "describe" of the command "import excel" it is possible to see and store the names of all the worksheets in an excel file:


import excel "path/excelfile.xlsx", describe

return list


Names are store in locals: r(worksheet_1), r(worksheet_2) etc. In this example the excel file has 5 worksheet. To import all worksheets one needs to save each of them as a separate file (it is also possible to save them as Stata temporary files).


forvalues i = 1/5{

    clear

    import excel "excelfile.xlsx", sheet('r(worksheet`i''))

    save sheet`i', replace

}


clear

    forvalues i = 2/5{

    use sheet1

    mmerge id using sheet`i'

}




a) Convert the pdf into MS word using an online application such as: convert.files

b) Often when converting a table, multiple lines in the same cells are separated by line-breaks (indicated by ¶) which means that when copied into excel they will appear as multiple rows.

c) Replace the line-breaks symbol with an empty space. MS Word allows to represent Line-breaks using the symbol ^p. Replace "^p" with "" in MS word.

d) Copy the table in excel

e) Import into Stata using the command: import excel "table.xlsx", sheet("sheet1") firstrow 



Sometimes you might have different versions of a dataset that you want to compare. Here is a simple way to do it.


use "DatasetA", clear


*Identifier should be the first variable in your dataset

order identifier


*Store variable names. "firstvar" is the first variable after the identifier while "lastvar" is the last variable in the dataset

local x 

foreach var of varlist firstvar - lastvar {

local x `x' `var'

}

di "`x'"


*Rename variables to allow comparison

rename * z_*


*Check identifier to us merge command

rename ne_identifier identifier


*Merge the two new and old datasets

merge 1:1 identifier using  "DatasetB"


*Compare all variables 

foreach i of local x{

compare `i' z_`i'

}



*Set folder when files are located

cd "mypath/mypath/myfolder/"


*Create local with names of all Stata files

fs *.dta


* Drop  name of file from local to avoid double entry 

local files ""

foreach f in `r(files)'{

  local files = "`files' `f'"

 }


di "`files'" 

local drop "Name_of_file_to_drop.dta"

local files: list files- drop

di "`files'"


* Append files

use Name_of_file_to_drop.dta, clear

foreach f of local files{

  append using `f'

 }


*Here are some examples of how dates can appear in Stata (string format):


set obs 5 

g date = "16/10/1979" in 1 

replace date = "25/12/1985" in 2 

replace date = "08/12/2010" in 3 

replace date = "03-05-1985" in 4 

replace date = "08-09-1983" in 5 


* Days

g day = date(date, "DMY") 

format day %td 


* Weeks

g week = date(date, "DMY") 

replace week = wofd(week) 

format week %tw 


* Months

g month = date(date, "DMY") 

replace month = mofd(month) 

format month %tm 


* Quarters

g quarter = date(date, "DMY") 

replace quarter = qofd(quarter) 

format quarter %tq 


* Semesters

g semester = date(date, "DMY") 

replace semester = hofd(semester) 

format semester %th 


* Years

g year = date(date, "DMY") 

replace year = yofd(year) 

format year %ty


Consider a dataset that records a firm's capital costs for every year and the variables have the following format: capital2000, capital2001, capital2002 etc. To construct, for example, biannual changes in capital one could use the following loop:


forvalues i = 0/10{

    ge change`=`i'+2002'   = capital`=`i'+2002' - capital`=`i'+2000'

}


Note that the above calculations could be better conducted by first reshaping the dataset and then using time-series operator, nevertheless the procedure might be useful in other situations where a reshape is not convenient option.