Stata tips
How to create a table with t-tests (option 1)
sysuse auto.dta, clear
reg mpg foreign
tabstat mpg, by(foreign)
ttest mpg, by(foreign)
** use regressions to get statistical significance
foreach i of varlist price mpg weight length {
reg `i' foreign
** collect results by storing coefficients as scalars
estadd scalar Domestic = _b[_cons]
estadd scalar Foreign = _b[_cons] + _b[foreign]
estadd scalar Difference = _b[foreign]
estimates store `i'
}
** display results
esttab price mpg weight length, nose nogaps b(%9.3f) star(* 0.1 ** 0.05 *** 0.01) scalar( Domestic Foreign Difference Observations)
** export results
esttab price mpg weight length using ttest, csv replace nose nogaps b(%9.3f) star(* 0.1 ** 0.05 *** 0.01) scalar(Domestic Foreign Difference N)
With few manipulations in Excel you can obtain the table on the right.
How to create a table with t-tests (option 2) - Multiple categories
use http://www.stata-press.com/data/r10/bpwide.dta
table agegrp sex, stat(mean bp_before) nototal
** use regressions to get statistical significance
reg bp_before i.agegrp if sex==1 // women
** collect results by using the command nlcom, use post to store them
nlcom (type1: _b[_cons]) ///
(type2: _b[_cons] + _b[2.agegrp]) ///
(type3: _b[_cons] + _b[3.agegrp]) ///
(diff1: _b[2.agegrp]) ///
(diff2: _b[3.agegrp]) ///
(diff3: _b[3.agegrp] - _b[2.agegrp]), ///
post
estimates store ttestf
reg bp_before i.agegrp if sex==0 // men
nlcom (type1: _b[_cons]) ///
(type2: _b[_cons] + _b[2.agegrp]) ///
(type3: _b[_cons] + _b[3.agegrp]) ///
(diff1: _b[2.agegrp]) ///
(diff2: _b[3.agegrp]) ///
(diff3: _b[3.agegrp] - _b[2.agegrp]), ///
post
estimates store ttestm
** display results
esttab ttest* , star(* 0.1 ** 0.05 *** 0.01) noomitt not nogaps scalar( ) b(%9.3f) nocons mtitle("Women" "Men")
** export results
esttab ttest* using ttest_part2 , csv replace star(* 0.1 ** 0.05 *** 0.01) noomitt not nogaps scalar( ) b(%9.3f) nocons mtitle("Women" "Men")
With few manipulations in Excel you can obtain the table on the right.
How to drop variables that have all missing values:
use http://www.stata-press.com/data/r10/auto.dta
**here I create a variable with all missing values
replace rep78= .
dropmiss, force
How to drop variables satisfying a condition:
use http://www.stata-press.com/data/r10/auto.dta
global allvar "price mpg trunk"
foreach x of global allvar{
sum `x', meanonly
if r(mean) <= 20 {
display in smcl as text "dropped {result}`x' "
drop `x'
}
}
How to restore variables' labels after collapsing a dataset:
use http://www.stata-press.com/data/r10/auto.dta
local allvar "price mpg trunk"
** storing labels
foreach x of local allvar {
local l`x': variable label `x'
}
collapse price mpg trunk, by(rep78)
**restoring the labels
foreach x of local allvar {
label var `x' "`l`x''"
}
How to generate a variable that indicates the names of the variables with missing data for each observation:
Outcome:
use http://www.stata-press.com/data/r10/auto.dta
local allvar "price mpg rep78"
** Here I create some missing data-points
replace price = . if price<4600
ge missingvars = ""
foreach x of local allvar {
replace missingvars = missingvars + "`x'" + "; " if `x' ==.
}
How to create an excel file that reports the missing data-points of a dataset:
use http://www.stata-press.com/data/r10/auto.dta
** here I create some missing data-points
replace price = . if price<4600
** We are interested in identify the unit of observation that has at least one missing value in the 3 variables: price, mpg and rep78
egen missing = rownonmiss(price mpg rep78)
keep if missing < 3
outsheet using Missing.csv, c replace
How to produce summary statistics (or regressions) for each category of a variable:
- or How to create a local that contains the values taken by a variable (levelsof)
use http://www.stata-press.com/data/r10/auto.dta
** Here I create a string variable with 3 entry: low, middle and high to be used later in the example
ge category = "low" if price < 3900
replace category = "middle" if price>=3900 & price<8000
replace category = "high" if price>=8000
** Here I create a local that contains the element of the variable category: "low middle high"
** This command is useful when the variable contains a long list of countries, for example, and one is interested in producing a graph or a particular statistic
** for each of them and the option "by" is not available.
levelsof category, local(categ)
foreach i of local categ{
sum mpg if category=="`i'"
}
This command works also for regressions:
levelsof category, local(categ)
foreach i of local categ{
reg mpg price weight length if category=="`i'"
}
How to create a variable that lists the elements (countries, firms, etc.) satisfying certain conditions:
use http://www.stata-press.com/data/r10/auto.dta
** the local "makeless3900" lists all the makers that have a price lower than 3900, names are separated by a comma
levelsof make if price < 3900, local(makeless3900) separate(,) clean
** the variable "makeless3900" lists all the makers that have a price lower than 3900 (separated by a comma):
ge makeless3900 = "`makeless3900'"
How to extract a portion of a string variable:
The easiest way to extract a portion of a string variable is to first map the content of the string variable into its various components. This can be done using the command regexm.
Below I report some examples.
To extract one of the components one should use the command: regexs: ge newvar = regexs(#component) if (Map))
- to extract the first component: ge newvarname = regexs(1) if (regexm(varname,"([0-9][0-9])[/]([0-9][0-9])[/]([0-9][0-9][0-9][0-9])")) gives you the string 10 (using the first example)
- to extract the second component: ge newvarname = regexs(2) if (regexm(varname,"([0-9][0-9])[/]([0-9][0-9])[/]([0-9][0-9][0-9][0-9])")) gives you the string 07 (using the first example)
- etc.
If the structure of the string variable varies across observations then the method above cannot be applied. There are other ways in which one can extract a portion of a string variable; these can be found here: http://www.ats.ucla.edu/stat/stata/faq/regex.htm
-> 8b. How to extract the first or the last portion of a string variable:
a. To select the the first NUMERIC part of a string, for example the first 3 numbers of 3340098:
ge newvarname = regexs(1) if (regexm(varname,"(^[0-9][0-9][0-9])"));
the last 3 numbers: ge newvarname = regexs(1) if (regexm(varname,"([0-9][0-9][0-9]$)"))
b. To select the the first LETTERS of a string, for example the first 3 letters of Lon0098:
ge newvarname = regexs(1) if (regexm(varname,"(^[a-zA-z][a-zA-z][a-zA-z])"));
the last 3 letters: ge newvarname = regexs(1) if (regexm(varname,"([a-zA-z][a-zA-z][a-zA-z]$)"))
c. To select the the first LETTERS or NUMERIC characters of a string, for example the first 3 elements of A3b0098:
ge newvarname = regexs(1) if (regexm(varname,"(^[0-9a-zA-z][0-9a-zA-z][0-9a-zA-z])"));
the last 3 elements: ge newvarname = regexs(1) if (regexm(varname,"([0-9a-zA-z][0-9a-zA-z][0-9a-zA-z]$)"))
How to create a variable indicating the percentile of the distribution of a variable by categories:
use http://www.stata-press.com/data/r10/bpwide.dta
In this dataset people are categorized by gender and age-group. I intend to create a variable indicating to which percentile of the distribution of a variable (bp_before) a person belongs to within their gender and age-group.
Here is an example of the data:
** Here I create a variable that gives a unique identifier to each combination of sex and age-group.
egen levels = group(sex agegrp)
** Here I create a local containing all the unique elements of the variable levels
levelsof levels, local(locallevel)
** Here I create a variable that indicate the decile of the distribution each observation belongs to by sex and age-group
gen decile= .
foreach i in `locallevel' {
xtile decile_temp=bp_before if levels ==`i', nq(10)
replace decile= decile_temp if missing(decile)
drop decile_temp
}
** Here is the final dataset:
** Patient number 39 is in the top decile of the distribution of bp_before within his gender and age-group: male patients aged 46-59. In the same group, patient number 40 belongs instead to the 7th decile of the distribution of db_before.
How to write loops that consider sequential numbers (such as i-1 or i+1)
When variable names have a numeric component it is possible to combine them sequentially using loops. Below is a simple example.
Consider a dataset that records a firm's capital costs for every year and the variables have the following format: capital2000, capital2001, capital2002 etc. To construct, for example, biannual changes in capital one could use the following loop:
forvalues i = 0/10{
ge change`=`i'+2002'
= capital`=`i'+2002' - capital`=`i'+2000'
}
Note that the above calculations could be better conducted by first reshaping the dataset and then using time-series operator, nevertheless the procedure might be useful in other situations where a reshape is not convenient option.
How to rename variables that have part of their names in common:
Consider a dataset where variables names include a common component such as the year of reference (for example: consumption1998, income1998, age1998 etc.).
To change the year of reference in all variable names you can use the following procedure:
foreach x of varlist *1998{
local `x'_y = regexr("`x'", "1998", "2000")
rename `x' ``x'_y'
}
the command regexr will substitute all "1998" with "2000". If you want to get read of the common part ( "1998") from all variable names:
foreach x of varlist *1998{
local `x'_y = regexr("`x'", "1998", "")
rename `x' ``x'_y'
}
How to select part of variable names that have a common prefix:
Consider a dataset with a long list of variables with a common prefix such as:
marketshare_a
marketshare_b
...
marketshare_z
To create a local containing the non-common part of the variable names (a, b, ..., z) one could use the following procedure:
To create a local that contains the names of all variables:
local c ""
foreach x of varlist marketshare_a-marketshare_z{
local c = "`c' `x'"
}
The commands above will create the local: "marketshare_a marketshare_b marketshare_c ... marketshare_z"
To create a local that contains the unique part of the variable names:
local c=subinstr("`c'","marketshare_","",.)
di "`c'"
The commands above will create the local: "a b c ... z"
How to find a word in a string variable:
Consider a dataset where a variable contain a description and you want to classify an observation on the basis of some elements of this description. In the "auto" dataset below, for example, we have a variable indicating the model of the car: Merc. Bobcat, Chev. Impala ...
use http://www.stata-press.com/data/r10/auto.dta
To create a variable that indicates whether the car is a Mercedes, for example, we can use the following command:
gen d_merc = strpos(make, "Merc")
This command will produce a variable that takes value zero if the word "Merc" does not appear in the string and a number corresponding to the position of the word in the string when the word is present.
How to substitute blank with underscore in strings with blanks
This is very useful when one needs to reshape a dataset and the j variable is a string with spaces. It is possible to convert strings with spaces into Stata names using the following command:
ge stringwithnospace = strtoname(stringwithspace)
Blanks are substituted with underscores _ .
How to drop duplicate pairs
Considering a dataset that record data at pair level, pairs are likely to be repeated (for example a-b and b-a). To drop duplicate pairs one can use the following command:
** generate identifier that uses a unique code for identical pairs
egen pair1 = concat(reporter partner)
egen pair2 = concat(partner reporter)
replace pair1 = pair2 if reporter > partner
** keep only one pair
bysort pair1: keep if _n==1
How to use multiple datasets that have complex filenames?
Consider several datasets with the following filenames:
ISIC24_ITA_18aug2003
ISIC25_FRA_19aug2003
etc.
If one wants to merge or manipulate these datasets using a macro it is possible to select each of them and modify their names with the following commands:
** fs needs to be installed: ssc install fs
** Select path to locate the folder where all files are stored
cd "C://Myfolder"
** the following command select all datasets containing the word "ISIC" in the folder indicated above
fs *ISIC*.dta
** Here we decompose the name of each dataset into its different elements.
foreach f in `r(files)'{
local sector=substr("`f'",5,2)
local country=substr("`f'",8,3)
di "`country'"
** here we open each dataset, modify some variables (if necessary) and give the dataset a temporary name with the format we consider useful.
use "`f'"
rename x y
tempfile data_`sector'_`country'
save `data_`sector'_`country''
}
Now the temporary dataset so created can be used in standard macro routines.
How to search a variable through several datasets?
I found this much easier way that the one I was using before (presented further below):
ssc install find
ssc install rcd
rcd "C:\Mydirectory" : find *.dta , match( VariableIamlookingfor ) show
also works with do-files:
rcd "C:\Mydirectory" : find *.do , match( VariableIamlookingfor ) show
Consider a folder with several datasets. To search for a variable through all the datasets one can use the following command.
** Select the directory where the folder is located
cd "C:\Mydirectory"
** Create a local with all the filenames in the folder
fs *.dta
** Search through the datasets. In this example the variable we are looking for is called "commID". I generate a variable named "comm" so that the program does not stop if it cannot find "commID".
foreach f in `r(files)'{
clear
use `f'
di "`f'"
ge comm = 1
codebook comm*, compact
}
** If the word you are looking for (e. g. district) appears in the variable label then you should use lookfor:
foreach f in `r(files)'{
clear
use `f'
di "`f'"
lookfor district
}
How to import multiple worksheets of an excel file.
The following commands work only with Stata12 and newer versions. With the option "describe" of the command "import excel" it is possible to see and store the names of all the worksheets in an excel file:
import excel "path/excelfile.xlsx", describe
return list
Names are store in locals: r(worksheet_1), r(worksheet_2) etc. In this example the excel file has 5 worksheet. To import all worksheets one needs to save each of them as a separate file (it is also possible to save them as Stata temporary files).
forvalues i = 1/5{
clear
import excel "excelfile.xlsx", sheet('r(worksheet`i''))
save sheet`i', replace
}
clear
forvalues i = 2/5{
use sheet1
mmerge id using sheet`i'
}
How to import a table from a pdf into excel and Stata
a) Convert the pdf into MS word using an online application such as: convert.files
b) Often when converting a table, multiple lines in the same cells are separated by line-breaks (indicated by ¶) which means that when copied into excel they will appear as multiple rows.
c) Replace the line-breaks symbol with an empty space. MS Word allows to represent Line-breaks using the symbol ^p. Replace "^p" with "" in MS word.
d) Copy the table in excel
e) Import into Stata using the command: import excel "table.xlsx", sheet("sheet1") firstrow
How to compare datasets in Stata
Sometimes you might have different versions of a dataset that you want to compare. Here is a simple way to do it.
use "DatasetA", clear
*Identifier should be the first variable in your dataset
order identifier
*Store variable names. "firstvar" is the first variable after the identifier while "lastvar" is the last variable in the dataset
local x
foreach var of varlist firstvar - lastvar {
local x `x' `var'
}
di "`x'"
*Rename variables to allow comparison
rename * z_*
*Check identifier to us merge command
rename ne_identifier identifier
*Merge the two new and old datasets
merge 1:1 identifier using "DatasetB"
*Compare all variables
foreach i of local x{
compare `i' z_`i'
}
How to append multiple datasets using fs
*Set folder when files are located
cd "mypath/mypath/myfolder/"
*Create local with names of all Stata files
fs *.dta
* Drop name of file from local to avoid double entry
local files ""
foreach f in `r(files)'{
local files = "`files' `f'"
}
di "`files'"
local drop "Name_of_file_to_drop.dta"
local files: list files- drop
di "`files'"
* Append files
use Name_of_file_to_drop.dta, clear
foreach f of local files{
append using `f'
}
Some tips on how to manipulate dates in Stata
*Here are some examples of how dates can appear in Stata (string format):
set obs 5
g date = "16/10/1979" in 1
replace date = "25/12/1985" in 2
replace date = "08/12/2010" in 3
replace date = "03-05-1985" in 4
replace date = "08-09-1983" in 5
* Days
g day = date(date, "DMY")
format day %td
* Weeks
g week = date(date, "DMY")
replace week = wofd(week)
format week %tw
* Months
g month = date(date, "DMY")
replace month = mofd(month)
format month %tm
* Quarters
g quarter = date(date, "DMY")
replace quarter = qofd(quarter)
format quarter %tq
* Semesters
g semester = date(date, "DMY")
replace semester = hofd(semester)
format semester %th
* Years
g year = date(date, "DMY")
replace year = yofd(year)
format year %ty
Working with variable names
Consider a dataset that records a firm's capital costs for every year and the variables have the following format: capital2000, capital2001, capital2002 etc. To construct, for example, biannual changes in capital one could use the following loop:
forvalues i = 0/10{
ge change`=`i'+2002' = capital`=`i'+2002' - capital`=`i'+2000'
}
Note that the above calculations could be better conducted by first reshaping the dataset and then using time-series operator, nevertheless the procedure might be useful in other situations where a reshape is not convenient option.