Stata tips

How to create a table with t-tests (option 1)

sysuse auto.dta, clear

reg mpg foreign

tabstat mpg, by(foreign)

ttest mpg, by(foreign)

** use regressions to get statistical significance

foreach i of varlist price mpg weight length {

reg `i' foreign

** collect results by storing coefficients as scalars

estadd scalar Domestic = _b[_cons]

estadd scalar Foreign = _b[_cons] + _b[foreign]

estadd scalar Difference = _b[foreign]

estimates store `i'

}

** display results

esttab price mpg weight length, nose nogaps b(%9.3f) star(* 0.1 ** 0.05 *** 0.01) scalar( Domestic Foreign Difference Observations)

** export results

esttab price mpg weight length using ttest, csv replace nose nogaps b(%9.3f) star(* 0.1 ** 0.05 *** 0.01) scalar(Domestic Foreign Difference N)

With few manipulations in Excel you can obtain the table on the right.

How to create a table with t-tests (option 2) - Multiple categories

use http://www.stata-press.com/data/r10/bpwide.dta

table agegrp sex, stat(mean bp_before) nototal

** use regressions to get statistical significance

reg bp_before i.agegrp if sex==1 // women

** collect results by using the command nlcom, use post to store them

nlcom (type1: _b[_cons]) ///

(type2: _b[_cons] + _b[2.agegrp]) ///

(type3: _b[_cons] + _b[3.agegrp]) ///

(diff1: _b[2.agegrp]) ///

(diff2: _b[3.agegrp]) ///

(diff3: _b[3.agegrp] - _b[2.agegrp]), ///

post

estimates store ttestf

reg bp_before i.agegrp if sex==0 // men

nlcom (type1: _b[_cons]) ///

(type2: _b[_cons] + _b[2.agegrp]) ///

(type3: _b[_cons] + _b[3.agegrp]) ///

(diff1: _b[2.agegrp]) ///

(diff2: _b[3.agegrp]) ///

(diff3: _b[3.agegrp] - _b[2.agegrp]), ///

post

estimates store ttestm

** display results

esttab ttest* , star(* 0.1 ** 0.05 *** 0.01) noomitt not nogaps scalar( ) b(%9.3f) nocons mtitle("Women" "Men")

** export results

esttab ttest* using ttest_part2 , csv replace star(* 0.1 ** 0.05 *** 0.01) noomitt not nogaps scalar( ) b(%9.3f) nocons mtitle("Women" "Men")

With few manipulations in Excel you can obtain the table on the right.

How to drop variables that have all missing values:

use http://www.stata-press.com/data/r10/auto.dta

**here I create a variable with all missing values

replace rep78= .

dropmiss, force

How to drop variables satisfying a condition:

use http://www.stata-press.com/data/r10/auto.dta

global allvar "price mpg trunk"

foreach x of global allvar{

sum `x', meanonly

if r(mean) <= 20 {

display in smcl as text "dropped {result}`x' "

drop `x'

}

How to restore variables' labels after collapsing a dataset:

use http://www.stata-press.com/data/r10/auto.dta

local allvar "price mpg trunk"

** storing labels

foreach x of local allvar {

local l`x': variable label `x'

}

collapse price mpg trunk, by(rep78)

**restoring the labels

foreach x of local allvar {

label var `x' "`l`x''"

}

How to generate a variable that indicates the names of the variables with missing data for each observation:

Outcome:

use http://www.stata-press.com/data/r10/auto.dta

local allvar "price mpg rep78"

** Here I create some missing data-points

replace price = . if price<4600

ge missingvars = ""

foreach x of local allvar {

replace missingvars = missingvars + "`x'" + "; " if `x' ==.

}

How to create an excel file that reports the missing data-points of a dataset:

use http://www.stata-press.com/data/r10/auto.dta

** here I create some missing data-points

replace price = . if price<4600

** We are interested in identify the unit of observation that has at least one missing value in the 3 variables: price, mpg and rep78

egen missing = rownonmiss(price mpg rep78)

keep if missing < 3

outsheet using Missing.csv, c replace

How to produce summary statistics (or regressions) for each category of a variable:

- or How to create a local that contains the values taken by a variable (levelsof)

use http://www.stata-press.com/data/r10/auto.dta

** Here I create a string variable with 3 entry: low, middle and high to be used later in the example

ge category = "low" if price < 3900

replace category = "middle" if price>=3900 & price<8000

replace category = "high" if price>=8000

** Here I create a local that contains the element of the variable category: "low middle high"

** This command is useful when the variable contains a long list of countries, for example, and one is interested in producing a graph or a particular statistic

** for each of them and the option "by" is not available.

levelsof category, local(categ)

foreach i of local categ{

sum mpg if category=="`i'"

}

This command works also for regressions:

levelsof category, local(categ)

foreach i of local categ{

reg mpg price weight length if category=="`i'"

}

How to create a variable that lists the elements (countries, firms, etc.) satisfying certain conditions:

use http://www.stata-press.com/data/r10/auto.dta

** the local "makeless3900" lists all the makers that have a price lower than 3900, names are separated by a comma

levelsof make if price < 3900, local(makeless3900) separate(,) clean

** the variable "makeless3900" lists all the makers that have a price lower than 3900 (separated by a comma):

ge makeless3900 = "`makeless3900'"

How to extract a portion of a string variable:

The easiest way to extract a portion of a string variable is to first map the content of the string variable into its various components. This can be done using the command regexm.

Below I report some examples.

To extract one of the components one should use the command: regexs: ge newvar = regexs(#component) if (Map))

- to extract the first component: ge newvarname = regexs(1) if (regexm(varname,"([0-9][0-9])[/]([0-9][0-9])[/]([0-9][0-9][0-9][0-9])")) gives you the string 10 (using the first example)

- to extract the second component: ge newvarname = regexs(2) if (regexm(varname,"([0-9][0-9])[/]([0-9][0-9])[/]([0-9][0-9][0-9][0-9])")) gives you the string 07 (using the first example)

- etc.

If the structure of the string variable varies across observations then the method above cannot be applied. There are other ways in which one can extract a portion of a string variable; these can be found here: http://www.ats.ucla.edu/stat/stata/faq/regex.htm

-> 8b. How to extract the first or the last portion of a string variable:

a. To select the the first NUMERIC part of a string, for example the first 3 numbers of 3340098:

ge newvarname = regexs(1) if (regexm(varname,"(^[0-9][0-9][0-9])"));

the last 3 numbers: ge newvarname = regexs(1) if (regexm(varname,"([0-9][0-9][0-9]$)"))

b. To select the the first LETTERS of a string, for example the first 3 letters of Lon0098:

ge newvarname = regexs(1) if (regexm(varname,"(^[a-zA-z][a-zA-z][a-zA-z])"));

the last 3 letters: ge newvarname = regexs(1) if (regexm(varname,"([a-zA-z][a-zA-z][a-zA-z]$)"))

c. To select the the first LETTERS or NUMERIC characters of a string, for example the first 3 elements of A3b0098:

ge newvarname = regexs(1) if (regexm(varname,"(^[0-9a-zA-z][0-9a-zA-z][0-9a-zA-z])"));

the last 3 elements: ge newvarname = regexs(1) if (regexm(varname,"([0-9a-zA-z][0-9a-zA-z][0-9a-zA-z]$)"))

How to create a variable indicating the percentile of the distribution of a variable by categories:

use http://www.stata-press.com/data/r10/bpwide.dta

In this dataset people are categorized by gender and age-group. I intend to create a variable indicating to which percentile of the distribution of a variable (bp_before) a person belongs to within their gender and age-group.

Here is an example of the data:

** Here I create a variable that gives a unique identifier to each combination of sex and age-group.

egen levels = group(sex agegrp)

** Here I create a local containing all the unique elements of the variable levels

levelsof levels, local(locallevel)

** Here I create a variable that indicate the decile of the distribution each observation belongs to by sex and age-group

gen decile= .

foreach i in `locallevel' {

xtile decile_temp=bp_before if levels ==`i', nq(10)

replace decile= decile_temp if missing(decile)

drop decile_temp

}

** Here is the final dataset:

** Patient number 39 is in the top decile of the distribution of bp_before within his gender and age-group: male patients aged 46-59. In the same group, patient number 40 belongs instead to the 7th decile of the distribution of db_before.

How to write loops that consider sequential numbers (such as i-1 or i+1)

When variable names have a numeric component it is possible to combine them sequentially using loops. Below is a simple example.

Consider a dataset that records a firm's capital costs for every year and the variables have the following format: capital2000, capital2001, capital2002 etc. To construct, for example, biannual changes in capital one could use the following loop:

forvalues i = 0/10{

ge change`=`i'+2002'

= capital`=`i'+2002' - capital`=`i'+2000'

}

Note that the above calculations could be better conducted by first reshaping the dataset and then using time-series operator, nevertheless the procedure might be useful in other situations where a reshape is not convenient option.

How to rename variables that have part of their names in common:

Consider a dataset where variables names include a common component such as the year of reference (for example: consumption1998, income1998, age1998 etc.).

To change the year of reference in all variable names you can use the following procedure:

foreach x of varlist *1998{

local `x'_y = regexr("`x'", "1998", "2000")

rename `x' ``x'_y'

}

the command regexr will substitute all "1998" with "2000". If you want to get read of the common part ( "1998") from all variable names:

foreach x of varlist *1998{

local `x'_y = regexr("`x'", "1998", "")

rename `x' ``x'_y'

}

How to select part of variable names that have a common prefix:

Consider a dataset with a long list of variables with a common prefix such as:

marketshare_a

marketshare_b

...

marketshare_z

To create a local containing the non-common part of the variable names (a, b, ..., z) one could use the following procedure:

To create a local that contains the names of all variables:

local c ""

foreach x of varlist marketshare_a-marketshare_z{

local c = "`c' `x'"

}

The commands above will create the local: "marketshare_a marketshare_b marketshare_c ... marketshare_z"

To create a local that contains the unique part of the variable names:

local c=subinstr("`c'","marketshare_","",.)

di "`c'"

The commands above will create the local: "a b c ... z"

How to find a word in a string variable:

Consider a dataset where a variable contain a description and you want to classify an observation on the basis of some elements of this description. In the "auto" dataset below, for example, we have a variable indicating the model of the car: Merc. Bobcat, Chev. Impala ...

use http://www.stata-press.com/data/r10/auto.dta

To create a variable that indicates whether the car is a Mercedes, for example, we can use the following command:

gen d_merc = strpos(make, "Merc")

This command will produce a variable that takes value zero if the word "Merc" does not appear in the string and a number corresponding to the position of the word in the string when the word is present.

How to substitute blank with underscore in strings with blanks

This is very useful when one needs to reshape a dataset and the j variable is a string with spaces. It is possible to convert strings with spaces into Stata names using the following command:

ge stringwithnospace = strtoname(stringwithspace)

Blanks are substituted with underscores _ .

How to drop duplicate pairs

Considering a dataset that record data at pair level, pairs are likely to be repeated (for example a-b and b-a). To drop duplicate pairs one can use the following command:

** generate identifier that uses a unique code for identical pairs

egen pair1 = concat(reporter partner)

egen pair2 = concat(partner reporter)

replace pair1 = pair2 if reporter > partner

** keep only one pair

bysort pair1: keep if _n==1

How to use multiple datasets that have complex filenames?

Consider several datasets with the following filenames:

ISIC24_ITA_18aug2003

ISIC25_FRA_19aug2003

etc.

If one wants to merge or manipulate these datasets using a macro it is possible to select each of them and modify their names with the following commands:

** fs needs to be installed: ssc install fs

** Select path to locate the folder where all files are stored

cd "C://Myfolder"

** the following command select all datasets containing the word "ISIC" in the folder indicated above

fs *ISIC*.dta

** Here we decompose the name of each dataset into its different elements.

foreach f in `r(files)'{

local sector=substr("`f'",5,2)

local country=substr("`f'",8,3)

di "`country'"

** here we open each dataset, modify some variables (if necessary) and give the dataset a temporary name with the format we consider useful.

use "`f'"

rename x y

tempfile data_`sector'_`country'

save `data_`sector'_`country''

}

Now the temporary dataset so created can be used in standard macro routines.

How to search a variable through several datasets?

I found this much easier way that the one I was using before (presented further below):

ssc install find

ssc install rcd

rcd "C:\Mydirectory" : find *.dta , match( VariableIamlookingfor ) show

also works with do-files:

rcd "C:\Mydirectory" : find *.do , match( VariableIamlookingfor ) show

Consider a folder with several datasets. To search for a variable through all the datasets one can use the following command.

** Select the directory where the folder is located

cd "C:\Mydirectory"

** Create a local with all the filenames in the folder

fs *.dta

** Search through the datasets. In this example the variable we are looking for is called "commID". I generate a variable named "comm" so that the program does not stop if it cannot find "commID".

foreach f in `r(files)'{

clear

use `f'

di "`f'"

ge comm = 1

codebook comm*, compact

}

** If the word you are looking for (e. g. district) appears in the variable label then you should use lookfor:

foreach f in `r(files)'{

clear

use `f'

di "`f'"

lookfor district

}

How to import multiple worksheets of an excel file.

The following commands work only with Stata12 and newer versions. With the option "describe" of the command "import excel" it is possible to see and store the names of all the worksheets in an excel file:

import excel "path/excelfile.xlsx", describe

return list

Names are store in locals: r(worksheet_1), r(worksheet_2) etc. In this example the excel file has 5 worksheet. To import all worksheets one needs to save each of them as a separate file (it is also possible to save them as Stata temporary files).

forvalues i = 1/5{

clear

import excel "excelfile.xlsx", sheet('r(worksheet`i''))

save sheet`i', replace

}

clear

forvalues i = 2/5{

use sheet1

mmerge id using sheet`i'

}

How to import a table from a pdf into excel and Stata

a) Convert the pdf into MS word using an online application such as: convert.files

b) Often when converting a table, multiple lines in the same cells are separated by line-breaks (indicated by ¶) which means that when copied into excel they will appear as multiple rows.

c) Replace the line-breaks symbol with an empty space. MS Word allows to represent Line-breaks using the symbol ^p. Replace "^p" with "" in MS word.

d) Copy the table in excel

e) Import into Stata using the command: import excel "table.xlsx", sheet("sheet1") firstrow

How to compare datasets in Stata

Sometimes you might have different versions of a dataset that you want to compare. Here is a simple way to do it.

use "DatasetA", clear

*Identifier should be the first variable in your dataset

order identifier

*Store variable names. "firstvar" is the first variable after the identifier while "lastvar" is the last variable in the dataset

local x

foreach var of varlist firstvar - lastvar {

local x `x' `var'

}

di "`x'"

*Rename variables to allow comparison

rename * z_*

*Check identifier to us merge command

rename ne_identifier identifier

*Merge the two new and old datasets

merge 1:1 identifier using "DatasetB"

*Compare all variables

foreach i of local x{

compare `i' z_`i'

}

How to append multiple datasets using fs

*Set folder when files are located

cd "mypath/mypath/myfolder/"

*Create local with names of all Stata files

fs *.dta

* Drop name of file from local to avoid double entry

local files ""

foreach f in `r(files)'{

local files = "`files' `f'"

}

di "`files'"

local drop "Name_of_file_to_drop.dta"

local files: list files- drop

di "`files'"

* Append files

use Name_of_file_to_drop.dta, clear

foreach f of local files{

append using `f'

}

Some tips on how to manipulate dates in Stata

*Here are some examples of how dates can appear in Stata (string format):

set obs 5

g date = "16/10/1979" in 1

replace date = "25/12/1985" in 2

replace date = "08/12/2010" in 3

replace date = "03-05-1985" in 4

replace date = "08-09-1983" in 5

* Days

g day = date(date, "DMY")

format day %td

* Weeks

g week = date(date, "DMY")

replace week = wofd(week)

format week %tw

* Months

g month = date(date, "DMY")

replace month = mofd(month)

format month %tm

* Quarters

g quarter = date(date, "DMY")

replace quarter = qofd(quarter)

format quarter %tq

* Semesters

g semester = date(date, "DMY")

replace semester = hofd(semester)

format semester %th

* Years

g year = date(date, "DMY")

replace year = yofd(year)

format year %ty

Working with variable names

forvalues i = 0/10{

ge change`=`i'+2002' = capital`=`i'+2002' - capital`=`i'+2000'

}

Google Sites

Report abuse