Stata Beginner's Guide

(Note: this page assumes that you know a little basic statistics.)

  1. Run Stata. The first step is to open a dataset with which to work. If you are using Stata for Windows or Mac OS, the easiest way is to use File -> Open. If you are using Unix Stata though, or want to write a .do file for your analysis, you need to use the command use, followed by the location of the dataset you want to use.

    1. use "C:\Documents and Settings\EFoster\My Documents\stata guide\nps_example.dta"

  2. Next, use the command describe to look and see what kind of data you have.

    1. . describe; Contains data from C:\Documents and Settings\EFoster\nps_example.dta obs: 1,095 Sierra Leone 2005 National Public Services Survey vars: 30 23 Nov 2007 14:43 size: 59,130 (99.9% of memory free) (_dta has notes) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- province byte %9.0g provinces province district byte %18.0g districts district localcouncil int %21.0g localcouncils Local Council Area ea_code long %12.0f enumeration area code hh_no byte %9.0g household number within EA stratum byte %8.0g rural_urban urban or rural ... srno float %9.0g ------------------------------------------------------------------------------- Sorted by: ea_code hh_no

  3. To look at categorical variable, use the command tab which gives you a break down of the different values the variable takes on with their absolute and relative frequencies.

    1. . tab religion religion | Freq. Percent Cum. ------------+----------------------------------- Christian | 252 23.01 23.01 Muslim | 837 76.44 99.45 Other | 6 0.55 100.00 ------------+----------------------------------- Total | 1,095 100.00

  4. For a continuous variable, sum shows you various useful facts like the minimum, the maximum, the mean and the variance.

    1. . sum age Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- age | 1088 41.3557 15.4491 18 90

    2. Note that we only have 1088 observations for age so there are 7 observations where it is missing. The age of respondents in our dataset ranges from 18 to 90 with a mean of 41.4.

  5. These two commands can be combined (actually we use tab with the sum option) to allow us to look at average age by religion.

    1. . tab religion, sum(age) | Summary of age religion | Mean Std. Dev. Freq. ------------+------------------------------------ Christian | 39.458167 15.367018 251 Muslim | 42 15.440129 831 Other | 31.5 11.84483 6 ------------+------------------------------------ Total | 41.355699 15.449099 1088

  6. Now that we've had a look at our data, let's do some basic statistics. Suppose we want to do a T test on the hypothesis that the average age of male and female respondents is the same. We'll use the commandttest.

    1. . ttest age, by(gender) Two-sample t test with equal variances ------------------------------------------------------------------------------ Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- male | 559 43.65653 .663978 15.69855 42.35233 44.96073 female | 529 38.92439 .6439895 14.81176 37.65929 40.18948 ---------+-------------------------------------------------------------------- combined | 1088 41.3557 .4683696 15.4491 40.43669 42.27471 ---------+-------------------------------------------------------------------- diff | 4.732144 .9264646 2.914281 6.550007 ------------------------------------------------------------------------------ diff = mean(male) - mean(female) t = 5.1077 Ho: diff = 0 degrees of freedom = 1086 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000

    2. This command produces a lot of output, but I've highlighted the most important parts: the average age of men is 43.7 and of women is 38.9. The p-value for our hypothesis is essentially 0, so we reject the hypothesis that male and female respondents have the same average age.

  7. Next, let's run a regression. Since most of our respondents are the heads of their households (or their spouses) we would expect that older respondents have more children and therefore bigger households. Let's regress household size on the age of the respondent to see if this is true, using the commandregress.

    1. . reg hhsize age, r Linear regression Number of obs = 1076 F( 1, 1074) = 10.31 Prob > F = 0.0014 R-squared = 0.0140 Root MSE = 4.6486 ------------------------------------------------------------------------------ | Robust hhsize | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | .0358748 .011174 3.21 0.001 .0139494 .0578001 _cons | 6.013781 .4596593 13.08 0.000 5.111849 6.915714 ------------------------------------------------------------------------------

    2. (The option , r specifies that we want robust standard errors.) This command estimates that hhsize = 6.01 + 0.036 * age. The coefficient on age is positive (older respondents have bigger households on average) as we expected and statistically significant (p-value of 0.001).

  8. To explore the relationship between age and household size, we might want to fit a quadratic model -- that is, estimate an equation of the form hhsize = a + b1 × age + b2 × age2. To do this, we need to create a variable equal to age squared and add it to the regression. To create a new variable, we use the commandgenerate. (To create more complicated new variables you'll also need replace.)

    1. . gen age2 = age*age (7 missing values generated) . reg hhsize age age2, r Linear regression Number of obs = 1076 F( 2, 1073) = 5.37 Prob > F = 0.0048 R-squared = 0.0143 Root MSE = 4.6501 ------------------------------------------------------------------------------ | Robust hhsize | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | .0111948 .0614485 0.18 0.855 -.1093781 .1317677 age2 | .0002617 .0006779 0.39 0.700 -.0010685 .0015919 _cons | 6.524329 1.28781 5.07 0.000 3.997417 9.051241 ------------------------------------------------------------------------------

    2. (Note that the coefficient on age squared is not significant, so the quadratic model does not fit the data better.)

  9. Let's explore our data graphically now and create a histogram of household sizes.

    1. . histogram hhsize (bin=30, start=0, width=1.1666667)

  1. Now let's look at a scatter plot of household size versus the respondent's age.

    1. scatter hhsize age

    1. (This is not a very useful graphic, and the relationship between age and household size that we saw in our regression does not show up plainly here.)

Keywords: Andrew Johnston, Andrew, Johnston, Education, Wharton, vita, curriculum vitae, cv, economics, applied economics, economist, microeconomics, empirics, empirical economics