Stata tips

These are some Stata commands that have been helpful to me. I list some commands here because I occasionally forget how to do them, and I want to have my commonly-used tricks in one place. Acknowledgment: A lot of this information comes from browsing at Statalist.


Search string variables for some particular value (not an exact match)

list var if regexm(var, "word")

Where "word" is the value you are looking for, even if it's a small part of the entry. For example, if you look at a name variable, including middle and last names, and want to find every observation of a person with the first name "John," regardless of their middle or last name:

list name if regexm(name, "John")

The command regexm has a number of helpful options.


Create a year variable from a string date variable

If you have dates in whatever string format, such as 17/9/2000, and you want a non-string variable containing only the year, you first convert the variable into Stata Internal Form (SIF), then extract the year. It's only two commands:

gen double dateSIF = date(stringdate, "DMY")

gen newyearvariable = year(dateSIF)

Where stringdate is your original variable, dateSIF is it formatted, and newyearvariable is self-explanatory. The specificiation "DMY" lets Stata know the data is in day-month-year format, but you can do MDY and many other formats. Search Stata's datetime for more.


Delete partial contents of a string variable

replace month = subinstr(month, "/", "",.)

Where you have a month variable and want to take the diagonals or slashes out of every observation.


To “fill in” or “fill down” time-series data

Where id is the identifier (country code, personal ID, etc.), and var is the variable.

bysort id: replace var=var[_n-1]

You might want to address missing data with this modification:

bysort id: replace var=var[_n-1] if var==. & var[_n-1]!=.

To fill “up”

bysort id: replace var=var[_n+1]

Cannot tsset or merge because of repeated values, or merge variable “does not uniquely identify observations”

Where id is the identifier variable, and timevar is the time variable (year, etc.)

gen problem = .

bysort id: replace problem = 1 if timevar==timevar[_n-1]

Remove commas from numbers

replace var = subinstr(var,",", "",.)

Of course, this is easily done in Excel, too. But subinstr can be used for many mass changes to variables.


Limit the number of iterations of a model

For 10 iterations, for example. After the command, type:

iter(10)

Use this with caution, because of course if a model won’t converge, the model or the data probably have a problem.

Scientific notation, making variable values their actual value

If it’s 12 digits…

format var %12.0f

Need a quantity in "per 100,000 inhabitants" format

It's not a Stata issue, but it comes up.

gen homicides100th = homicides/(population/100000)


Round up or round down a non-integer variable

gen varround = round(var)

gen varint = int(var)


Create a quarterly date variable (not a string)

bysort id: generate date = tq(2003q1) + _n-1

format %tq date


Create a cumulative count variable with time series data

by id: gen cumulative_attacks=sum(attacks)


Create year dummies for all years

tab year, gen(y)


Identify the first observation in a group (panel data)

For example if you want to create a variable indicating the first year that a unit (country, person, terrorist group, etc.) is observed:

gen firstyear=.

bysort id: replace firstyear = year if _n==1


Identify the last observation in a group (panel data)

Following the previous example:

gen lastyear = 1 if id!=id[_n+1]


Plot the time-series trends of a lot of units with a plot or panel for each

xtline var, t(timevariable) i(id)

If you want to see all the lines in a single plot/panel, just add "overlay" at the end.

Home CV Research Data Teaching Stata tips Contact

***