binscatterhist (Stata)

Program availability

Binscatterhist can be downloaded from the SSC repository by running the command: ssc install binscatterhist . (version 2.2)

Feedback and bugs reports are greatly appreciated.

Main features

The binscatter package (Stepner, 2013) is one of the most useful tools of data analysis in Stata (StataCorp, 2019) and widely used in social science (e.g. Ash et al. (2020); Chetty et al. (2013)). Scatterplots (twoway scatter) allow to visualize the relationship between two variables, but might be hard to read when sample size is high. Binscatter solves this issue by collapsing observations in bins and fitting a regression line. The cost of such compact representation is a loss of information on the plotted variables, we propose therefore a modified version of the program: binscatterhist. Binscatterhist accounts for this loss of information by allowing the user to plot the variable(s) underlying distribution. Similar programs exist already for other statistical software, e.g. ggscatterhist for R or scatterhist for MATLAB.

The Figure above offers an example of a binscatterhist of wage and tenure coming from the web-available: National Longitudinal Survey of Women 1988 (Stata nlsw88 ). The binscatterhist includes grade fixed effects and robust standard errors. The graph shows a positive correlation between wage and tenure, with a slope of .15(.02), significant at the 1%, and with a sample size of 2228. By default, binscatterhist creates 20 equal-sized bins, meaning that if the scattered points are close to each other like in the left side of the plot, the underlying number of observations is higher. The horizontal distance between points is, however, hard to read promptly, especially when comparing points far apart or with strongly different y-axis height. Binscatterhist offers a fix for this, adding to the graph two histograms showing the variables distribution. In the example, this shows more clearly tenure to be skewed and wage to be also slightly skewed but with several high frequency bins.

Binscatterhist generates binned scatterplots by means of a non parametric visualization of the relationship between two variables. Basically, the essential features of binscatter remain unchanged: the program "groups the x-axis variable into equal-sized bins, computes the mean of the x-axis and y-axis variables within each bin, then creates a scatterplot of these data points" (Stepner, 2013). The novel features and differences with the original program are listed below.

Residualization method

Binning is performed after residualization when combined with controls() or absorb(). If absorb() is specified, binscatterhist runs by default a reghdfe (Correia, 2019), but an areg can be run by adding the option regtype(areg). When including controls but no fixed effects, residualization is performed via a standard regression.

Estimation results and sample size reporting

When including the options coefficient(#) and sample, binscatterhist includes a text box reporting the coefficient of the fitted line (approximated at #), its standard error and the sample size. The text box adjusts its position based on the coefficient.

Clustered and robust standard errors

The program allows for robust or clustered standard errors. These enter the estimation in two points: first, clustered s.e. affect indirectly the residualization, as they might change the sample if the the clusters’ variable is missing for some observations; second, they are included in the estimation of the fit line, so that the standard error and sample reported in the text box are adjusted accordingly.

Examples of usage

The following examples use the National Longitudinal Survey of Women 1988 webuse nlsw88.

Code: binscatterhist wage tenure

With no options specified, binscatterhist produces the same output as binscatter

Code: binscatterhist wage tenure, histogram(tenure)

When the histogram() option is added, binscatterhist produces a histogram for the requested variable.






Code: binscatterhist wage tenure, histogram(wage tenure) ymin(4)

Here, the histogram option for both variables is included. Including the axismin() fixes the misplacement of the x-histogram, telling binscatterhist to re-set the base line for the histogram in terms of the y-axis.

Code: binscatterhist wage tenure, histogram(wage tenure) ymin(4) yhistbarwidth(50) xhistbarwidth(50) ybstyle(outline) xbstyle(outline)

Including the axisbstyle(outline) option generates a simpler visualization of the histograms.

Code: binscatterhist wage tenure, histogram(wage tenure) ymin(4) xhistbarheight(15) yhistbarheight(15) xhistbins(40) yhistbins(40)

Binscatterhist allows furthermore to adjust the height and number of bins, here, a greater bar height and a higher number of bins are set.


Code: binscatterhist wage tenure, absorb(grade) vce(robust) coef(0.01) sample xmin(-2.2) ymin(5) histogram(wage tenure) xhistbarheight(15) yhistbarheight(15) xhistbins(40) yhistbins(40)

When fixed effects are included, binscatterhist runs a reghdfe. The options coefficient and sample add a text box reporting such information. Standard errors are robust to heteroskedasticity.


Code:

replace tenure=-tenure

binscatterhist wage tenure, regtype(areg) absorb(grade) vce(robust) coef(0.01) sample xmin(-22) ymin(5) histogram(wage tenure) xhistbarheight(15) yhistbarheight(15) xhistbins(40) yhistbins(40)

When the slope of the fit line is negative, the text box automatically adjusts its positioning. The option regtype(areg) runs an areg instead of a reghdfe to create the residuals.


References

Ash, E., Galletta, S., Hangartner, D., Margalit, Y., Pinna, M., 2020. The Effect of Fox News on Health Behavior During COVID-19. SocArXiv abqe5. Center for Open Science. URL: https://EconPapers.repec.org/RePEc:osf:socarx:abqe5.

Chetty, R., Friedman, J.N., Saez, E., 2013. Using differences in knowledge across neighborhoods to uncover the impacts of the eitc on earnings. American Economic Review 103, 2683–2721. URL: https://www.aeaweb.org/articles?id=10.1257/aer.103. 7.2683, doi:10.1257/aer.103.7.2683.

Correia, S., 2019. Reghdfe: Stata module to perform linear or instrumental-variable regression absorbing any number of high-dimensional fixed effects. URL: https:// EconPapers.repec.org/RePEc:boc:bocode:s457874.

Jann, B., 2014. ADDPLOT: Stata module to add twoway plot objects to an existing twoway graph. Statistical Software Components, Boston College Department of Economics. URL: https://ideas.repec.org/c/boc/bocode/s457917.html. StataCorp, 2019. Stata statistical software: Release 16. college station, tx: Statacorp llc.

Stepner, M., 2013. BINSCATTER: Stata module to generate binned scatterplots. Statistical Software Components, Boston College Department of Economics. URL: https://ideas.repec.org/c/boc/bocode/s457709.html.