Data Analysis & Course Works

The "clean" payoff to education and income inequality in China. (An ongoing research)

posted Oct 22, 2015, 1:50 AM by Jiqun Liu   [ updated Oct 25, 2015, 10:26 PM ]

What it means to be "clean" payoff?

It is widely acknowledged that education can cause significant effects on one's salary or income and this conclusion have been supported by an extensive body of empirical studies. However, the problem of reverse causation is almost inevitable if we solely estimates the payoff to education based on OLS regression. That is, one's wage can also influence the time spent on education and thus breeds the bias for coefficient estimation. For example, An increase in one's wage may provide he or she with more sufficient financial support to pursue higher degrees. Thus, I employed Instrument variable (IV) regression to correct the reverse causation and analyze the "clean" payoff to education. 

How to do the IV regression in this case precisely?

According to dozens of previous researches, parents' education levels tend to be positively correlated with their children's education level and they can be used as effective instrument variables.  I employed these variables as IVs in this research revisiting the question of education payoff in China context. As controlled variables and covariates, other demographic factors and related variables were also included in data analysis.

Main results of  pilot study 

The main result of the initial stage research or pilot study can break into two parts. The first part is  first stage regression result of 2SLS estimation. The second part is sensitivity analysis.

First-stage regression for instrument variables



Adjusted R-sq

Partial R-sq

F statistic








The F statistic presented in the table above is much greater than the corresponding critical value provided in Stock & Yogo's (2005) work and thus supports the relevance (non-weakness) of the two aforementioned IVs.

Regression for ln(income): Sensitivity analysis based on model comparison





Drop health

Drop social















































































* p<.05, ** p<.01
(Other variables, such as family history, occupation, location, and membership in organizations, were controlled in statistical analysis and omitted for brevity.)

Briefly, every coefficient of years of education in IV regressions (2SLS, 3SLS, Drop social, Drop health) are significant and also smaller than that of OLS to different extent. This result indicates that these two IVs are effective in tackling the reverse causation problem between education and income, and the reverse positive effect caused by income may be partially, if not entirely, corrected by the two IVs of education. Further, the coefficient of "clean" education is quiet stable and stay significant across diverse statistical models, which means that the existence of proposed "clean" education payoff is robust and reliable.

Future studies

1. Consider the geographic segregation between rich and poor and the highly unbalanced distribution of education resources and materials in China (such as advanced high schools, outstanding teachers and funds on facilities), the birthplace of an individual may also serve as an effective instrument variable for his or her years of education since it can help to tackle the potential omitted variable problem. According to Grusky's previous researches and lectures on the correlations among individual's class origin, his or her attendance to college and the macro-level income inequality, this IV may also perform well in the data analysis within the context of U.S. income distribution, if relevant, high-quality data is available.
2. More controlled variables need to be considered and added into models. The controlled variables used in this research may provide some implications for relevant studies in another region or country. For example, this research based on a data set collected in post-socialist China found that the communist party membership still maintains a significant positive effect on one's income across diverse job sectors. Therefore, insofar as the distribution social resources and opportunities are largely affected by certain groups with (explicit or implicit) socioeconomic privileges, the membership of groups and communities may deserve more attentions in inequality studies.
3. According to Liu & Grusky's (2015) work "Inequality in the Third Industrial Revolution", the payoff to education and skills were always mixed and convoluted in previous studies. Hence, specific factors measuring skills and professional capabilities should be distinctively analyzed in future studies.
4. In modern information society, people's skills in using information & communication technology (ICT) and abilities in seeking, searching & using relevant information tend to be increasingly important for raising wage and narrowing down the income gaps (supporting evidences can be found in my empirical works conducted in rural China during 2013-2014). Given this reality, future studies in skills, education and inequality may need to pay more attention to human information behavior, ICT acceptance and information literacy. 

What affects the scientific productivity of a PhD student? An analysis based on zero-inflated regression model

posted Oct 21, 2015, 7:18 PM by Jiqun Liu   [ updated Oct 26, 2015, 1:12 AM ]

This course work replicated Long & Freese's (2001) study on scientific productivity of biochemists. the dependent variable art is the number of articles published in the three years before receiving the Ph.D. The independent variables are gender (fem), whether the scientist is married (mar), the number of children under age 5 (kid5), the prestige of the Ph.D. department ranging from .75 to 5 (phd), and the number of articles published by the scientist’s mentor in the last three years (ment). This work provided many valuable insights for my research on information inequality, ICT acceptance and their effects on other aspects of social inequality. Specifically, I focused on the frequency of using cell phone or PC to search information and communicate with others online and how these habits affects one's accessibility to reliable job information and opportunities. 
The data set used in this coursework as example is couart2.dta.

Main results: ZIP & ZINB estimation using part of Long's (1997) data set


ZIP’s b

ZINB’s b



















vuong test

20.27 (p<0.001)

26.08 (p<0.001)

Based on the result of vuong test, we may safely come to the conclusion that zero-inflated Poisson regression model and zero-inflated negative binomial regression model can better fit the data set and provide more accurate prediction on zeros than standard Poisson and NB models. 
According to the results of ZIP and ZINB regression, female PhD students tend to publish less than their male counterparts. Another ceteris paribus conclusion we can draw is that having kids could also significantly reduce a PhD student's number of publications. On the other hand, mentor's publication can somehow motivates his or her doctoral students to publish more in their last 3 years prior to receiving the PhD. Department's prestige and whether a student is married or not do not cause significant effects on one's publication during the PhD journey.

Further, in order to compare the performances of different regression models, I plotted the difference between the observed proportions for each count and the mean probability from the four models (P, NB, ZIP, ZINB), as the figure shows below.

Four model's deviation from observed proportions of each count

As the figure shows, zero-inflated negative binomial model provides the best prediction of a PhD student's scientific productivity in this case based on the given data set.

The effects of individual-level factors on different income groups: Application of quantile regression

posted Oct 21, 2015, 9:08 AM by Jiqun Liu   [ updated Oct 21, 2015, 9:10 AM ]

This work employed quantile regression to go beyond the relationship between means of variables and estimate different quantiles of responsive variable. By virtue of this statistical model, I analyzed the different effects of several individual-level factors, such as education level, race, experience and IQ, on the wages of different income groups. Other controlled variables (marriage, location, etc) were also included in the model. With the regression results and related percentile charts, we can better understand the origin and divergent tendency of today's income gaps among different groups and communities.
I mainly use Stata as my data analysis tool. The dataset used here as example is card.dta.

Main results: quantile regression on log(wage) q(0.5)


Basic model

Full model



















Pseudo R2



Graphs of the coefficients of quantile regression on log(wage)  (Full model)

According to the coefficients on different quantile of log(wage) in graphs, the individual's education level (measured by education years), for example, may cause different effects on different income groups. For low income groups, the payoff effects of an increasing year spent on education is weaker than that on relatively higher income groups. When it comes to one's IQ and race, however, the payoff effects of these factors in top 10% income groups are lower compared to that of the bottom 10% income groups. 

1-3 of 3