Goal of Research

Fallacy: The Goal of research is to estimate a regression model

TRUTH: Regression models are always a means to an end. We have some hypothesis about the real world and how it works. There are three main purposes of regression models:

Descriptive -- it describes patterns of trends and correlations present in the data -- inequality has been increasing or decreasing, X and Y are correlated or not, this correlation has been increasing or decreasing, etc.

Assessment of Hypotheses: Economic hypotheses, like the idea that free trade leads to economic growth, can be assessed via regression models, which might provide support, or disconfirm the hypothesis.

Policy decisions. Regression models can provide estimates of the impact of one variable on another which can be useful for policy.

In all cases, there is some background theory which is used to justify the regression model, and the data sheds light on the theory. There is only one exception to this rule --

EXCEPTIONAL CASE:

that is the case of EXPLORATORY DATA ANALYSIS. This means we have no idea about how things work in the world and we run some regressions to explore; in this case the regression will be used to generate interesting hypothesis. For example, if we find a negative correlation between inflation and unemployment, we might come up with a hypothesis that reducing unemployment leads to inflation.

A regression model can be used to provide some support for this hypothesis. It can be used to describe some aspects of the real world which help us understand, or to make policy decisions.

NORMAL CASE:

In the usual case, we have some background theory. The regression is ONLY ONE element of support for the background theory. One can and should bring in other evidence, historical and qualitative to provide support (or to disconfirm) the hypothesis being researched. The main point is that the regression model is a MEANS to a goal, and not the END or the GOAL. One cannot run a regression and consider that the research is finished. We now give some specific examples.

EXAMPLES:

Here are some examples of this very common error taken from actual proposals submitted for evaluation:

T1. What is the effect of Terrorism on FDI?

T2. What is the effect of Growth on Income Distribution?

T3. What is the effect of Migration from Kashmir on Crop Yields in Kashmir?

There are two problems that one must resolve before running a regression to find the effect of X on Y:

PROBLEM 1: Why do we want to know the relationship between these two variables? How will this knowledge help solve some problem that we face?

PROBLEM 2: Does the regression equation make sense? Can it be estimated with the data available?

Topic T1: What is the effect of Terrorism on FDI

DOES IT MAKE SENSE?

The regression equation does make sense. Foreign Investors look at conditions within the country in order to decide whether or not to invest. If the conditions are bad they will invest less, and if they are good, they will invest more. So there is a direct link between conditions in the country (AS PERCEIVED by investors) and the amount of FDI which will come in

CAN IT BE ESTIMATED WITH AVAILABLE DATA?

This equation cannot really be estimated using time series data for Pakistan. Why? Considering the data for the past thirty years, a huge amount of changes have occurred in the country, one of which is the rise of terrorism mainly linked to the Afghan wars. Because many factors have varied simultaneously, it would be very hard to isolate the effect of terrorism from everything else which has been changing, especially with a small data set like this one. If we looked at a panel data set which has many countries with differing levels of terrorism incidents, then it might be possible to isolate the effects of terrorism from everything else which has been happening. One important issue to understand in this context is the difference between EXPERIMENTAL DATA and OBSERVATIONAL STUDY. In experimental data we try to make sure that the only variable which changes is the one you are interested in -- all other variables are controlled so that they do not change. Then the effect of the variable of interest can be isolated. In an observational study, you just look at what is happening in the real world. All factors are changing at the same time, and so you cannot be sure which factor is causing the changes that you can see happening. There is also something called QUASI-EXPERIMENTAL Study, where you try to make observations that resemble experiments. There are some very strong conditions needed for this to work.

TOPIC T2: How does growth effect income distribution?

Here the question itself does not make much sense. There are some types of growth (called pro-poor growth) which improve income distribution, other types which lead to more concentration of wealth at the top. So a better question might be to ask: what types of growth policies are pro-poor and what types are anti-poor? But this is a very different question, and cannot immediately be answered by running a regression. Another question might be: growth has been taking place in Pakistan; has it been helping the poor or helping the rich -- that is, to which classes of people has the extra money being generated by growth been going? Again this is a very different question. To illustrate by a different example, suppose we ask: What is the impact of MEDICINE on MALARIA? For this we run a regression with Malaria Incidence as dependent variable and IMPORTS OF MEDICINE as the independent variable. Only a small percentage of the medicine is anti-Malarial so only that part would be relevant. So this regression equation does not make much sense. Answers would vary according to whether or not the proportion of anti-Malaria medicine was high or low. There is another important issue of exogeneity/endogeneity which will be discussed in detail later on -- That is the following: if there is more malaria, then we will import more medicine to fight it -- thus malaria medicine is not exogenous (determined by outside factors). It is endogenous: the level of the dependent variable determines how much medicine we import. Suppose we import exactly as much medicine is needed for the number of malaria cases that we have. Then we will have a perfect regression: Malaria = 0 + 1 x MalariaMedicine + error

R-sq =100%. The coefficient will be positive, and we can say (!) Malaria is caused by Malaria Medicine, the more we import the more cases of Malaria we get !!! This interpretation is wrong because Malaria Medicine is not exogenous.

Topic T3: Does Migration from Kashmir affect crop yields in Kashmir?

Migration is not a determinant of Yield. Therefore it does not make sense to directly study the impact of migration on yields. One must go through the determinants of yield to determine impact of migration on yield: see [link] for a detailed explanation.

SOLUTIONS:

So how can we fix these problems?

First understand that we don't start our research by saying: I am going to discover the effect of X on Y. We must start with a real world problem. For example, I start by saying: how much money in government budget should be spent on Higher Education Versus Primary Education. Note that this is a real practical problem. If we spend 1 million on education, should we put 500,000 into higher education and 500,000 into elementary schools, or more or less?

To answer this question, we will need to run regression which will tell us the rates of return to education in higher education and in elementary education. This will be one part of the information needed to answer this question. Other parts will require reading the literature on education to find out. The whole set of information needed to answer this question would be the subject of an MS Thesis, which will review all related literature. Regression will be just one part of the whole picture, not the entire focus of the thesis. For more detailed information, see How to choose a topic for MS Thesis Research.

Second understand the limitations of data. This should be understood in an intuitive way, directly, without doing statistics. Suppose I want to find out the effect of remittances on consumption. In my mind, I have the idea that remittances from abroad lead to luxury consumption, but not so much to investment. Or else that remittances are invested in land, leading to higher real estate prices. Can I find this out by using macro data? That is, I use annual data on remittances and annual data on luxury consumption versus investment, or on land prices. Will this tell us what we want to know? HIGHLY UNLIKELY. This is an observational study. Consumption and land prices have been subject to a lot of influences. Everything has been changing over the past twenty years. Wars, 9/11, Pakistan Atomic Bomb, embargo, Oil Price Hikes, etc. With only a small amount of data, it is hard to sort out the influences, and make sure that the changes we see are due to remittances, and not to some other factor. ON THE OTHER HAND, if we look at micro level household data in HIES or some such, there are thousands of households. Some have remittances and some don't. Looking at consumption patterns within these households, it should be possible to find out how remittances are spent, and to compare with case of no remittances. So here, micro level data should enable us to answer the question, which macro level cannot. This type of thinking/reasoning must be done in ADVANCE of running regressions, which can easily give misleading answers. For another example of a case where data is insufficient for the hypothesis being tested, but we can get results by CHANGING and INCREASING the data set, look at [link]