Census Data

A Few Notes on China's Census Data

Currently, 5 waves of China's census have microdata available: 1982, 1990, 2000, 2010, and 2015. The first three could be downloaded from IPUMS International after registration and application.

To download the data, make sure you select the "harmonized variables" rather than "source variables" as geographical information (e.g. province of the household) will be concealed in the "source" version. 

Moreover, World Bank has more documents for the 82 and 90 census, which are not available from IPUMS:



2005, 2010, and 2015 data are being circulated privately. 

There are also privately circulated 1990 and 2000 data. This 1990 one is exactly the IPUMS one with the original household ID (which is concealed in the IPUMS version) and a broken "Zhejiang province" file. The 2000 one seems to be a subsample of the IPUMS one. The privately circulated one has a sample size of around 1 million while the IPUMS one has 10 million observations.


What's wrong with the 1987 Census?

In the 1987 small census, the population share in cities and towns is incredibly high——37%. This figure is much higher than the 1982 census (21%) and the 1990 census (26%). What's wrong with it?

Long in short, it is due to changes in the definition of cities and towns, both in statistical methods and administrative divisions.

In 1982, the population in cities and towns was all people living in the administrative areas of cities and towns. Later, many counties were "upgraded" to towns, which made the population in towns increase significantly.

In the 1990 census, if measured in this way, the urbanization rate (population share in cities and towns) would therefore reach 53%. This figure could not reflect the real urbanization rate in China. To solve this problem, the NBS called the previous method "the first definition". They developed another definition to classify the administrative divisions more carefully and called it "the second definition". From the 1990 census, figures of both definitions are reported and researchers could choose between them (in most cases, you should use the second definition).

Back in the 1987 small census. In this census, apparently, the NBS  uses the first definition but they never issue an official announcement about it: In the 1987 official report, they say the urbanization is really fast when compared with 1982. In the 1990 official report, they simply forget the 1987 and 1982 censuses!

So if you don't know the second definition came into existence after 1990*, the data of the 1987 census would really puzzle you. To use the 1987 census, of course, we can have some complicated adjustment methods. However, the easiest way is to count "population in cities" as "population in cities and towns". In that way, the share will be 18%. A bit lower is better than a lot higher.

*NBS seems to have already realized this problem in 1989. After the 1987 census, they begin to conduct yearly population surveys. The 1988 survey seems to be a preliminary one. The 1989 survey is a more standard one and is reported in the 1990 yearbook. In the survey data (Chapter 4 of the 1990 yearbook), the population share in cities and towns is 22% which is consistent with 1990 data. In the meanwhile, in Chapter 5 of this yearbook, so-called "statistical data" shows the share to be 42%.

It is impossible to identify households in the 2010 and 2015 official census microdata

The official microdata of the 2010 and 2015 census don't have household identifiers. It is because "In order to prevent the leakage of personal characteristics, the database has been anonymized and information identifying households or even individuals has been removed." (official document)

You may think even without household identifiers, it is still possible to identify a household from the combination of other variables such as "the area of the house", "birth within the household", etc.

However, besides dropping sensitive variables, they randomly dropped individuals from a household, making these two officials useless if you want to study anything related to the household.

How do I know it? Because there is a raw version of the 2015 census microdata that is 1.5 larger than the official version. Initially, I thought the official version dropped some sensitive households. After comparing these two datasets, I realize actually they dropped individuals from a household. The average size of a household is around 3-4 in China and they dropped one member from each household, which explains the difference in size.

Here is an example. If you try to identify households by "the area of the house"

From the official data, you find there are 5 people whose area of the house is 7 square meters in Dongcheng district in Beijing. They seem to belong to three households. Maybe (1, 3), (2), (4, 5), or (1, 5), 2, (3, 4)?

However, the raw data version shows actually there are 7 people! The true household combination is (1, 6), 2, (4, 5), (3, 7).

We can see that, due to individuals being dropped from households, it is impossible to identify the true household from the official version. The raw version of the 2015 census is available but the raw version of the 2010 census is still waiting to be leaked : )


ID    relation   area_of_the_house   family_members   birth_year    

1      house head        7                     2                          1984

2      house head        7                     1                          1991

3      spouse               7                     2                           1959

4      house head        7                     2                           1978

5      spouse               7                     2                           1977

================censored data below==================

6      spouse               7                     2                            1994 

7      house head        7                     2                            1957
