Census Tree FAQ

About the Census Tree

What is the Census Tree?

This Census Tree is the largest-ever database of record links among the historical U.S. censuses, with over 700 million links for people living in the United States between 1850 and 1940. The Census Tree includes 314 million census-to-census links for women, and 41 million links for Black Americans.

The links in the Census Tree will enable promising research in the social, behavioral, and economic sciences. For example, linked census data can be used to measure the intergenerational transmission of wealth or education, to estimate the long-term impacts of childhood circumstances, and to document trends in family formation. Because the Census tree is large and highly representative of the population, researchers will be able to include small or under-represented groups in work that has excluded them in the past.

In practical terms, the Census Tree database is a set of crosswalks between the IPUMS versions of the 1850-1940 full-count decennial censuses. Our “Get the Data” page will direct you to the ICPSR repositories that host the crosswalks.

For much more on the Census Tree, its properties, and how it is created, see Buckles, Haws, Price, and Wilbert (2024).

What is FamilySearch and the Family Tree?

FamilySearch is a “non-profit family history organization dedicated to connecting families across generations” [i] that is offered as a service of the Church of Jesus Christ of Latter-day Saints. FamilySearch started in 1894 as the Utah Genealogical Society. It has a website (familysearch.org) that provides access to historical record collections and a wiki-style platform, called the Family Tree, through which individuals can gather information about their ancestors. In 2020 there were over 1.2 billion individual profiles on the Family Tree and over 12 million registered users, making it the largest genealogy website in the world.

[i] https://www.familysearch.org/en/home/about

How was the Census Tree created?

The core of the Census Tree comes from information provided by users of FamilySearch.org, an online genealogy platform. Users can attach digitized historical records to the profiles of their ancestors, including the decennial censuses from 1850-1940. Any time a user links two different census records to a single profile, this creates a census-to-census link. There are over 317 million user-provided links, which constitute a dataset we call the Family Tree.

We build on the Family Tree in two ways. First, we use the Family Tree as training data for a machine learning algorithm to create additional census-to-census links. Second, we add links from the Census Linking Project, the IPUMS Multigenerational Longitudinal Panel, and from FamilySearch. After filtering the links for quality and adjudicating conflicts, we have the Census Tree.

For a more detailed description of the methodology behind the Census Tree, please see Buckles, Haws, Price, and Wilbert (2024).

What is FamilySearch?

[i] https://www.familysearch.org/en/home/about

How is the Census Tree different from other available sets of census links?

The Census Tree is the largest set of census-to-census links ever created, with over 700 million unique links. It includes over 70% of the possible links that could be made for men, and over 60% of possible links for women.

For comparison, the Census Linking Project has match rates for men in the 20-30% range, and no links for women. Match rates for the Multigenerational Longitudinal Panel has match rates of 55% and 42% for men and women, respectively, but they do not attempt to link non-adjacent censuses (e.g. 1910 to 1930).

The Census Tree achieves these high match rates while maintaining high precision (the fraction of matches that are believed to be correct). Thus, the Census Tree has moved the frontier for linking the historical U.S. Censuses.

Please see Buckles, Haws, Price, and Wilbert (2024) for more on how the Census Tree builds on and compares to other sets of census-to-census links.

What do we know about the quality of the Census Tree links?

The quality of the links from the Family Tree, our XGBoost algorithm, the Census Linking Project, and the IPUMS Multigenerational Longitudinal Panel have each been assessed in prior work; see for example Price et al. (2021) and Abramitzky et al. (2021). Additionally, we have checked the quality of the full Census Tree by randomly selecting a sample of the 1900-1910 links, and asking research assistants to independently verify that the link is correct. The results of this exercise are available in Buckles et al. (2023). To summarize, between 89% and 94% of the links were determined to be correct, depending on how we treat links the verifiers were unsure about. The rates were higher--between 94% and 97%--for links that were identified by more than one source.

Where can I find the code and data that has been used in the creation of the Census Tree?

See our Methodology page for links to training data and code used in the creation of the Census Tree.

How do I get started using the Census Tree?

We have created a Quick Start Guide to help you understand the data and how to use it.

How do I cite the Census Tree in research using the data?

When downloading the crosswalks from ICPSR, please use the project citation provided with each set of links.

Please also cite the following papers that describe the Census Tree methodology:

If you are using links from the Census Linking Project or the IPUMS Multigenerational Longitudinal Panel, please cite those sources accordingly:

Finally, researchers should be sure to cite the IPUMS versions of the full-count censuses used in the work.

How can I tell you about my paper that uses the Census Tree?

We can't wait to see what other scholars do with the Census Tree, and will be building a Bibliography that tracks this research. Please use the Contact page on our website to tell us about your work!

About FamilySearch and the Family Tree

How is the Family Tree different from the Census Tree?

The Family Tree is a subset of the Census Tree links. The Family Tree includes only the user-generated links from FamilySearch. These links are then used as training data for the machine learning algorithm to produce additional links, which we further combine with links from the Census Linking Project, the Multigenerational Longitudinal Panel (IPUMS), and FamilySearch hints. After filtering the links for quality and resolving conflicts among the various sets of links, we have the full set of available links known as the Census Tree.

What is FamilySearch?

[i] https://www.familysearch.org/en/home/about

How does FamilySearch work for its registered users?

When someone first registers on FamilySearch.org, they enter information that they know about their parents, grandparents, and other relatives. If a deceased relative appears to be similar to a profile that is already on the Tree (e.g. in terms of name, dates of birth or death, birth location) then the site will suggest that the existing profile be linked to the user’s family tree. In this way, the user’s individual tree becomes connected to the large, wiki-style Family Tree—the largest of which connects over 400 million profiles. It is very common for users with ancestors in the United States—and increasingly with ancestors elsewhere—to quickly find a relative with an existing profile that allows them to link into the Family Tree. The Family Tree is a “wiki” in the sense that it is a public, shared platform, and when individuals have ancestors in common, any one of them can add and edit information and anyone else will see those edits when they visit the profile of the individuals.

Users may also start to attach records to relatives’ FamilySearch profiles, including Census records, birth and death certificates, images from yearbooks or newspapers, church records, and military records. FamilySearch has a growing collection of over 4 billion digital images for these records, and it partners with other record sources including Ancestry.com, findmypast.com, and others. Users can search the digitized record collection themselves using the site’s search forms. The FamilySearch website can also suggest possible record matches to users when the digitized information is similar to that provided by the user on the profile.

It is important to remember that while individual users can include living persons in their own personal family tree, only information about deceased persons appears on the wiki-style Family Tree. As a result, researchers only have access to records for deceased persons.

What is included in an individual profile on the Family Tree?

The image linked here includes an example of an individual profile on the Family Tree. Each profile includes a section about vital information, family members, sources, notes, and memories. On the right of the profile there are also help features that provide links to record hints for the person, flags for errors about their profile (e.g. being born after a parent dies), search tools, an edit history, and other features.

The image linked here shows the source page for this profile. The source page identifies the user that attached the record, includes links for viewing the source, and provides hints for additional record linking based on the established records (described in detail below).

How do users find records to attach to the Family Tree?

There are three ways that sources become attached to a profile on the Family Tree: record hints, search, and private information.

First, the most common way that a source becomes attached to a profile is through a record hint. These appear in the upper right hand of the screen when on an individual’s profile page or are sometimes sent to users through an email campaign. These hints are generated using the matching algorithm that FamilySearch has developed which is based on a neural net using training data generated by genealogists.

Second, users will use the search features on FamilySearch to look for a person in various records. This approach allows the user to specify which features to use to search for the person including their name, birth place, birth year, family member names, residence place, and race. They can employ narrow or wide ranges around dates of life events and include wild cards that allow a search query like: J* Pri?e which would search for anyone with a first name that that starts with J and has a surname that has the letters Pri_e where the blank could be any character. These advanced search features allow users to find records that get missed by the record hint algorithms.

Third, users find records in some other way, either through a manual search through all of the records for a particular town or using research compiled in other books or sources. Also, all of the other major genealogical websites (Ancestry, FindMyPast, MyHeritage) provide record hints and search tools and their users us this information that they find on other websites and attach the same sources to profiles on FamilySearch.

Does FamilySearch verify the links?

FamilySearch does not have an automated process to check whether records are attached to the correct person. Many of the sources that are attached are based on record hints provided by FamilySearch. Their match algorithm sets their precision threshold at 95% so that the links are highly likely to be a match, and the user plays a key role in providing a final validation before linking the source.

There are also features within the FamilySearch platform that help users catch and fix mistakes. For example, if someone attempts to attach a profile on the Family Tree to a source that is already attached to another profile, they will see the profile to which the record is already attached. At this point, it is easy for the user to compare the profiles, and to detach the source from the original profile if that is deemed to be an incorrect link. They are also able to put in a reason statement to help explain their decision. Furthermore, when users are doing research on a particular individual, they will often look over the sources attached to that person to look for additional information that should be included on the profile or identify family members that might have been missed. In the process of comparing information across sources and reconciling conflicting information they will often discover that one of sources was incorrectly attached to that person and they can easily detach the source from the person. This wiki-style aspect of the Tree is one of the most important features to ensure high quality in the long-run (though it has the ability to lower quality in the short-run).

Is there any evidence on the quality of the links?

Here we summarize two exercises to examine the quality of the data; see Price et al. (2021) for more detail.

In the first exercise, we compare links from the Family Tree with the links created by the human trainers working on the LIFE-M project. LIFE-M provided us a set of 54,000 individuals that they had linked from an Ohio birth certificate to the 1940 census. We were able to find about 12,000 people from their sample that were attached to both an Ohio birth certificate and the 1940 census on the Family Tree. Of these, 1,060 links were identified by both LIFE-M and the Family Tree, and we found that that the links agreed 94% of the time. For the few cases where there was disagreement, we asked hand research assistants to use traditional family tools to determine which match was correct. Adding the links that the research assistants identified as correct to those where LIFE-M and the Family Tree data agree, we conclude that the links based on the Family Tree were correct 98% of the time.

In the second exercise, we began with 500,000 matches for our Ohio sample between the 1910 and 1920 censuses and randomly sampled 100 records from the 1920 census. We gave these 100 records to trained research assistants and asked them to use the search tools on Ancestry to identify the number of potential matches for that person in the 1910 census and which of those possible matches they determined was correct based on their inspection of the information from the two records. On average, they identified 12 individuals in the 1910 census that were a possible match for each person in the sample from the 1920 census. The 1910 census record that they labeled as a match for each 1920 census record agreed with the match in the Family Tree data 98% of the time. We replicated this with a random sample of 350 record links from our full data set. Of those 350 records, they were able to find a link 94% of the time and of these links that were found, they agreed with the link in the Family Tree data 99% of the time.

Both of these exercises suggest that the Family Tree links achieve a level of accuracy similar to or better than that created by skilled human trainers, at a much lower cost.

How much does the Family Tree change over time?

The wiki-style structure of the Family Tree means that the profiles on the Family Tree are being continually updated and edited. As such, the training data that we gather from the Family Tree at one point in time might differ from the training data we would obtain at a different point in time.

Each profile on the Family Tree includes an edit history that records every single change that has been made to the profile and the date it occurred and the user who made the change. In order to test how much the training data we use in Price et al. (2021) changes over time, we examined the edit history for a random sample of 10,000 linked pairs in our training data. For each linked pair, we gathered data from the edit history to see when each census source in the pair was attached to the profile. We find that 69% of the time, the two census records were linked to the individual's profile on the same day and 84% of the time they were linked in the same year. In cases where the year differed, we used the later of the two dates. This table provides the year that each of our linked pairs were added to the Tree.

Note that almost half of our training pairs were created in 2018 or later. This confirms that the Tree continues to grow rapidly; those interested in using the Tree for training data will therefore want to use the most recent data available. Nevertheless, we have found that in practice, updating the training data with newly made links does not meaningfully change the number or composition of the links made by the machine learning algorithm.

This random sample of our training data also provides some interesting insight about the contributors to the training data. There were 9,141 unique registered users who attached one of the 20,000 sources that were used for our random sample of 10,000 linked pairs (two attached sources for each pair). Of these attached sources, 17% were attached by users who only attached one source, 41% were attached by users who attached two sources, 20% were attached by users who attached 3-4 sources, and the other 22% of sources were attached by users who attached five or more of the sources within this random sample of our training. In addition, of the matches in this random sample, 71% were attached by the same registered user.

Who are the users of FamilySearch, and how representative is the Family Tree of the general population?

There are two key questions to consider when thinking about how FamilySearch users might produce a Family Tree that is not representative of a population. First, what are the characteristics of the users themselves? Here, specific concerns include the facts that FamilySearch users have access to computers/smart phones, internet, and time, or that members of the Church of Jesus Christ of Latter-day Saints may be more prevalent among FamilySearch users than they are in the general population. Second, how might the behavior of users affect who ends up on the Tree and who does not? For example, are users more likely to look for or find information about successful relatives, which would lead to their over-representation on the Tree?

Unfortunately, we do not have demographic information that allows us to provide summary statistics for the 12 million+ FamilySearch users. However, we can compare the characteristics from records on the Family Tree to other population records, to help us assess the representativeness of the Tree. We summarize the key findings here; see Price et al. (2021) for the full results.

When comparing the census profiles that are on the FamilyTree to the full population, we see that those on the tree are similar in terms of gender, age, household size, and the probability of being the household head. However, those on the Tree are more likely to be white, married, literate, and are more likely to be living in their birth state. Interestingly, we find that those on the Tree have a lower occupation score, which suggests that users are not more likely to look for or find information on more successful relatives.

While these results suggest that there is selection into the Tree along some characteristics, we note that when using the FamilyTree data (or any samples produced using it as training data), it is possible to re-weight the sample to be representative of the desired population by following the procedure outlined in Bailey et al. (2019). The fact that the Family Tree includes over 1.2 billion profiles means that even for under-represented groups, there will likely be sufficient support in the data for this approach.

What about survivor bias?

Since people using a genealogical platform often seek out their own ancestors first, the Family Tree might under-represent individuals who never had descendants. One thing that allays this concern is that while users on FamilySearch tend to start by focus on finding information about their direct ancestors, they generally then turn their attention to doing descendancy research. Descendancy research is gathering information on the children, grandchildren, and great grandchildren of each of your ancestors. This approach to family history allows everyone the chance to be gathered into the Family Tree, including those with no living descendants.

To test for the extent of survivorship bias on the Family Tree, we took a random sample of women age 35 from the 1910 census and compared the coverage rate on the Family Tree of women based on whether or not they had ever had children (that census year includes a question for women about how many children they have ever had). We created a random sample of 5,000 women who had had children and 5,000 women who had never had children. We found that 46% of women with children in the 1910 census had a profile on the Tree, compared to 18% of women who had never had children. These results suggest that individuals who do not have children are less likely to currently have a profile on the Family Tree but they do still have a significant presence and are included in our training data.

Are there other uses for the Family Tree data, beyond its use as training data?

Absolutely. As one example, the Family Tree itself provides a rich set of links that traditional linking methods and even machine learning are unlikely to be able to provide. To see this, consider the case of "maiden" names. Women’s records from before and after marriage have historically been difficult to link because her surname usually changes. Neither the research assistant nor the machine learning algorithm has the information needed to link “Mary Gaddie” as a child to “Mary Caswell” as a married adult. However, family members often have this private information and can successfully create this link. Family members also have private information about occupations, geographical moves, and the names of family members. Given that the Family Tree contains tens of millions of record links, this is a valuable source of difficult-to-link records.

Additionally, because the Family Tree data contain millions of links that are considered “ground truth,” they can be used to check the validity of other methods. For example, Abramitzky et al. (2019) and Bailey et al. (2019) have used the Family Tree data in this way.

How can I access the FamilySearch API?

Data from the Family Tree can be accessed using the FamilySearch API. Documentation for the API is provided on their API resources page. FamilySearch also provides a helpful getting started page. The FamilySearch API is designed primarily for App developers who use data from the Family Tree to create family history experiences; however, there is increasing interest to partner with academics as a way to increase the quality and growth of the Family Tree.

An App Key is required to use the FamilySearch API and instructions for how to obtain one are provided on the FamilySearch API getting started page. The BYU Record Linking Lab is willing to share our experience in using the FamilySearch API. We can also help identify ways that academic projects can facilitate the growth of data in the Family Tree or add value back to the Family Tree, as these are two of the primary considerations that FamilySearch uses when determining whether to grant access to an App Key. Data acquisition and record linking efforts in economics and other fields have tremendous potential to add value back to the Family Tree.

Any other advice?

Yes! The best way to understand the FamilySearch platform and the Family Tree is to create an account and build your family tree. Doing so will help you gain a deeper understanding of the user experience, and how that experience is reflected in the data you want to use.

Last update: 8/2/23

Page updated

Report abuse