The description below is also available as PDF here.
Summary
Since March 2001, the USPTO publishes the large majority of all patent applications. As research has extensively used patents for a variety of measures, this is an interesting source of information. Specifically if one takes into account that around 50% of all patent applications fail (link) and thus result in a large part of unobserved activities when solely using granted patents. This dataset consist of scraped patent application data split into various tables describing the patent (title, dates, types), its technological classes (IPC and USPTO), its inventors (names, locations), and its assignees (names, locations).
Introduction
All types of research have been using patent data to observe or measure a tremendous number of different concepts or mechanisms. Despite the progress that has been made in using such data, most of this work may be subject to an unobservability bias/selective sampling issue. After all, not all inventions are patented and even in settings where innovations are regularly patented, around 50% of all patent applications still fail.
To address the latter issue, one could use not only granted patents, but all patent applications to overcome such issues. Whereas some patent agencies (including the EPO) have always published all patent applications, the USPTO only started to publish patent applications in the year 2001 (link). The publication of patent applications occurs 18 months after filing or earlier upon request by the application. Certain types of applicants (individuals and small enterprises) can opt-out from such mandatory publication.
Source of data
The data published here was originally provided by the USPTO and retrieved from Google USPTO Bulk Downloads (link). Specifically, all data stem from the “Patent Application Publication Bibliographic Data”. All the weekly XML files have been scraped and put into tables that follow a ‘relational database’ format. All string variables have been put into uppercase, trimmed, and cleaned from non-ASCII characters.
Description of tables and fields
See details in the PDF file.
Disclaimer
All copyright of this data remains with the original information providers (USPTO and/or patent applicant). This data may solely be used for non-commercial purposes. Though care has been taken in collecting this data, it cannot be guaranteed there are no errors or mistakes. Please use this data at your own risk.
Frequently-asked questions
How do I use this dataset on Stata 11/12/13?
Please download the CSV files and use the ‘import’ or ‘insheet’ commands to import the data.
Help, I get an out-of-memory error!
First, it may be an issue with your Stata settings. Use ‘set max_memory’ to increase the memory limit.
Second, your computer may not have enough memory. There are a couple of solutions. One can TxtFileSplitter to split the CSV files into smaller parts and import these into Stata. Alternatively, one can only load certain variables into Stata (‘use app_id asg_organization for using assignees.dta’) or only certain observations (‘use assignees.dta in 1/1000’).
There are mistakes in the dataset.
Yes, there are indeed some mistakes in the data, like abbreviations for non-existent states and countries or e-mail addresses in the phone number field. However, these mistakes were already present in the original data. Depending on the purpose of your research, you may have to clean the data further.