Description

The full description of the Patent Grant Data v2017.1 dataset is available as PDF file. Below is the most important information.

Summary

This website provides the bibliographic data from patents granted by the USPTO between January 1976 and June 2017. These data, obtained from free sources, are provided in a convenient format (the so-called ‘relational database’ format). These data may only be used for non-commercial purposes.

Introduction

Patent data have been extensively used for both theoretical and applied research in a large number of fields. Various groups have presented patent datasets for research purposes, free or paid, for one or more patent agencies, and for various periods. This dataset is similar to the ones already published with two major differences. To start, this dataset goes up until June 2017 which is more recent than many free alternatives. More importantly, this dataset publishes all bibliographic fields and not only the most important ones. These additional fields may open up new research topics.

Data source

The data published here was originally provided by the USPTO and retrieved from Google USPTO Bulk Downloads (link). Specifically, all data stem from the “Patent Grant Bibliographic Data”.

Data processing

The USPTO has changed the format and fields of patents over time. This is reflected in the format that these data are provided and the bibliographic information we can get. Between 1976 and 2017, eight different formats have been used:

All bibliographic data have been scraped from the XML files. First, the structure of each XML format has been identified to ensure all possible fields for each patent. Following the USPTO format, we grouped these variables and created tables for these groups. Second, all XML files have been scraped by version-year combination. These raw data were cleansed in a number of ways to create the ‘processed data’. For each version, all data from the different years were merged together and each field/variable is given a unique name. In addition, the fields were formatted:

- Numeric data were stored as numbers (instead of strings)

- Dates were stored in date formats (instead of strings)

- Strings were cleansed from non-ASCII characters

- IPC classifications were split into sections, classes, and groups

The various fields of the cleansed data are then standardized. For example, v1.0 has inventor name as a single field of full name while the other versions separate it into two fields as first and last name. When standardizing, inventor name is created as a single field as either the full name or concatenating the first and last name. This, unfortunately, also means that certain variables are lost. For a full overview, please see the description in the 'Fields' section below. The standardized data are published in CSV and Stata formats.

Data structure

The data structure, as described below, applies solely to the final data.

Tables

Fields

Please see the full-text description (PDF) for all fields per table.

Disclaimer

All copyright of this data remains with the original information providers (USPTO and/or patent applicant). This data may solely be used for non-commercial purposes. Though care has been taken in collecting this data, it cannot be guaranteed there are no errors or mistakes. Please use this data at your own risk.

Page updated

Google Sites

Report abuse