Data Gathering

Libraries

import requests

import csv

import os

import datetime

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

requests
- Used to make HTTP requests, such as GET or POST, to access APIs or websites.
csv
- Provides functionality to read from and write to CSV (Comma-Separated Values) files.
os
- Provides a way to interact with the operating system, such as handling file paths or environment variables.
datetime
- Used to manipulate dates and times, offering date formatting, time calculations, and more.
pandas
- A powerful data manipulation library used for data analysis, providing data structures like DataFrames.
numpy
- A library for numerical operations and array handling, offering efficient computation for large datasets.
seaborn
- A data visualization library built on top of matplotlib, providing an easy interface for creating attractive statistical graphics.
pyplot
- A plotting library used for creating static, interactive, and animated visualizations in Python.

How to Start with load API data

First, we are going to need the main URL.

url = "https://api.patentsview.org/patents/query"

Now, we will need to attach what we are going.

Q : Query

A query is a request made to retrieve specific information from a database, API, or data source. It typically involves specifying conditions or filters that narrow down the results to meet the user's needs. For example, in the context of an API, a query might ask for data where a particular field. The query helps to define the exact subset of data you want to access, rather than returning all the available data. In this code, query was used to make a search date range, and topic selection(ML,NLP)

F : Fields

This parameter is used to specify which fields you want the API to return in the response. It allows you to customize the response by including only the specific data fields you're interested in. In this code, all the field endpoints are called to get a raw data.

O : Outputs

This parameter is used to specify additional options for the query, such as pagination or sorting. Because there are limitation when we call requests, use this output option to expand search listing.

query_payload = {

"q": {

"_and": [

{

"_gte": {"patent_date": "2018-01-01"}

},

{

"_lte": {"patent_date": "2024-12-31"}

},

{

"_or": [

{"_text_phrase": {"patent_title": "machine learning"}},

{"_text_phrase": {"patent_title": "ML"}},

{"_text_phrase": {"patent_title": "Natural Language Processing"}},

{"_text_phrase": {"patent_title": "NLP"}}

]

}

]

},

"f": [

"appcit_app_number",

"appcit_category",

"appcit_date",

"appcit_kind",

....

"wipo_field_id",

"wipo_field_title",

"wipo_sector_title",

"wipo_sequence"

],

"o": {

"per_page": 1000,

"page": 1

}

After setup the all the search options, now we request response from the url. Because of edited outputs, response will show us 1000 patents per pages. And there are more pages coming next. For each page, by using requests library, we can GET API response.

response = requests.post(url, json=query_payload)

data = response.json()

total_patent_count = data['total_patent_count']

per_page = query_payload['o']['per_page']

total_pages = (total_patent_count + per_page - 1) // per_page

all_patents = []

for page in range(1, total_pages + 1):

query_payload['o']['page'] = page

response = requests.post(url, json=query_payload)

data = response.json()

patents = data.get('patents', [])

all_patents.extend(patents)

print(f"Retrieved page {page}/{total_pages}")

print(f"Total patents retrieved: {len(all_patents)}\n")

Now data is ready. Import these data with field names to a csv file for more prepare for the data cleaning.

with open('patents.csv', 'w', newline='') as csvfile:

fieldnames = [

"appcit_app_number",

"appcit_category",

"appcit_date",

"appcit_kind",

...

"wipo_field_id",

"wipo_field_title",

"wipo_sector_title",

"wipo_sequence"

]

writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

writer.writeheader()

items = all_patents

for item in items:

writer.writerow({

"appcit_app_number" : item.get("appcit_app_number", ""),

"appcit_category" : item.get("appcit_category", ""),

"appcit_date" : item.get("appcit_date", ""),

"appcit_kind" : item.get("appcit_kind", ""),

"appcit_sequence" : item.get("appcit_sequence", ""),

...

"wipo_field_id" : item.get("wipo_field_id", ""),

"wipo_field_title" : item.get("wipo_field_title", ""),

"wipo_sector_title" : item.get("wipo_sector_title", ""),

"wipo_sequence" : item.get("wipo_sequence", "")

}

)

Total patents retrieved: 7092

Now there is a fresh csv file from the patentview API.

For checking Raw Data, check the sub-tab