Because this is freshly generated data, First thing had to be done is check availability.
Load the data from the csv file and use .head() method to check columns.
df = pd.read_csv("patents.csv")
df.head()
Oh my god. What happened to the data?
df.isna().sum()
Let's check it out why there are so many NaN values in this chart.
This data seems like mostly filled with NaN values. Let's get rid of those useless columns.
df= df.loc[:, df.isna().sum() < 7092]
df.isna().sum()
Now there are some useable values are showing up. Let's check what columns are remained.
df.columns
There is a lots of data types remaining. Patent title and abstract are the important information. Latitude/longitude could be used as a location, State/Country are also useful parameter. Patent date and processing time is datetime methods. Patent type is categorical value. Number of citations are also good source to find out what is a good patent.
But there are still NaN values we had to go through.
need_clean = df.columns[df.isna().any()].tolist()
need_clean
Here are the description of missing values of columns
detail_desc_length : The character count (length) of the detailed description text on a given patent
patent_abstract : Abstract of the patent
patent_average_processing_time : Average processing time for patents in the same USPC mainclass category as the selected patent
patent_firstnamed_assignee_city : City of the location of the first-named (i.e. first in the list) assignee on the patent
patent_firstnamed_assignee_country : Country of the location of the first-named (i.e. first in the list) assignee on the patent
patent_firstnamed_assignee_id : Assignee ID (assignee_id) for the first-named (i.e. first in the list) assignee on the patent
patent_firstnamed_assignee_latitude : Latitude of the location of the first-named (i.e. first in the list) assignee on the patent
patent_firstnamed_assignee_location_id : Unique database ID for the location of the first-named (i.e. first in the list) assignee on the patent
patent_firstnamed_assignee_longitude : Longitude of the location of the first-named (i.e. first in the list) assignee on the patent
patent_firstnamed_assignee_state : State of the location of the the first-named (i.e. first in the list) assignee on the patent
patent_firstnamed_inventor_city : City of the location of the the first-named (i.e. first in the list) inventor on the patent
patent_firstnamed_inventor_country : Country of the location of the first-named (i.e. first in the list) inventor on the patent
patent_firstnamed_inventor_id : Inventor ID (inventor_id) for the first-named (i.e. first in the list) inventor on the patent
patent_firstnamed_inventor_latitude : Latitude of the location of the first-named (i.e. first in the list) inventor on the patent
patent_firstnamed_inventor_location_id : Unique database ID for the location of the first-named (i.e. first in the list) inventor on the patent
patent_firstnamed_inventor_longitude : Longitude of the location of the first-named (i.e. first in the list) inventor on the patent
patent_firstnamed_inventor_state : State of the location of the first-named (i.e. first in the list) inventor on the patent
There are some unique content columns that we cannot subsitute. Such as patent_abstract and ids. In this case, it is recommended to remove the missing content rows.
And patent_average_processing_time, because this is average time value, let's replace NaN value with the average of columns.
And Latitude/Longitude/Location and Country/City. When one is NaN, others are also NaN. This is usually caused by inventor/assignee are from outside of US or major countries. And the percentage of missing NaNs are less than 3%. Remove the rows.
Missing State are caused by inventor/assignee are from outside of US but from the major countries. We can replace NaN to "Foreigner".
df['detail_desc_length'] = df['detail_desc_length'].fillna(round(df['detail_desc_length'].mean(),1))
df['patent_average_processing_time'] = df['patent_average_processing_time'].fillna(round(df['patent_average_processing_time'].mean(),1))
df['patent_firstnamed_assignee_state'] = df['patent_firstnamed_assignee_state'].fillna("Foreigner")
df['patent_firstnamed_inventor_state'] = df['patent_firstnamed_inventor_state'].fillna("Foreigner")
need_clean = df.columns[df.isna().any()].tolist()
need_clean
df = df.dropna(subset=need_clean)
df.isna().sum()
Now Every cell has some contents and we still got 6871 rows × 30 columns.
Next is sorting columns. Because columns are sorted alphabetical, it looks not organized. Re-sort them with the priority and group of endpoints
df.columns
df=df[[
"patent_number",
"patent_title",
"patent_kind",
"patent_type",
"patent_year",
"patent_date",
"patent_abstract",
"patent_firstnamed_assignee_city",
"patent_firstnamed_assignee_country",
"patent_firstnamed_assignee_id",
"patent_firstnamed_assignee_latitude",
"patent_firstnamed_assignee_location_id",
"patent_firstnamed_assignee_longitude",
"patent_firstnamed_assignee_state",
"patent_firstnamed_inventor_city",
"patent_firstnamed_inventor_country",
"patent_firstnamed_inventor_id",
"patent_firstnamed_inventor_latitude",
"patent_firstnamed_inventor_location_id",
"patent_firstnamed_inventor_longitude",
"patent_firstnamed_inventor_state",
"patent_num_cited_by_us_patents",
"patent_num_combined_citations",
"patent_num_foreign_citations",
"patent_num_us_application_citations",
"patent_num_us_patent_citations",
"patent_num_claims",
"detail_desc_length",
"patent_processing_time",
"patent_average_processing_time"
]]
df.columns
For the last, check the data type and change to the right one.
df.dtypes
df['patent_type']=df['patent_type'].astype('category')
df['patent_kind']=df['patent_kind'].astype('category')
df['patent_firstnamed_assignee_country']=df['patent_firstnamed_assignee_country'].astype('category')
df['patent_firstnamed_assignee_state']=df['patent_firstnamed_assignee_state'].astype('category')
df['patent_firstnamed_inventor_country']=df['patent_firstnamed_inventor_country'].astype('category')
df['patent_firstnamed_inventor_state']=df['patent_firstnamed_inventor_state'].astype('category')
df['patent_date']=pd.to_datetime(df['patent_date'])
df['patent_year']=pd.to_datetime(df['patent_year'], format='%Y').dt.year
df.dtypes
Now This data is all cleaned with no NaN values and ready be visualized.