Data Cleaning

Check the raw data

Because this is freshly generated data, First thing had to be done is check availability.

Load the data from the csv file and use .head() method to check columns.

df = pd.read_csv("patents.csv")

df.head()

Oh my god. What happened to the data?

df.isna().sum()

Let's check it out why there are so many NaN values in this chart.

This data seems like mostly filled with NaN values. Let's get rid of those useless columns.

df= df.loc[:, df.isna().sum() < 7092]

df.isna().sum()

Now there are some useable values are showing up. Let's check what columns are remained.

df.columns

There is a lots of data types remaining. Patent title and abstract are the important information. Latitude/longitude could be used as a location, State/Country are also useful parameter. Patent date and processing time is datetime methods. Patent type is categorical value. Number of citations are also good source to find out what is a good patent.

But there are still NaN values we had to go through.

need_clean = df.columns[df.isna().any()].tolist()

need_clean

Here are the description of missing values of columns

detail_desc_length : The character count (length) of the detailed description text on a given patent

patent_abstract : Abstract of the patent

patent_average_processing_time : Average processing time for patents in the same USPC mainclass category as the selected patent

patent_firstnamed_assignee_city : City of the location of the first-named (i.e. first in the list) assignee on the patent

patent_firstnamed_assignee_country : Country of the location of the first-named (i.e. first in the list) assignee on the patent

patent_firstnamed_assignee_id : Assignee ID (assignee_id) for the first-named (i.e. first in the list) assignee on the patent

patent_firstnamed_assignee_latitude : Latitude of the location of the first-named (i.e. first in the list) assignee on the patent

patent_firstnamed_assignee_location_id : Unique database ID for the location of the first-named (i.e. first in the list) assignee on the patent

patent_firstnamed_assignee_longitude : Longitude of the location of the first-named (i.e. first in the list) assignee on the patent

patent_firstnamed_assignee_state : State of the location of the the first-named (i.e. first in the list) assignee on the patent

patent_firstnamed_inventor_city : City of the location of the the first-named (i.e. first in the list) inventor on the patent

patent_firstnamed_inventor_country : Country of the location of the first-named (i.e. first in the list) inventor on the patent

patent_firstnamed_inventor_id : Inventor ID (inventor_id) for the first-named (i.e. first in the list) inventor on the patent

patent_firstnamed_inventor_latitude : Latitude of the location of the first-named (i.e. first in the list) inventor on the patent

patent_firstnamed_inventor_location_id : Unique database ID for the location of the first-named (i.e. first in the list) inventor on the patent

patent_firstnamed_inventor_longitude : Longitude of the location of the first-named (i.e. first in the list) inventor on the patent

patent_firstnamed_inventor_state : State of the location of the first-named (i.e. first in the list) inventor on the patent

There are some unique content columns that we cannot subsitute. Such as patent_abstract and ids. In this case, it is recommended to remove the missing content rows.

And patent_average_processing_time, because this is average time value, let's replace NaN value with the average of columns.

And Latitude/Longitude/Location and Country/City. When one is NaN, others are also NaN. This is usually caused by inventor/assignee are from outside of US or major countries. And the percentage of missing NaNs are less than 3%. Remove the rows.

Missing State are caused by inventor/assignee are from outside of US but from the major countries. We can replace NaN to "Foreigner".

df['detail_desc_length'] = df['detail_desc_length'].fillna(round(df['detail_desc_length'].mean(),1))

df['patent_average_processing_time'] = df['patent_average_processing_time'].fillna(round(df['patent_average_processing_time'].mean(),1))

df['patent_firstnamed_assignee_state'] = df['patent_firstnamed_assignee_state'].fillna("Foreigner")

df['patent_firstnamed_inventor_state'] = df['patent_firstnamed_inventor_state'].fillna("Foreigner")

need_clean = df.columns[df.isna().any()].tolist()

need_clean

df = df.dropna(subset=need_clean)

df.isna().sum()

Now Every cell has some contents and we still got 6871 rows × 30 columns.

Next is sorting columns. Because columns are sorted alphabetical, it looks not organized. Re-sort them with the priority and group of endpoints

df.columns

df=df[[

"patent_number",

"patent_title",

"patent_kind",

"patent_type",

"patent_year",

"patent_date",

"patent_abstract",

"patent_firstnamed_assignee_city",

"patent_firstnamed_assignee_country",

"patent_firstnamed_assignee_id",

"patent_firstnamed_assignee_latitude",

"patent_firstnamed_assignee_location_id",

"patent_firstnamed_assignee_longitude",

"patent_firstnamed_assignee_state",

"patent_firstnamed_inventor_city",

"patent_firstnamed_inventor_country",

"patent_firstnamed_inventor_id",

"patent_firstnamed_inventor_latitude",

"patent_firstnamed_inventor_location_id",

"patent_firstnamed_inventor_longitude",

"patent_firstnamed_inventor_state",

"patent_num_cited_by_us_patents",

"patent_num_combined_citations",

"patent_num_foreign_citations",

"patent_num_us_application_citations",

"patent_num_us_patent_citations",

"patent_num_claims",

"detail_desc_length",

"patent_processing_time",

"patent_average_processing_time"

]]

df.columns

For the last, check the data type and change to the right one.

df.dtypes

df['patent_type']=df['patent_type'].astype('category')

df['patent_kind']=df['patent_kind'].astype('category')

df['patent_firstnamed_assignee_country']=df['patent_firstnamed_assignee_country'].astype('category')

df['patent_firstnamed_assignee_state']=df['patent_firstnamed_assignee_state'].astype('category')

df['patent_firstnamed_inventor_country']=df['patent_firstnamed_inventor_country'].astype('category')

df['patent_firstnamed_inventor_state']=df['patent_firstnamed_inventor_state'].astype('category')

df['patent_date']=pd.to_datetime(df['patent_date'])

df['patent_year']=pd.to_datetime(df['patent_year'], format='%Y').dt.year

df.dtypes

Now This data is all cleaned with no NaN values and ready be visualized.

Page updated

Google Sites

Report abuse