Patent Classification using a Multi-layer Perceptron

I have updated the algorithm to use BERT for classification. A new training sample will also be produced soon. Please look for updates.

This page explains how I classified patents into quality-improving ones (product patents) and cost-reducing ones (process patents). In the text below, I shorten them as quality patent and cost patent.

A quality patent is one that improves the quality of an existing product/function or creates a new product/function. This definition coincides with the definition of product innovation in Oslo Manual (2018): "a new or improved good or service that differs significantly from the firm’s previous goods or services and that has been introduced on the market."

A cost patent is one that improves production efficiency or reduces production cost.  This definition highly overlaps with Oslo Manual's definition of a (business) process innovation: "a new or improved business process for one or more business functions that differs significantly from the firm’s previous business processes and that has been brought into use in the firm."

The documentation assumes basic knowledge about the multilayer-perceptron model and its training process, but I will explain the rationale for each step. The codes are adapted from Google TensorFlow's movie reviews tutorial.

1. Training Sample

I hired two research assistants, one with a science background and the other with an economics background, to classify 6000 randomly selected patents into two categories, quality (improving) and cost (reducing), based on the previous definition. Let us refer to them as RA1 and RA2.

The table below shows the tabulation of their classification results:

RA1 is more likely to assign a patent as quality-improving and RA2 is more likely to assign a patent as cost-reducing. To verify their classification quality, two additional RAs (RA3 and RA4) were hired to classify the first 1000 of the 6000 patents. RA1’s classification has 802 overlaps with RA3 and 840 overlaps with RA4. RA2 has 599 overlaps with RA3 and 659 overlaps with RA4.  Based on the number of overlaps, RA1's classification seems more credible than RA2's. This is verified in training stage.

2. A Neural Network Model

I considered two classes of neural network models: a sequence model and an n-grams model. The sequence models consider the sequence of wording in a patent text (title or abstract), while the n-grams model considers only the bag of words. The n-grams model performs better than the sequence model.

This is unsurprising because, according to the Google Developers Web, n-grams models work better when the ratio between the number of samples and the average number of words per sample is less than or equal to 1,500. Sequence models work better when the ratio is greater than 1500.

The average abstract contains 250 words, which means the ratio is 6000/250=40.

The n-grams model I chose is called a multi-layer perception model. Below I explain how I train this model using the 6000 classified patents.

2.1. Text preprocessing

Patent titles and abstracts are processed in the following steps:


[1] The reason for replacing numbers with number signs is because patent abstracts often contain illustration of a numbered graph, where the different parts are numbered. The presence of these numbers does not contribute to the resemblance of patents.

[2] Stop words (chosen from nltk.corpus) are the most commonly used words in the English language. Since we use an n-grams model for patent classification, stop words do not carry much information.

[3] Lemmatization converts all plurals to singulars and all verb tenses to present. The lemmatizer used here is WordNetLemmatizer from the nltk.stem package.

For example, here is a patent abstract before preprocessing:

The utility model relates to a filter outside an aquarium that can be emptied automatically. The filter contains a shell, a filter layer, an upper cover body, a diving pump, a water inlet and a water outlet and is characterized in that: the utility model also contains a releaser that comprises a compressed air plug, a compressed air plug sleeve, a seal rubber ring, a return spring and a check valve; the compressed air plug sleeve, upper part and lower part of which are open is positioned in the upper cover body; the compressed air plug is connected with the compressed air plug sleeve by the seal rubber ring and the peak of the compressed air plug goes through a pylome of the upper cover body and extends outside of the upper cover body; the compressed air plug sleeve is communicated with the water outlet; the check valve is positioned in the water outlet; the return spring is arranged between the compressed air plug and the upper cover body. The utility model solves the problem that the water pump has to be started by absorbing water with mouth in the current technology and has the advantages of simple structure, convenient application, etc.

This is the patent abstract after preprocessing:

utility model relates filter outside aquarium emptied automatically filter contains shell filter layer upper cover body diving pump water inlet water outlet characterized utility model also contains releaser comprises compressed air plug compressed air plug sleeve seal rubber ring return spring check valve compressed air plug sleeve upper part lower part open positioned upper cover body compressed air plug connected compressed air plug sleeve seal rubber ring peak compressed air plug go pylome upper cover body extends outside upper cover body compressed air plug sleeve communicated water outlet check valve positioned water outlet return spring arranged compressed air plug upper cover body utility model solves problem water pump started absorbing water mouth current technology advantage simple structure convenient application etc

This is a frequency plot of the most popular words in patent abstracts:

2.2. Vectorization

This step converts the preprocessed patent texts to matrices.

First, the preprocessed patent texts (composed of nouns and verbs) are converted to bags of words composed of two- and three-word grams. For example, "I am happy today" is converted to {"I am", "I am happy", "am happy", "am happy today", and "happy today"}.

Second, a dictionary is built for these two- and three-word grams. In this context, the dictionary contains 10,000 - 12,000 grams depending on whether patent titles or abstracts are included as patent texts.

Third, the bags of words are converted into a sparse matrix with the columns composed of the grams in the library and each row corresponding to one patent text. An entry is 1 if a patent text contains the word gram and 0 otherwise.

Fourth, each positive entry is multiplied by a weighting factor called TF-IDF, where TF refers to term frequency (the number of times a gram appears in the patent text), and IDF refers to inverse document frequency (the inverse of the number of times a term appears in the dictionary). The TF-IDF weighting factor allows one to increase the weight of a term that appears multiple times in a patent text and decrease the weight of a term that appears multiple times in a document/dictionary.

2.3. Sample balancing, shuffling and splitting

The patent classification sample is unbalanced in that the number quality and cost patents are not equal. This is believed to lead to poor training performance. I balance the classification sample using a technique called over-balancing, where the under-represented category is replenished by randomly choosing samples from the under-represented category. 

After sample balancing, the 6000 classified patents are randomly split into three groups: the training set, the validation set, and the test set. The test set contains 100 patents. 80% of the remaining patents are put into the training set and the remaining 20% are put into the validation set.

2.4. Model building and training

This section builds a multi-layer perceptron (MLP) model using Google's TensorFlow’s Python modules.

An MLP model, illustrated in the figure below, contains one input layer, one output layer, and one or more hidden layers. The input layer transforms the vectorized patent texts into hidden layers for processing with a linear function. The output layer contains a sigmoid function that assigns a value between 0 and 1 to each row of patent text. Each hidden layer contains several neuron-like processing units (each unit contains a ReLU activation function) to process the information and pass it down to the next layer.

Each training session is called an epoch, which begins with the model feeding a batch of training samples into the input layer and ends when the accuracy and loss rates have been calculated for both the training and validation sets. In each training session or epoch, the model changes the parameters in the neurons to improve its own performance in the training and validation set. 

The model’s performance in the training and validation sets is measured by its accuracy and loss rates. The former measures the percentage of correct predictions. The latter is a cross-entropy function that captures how much information the model loses.

The model stops the training process if its accuracy and loss rates stop improving or if it has finished 1000 epochs, whichever comes first.

2.5. Hyperparamter tuning

The last step is to improve the MLP model's performance (its accuracy and loss rates) by tuning the following hyperparameters: 

I trained six models, each based on a different training sample:

"Common" refers to those patents where the two RAs gave the same classification.

The figures below demonstrate the loss and accuracy rates of these samples after tuning:

RA1's classification + patent abstract (accuracy)

RA1's classification + patent abstract (loss)

RA1's classification + patent title (accuracy)

RA1's classification + patent title (loss)

RA2's classification + patent abstract (accuracy)

RA2's classification + patent abstract (loss)

RA2's classification + patent title (accuracy)

RA2's classification + patent title (loss)

Common classification + patent abstract (accuracy)

Common classification + patent abstract (loss)

Common classification + patent title (accuracy)

Common classification + patent title (loss)

The table below summarize the six model's performance in the test set (the random sample of 100 patents that never entered the training process):

Overall, RA1's classification based on patent abstracts seems to deliver the best results. This model is used to classify patents.

3. Patent Classification

The fifth model (common classification + patent abstract) is used to classify patents. It reads a preprocessed and vectorized patent text into the model and assigns a probability to a patent. The probability indicates the likelihood of a patent being a quality patent.

Patents with a probability greater than or equal to .5 are classified as quality patents. The others are classified as cost patents.

Here is the abstract of a quality patent (PATSTAT APPLN_ID 8051302): 

The present invention relates to novel hypolipidemic, antiobesity, hypocholesterolemic and antidiabetic compounds. More particularly, the present invention relates to novel alkyl carboxylic acids of the general formula (I), their stereoisomers, pharmaceutically acceptable salts thereof and pharmaceutical compositions containing them where all symbols are as defined in the description.

Here is the abstract of a cost patent (PATSTAT APPLN_ID 481988354):

The utility model discloses a sunflower head harvesting device, including a plurality of slot type branch gansu province ware, a plurality ofly move back that pole ware, auger unload that piece, chain are taken off, harvesting platform with take off that the seed quick -wittedly is connected conveyer belt, spiral auger, is taken off the seed machine, melon seed storehouse, hydraulic motor, auger and push away the seed piece, the slot type divide gansu province ware with it sets up to move back pole ware interval, the slot type divides gansu province ware sprocket, upper scraper blade, upper chain scraper, upper scraper blade, the chain scraper of lower floor sprocket, the front chain wheel of lower floor, the chain chain scraper of lower floor, lower floor's scraper blade and the front chain wheel of lower floor axle behind the preceding scraper chain shaft in upper strata, upper scraper blade front chain wheel, upper scraper blade, the harvester bench is equipped with radar monitor camera. The utility model discloses sunflower head harvesting device has realized that degree of automation is high, has improved labor efficiency, has saved the manpower from picking, convey, take off the seed, unloading the automatic line production of seed, and material resources provide the condition for the plant in a large area of sunflower.