Integration of innovation and technology transfer concepts to provide market-driven solutions through incubation programs
Document Classification Based on Metadata and Keywords Extraction
Abstract: We present a model for automatic extraction of metadata and keywords to be used in the classification of scientific documents. The model mainly consists of metadata extraction, keywords extraction and documents classification. At the metadata extraction stage, various metadata items are extracted from research documents in the domain of commerce such title of the thesis/research article, author/s, advisor/s, year, publisher, type, and abstract. At the keywords extraction stage, Latent Semantic Indexing (LSI) is used to extract the underlying topics from these documents. At the classification stage which depends on the metadata and keywords extraction stages, three classification algorithms are used which are Stochastic Gradient Descent (SGD), Linear Support Vector (LSVC) and K-Nearest Neighbor (KNN). SGD has achieved the highest classification accuracy (80.5%) compared to LSVC and KNN when applied to Arabic document corpus. LSVC has achieved the highest classification accuracy (81.5%) compared to SGD and KNN when applied to the English document corpus.
Initially a dataset of research articles in the commerce domain is built with more than 1000 document. They come from the social sciences domain specifically from the follwoing subdomain (classes): Personnel Management, Public Accounting Auditing, Finance, Commerce, Investment, Public Finance, Marketing, Business Ethics, Advertising, Insurance, Bank Loans, Public Debts, Inflation and Taxation.
The data set is later divided into train and test sets and used for various purposes including exprementation.
The figure below shows an abstract view of the basic machine learning model that is used to extract metadata and keyword and also classifiying the given research document/s according to the predefined library of congress classification scheme. Based on the domain specific research dataset prepaired on the first stage of the model, the model is trained and tested to extract metadata and keywords as well as to calssify research documents according the given classificatoin scheme. Metadata include research title, author/s, year and month, publisher, etc. Keywords include research topic-related terms and words and some phrases as needed. These output data items are fedback to the model so it can use it in future extractions and classifications.
The figure below shows the architecture of the cloud-transformed realization of the basic machine learning model shown above. After the basic model is trained and tested, it is serialized together with the research dataset (train and test) to Google Cloud Storage as pucket storage and to Google App Engine as implemneted model in Python toghether with needed Machine learning functions and data processing function. The model running in Google App Engine requests and loads research dataset/s from the Google Cloud Storage as needed. The implemented model (extraction tool now) has a RESTful interface where a user can interact with to submit and get extraction and classification tasks.
An extraction tools is built and deployed into Google cloud (specifically into Google App Engine (GAE). It is based on the basic model illustrated above and respectively the cloud-transformed realization of the model shown above. The user uploads a research document as text or pdf through the GUI. Then it is sent to the tool in Google app Engine. Based on the model, the tool analyses the document then returns: extracted metadata, extracted keywords, and the chosen classification of the given research document.