Integration of innovation and technology transfer concepts to provide market-driven solutions through incubation programs

MetaExtractor

Automatic Content Classification Based on Extraction of Metadata and Keywords

Making searching scientific content faster and more effective

MetaExtractor project aims to develop a general machine learning model and a software tool for automatic extraction of metadata and keywords to be used for searching and classification of scientific documents.

Stages and Milestones

Literature Search and Review

Data Collection and Preparation

Extracting search-related metadata from these documnets

Extracting classification-related keywords from these documnets

Building the Model: training, testing, and classifying

Building A Cloud-Transformed Realization of the Model

Transforming the model into a cloud-based tool

Stages and Milestones

Literature Search and Review

Looking for and reviewing existing works either in metadata extraction, keyword extraction, or document classification based on extracted metadata and keywords including Arabic based work.

Data Collection and Preparation

Selecting and collecting research documents and forming a dataset of English and Arabic in the commerce domain. Determine the number of research documents suitable to be divided later into training as well as testing datasets when we apply the classification algorithm. Applying data preprocessing such as cleaning, transforming, reformatting, making corrections and combining as necessary.

Extracting search-related metadata from these documnets

Preparing training dataset by extracting important metadata from these documents such as title, abstract, author/s, institution, dates and other important research topic-related features and attributes. This will affect the mechanism of search and therefore reduce both time and effort.

Extracting classification-related keywords from these documnets

Preparing training dataset by extracting important keywords from these documents such as topic-related keywords. This will affect the mechanism of classifying a given test research documents according to a predefined classification schemes within the repository.

Building the Model: training, testing, and classifying

Developing a generalized machine learning model based on the built training and test data. The model is able to extract metadata and keywords and based on this is able to classify document groups under a specified classification scheme/s.

Building A Cloud-Transformed Realization of the Model

Transforming the model into a cloud-based model into Google App Engine as implemneted model toghether with the needed machine learning functions and data processing function. The whole dataset is stored in Google Cloud Storage where the model can use it.

Transforming the model into a cloud-based tool

Implementing this model as a tool as a proof of concept and realize it as a standalone web/cloud application.

Page updated

Report abuse