Integration of innovation and technology transfer concepts to provide market-driven solutions through incubation programs
Making searching scientific content faster and more effective
MetaExtractor project aims to develop a general machine learning model and a software tool for automatic extraction of metadata and keywords to be used for searching and classification of scientific documents.
Looking for and reviewing existing works either in metadata extraction, keyword extraction, or document classification based on extracted metadata and keywords including Arabic based work.
Selecting and collecting research documents and forming a dataset of English and Arabic in the commerce domain. Determine the number of research documents suitable to be divided later into training as well as testing datasets when we apply the classification algorithm. Applying data preprocessing such as cleaning, transforming, reformatting, making corrections and combining as necessary.
Preparing training dataset by extracting important metadata from these documents such as title, abstract, author/s, institution, dates and other important research topic-related features and attributes. This will affect the mechanism of search and therefore reduce both time and effort.
Preparing training dataset by extracting important keywords from these documents such as topic-related keywords. This will affect the mechanism of classifying a given test research documents according to a predefined classification schemes within the repository.
Developing a generalized machine learning model based on the built training and test data. The model is able to extract metadata and keywords and based on this is able to classify document groups under a specified classification scheme/s.
Transforming the model into a cloud-based model into Google App Engine as implemneted model toghether with the needed machine learning functions and data processing function. The whole dataset is stored in Google Cloud Storage where the model can use it.
Implementing this model as a tool as a proof of concept and realize it as a standalone web/cloud application.