What is it?
ETL (Extract, Transform, and Load) is a type of data integration process that consists of three phases:
Extract: Data is collected from various source systems, such as databases, CRM systems, and other repositories.
Transform: The extracted data is cleaned, enriched, converted, and formatted to meet the target system’s requirements.
Load: The transformed data is then loaded into a data warehouse or other target system.
The term smart in Smart ETL comes from the use of Artificial Intelligence (AI) to automate and optimize the ETL processes.
Why is it relevant for this Observatory?
Currently our data sources are the catalogs of STTs described below, where we need to extract all types of data (text, images, links, QR codes, ...), and then classify the STTs according to the taxonomy created in this project. This last task is the most important, as it allows the users of the Observatory to carry out searches tailored to their needs.
Doing these tasks manually would take too long, specially the classification task. By using AI, we can automate and optimize these processes.
In the future, we plan to expand our data sources to include automatic web searches, i.e. finding information on the web without the user having to type anything.
How did we implement it?
For the extraction phase we use the Python library PyMuPDF, along with other common libraries for text data, and PILLOW, OPENCV, and fastai for infographic data.
In the transformation stage, the classification task is carried out using Large Language Models (LLMs), i.e., the ChatGPT type model. We selected this AI technology since it is the best at interpreting human-like text, which is essential for categorizing STTs based on their descriptions.
Before detailing the classification, it is imperative to define a prompt.
Prompt - a set of instructions or a question given to a LLM (Large Language Model) to elicit a specific response or action from it.
There are three possible approaches to classifying STTs using LLMs:
Fine-tuning - is the process of adjusting a pre-trained LLM by continuing the training phase with a smaller, specialized dataset to improve its performance on specific tasks or domains.
Prompt-guided - is the method where we provide the LLM with specific prompts that guide it to generate the desired output or perform a particular task without additional training.
Fine-tuning combined with prompt-guided
Due to the lack of computing resources required to carry out the fine-tuning, we decided for a prompt-guided approach.
Within this approach there are several ways to proceed. We could employ the following techniques:
Chain-of-Thought (CoT) - Involves prompting the LLM to provide intermediate steps or reasoning paths leading to an answer.
Tree-of-Thought (ToT) - Involves prompting the LLM to consider various reasoning pathways and potential outcomes, much like branches on a tree, to reach a comprehensive answer.
Zero-shot learning - The LLM generates a response without any prior examples, based solely on the prompt.
Few-shot learning - The LLM is provided with a few examples to guide its response.
Role-Based Expert Assignment - Involves assigning experts to specific solution domains, ensuring that each domain is represented by knowledgeable individuals who specialize in their respective areas.
After our research to decide on the most suitable technique(s) for classifying STTs, we used few-shot learning combined with role-based expert assignment.
Our methodology with the LLM to classify STTs is summarized as follows:
The first prompt is to explain the STT taxonomy.
The following prompts are the few classification examples to guide the LLM (one per prompt). The number of examples given is sufficient to cover all the categories of the taxonomy.
Finally, unclassified STTs are provided (one per prompt). In each prompt and before providing the STT to be classified, we ask the LLM to create an expert for each application domain of the STT taxonomy (in this case, 3 experts). The task of each expert is to assign a score from 1 to 5 to each of their categories. Those with a score higher than 3, if any, are the expert's answer to the classification problem.
Our experiments with Copilot (Bing Chat), ChatGPT, and Gemini have proved that this classification is feasible. However, the performance metrics obtained so far are not good enough for large-scale classification.
Possible future directions to get better results are to apply fine-tuning to an LLM and, perhaps to get the best possible results, combine it with our methodology so far. In that case, the first prompt would no longer be necessary since the LLM would already be an expert in the subject.
Content from the following catalogs has been added to the Observatory with permission from the organizations listed below.