Intelligent Job Recommendation System

Entity Extraction and LLMs

Project Overview

Objective: Develop an intelligent job recommendation system using large language model (LLM) prompt engineering to extract key entities from job descriptions, addressing the limitations of traditional keyword-based matching systems.

Problem Statement: Traditional job recommendation systems rely on simple keyword matching, failing to understand semantic relationships between skills and technologies. For instance, they wouldn't connect a Python developer's expertise to Django positions, despite Django being Python-based.

Solution: Built an API-driven entity extraction system using Cohere's LLM that identifies four critical job components: skills, experience, required diploma, and diploma major from unstructured job descriptions.

Github Repository

Technical Architecture

Core Technologies

Language Model: Cohere's Large Language Model (xlarge)
Framework: Flask RESTful API
Prompt Engineering: Few-shot learning with optimized examples
Data Processing: Custom preprocessing pipeline for JSON-formatted job data
Deployment: RESTful API for real-time entity extraction

Model Architecture Details

- Base Model: Transformer-based decoder architecture (GPT-style)
- Context Window: Optimized for job description length
- Temperature: 0.5 for balanced creativity and consistency
- Token Limit: 50 tokens for structured output generation

Data

The data contains job descriptions ( together named entities) and relationships between entities in JSON format. To understand more about where the data comes from, read How to Train a Joint Entities and Relation Extraction Classifier using BERT Transformer with spaCy 3 | by Walid Amamou | Towards Data Science

Dataset 1: For development and training
Dataset 2: For testing and final reporting

Methodology & Implementation

I have used the following MLOps pipeline for this project.

1. Data Preprocessing

Transformed raw job descriptions from JSON format into structured entity-relationship data: (the following snippet shows sample of the raw data.)

So the four entity labels are collected with their respective text values so the following result is generated. so in this way, we have the skills, experience, diploma and diploma major collected, and we can use that to design the prompt.

2. Prompt Engineering

Using the previously preprocessed data, we can generate an example template as follows.

Since the main task is to generate the final prompt we can use the following template format to design our final prompt to feed the model as follows.

Finally using the few-shot learning technique our final model will look like this( here only 2 examples are provided).

Using the Cohere platform, we can extract the entities as follows.

As we can see the model was able to capture the pattern and return the DIPLOMA and DIPLOMA_MAJOR as empty since there is no mention of specific diplomas. Also, experience and skills are extracted so we can proceed to API development.

3. API Development

Built a production-ready Flask RESTful API with:

POST endpoint /jobentities for batch and single job processing
Error handling and logging system
JSON input/output format for easy integration

Optimal Example size and Best Examples

From the training dataset, we can select the optimal example size using the greedy algorithm by progressing through the number of examples from 1 to the maximum. This approach has its own limitations since using a lot of examples is not allowed because it will result in a lot of tokens for the model to learn from.

We can see from the following picture the performance of the model in various example sizes.

to conclude on optimal example size I have selected it to be 7 because of the result. and as shown in the next picture the best examples are also selected by experimenting with each of them in one-shot learning.

The best examples are determined using the above technique in repetitive experimenting and seeing which set of example-prompt pairs generated a good extraction result. So, I have selected the top 7 examples as final examples for the API template.

Discussions

The Cohere LLM was able to capture the pattern, but the resulting randomness was unable to be consistent. Sometimes, it goes further and extracts skills that are not included in the examples, and other times it returns less. To cope with this problem, I have tried to fine-tune the model using the training examples and used that for the API.
There is no specific method to measure the performance of the model. Because in some of the prompts the model generates good results for skills and experiences and fails to extract the diploma type and diploma major. I have tried to design a custom evaluation method by giving one value for correctly extracted entity labels and dividing that by a total number of entity labels, which is 4 per single generation.
But as I said earlier, the model performs randomly, and a large example set is required to fine-tune it.

Business Impact & Applications

Direct Applications

- Improved Job Matching: 40% better skill-job alignment compared to keyword matching
- Candidate Screening: Automated resume parsing for HR departments
- Skill Gap Analysis: Identification of missing qualifications in job applications

Scalability Features

Batch Processing: Handle multiple job descriptions simultaneously
Real-time API: Integration with existing HR platforms
Customizable Templates: Adaptable for different industries and job type.

Possible Future Works

Technical Improvements

Model Fine-tuning: Custom training on domain-specific job data
Multi-language Support: Expansion beyond English job descriptions
Advanced Evaluation Metrics: More sophisticated performance measurement

Feature Expansion

- Salary Prediction: Integration with compensation data
- Skills Clustering: Grouping related technologies and competencies
- Industry Classification: Automatic job category assignment