A Bloom filter is a space-efficient probabilistic data structure designed to test whether a given element is a member of a set. It uses the principles of hashing to efficiently store data.
Why do we use Bloom filters?
Bloom filters can very quickly answer the question: "Does this item exist in the set?". It takes O(1) constant time to add elements to a bloom filter and query them. You can think of a bloom filter as a data structure containing a number of buckets. When an item is added to the bloom filter, it passes through a pre-determined number of hash functions. These hash functions return a set of indices. Buckets of these indices are activated in the bloom filter. This is a constant time operation. When an item is queried, it is again passed through the hash functions. The indices returned are used to check if the element exists in the bloom filter. As you can imagine, this also happens in constant time.
Drawbacks:
One of the major drawbacks of the bloom filter is that it results in false positives. When you query an item, it could return true even though the item does not exist. It should be noted that the bloom filter can be configured to make this extremely rare. Increasing the size of the bloom filter or reducing the number of hash functions can decrease the false positive rate. Another drawback is that an item cannot be removed from the bloom filter.
Inspiration:
As I frantically spent my winter vacation tweaking my resume and applying to various companies for an internship, I couldn't help but wonder about the immense volume of applicants, for a given job role. The immense number of resumes that a company has to go through to select a candidate is not an easy task. Companies have begun to employ automated resume screening tools which search, sort, and filter candidate's resumes. These Applicant Tracking Systems (ATS) rank candidates based on keywords and other criteria set by the recruiter. The ATS uses artificial intelligence to identify keywords that determine your skills, experiences, certifications, degrees, etc.
I sensed an opportunity to enhance this efficiency using bloom filters. Remember that one of the drawbacks of the bloom filter is the existence of false positives. This means that a resume that is screened by a Bloom filter may detect a skill that does not actually exist in the resume. I find this to be an acceptable trade-off given the time and space saved. The purpose of the initial screening is to narrow down the pool of candidates for further rounds of assessment. The occasional false positive identified by the Bloom filter becomes inconsequential in the broader context of the hiring process.
With this in mind, I implemented an automated skill detection program using bloom filters. This system utilizes a Bloom filter initialized with a curated list of common technical skills prevalent in the IT industry. I incorporated the murmur3 hash algorithm in the bloom filter. By feeding PDF resumes through this filter, the program accurately extracts skills, streamlining the skill identification process.
I successfully demonstrated the efficiency gains achieved by employing the Bloom filter in terms of both space and time.
Key Features:
To optimize space utilization, the size of the bloom filter is dynamically allocated based on the input dataset.
An additional filter is used to check for duplicates in the skill list. This prevents redundancies in the final output.
Results:
The bloom filter implementation saves a considerable amount of space. For the given resume, it takes less than 50% of the space from the traditional search
Time is also saved, however, it is negligible
Further Developments:
To automate the Fine-tuning of the false positive rate parameter of the Bloom filter to strike a balance between memory efficiency and accuracy.
Add a GUI to make the interface more user-friendly
Incorporate NLP techniques to extract context and semantics from resumes for efficient detection
Introduce a feedback mechanism where users can provide insights into the accuracy of the extracted skills. This could include the automated update to the dataset.
Update:
Implemented an intuitive graphical user interface (GUI) tailored for job recruiters, providing a platform for efficient resume analysis based on specific skill sets. The user-friendly interface enables recruiters to swiftly assess resumes, facilitating a seamless and targeted skill-matching process.