Datasets

ElectroCom61

The "ElectroCom61" dataset contains 2121 annotated images of electronic components sourced from the Electronic Lab Support Room, the United International University (UIU). This dataset was specifically designed to facilitate the development and validation of machine learning models for the real-time detection of electronic components. To mimic real-world scenarios and enhance the robustness of models trained on this data, images were captured under varied lighting conditions and against diverse backgrounds. Each electronic component was photographed from multiple angles, and following collection, images were standardized through auto-orientation and resized to 640x640 pixels, introducing some degree of stretching. The dataset is organized into 61 distinct classes of commonly used electronic components. The dataset were split into training (70%), validation (20%), and test (10%) sets.

GitHub

MosquitoFusion

The dataset, comprising 1204 meticulously curated images, serves as a comprehensive resource for advancing real-time mosquito detection models. The dataset is strategically divided into training, validation, and test sets, accounting for 87%, 8%, and 5% of the images, respectively. A rigorous preprocessing phase involves auto-orientation and resizing to standardize dimensions at 640x640 pixels. To ensure dataset integrity, the filter null criterion mandates that all images must contain annotations. Augmentations, including flips, rotations, crops, and grayscale applications, enhance the dataset's diversity, fostering robust model training. With a focus on quality and variety, this dataset provides a solid foundation for evaluating and enhancing real-time mosquito detection models.

The paper of this dataset "MosquitoFusion: A Multiclass Dataset for Real-Time Detection of Mosquitoes, Swarms, and Breeding Sites Using Deep Learning" got accepted in ICLR 24 TinyPapers. It has been selected among the top papers in this track by taking place in the "Invited to present (notable)" category.

GitHub

JailbreakTracer Corpus

Toxic Prompt Classification Dataset: The Toxic Prompt Classification Dataset contains a collection of prompts labeled as either toxic or non-toxic, with the primary aim of identifying prompts designed to manipulate or exploit large language models (LLMs). The initial dataset consists of 16,029 non-toxic (FALSE) prompts and 1,952 toxic (TRUE) prompts, highlighting an imbalance in class distribution. To address this, synthetic data generation was performed using GPT, resulting in a significantly expanded dataset with 37,333 toxic (TRUE) samples and 16,053 non-toxic (FALSE) samples.

Forbidden Question Reasoning Dataset: The Forbidden Question Reasoning Dataset is designed to classify and analyze prompts that contain questions deemed inappropriate or unethical for LLMs to answer. It includes 13 distinct scenarios, covering a wide range of forbidden question types, such as Illegal Activity, Hate Speech, Malware Generation, Physical Harm, Economic Harm, Fraud, Pornography, Political Lobbying, Privacy Violence, Legal Opinion, Financial Advice, Health Consultation, and Government Decision. Each scenario is represented by 8,250 samples, ensuring balanced representation across classes.

GitHub

VisText-Mosquito

VisText-Mosquito is a comprehensive multimodal dataset designed to support detecting mosquito breeding sites, segmentation of water surfaces, and generating natural language reasoning for explainable AI applications. It consists of three core components:

Breeding Place Detection: This part includes 1,828 images with 3,752 annotations across five classes: Coconut-Exocarp, Vase, Tire, Drain-Inlet, and Bottle. The images were collected from diverse urban, semi-urban, and rural environments in Bangladesh under daylight conditions to ensure visual consistency. Detection performance was validated using state-of-the-art object detection models, including YOLOv5s, YOLOv8n, and YOLOv9s, with YOLOv9s achieving the highest mAP@50.
Water Surface Segmentation: This component contains 142 images with 253 annotations across two classes: Vase with Water and Tire with Water. YOLOv8x-Seg and YOLOv11n-Seg models were used to validate segmentation performance in detecting water surfaces within the identified containers.
Textual Reasoning Generation: Each image is linked with a natural language reasoning statement that explains the presence or absence of breeding risk. A fine-tuned BLIP model was used to generate these explanations, achieving strong performance on BLEU, BERTScore, and ROUGE-L metrics.

GitHub

Page updated

Report abuse

Datasets

ElectroCom61

MosquitoFusion

JailbreakTracer Corpus

VisText-Mosquito

Get in touch at msayeedi212049@bscse.uiu.ac.bd