ABSTRACT: This research addresses the limitations of keyword search in biomedical literature by developing a Semantic Textual Similarity (STS) model. Using a teacher-student framework, we trained a model on a large-scale BioASQ dataset to understand complex medical language. We compared a baseline dual-loss training model with an "Adjusted Dual-Loss" model, which incorporated advanced regularization to improve stability.
The results demonstrated that the adjusted model provided significantly more stable training and was more effective at resisting overfitting. However, a critical finding was the discovery of severe data leakage, with thousands of text snippets repeated across the training and test sets. This data integrity issue compromises the validity of the current performance metrics. Therefore, while the adjusted training method shows promise, the model's reliability must be re-evaluated on a clean, deduplicated dataset.
For more details, please visit the AI in Natural Language Processing - Hugging Face Project website: