Appraisal of: Wagner M, Rosumeck S, Küffmeier C, Döring K, Euler U. A validation study revealed differences in design and performance of MEDLINE search filters for qualitative research. J Clin Epidemiol. 2020;120:17-24
Reviewer(s):
Andrew Booth
Full Reference:
Wagner M, Rosumeck S, Küffmeier C, Döring K, Euler U. A validation study revealed differences in design and performance of MEDLINE search filters for qualitative research. J Clin Epidemiol. 2020;120:17-24
Short description:
This validation study assessed the performance of MEDLINE search filters designed to identify qualitative research. The authors aimed to provide a comparative overview of existing search filters and determine which achieved the highest sensitivity, precision, or best balance between the two.
Thirteen search filters were validated using a gold standard generated through a relative recall approach. The gold standard consisted of 2,323 references (131 systematic reviews and 2,192 primary qualitative studies) derived from 145 systematic reviews of qualitative studies published in DARE between 2009-2014. These references were checked for MEDLINE indexing and used to test whether each search filter could successfully retrieve them.
Performance metrics included sensitivity (percentage of gold standard retrieved), precision (percentage of retrieved records that were relevant), and number needed to read (NNR). The validation was conducted in MEDLINE via the Ovid platform.
Key findings showed that the Wong et al. (2004) filter labeled "Wong c" achieved the highest sensitivity (93.63%) but required screening 1,418 articles per relevant article found. The MeSH term "Qualitative Research" alone achieved the best precision (2.15%) but the lowest sensitivity (22.56%). The University of Texas filter provided the best balance with 81.96% sensitivity and 0.80% precision (NNR: 126).
The study concludes that search filter selection should depend on project-specific demands, with recommendations varying based on whether comprehensive retrieval or manageable screening workload is prioritized.
Limitations stated by the author(s):
1. Precision underestimation: The relative recall approach does not allow for true precision calculation because the exact number of irrelevant retrieved references cannot be determined without manual screening. This likely led to underestimation of true precision and NNR values.
2. Lack of qualitative verification: The authors did not manually verify that all references in the gold standard were actually qualitative research, acknowledging some may not be.
3. Potential selection bias in gold standard: Some systematic review authors planned to exclude studies of poor quality, which may have introduced bias into gold standard 2 (primary studies), though the impact appears limited due to scarce reporting.
4. Topic-specificity concerns: The gold standard was generated irrespective of medical subject, but performance measures may differ when filters are combined with topic-specific search terms.
5. Circular validation potential: Using a gold standard created from systematic reviews that themselves used qualitative search terms may have influenced validation results.
6. Limited practical applicability: True performance in everyday use depends on combining filters with topic-specific terms and other limitations, which could considerably change the performance metrics.
Limitations stated by the reviewer(s):
Strengths of the study:
• Comprehensive and systematic identification of existing search filters through established sources
• Large, topic-independent gold standard (n=2,323) covering an extended publication period (1968-2014)
• Transparent and reproducible methodology with clear reporting
• First validation of the University of Texas filter, filling an important evidence gap
• Practical utility: provides clear guidance for filter selection based on different research needs
• Appropriate use of precision ratio to enable comparison across filters despite relative recall limitations
Critical limitations and concerns:
1. Single reviewer bias: Study selection, plausibility assessment, and data extraction were conducted by a single reviewer without independent verification. This introduces significant risk of study selection bias and reduces reproducibility. [Study Selection Bias; Reproducibility]
2. Circular validation problem: The gold standard was derived from systematic reviews that likely used qualitative search filters themselves (approximately 60% included a qualitative search block). This creates potential for circular validation where filters are tested against studies found using similar methods, potentially inflating performance estimates. [Methodology Bias; Construct Validity]
3. Lack of true gold standard: No manual verification that included studies actually employed qualitative methods. The authors acknowledge this but the impact could be substantial—if 10-15% of included studies were not truly qualitative, this would significantly affect all performance calculations. [Misclassification Bias]
4. Platform limitation: Validation restricted to MEDLINE via Ovid only. Performance may differ on other platforms (PubMed, Embase) where filters may need translation or adaptation. Limits generalizability. [External Validity]
5. Unclear exclusion criteria: The "plausibility check" that led to exclusion of Shaw filter (Shawb) and DeJean hybrid filter is not sufficiently detailed. The judgment about what constitutes "considerable syntax errors" or appropriate subject-specific terms ("women's stor*") appears subjective without explicit criteria. [Reporting Bias; Transparency]
6. Incomplete quality assessment: No formal quality appraisal of included filter development studies. The review does not assess whether original filter developers used appropriate validation methods, adequate gold standards, or reported findings transparently. [Quality Assessment Gap]
7. Limited gold standard diversity: Gold standard derived only from DARE-indexed systematic reviews (2009-2014) may not represent all types of qualitative research, particularly newer methodologies, implementation science studies, or qualitative research in databases beyond those typically searched (MEDLINE, CINAHL, PsycINFO, Embase). [Sampling Bias; Representation]
8. Missing sensitivity analysis: No exploration of how different components of the gold standard (systematic reviews vs. primary studies; different time periods; different topics) might affect filter performance. Table 4 provides some of this but more granular analysis would strengthen findings. [Analysis Depth]
9. Temporal validity concerns: Some filters were developed before the MeSH term "Qualitative Research" was introduced (2003). While the authors created separate filters using this term, the validation doesn't address whether older filters should be updated or retired. [Currency]
10. No inter-rater reliability: For the subset of judgments made (filter inclusion, plausibility), no second reviewer verification or inter-rater reliability testing was performed. [Reliability]
11. Incomplete reporting of search strategy: While the authors describe using "well-known sources," the full search strategy for identifying filters is not provided, making it difficult to assess comprehensiveness or replicate the review. [PRISMA Compliance]
Study Type:
Validation study (methodological study)
Related Chapters:
Tags:
• Search filters
• Qualitative research
• MEDLINE
• Databases
• Validation study
• Sensitivity
• Precision
• Information retrieval
• Systematic reviews
• Gold standard
• Relative recall
• B. Designing strategies - general