Shaw 2004

Appraisal of: Shaw RL, Booth A, Sutton AJ, Miller T, Smith JA, Young B, Jones DR, Dixon-Woods M. Finding qualitative research: an evaluation of search strategies. BMC Med Res Methodol. 2004;4:5.

Reviewer(s):

Andrew Booth

Full Reference:
Shaw RL, Booth A, Sutton AJ, Miller T, Smith JA, Young B, Jones DR, Dixon-Woods M. Finding qualitative research: an evaluation of search strategies. BMC Med Res Methodol. 2004;4:5.

Short description:

This empirical study evaluated the effectiveness of three different electronic search strategies for identifying qualitative research in the specific topic area of support for breast-feeding. The authors selected this topic because qualitative research was likely to be particularly valuable in this area, there was likely to be a substantial body of qualitative research available, and an existing Cochrane systematic review had already developed recognized search terms for the subject area that were neutral to methodology.

The three search strategies evaluated were: Strategy 1 using thesaurus terms (controlled vocabulary/subject headings specific to each database, such as MeSH terms in MEDLINE); Strategy 2 using free-text terms (over 40 commonly used qualitative methodology terms searched in titles, abstracts and keywords); and Strategy 3 using broad-based terms (three broad free-text terms: "qualitative", "findings", "interview$" plus the thesaurus term "Interviews"). All three strategies incorporated the breast-feeding support search terms from the previous Cochrane review and were applied across six electronic bibliographic databases representing medicine, nursing, and social sciences: MEDLINE, EMBASE, CINAHL, British Nursing Index, ASSIA, and Social Sciences Citation Index.

The authors evaluated each strategy using two key metrics borrowed from screening test terminology: recall (the proportion of potentially relevant records identified, analogous to sensitivity) and precision (the proportion of actually relevant records among those identified, analogous to positive predictive value). Relevance was defined by two criteria: whether records addressed the topic of breast-feeding support and whether they used a recognized qualitative methodology. Judgments of relevance were made by a team of experts in qualitative research and the topic area based on abstracts, or where abstracts were unavailable (23% of cases), on full-text articles. Difficult cases were resolved by consensus.

The combined total initial yield across all three strategies was 7,420 potentially relevant records after eliminating duplicates and non-human research. This figure was used as a proxy for the population of qualitative studies on breast-feeding support, as determining the "true" population would have required gold-standard methods like extensive hand-searching that were not logistically feasible. Following abstract screening for both subject matter and methodology, 262 records were judged actually relevant to both breast-feeding support and qualitative research. These relevant studies were published between 1976 and 2002 across over 100 different journals.

The results demonstrated important trade-offs between recall and precision. For recall, the broad-based strategy performed best, identifying 52.7% of the total initial yield (3,912 records), while the thesaurus and free-text strategies identified 47.6% (3,537 records) and 46.5% (3,451 records) respectively. Critically, no single strategy was sufficiently comprehensive to identify all potentially relevant records, suggesting that relying on one approach alone would miss important studies.

For precision, all three strategies performed poorly, though the thesaurus strategy was marginally better at 5.4% (meaning that of records initially identified as potentially relevant, only 5.4% proved actually relevant), compared to 4.9% for free-text and 4.7% for broad-based strategies. Overall, 96% of the potentially relevant records initially retrieved were subsequently judged irrelevant to the specific criteria. Furthermore, even the most successful strategy (thesaurus terms) only identified 72.9% of the total actually relevant records found by all three strategies combined, reinforcing that no single strategy was adequate.

The study identified 2,608 records relevant to breast-feeding support but not using qualitative methodology, and 26 records reporting qualitative research but not relevant to breast-feeding support, highlighting the challenge of simultaneously filtering by both topic and methodology.

The authors conclude that searching for qualitative research involves unavoidable trade-offs between recall and precision. They recommend that a combination of search strategies using both thesaurus and free-text terms is required to maximize recall, but warn that attempts to maximize comprehensiveness will result in poor precision and necessitate screening large numbers of irrelevant records. They argue that improvements are needed in how bibliographic databases index qualitative research and suggest that authors could assist by making their study designs more explicit and using structured abstracts where possible.

Limitations stated by the author(s):

The authors acknowledge that it was not possible for practical reasons to identify the "true" population of qualitative research in the area of breast-feeding support. Establishing a gold standard would have required hand-searching of journals, but initial analyses indicated that relevant references were so widely scattered across journals that this would be logistically impractical – they estimated it would require searching almost 30 years of more than 20 journals to retrieve just half of the relevant records identified through electronic searches. Instead, they used the total initial yield of all three strategies across six databases as a proxy for the population. The authors note that their findings are based on one specific topic area and may not generalize to other subject domains. They also acknowledge that 23% of records lacked abstracts in the databases, requiring retrieval of full-text articles for assessment, which added time and expense to the review process.

Limitations stated by the reviewer(s):

Weaknesses:

1. Lack of gold standard / uncertain denominator [Measurement Bias; External Validity]: The most fundamental limitation is that the study cannot establish the true population of qualitative studies on breast-feeding support. Without a gold standard method (such as comprehensive hand-searching), the authors cannot know how many relevant studies they actually missed. The recall figures are therefore calculated against a proxy denominator (the combined yield of all three strategies) rather than the true population. This means the actual sensitivity of each strategy could be lower than reported. The study essentially evaluates relative performance of strategies against each other rather than absolute performance against a known complete set of relevant studies.

2. Limited generalizability to other topic areas [External Validity]: The study examined only one subject area – support for breast-feeding. The performance characteristics of search strategies may vary considerably across different clinical or health topics depending on factors such as: terminology standardization in that field, number and distribution of relevant studies, journal coverage in databases, and indexing consistency for that topic. The findings may not transfer to searches for qualitative research in mental health, health services research, chronic disease management, or other domains. Breast-feeding support may have unique characteristics in terms of vocabulary, publication patterns, or methodological approaches that influence search strategy effectiveness.

3. Single time point and database version [Temporal Validity]: The searches were conducted at one point in time (the article was published in 2004, searches likely conducted 2003 or earlier). Database indexing practices, coverage, and thesaurus structures evolve over time. For example, the MeSH term "Qualitative Research" was only introduced in 2003, and indexing practices have likely improved since this study was conducted. The findings may not reflect current database performance. Additionally, the study does not capture the lag time in indexing – recently published articles may not yet be properly indexed, affecting the performance of thesaurus-based strategies in particular.

4. Unclear inter-rater reliability in relevance judgments [Measurement Bias]: While the authors state that relevance judgments were made by experts and difficult cases were resolved by consensus, they do not report formal assessment of inter-rater reliability. Seven different individuals (RS, TM, JS, BY, DJ, MDW and SB) made relevance judgments, but we don't know: what proportion of records were double-screened, what the agreement rate was before consensus, what criteria guided the consensus process, or whether screeners were blinded to study source or database. Given that 7,158 records needed to be screened, inconsistency in applying eligibility criteria could significantly affect the precision estimates. The binary judgment of "qualitative methodology" may be particularly challenging given the diversity of methods that fall under this umbrella.

5. Incomplete reporting of methodology [Transparency]: Several methodological details are not fully reported. The paper does not specify: the exact dates of the searches, whether searches were conducted independently by multiple searchers to assess reproducibility, how the 15 "key papers" for supplementary citation searching were identified, the full details of the search strategies for each database (provided in supplementary files not readily accessible in the main text), or how the authors handled edge cases such as mixed methods studies. The paper states that "any ambiguities or difficult cases were settled by consensus" but does not describe how many such cases arose or document the decision-making process, limiting transparency and reproducibility.

6. Limited analysis of sources of poor precision [Incomplete Analysis]: While the study documents that precision was poor (96% false positives), there is limited systematic analysis of why strategies failed. The authors identify 2,608 records relevant to breast-feeding but not qualitative, and 26 qualitative records not about breast-feeding, but don't provide detailed analysis of: what types of studies were falsely identified as qualitative (e.g., quantitative surveys misindexed, mixed methods studies, theoretical papers), which specific search terms contributed most to false positives, or whether certain databases were more prone to poor precision. A systematic analysis of a sample of false positives could have provided more actionable recommendations for strategy refinement.

7. No assessment of search strategy reproducibility [Reliability]: The study does not evaluate whether different searchers would construct and execute the same strategies consistently. Search strategy development involves judgment calls about which terms to include, how to combine them, and how to translate strategies across databases with different interfaces and vocabularies. The study presents the strategies as used by this team but doesn't assess whether other experienced searchers would arrive at the same strategies or achieve similar results, limiting confidence in the reproducibility of the approach.

8. Potential for verification bias in assessing relevance [Verification Bias]: The relevance of 23% of records (those without abstracts) was assessed using full-text articles, while 77% were assessed using abstracts alone. These two assessment methods may not be equivalent. Full-text review might identify methodological details not apparent in abstracts, potentially leading to more nuanced judgments. This differential verification could introduce bias if, for example, records without abstracts were more likely to be older studies with different reporting standards, or published in journals with different editorial policies. The authors do not analyze whether relevance assessment differed systematically between abstract-only and full-text assessments.

9. Limited exploration of database-specific performance [Incomplete Analysis]: While the study used six databases, the results are reported in aggregate across databases rather than providing database-specific performance metrics. Different databases have different strengths: specialized databases like CINAHL may have better coverage and indexing of qualitative nursing research, while general medical databases like MEDLINE may have broader coverage but less specific indexing. Understanding which databases performed best for which types of strategies would provide more nuanced guidance for searchers but this analysis is not presented.

10. No cost-effectiveness analysis [Practical Limitations]: While the study demonstrates that the broad-based strategy had highest recall, it required screening 3,912 potentially relevant records to identify 187 actually relevant ones (a 4.7% yield). The authors do not discuss the resource implications (time, cost, personnel) of screening such large volumes of records, or provide guidance on when the additional effort required for higher recall might be justified versus when a more precise but less comprehensive strategy might be acceptable. This practical consideration is important for researchers with limited resources or tight timelines.

11. Search strategies designed for maximum sensitivity [Design Choice]: The authors explicitly state they purposively chose terms to maximize comprehensiveness/sensitivity. This design choice was appropriate for the research question but means the strategies tested may be more sensitive (and less precise) than strategies that reviewers might use in practice when they need to balance sensitivity with feasibility. The study doesn't explore whether more selective strategies might achieve better balance between recall and precision, or provide guidance on how reviewers should calibrate their strategies based on review scope and resources.

12. Potential language and publication bias [Selection Bias]: The study does not explicitly state whether searches were limited to English language publications, or whether grey literature, conference proceedings, or dissertations were included. The finding that relevant studies were published across over 100 different journals suggests good scatter, but the extent to which the strategies would identify research published in non-English journals, books, or non-indexed sources remains unclear. Given that qualitative research may be more likely than quantitative research to be published in diverse venues including books and local journals, this could be a significant limitation.

Strengths:

Despite these limitations, the study has important strengths: it provides empirical evidence rather than opinion on search strategy performance; it evaluates multiple strategies across multiple databases; it uses appropriate metrics (recall and precision) borrowed from diagnostic test evaluation; the research team included both information retrieval specialists and qualitative methods experts; and the findings, though now over 20 years old, were influential in highlighting systematic problems with qualitative research indexing that have driven subsequent improvements in database providers' practices. The study appropriately acknowledges that perfect recall with high precision may not be achievable and that trade-offs are necessary, which remains an important message for systematic reviewers.

Study Type:

Comparative evaluation study / Methodological study

Related Chapters:

Tags:

Qualitative research
Search strategies
Search filters
Database searching
Thesaurus terms
MeSH terms
Free-text searching
Recall
Precision
Sensitivity
Positive predictive value
MEDLINE
EMBASE
CINAHL
Breast-feeding
Systematic reviews
Information retrieval
Search evaluation
B. Designing strategies - general
E. Quality assurance and reporting

Page updated

Report abuse