are large numbers of such instruments from which to choose for any given health problem or context and insufficient guidance to inform choice (Garratt et al., 2002). Such instruments generally take the form of questionnaires containing several items reflecting the broad nature of health status, disease, or injury, which are most often summed to give a total score. The term ‘patient-reported outcome measure’ will be used throughout this review to refer to patient-completed instruments. There are two broad categories of PROM: generic and specific. Generic instruments are not age-, disease-, or treatment-specific and contain multiple concepts intended to be relevant to a wide range of patients and the general population. Specific instruments may be specific to a particular condition (for example, diabetes), a particular intervention or patient population. Diseasespecific instruments may have greater clinical appeal due to their specificity of content, and associated increased responsiveness to specific changes in condition. 6 The broad content of generic instruments enables the identification of comorbid features and unanticipated treatment side-effects that may not be captured by specific instruments, which suggests they may be useful in assessing the impact of new health-care technologies where the therapeutic effects are uncertain. However, the broad content may reduce responsiveness to small but important changes. It has therefore been recommended that a combination of generic and specific measures be used in the assessment of health outcomes. PROMs have increasingly been applied in a range of settings including routine patient care, clinical research, audit and quality assurance, population surveys, and resource allocation. However, consensus is often lacking as to which instrument to use; this has important implications for the evaluation of clinical effectiveness. Structured reviews of measurement properties are a prerequisite for instrument selection and standardisation, and instruments with measurement properties that support their application in specific populations and across a range of evaluation settings need to be identified. Selection criteria have been defined for assessing the quality of patientreported health instruments (Streiner & Norman, 2008; McDowell & Newell, 2006; Fitzpatrick et al., 1998). These include measurement issues, such as reliability, validity, responsiveness, and precision, as well as practical issues, such as acceptability and feasibility. Such criteria are now regarded as essential by regulatory bodies such as the United States Food & Drug Administration (FDA). Additionally, current FDA guidance places patients at the centre of the development process of PROMs (Food & Drug Administration, Department of Health and Human Services, 2009). These criteria are now briefly summarised since they directly inform the review reported here. Criteria for assessing PROMs Reliability is concerned with whether measurement is accurate over time and, for multi-item instruments, whether they are internally consistent. Testretest reliability usually involves instrument self-completion on two occasions separated by a suitable time-period and, assuming no change in the underlying health state, measures the temporal stability of the score (Fitzpatrick et al., 1998). A test-retest period of between two days and two weeks has been recommended for most conditions (Streiner & Norman, 2008). Too short a period may be associated with patient recall of answers, which may artificially inflate reliability (Nunnally & Bernstein, 1994; Streiner & Norman, 2008); too long a period may be associated with actual change in health. Health transition questions, which invite patients to indicate whether their general or specific health has changed between instrument administrations, are often included in evaluations. This allows for the identification of stable respondents in whom intra-class correlations between scores at different administrations may be high. 7 The correlation coefficient is the most frequently used method for calculating estimates of test-retest reliability; the intra-class correlation coefficient (ICC) is used to identify group shift over time as a measure of reliability (Streiner & Norman, 2008). For group comparisons, levels of reliability over 0.70 are required (Streiner & Norman, 2008; Fitzpatrick et al., 1998). For the evaluation of individuals, levels above 0.90 have been recommended (Nunnally & Bernstein, 1994; Fitzpatrick et al., 1998). Internal consistency reliability of multi-item instruments that adopt a traditional summated rating scale format is tested following a single application. The relationship between all items and their ability to measure a single underlying domain is assessed using Cronbach's alpha: alpha levels of between 0.70 and 0.90 have been recommended (Streiner & Norman, 2008; Scientific Advisory Committee of the Medical Outcomes Trust, 2002; Garratt et al., 2001). Homogeneity at the item level can be assessed using item-total correlation: levels above 0.40 have been recommended (Ware, 1997). Validity assesses whether an instrument measures what is intended in the different settings in which it may be applied (McHorney, 1996; Fitzpatrick et al., 1998). Instrument validity is not a fixed property. The process of validity testing is on-going, informing instrument application and interpretation in