Table 9.1 Commonly used standardized tests
Self-Assessment and Peer Assessment
Evaluating the Quality of Assessment Measures
Table 9.2 Evaluating RSVP characteristics of different kinds of assessments
Interpreting and Communicating Test Scores
Important Terms that Relate to Assessment
Assessment refers to the process of drawing inferences about a student’s knowledge and abilities based on a sample of the student’s work. The product of assessment certainly influences admission to higher education and career opportunities, but assessment goes well beyond just assigning grades. When used successfully, assessment results can provide valuable information about students’ achievements and motivations to teachers, parents, students, and educational administrators as well as information about the success of the teacher in meeting his or her personal and professional goals.
Effective teachers consider assessment when first planning a lesson, and many make the assessment tools (e.g., rubrics, scoring systems) available to students at the start of the lesson to help them focus their attention on the learning goals, identify the cognitive processes necessary to achieve those goals, and provide ongoing feedback about what has and has not yet been learned. To be useful, assessments must be clearly tied to learning objectives and any content standards, but they should also be flexible enough that the teacher doesn’t end up “teaching to the test.” Wise teachers also realize that no single assessment measure should serve to determine a student's achievements, aptitudes, or abilities.
Assessments come in many forms and can be used to serve many purposes. This chapter presents commonly used assessment formats and measures along with a discussion of how and when they are best applied. What’s critical is that the proper assessment be selected for the intended purpose. For example, if the instructional objective involves students’ ability to use various scientific formulas successfully in hypothesis testing, then any assessment should measure those procedures, rather than merely recall the formulas themselves. If the instructional goal includes students’ use of proper spelling in an essay, the assessment should require students to write an essay, rather than identify misspelled words, either in or out of context.
The Purpose of Assessment
Assessment can be used for formative, summative, or diagnostic purposes. Formative evaluations are designed to provide information regarding what students know and can do before or during instruction. For example, a geography teacher may interrupt a lecture to say, “Please make a list of all the state capitals you can think of right now,” or an English teacher might say, “Take out your journals and summarize the main points we’ve been discussing today.” Use of formative assessment helps to drive the direction of the lesson — allowing teachers and students to judge whether the lesson is going well, whether students need additional practice, or whether more information needs to be provided by the teacher. Teachers engaging in reflective practice often use formative assessments to evaluate their own achievements in light of student performance as well (e.g., how well am I getting the point across? am I providing enough opportunities for critical thinking? etc.).
Whereas formative assessments generally address the question, “What are students learning?” summative evaluations address the question, “What have students learned?” In other words, summative assessments provide information regarding what students know or have achieved following instruction. Final exams are summative assessments; so are high-stakes achievement tests. As with all assessments, it’s critical that any measure designed for summative assessment adhere closely to the learning objectives.
Diagnostic assessments are intended to identify what students know before instruction. Although sometimes used as pretests at the start of a term or unit, diagnostic tests are more commonly used to identify exceptionalities in learning, including disabilities and giftedness (see Chapter 4). IQ tests, for example, were originally designed for diagnostic purposes. Diagnostic assessments are frequently conducted outside the classroom by education specialists or school psychologists.
Assessment Measures
Informal assessments are spontaneous measures of student achievement. For example, teachers who listen to the types of questions students ask during a lesson are informally assessing the degree to which they comprehend the lesson. Similarly, when teachers observe children during daily tasks — at play, with peers, at their desks, during routines — they are informally assessing them. Informal assessments are not graded and are primarily used for formative purposes: Data collected via informal assessments offer continuous feedback regarding the daily lessons, classroom experience, student motivation, and so on.
Formal assessments, in contrast, are planned and structured, although they can be used for formative as well as summative evaluation. For example, a pop quiz can be formative if the teacher uses the information not as an assessment of each students’ final knowledge but rather to identify areas that need additional instruction, whereas the same quiz at the end of a unit may be summative. A variety of measures can be used for formal assessments.
Testing
Written tests may be the most commonly used assessment measures. Objective tests (i.e., selected responses) include multiple-choice and matching tests. This type of test is popular for many reasons, including that they can be scored easily and objectively and are efficient and usually inexpensive to administer. Because many objective tests require students to recognize a correct answer, they are best used to assess lesson content that is highly structured or concrete; however, well-designed objective tests can also be used to assess higher-level thinking, such as application or analogical reasoning.
Essay tests, also known as free-response tests, are an alternative to objective tests; essays require students to create their own answers, rather than select from a set of possible responses. Essays can be quick to construct, although they can be challenging to grade fairly. They are, however, a good teaching tool as well as an assessment measure: When students create responses on essay tests, they are likely to engage in higher-level thinking skills and are better able to transfer their knowledge to other situations outside the testing environment.
Standardized tests are developed by test construction experts and are used in many different schools or settings — in this case,standardized means that everyone takes the same test in the same way. Standardized tests can include both objective and essay components — the SAT is an example of a standardized test with both selected-response items (i.e., the verbal and math sections) and a free-response section (i.e., the writing test). In the era of high-stakes testing, standardized tests are becoming more common.
Standardized tests can be used to measure achievement, aptitude, or ability. Table 9.1 provides descriptions and comparisons of these types of standardized tests.
Adapted from J. Ormrod. (2009). Essentials of Educational Psychology (2nd ed, p. 391). Columbus, OH: Merrill. To purchase a copy of this book, click here.
Alternatives to Testing
Alternative assessment approaches include observations, performance evaluations, portfolios, and conferencing, among others. Used alone or in combination with testing, these qualitative assessments can provide a more complete picture of a student and his or her achievements and abilities.
Direct observations of what students say and do in the classroom can be recorded as anecdotal or running records, which qualitatively capture the flavor of the behaviors, or they can be guided by checklists or rating scales that allow teachers to quantify the observations. These techniques can be used individually or in combination. For example, a teacher may keep a detailed account of behavior as it occurs (e.g., a running record of playground activity that includes aggression among children) and may then use a rating scale to evaluate a particular behavior that occurs during that interval (e.g., very aggressive, moderately aggressive, relatively neutral). Direct observations can be especially useful when both verbal and nonverbal behaviors are recorded (e.g., one observation during free play can provide information about both physical coordination and social skills). Teachers can also review observational records to identify patterns of behavior over time. Note, however, that when observing, teachers must strive for objectivity, and that can sometimes be difficult when the teacher has formed expectations for and relationships with the students.
A performance assessment is a specific type of observation frequently used for assessment of procedural knowledge (e.g., skills; see Chapter 3 of this tutorial). In some cases, students are assessed as they perform a particular procedure (e.g., mixing chemicals in the lab, playing piano); in other cases, the product is assessed (e.g., the color of the liquid in the test tube after mixing, the piano concerto composed by the student). Performance assessments are well suited for the arts and for laboratory sciences; they are also useful as authentic assessments that emphasize skills used outside the classroom in the “real world.” For example, a chemistry performance assessment may include the prompt, “Is this sample of water safe to drink?” Because of their relevance and hands-on characteristics, performance assessments may increase student motivation, especially when used formatively.
A portfolio is a collection of a student’s work systematically collected over a lengthy time period. Portfolios can include any number of different items — writing samples, constructions or inventions, photographs, audiotapes, videotapes, and so on. They also frequently includereflections, which are the students’ own evaluations and descriptions of their work and their feelings about their achievements. Because of their diversity, portfolios can capture a broad picture of the student’s interests, achievements, and abilities and are best used for summative purposes. Student selection of portfolio content and the reflection process both encourage critical thinking, self-regulation and self-evaluation, and metacognitive skills. In addition, students’ pride in their work, when collected and displayed in their portfolios, may increase self-esteem and motivation.
Finally, assessments can take the form of one-to-one conferences between a student and the teacher. Conferences need not be oral exams; they can be an informal method for learning more about what the student knows, thinks, or feels and how the student processes learning. (See Chapter 8.) Teachers should take care to ensure that conferences are nonthreatening to students, keeping in mind that they must also be focused to yield useful results. Note that a conference may or may not include feedback — when a conference is used just for assessment, the teacher is collecting information about the student but not offering conclusions based on that information.
Self-Assessment and Peer Assessment
Teachers who focus on self-directed learning often encourage students to engage in self-assessment, in which students have input in determining their grades, based on reflection and objective evaluation of their work. In other situations, students evaluate each others’ work. In this case, students should have an opportunity to challenge or discuss a peer-assigned grade.
In general, self-assessment and peer assessment allow students to serve as agents of their own learning and can lead to increased motivation for schoolwork. However, it’s necessary that the teacher guide the process, sometimes by providing standards for evaluation and other times by facilitating a discussion in which students come to agreement regarding those standards and the procedures to follow. Once standards are developed, students can use checklists, rubrics, rating scales, observations, or any of the other tools described in this chapter to identify the extent to which they, or their peers, have met those standards. Journals can be particularly effective for encouraging less formal and more reflective, qualitative assessments.
Evaluating the Quality of Assessment Measures
Each assessment format just discussed has specific strengths and limitations; the choice of format ultimately depends on the specific educational context and instructional objectives. To make the determination, teachers rely on four characteristics, which also help determine the quality of any particular assessment tool. The acronym RSVP is used to help recall these characteristics: reliability, standardization, validity, and practicality. Table 9.2 shows how the RSVP characteristics are applied in evaluating assessment measures.
The reliability of an assessment instrument refers to its consistency in measurement. In other words, if the same person took the same test more than once under the same conditions and received a very similar score, the instrument is highly reliable. If an assessment instrument is not reliable, teachers cannot use the results to draw inferences about students’ achievement or abilities.
Standardization refers to uniformity in the content and administration of an assessment measure. In other words, standardized measures have similar content and format and are administered and scored in the same way for everyone. When tests are standardized, teachers have a way to compare the results from diverse populations or different age groups. For example, if a child takes the same standardized achievement test in both the third and fourth grades, a teacher (or the parents) can compare the results to determine how much the child learned in the intervening time. Using measures that are standardized reduces bias in testing and scoring.
The validity of an assessment instrument refers to how well it measures what it is intended to measure. For example, a final exam with only 10 multiple-choice questions is probably not a valid measure of the amount of information a student has learned in an entire term, nor is it likely a valid measure of the skills a student has learned during that same time period. Note that the validity of any measure depends on the purpose and context of its intended use. The same assessment instrument may be valid for some purposes and less valid for others. For example, a performance assessment may be a valid measure of laboratory skills but not a valid measure of content learned in science class. Measures that are not valid should not be used.
Practicality refers, broadly, to ease of use. For example, when evaluating practicality, teachers may ask, Is the measure affordable given the budget? Can it be administered by current staff, or with little training? Is special equipment needed? Can it be completed in the time allotted? Sometimes, measures that are standardized, reliable, and valid are simply impractical given the circumstances.
Used with permission from J. Ormrod. (2008). Educational Psychology: Developing Learners. (6th ed, p. 579). Columbus, OH: Merrill. To purchase a copy of this book, click here.
Recognize, too, that assessment measures, and the teachers who use them, need to be fair and unbiased. When creating or selecting assessments, teachers must look for bias in content (e.g., material known by only one culture) or in administration (e.g., assessing students with limited English-language skill is challenging). Teachers need to remain aware of the diversity of the student population when considering standardization, validity, and practicality.
Scoring Assessment Measures
Scoring selected-response measures can be quite easy, especially if the test is carefully constructed. In general, the person scoring the test must only identify whether the test taker selected the correct response for each item; scoring of this sort is objective and fast. Evaluating and/or grading the alternative assessments objectively can be much more challenging, especially when a holistic scoring system is used. Holistic scoring refers to an assessment method in which an overall score is determined based on the teacher’s impression of the quality of work. Performance assessments, essays, and portfolios are frequently scored holistically.
In contrast, an analytic scoring system is a quantitative measure in which individual components of a project, portfolio, performance, or essay are scored individually and then the scores are added together to form an overall score or grade. For example, rubrics are often used to score alternative assessment measures. In general, a rubric includes a list of characteristics that responses may include and that will be considered in the evaluation. More specifically, rubrics stipulate the scoring dimensions in terms of content or process (e.g., writing style, introduction, required facts) and a scale of values for evaluating each dimension (e.g., beginner, developing, advanced; some rubrics use a point scale or letter grade). Good rubrics also include clear explanations and examples of expected responses at each level of the scale. Additionally, the individual components in the rubric are often weighted — perhaps, for example, content is weighted more than writing style. Note that, as mentioned previously, instructors can give scoring rubrics to students at the start of a lesson to help students identify and work toward optimal performance.
Types of scores
The most common type of score, used most frequently for classroom (nonstandardized) tests, is the raw score. A raw score indicates the number of correct responses on a particular assessment measure. For example, on a quiz with 10 points, a student can earn 0, 1, 2, 3 points, and so on up to a raw score of 10. Interpreting a raw score requires knowledge of the test — for example, a score of 3 is only useful to someone who knows the total number of questions. For that reason, raw scores are often transformed into criterion-referenced or norm-referenced scores, especially when grades are attached.
Criterion-referenced scores specify how one student’s raw score compares with an absolute standard based on the specific instructional objectives. For example, if a 100-point test is constructed to sample the content of one semester, then a student who has a raw score of 58 can be said to have mastered approximately 58% of the course material. Note that when a criterion-referenced scoring system is used, each student is evaluated against the standard (i.e., the criterion), not against other students. In many classrooms, teachers assign letter grades based on criterion-referenced scores (e.g., 90% or above = A). Note that criterion-referenced scoring systems need not be point totals; rubrics that provide detailed descriptions of expected performance at each scoring level are also criterion-referenced.
Two common types of criterion-referenced scores used in standardized testing are grade-equivalent scores and age-equivalent scores. Grade-equivalent scores are generally computed by comparing one person’s performance to the average score for all students in the same grade taking the same test, and age-equivalent scores are computed by comparing one person’s performance to the average score for all individuals at the same age taking the same test. For example, if the average (raw) score for all eighth graders taking a reading achievement test in the first semester is 72, then any student who scores a 72 would have an eighth-grade equivalent score; students scoring above 72 would be performing comparably to students in a later semester or in a higher grade.
In contrast, a norm-referenced score is determined by comparing a student’s performance with the performance of others. For example, teachers using a norm-referenced scoring system may determine that the top 10% of scores earn As, the next 10% earn Bs, and so on — regardless of the students’ raw scores. Many teachers (usually incorrectly) refer to norm-referenced scoring as “grading on a curve.” Norm-referenced scoring is most common in standardized testing but can also be used in other classroom settings (e.g., some instructors grade holistically rather than using a rubric: “This essay is the best in the class and thus earns an A; these two are almost as good and thus earn A-”; note the subjectivity in this type of grading).
Percentile rankings are one type of norm-referenced score. In percentile ranking, each student’s individual score is compared with the individual scores of other students taking the same test at the same time. The percentile rank shows the percentage of students in the group who scored equivalent to or below a particular raw score — not the percentage of correct answers. For example, if a student accurately answered 83% of the questions on a test and had the highest score in the class, the student would have a percentile rank of 100% — 100% of the group scored at or below 83% correct. In this situation teachers would describe this student as “in the top 1% of the class.”
Other types of norm-referenced scoring require an understanding of basic descriptive statistics, including measures of central tendency and variance. Measures of central tendency are indicators of the score that is typical or representative of all test takers (i.e., the distribution of scores). The most frequent measure of central tendency is the mean, which is simply the arithmetic average of a group of scores. When there are a few very high or low scores, the median may be a better representative of the central tendency of a group. The median is the middle score in a ranked list of scores. By definition, half the scores are larger than the median, and half are smaller. The mode, the score that occurs most often, is another measure of central tendency, although it’s less often used to characterize student performance on an assessment measure.
Variance refers to the amount of spread among scores. The most frequently used measure of variance is the standard deviation, a measure of how much the scores differ from the mean. The larger the standard deviation, the more spread out the scores are in the distribution. The smaller the standard deviation, the more the scores are clustered around the mean. For example, if everyone scores a 50% on a test, the mean is 50 and the standard deviation is zero — there’s no variance in the scores. If one person scores 52 and one person scores 48 on a test with a mean of 50, the standard deviation is greater than zero but still small; if many students score above 65 and below 40, the standard deviation will be relatively large.
When instructors know both the mean and standard deviation for a group of scores, they can easily determine how any individual compares with the larger group. Sometimes, however, scores are reported as standard scores, which are derived from the standard deviation. A zscore, for example, indicates how many standard deviations above or below the mean a particular score is. For example, if the mean of a test is 60 and the standard deviation is 5, a student scoring a 55 will have a z score of –1 and a student scoring a 70 will have a z score of 2.
Interpreting and Communicating Test Scores
Effective teachers can accurately interpret assessment measures — they understand the connections between objectives and measures, and they can make valid inferences from the data regarding a student’s ability, aptitude, or performance. Furthermore, effective teachers must explain results of assessments using language appropriate for the audience, whether that audience includes the students themselves, the parents, school administrators, or government officials. Some general guidelines to keep in mind:
1. In the United States, test scores are confidential information under the Family Educational Rights and Privacy Act (FERPA). Teachers can share assessment results with the student, that student’s parents or guardians, and any school personnel directly involved with the student’s education. Teachers cannot post scores publicly or in a fashion that allows identification (e.g., by social security number), nor can teachers leave a stack of graded papers for students to pick up.
2. Teachers should be well informed about the test when communicating results to parents or students. Sometimes it’s best to use general statements when communicating assessment results (e.g., “your child is on target for children of her age”), but if a parent asks for more detailed or specific information, FERPA requires that it be given. For example, if a student scores at the “proficient” level on a standardized achievement test, a teacher should be able to explain the various other levels, the percentage of students achieving this score, and the reliability of the test.
3. Be attentive to the feelings of the students and/or the families involved. “Your child scored significantly below the rest of the class” may be truthful, but an effective teacher should communicate in a positive, encouraging fashion.
4. Attend to differences in language and culture when discussing assessment results. Be sure that everyone understands the data and the implications.
Important Terms that Relate to Assessment
The following terms relate to assessment; many of the definitions come from several texts, including: Essentials of Educational Psychology(2nd ed., pp. G1−G5), by J. Ormrod, 2009, Columbus, OH: Merrill; Educational Psychology: Developing Learners (6th ed., pp. G1−G8), by J. Ormrod, 2008, Columbus, OH: Merrill (To purchase a copy of this book, click here.); Educational Psychology: Windows on Classrooms (8th ed., G-1−G-8), by P. Eggen & D. Kauchak, 2010, Columbus, OH: Merrill (To purchase a copy of this book, click here.); Educational Psychology (10th ed., 613−622), by A. Woolfolk, 2007, Boston, MA: Allyn & Bacon (To purchase a copy of this book, click here.); and Child development and education (3rd ed., p.2), by T. M. McDevitt and J. E. Ormrod, 2007, Columbus, OH: Merrill (To purchase a copy of this book, click here.).
Accountability. Mandated obligation of teachers and other school personnel to accept responsibility for students’ performance on high-stakes assessments.
Achievement tests. Standardized tests measuring how much students have learned in a given content area.
Age-equivalent score. Test score indicating the age level of students to whom a test taker performed most similarly.
Analytic scoring. Scoring a student’s performance on an assessment by evaluating various aspects of it separately.
Anecdotal records. Narrative accounts of observed student behavior or performance.
Aptitude tests. Standardized tests designed to predict the potential for future learning and measure general abilities developed over long periods of time.
Assessment. Process of observing a sample of a student’s behavior and drawing inferences about the student’s knowledge and abilities.
Authentic assessment. Assessment of students’ knowledge and skills in a “real-life” context.
Central tendency. Typical score for a group of scores.
Checklist. Assessment tool with which a teacher evaluates student performance by indicating whether specific behaviors or qualities are present or absent.
Conferences. Face-to-face interactions with teachers and students or teachers and parents to communicate strengths in student learning or areas that need improvement.
Content validity. Extent to which an assessment includes a representative sample of tasks within the domain being assessed.
Criterion-referenced score. Assessment score that specifically indicates what a student knows or can do.
Diagnostic assessment. Highly specialized, comprehensive and detailed procedures used to uncover persistent or recurring learning difficulties that require specially prepared diagnostic tests as well as various observational techniques.
Dynamic assessment. Systematic examination of how easily a student can acquire new knowledge or skills, perhaps with an adult’s assistance.
Essay tests. An assessment format that requires students to make extended written responses to questions or problems.
ETS score. Standard score with a mean of 500 and a standard deviation of 100.
Formal assessment. Preplanned, systematic attempt to ascertain what students have learned.
Formative evaluation. Evaluation conducted before or during instruction to facilitate instructional planning and enhance students’ learning.
Grade-equivalent score. Test score indicating the grade level of students to whom a test taker performed most similarly.
High-stakes testing. Practice of using students’ performance on a single assessment instrument to make major decisions about students or school personnel.
Holistic scoring. Summarizing a student’s performance on an assessment with a single score.
Informal assessment. Assessment that results from a teacher’s spontaneous, day-to-day observations of how students behave and perform in class.
Mean (M). Mathematical average of a set of scores.
Median. Middle score in a group of scores.
Mode. Most frequently occurring score.
Normal distribution (normal curve). Theoretical pattern of educational and psychological characteristics in which most individuals lie somewhere in the middle range and only a few lie at either extreme.
Norm-referenced score. Assessment score that indicates how a student’s performance on an assessment compares with the average performance of others.
Objective testing. Multiple-choice, matching, true/false, short-answer, and fill-in tests; scoring answers does not require interpretation.
Paper-pencil assessment. Assessment in which students provide written responses to written items.
Percentile ranking. Test score indicating the percentage of people in the norm group getting a raw score less than or equal to a particular student's raw score.
Performance assessment. Assessment in which students demonstrate their knowledge and skills in a nonwritten fashion.
Portfolio. Collection of a student’s work systematically compiled over a lengthy time period.
Practicality. Extent to which an assessment instrument or procedure is inexpensive and easy to use and takes only a small amount of time to administer and score.
Rating scale. Assessment tool with which a teacher evaluates student performance by rating aspects of the performance on one or more continua.
Raw score. Assessment score based solely on the number or point value of correctly answered items.
Reflections. Students’ own evaluations and descriptions of their work and their feelings about their achievements
Reliability. Extent to which an assessment instrument yields consistent information about the knowledge, skills, or characteristics being assessed.
Rubric. List of components that a student’s performance on an assessment task should ideally include.
Running record. Narrative records of a child’s activities during a single period of time.
Standard deviation (SD). Statistic that reflects how close together or far apart a set of scores is and thereby indicates the variability of the scores.
Standard score. Test score indicating how far a student’s performance is from the mean with respect to standard deviation units.
Standardization. Extent to which assessments involve similar content and format and are administered and scored similarly for everyone.
Standardized test. Test developed by test-construction experts and published for use in many different schools and classrooms.
Stanine. Standard score with a mean of 5 and a standard deviation of 2; it is always reported as a whole number.
Summative evaluation. Evaluation conducted after instruction to assess students’ final achievement.
Validity. Extent to which an assessment instrument actually measures what it is intended to measure and allows appropriate inferences about the characteristic or ability in question.
Variability or variance. Degree of difference or deviation from mean.
z-score. Standard score with a mean of 0 and a standard deviation of 1.