UNIT III - Designing and Developing Assessments
A. Characteristics of Quality Assessment Tools
Reliability - the right assessment tool produces the same results over time. So there’s a consistency, or accuracy in these results. Here, you should consider whether the test can replicate results whenever it is used.
For example, if the students perform the same tasks simultaneously, then such assessment passes as reliable.
Validity The validity of an assessment boils down to how well it measures the different criteria being tested. In other words, it is the idea that the test measures what it intends to measure.
This means your assessment method should be relevant to the specific context. For example, if you’re testing physical strength, you shouldn’t send out a written test. Instead, your tests should include physical exercises like pushups and weightlifting.
Equitable A good assessment tool is equitable, which means it doesn’t favor or disfavor any participant. Fair assessments imply that students are tested using methods and procedures most appropriate to them. Every participant must be familiar with the test context so they can put up an acceptable performance.
Standardization - means applying consistency to your testing methods. For example, if you’re sending out a questionnaire, it should have the same set of questions for all participants, and all the answers should be graded using the same criteria.
Other characteristics of assessment tools include:
A good assessment tool should provide a window for high-quality feedback.
It is feasible and accounts for equivalence.
It should motivate participants to be involved in the testing.
It should be transparent, non-discriminatory, and match expectations.
B. Types of Teacher-made Tests
Three reasons for Teacher-Made Tests:
They are consistent with classroom goals and objectives.
They present same questions to all students under nearly identical conditions.
They generate a product that can be evaluated and stored for later use—for example, parent conferences.
Three Alternatives or Types of Teacher-Made Tests:
Objective Test—alternative, multiple choice, matching, and completion test.
Essay Test—brief or extended.
Combination of the two.
Points to Consider about Testing:
Tests should be written at the taxonomical level taxonomical level of the objectives covered by the exam.
Instructional objectives suggest the best type of test item.
Purpose of tests is check student mastery of stated objectives.
Every test item should separate those who have mastered the objectives from those who have not—prevent guessing/ offset test wiseness.
Types of Teacher-made Tests
1. Alternate-Choice Items are:
True/False.
Yes/No.
Right/Wrong.
Agree/Disagree.
Key points about alternative-choice items:
Use simple declarative sentences.
Must be stated clearly to avoid ambiguity.
Have low reliability and validity.
Guidelines for creating alternate-choice items:
Avoid using negative statement and double negatives.
Ask something important and worth remembering.
Don’t make false items longer than true items.
Watch for item response patterns.
Be clear and concise.
Limit each statement to only one central idea.
Avoid using words—all, none, sometimes and usually—that can divulge the correct response.
Don’t use exact quotes from textbooks—can have different meaning when taken out of context.
2. Multiple-Choice Items:
Can cover many objectives.
Measures different cognitive behaviors—factual to the analysis of complex data.
Extremely versatile and easy to score.
Must be written in a straightforward, clear and concise way.
Can be modified after being administered.
Relatively insensitive to guessing—BUT more sensitive to guessing than supply items.
Guidelines for Creating Multiple -Choice Items:
Avoid providing grammatical/contextual clues to the correct answer.
Utilize language that even most unskilled readers will understand—write concise stems and precise choices.
Avoid absolute terms—always, never, and none— in the stem and alternatives.
Stem should contain the central issue.
Alternatives should be grammatically correct
Avoid the use of negatives.
Avoid giving structural clues.
Use all of the above and none of the above with care.
Avoid pulling statements directly from the textbook.
Alternatives should be plausible to less knowledgeable students.
3. Matching:
Designed to measure students’ ability to recall a large amount of factual information—verbal, associative knowledge.
Two lines of items are presented and students to select an item from one list that closely relates to an item from the second list.
Intended for lower-level learning.
Guidelines for creating matching columns:
Indicate basis for matching the premises with the responses.
Matching columns should be contained on one page.
Keep the number of items to be matched short.
Put premises and responses in logical order.
Premises and responses should fall in the same general topic/ category.
Make the length of statements consistent.
Use complete names if names are to be matched.
4. Completions:
Require that students write responses in their own handwriting supplying a recalled word/ phrase.
Difficult to write.
Excellent for subjects that require the recall of unambiguous facts/ perform certain calculations.
Guidelines for creating completions:
Give clear directions.
Be definite enough so that only one correct answer is possible.
Do not utilize direct statements from textbooks—it might encourage memorization.
Ensure that that all blanks are of equal length and correspond to the lengths of desired responses.
Items should be completed with a single word/ brief phrase.
5. Essay
Permits students to formulate answers to questions in their own words.
Measure what students know because they utilize their own storehouse of knowledge to answer a question.
Determines students’ ability to: analyze, synthesize, evaluate and solve problems.
Two basic forms are:
Brief — requires a short answer solution of a problem.
Extended — requires several paragraphs of writing.
Guidelines for Creating Essays:
Make directions clear and specific.
Allow ample time for the completion of essays— suggest a time allotment for each question.
Provide a choice of questions.
The worth of each question should be identified in the test instructions.
Explain scoring technique to students before the exam—it makes explicit what you are looking for.
Guidelines for offsetting low reliability and validity of essays:
Before exam—write a sample answer and assign points to the various components of the answer.
Skim the exam and identify a model paper—the anchor paper for grading.
Grade each question for all students before proceeding to the next question.
Grade papers blindly.
Establish page limit and time limit for each essay item.
If possible—read student responses several times.
C. Learning Target and Assessment Method
Table of Specification
Table of specification (TOS) is a chart or table that details the content and level of cognitive domain assessed on a test as well as the types and emphases of test items (Gareis and Grant, 2008).
TOS is very important in addressing the validity and reliability of the test items. The validity of the test means that the assessment can be used to draw appropriate result from the assessment because the assessment guarded against any systematic error.
TOS provides the test constructor a way to ensure that the assessment is based from the intended learning outcomes.
It is also a way of ensuring that the number of questions on the test is adequate to ensure dependable results that are not likely caused by chance.
It is also a useful guide in constructing a test and in determining the type of test items that you need to construct.
Different Formats of Table of Specification
Format 1 of a Table of Specification.
This format is composed of the specific objectives, the cognitive level, type of test used, the item number, and the total points needed in each item.
Specific Objectives refer to the intended learning outcomes stated as specific instructional objective covering a particular test topic.
Cognitive Level pertains to the intellectual skill or ability to correctly answer a test item using Bloom’s taxonomy of educational objectives. We sometimes refer to this as the cognitive domain of a test item. Thus, entries in this column could be knowledge, comprehension, application, analysis, synthesis, and evaluation.
Type of Test Item identifies the type or kind of test a test item belongs to. Examples of entries in this column could be multiple-choice, true or false, or even essay.
Item Number simply identifies the question number as it appears in the test.
Total Points summarize the score given to a particular test.
FORMAT 1
FORMAT 2
FORMAT 3
PREPARING A TABLE OF SPECIFICATIONS
Selecting the learning outcomes to be measured. Identify the necessary instructional objectives needed to answer the test items correctly. The lists of the instructional objectives will include the learning outcomes in the areas of knowledge, intellectual skills or abilities, general skills, attitudes, interest, and appreciation. Use Bloom’s taxonomy or Krathwolh’s 2001 revised taxonomy of cognitive domain as guide.
Make an outline of the subject matter to be covered in the test. The length of the test will depend on the areas covered in its content and the time needed to answer.
Decide on the number of items per subtopic. Use this formula to determine the number of items to be constructed for each subtopic covered in the test so that the number of item in each topic should be proportioned to the number of class sessions.
Make the two-way chart as shown in the format 2 and format 3 of a Table of Specification.
Construct the test items. A classroom teacher should always follow the general principle of constructing test items. The test item should always correspond with the learning outcome so that it serves whatever purpose it may have.
Types of Asssessment Methods
Direct observation - assessed in real time in the workplace.
assessed in a stimulated off-the-job situation that reflects the workplace.
Product based method - structured assessment activities such as reports, displays, work samples, role plays, and presentations.
Portfolio - a purposeful collection of work samples of annotated and validated pieces of evidence compiled by the learner.
evidence could include written documents, photographs, videos, or logbooks.
Questioning - generally more applicable to the assessment of knowledge evidence.
assessment could be by written or oral questioning, conducting interviews and questionnaires.
D. Assessment Tools Development
Assessment development cycle
The following graphic shows how the Assessment Cycle is built on these four distinct but interrelated actions. Results at one stage guide activity at the following stage. Clearly articulated outcome statements guide course design, course activities yield data that measure student learning, and evaluation of this data informs course and program revision. The graphic also demonstrates that assessment is a continuous process. Revised outcomes are implemented and student learning evaluated, following a process that may lead to further revision.
PLAN (What do I want students to learn?) - Good assessment planning begins by identifying learning outcomes for students. Planning then involves building programs and courses that provide students with opportunities to achieve these learning outcomes.
Alignment and integration of learning outcomes are the keys to successful assessment planning. Learning outcomes identified at the institutional level must be integrated at the program and course level. Conversely, course outcomes must align with program outcomes, which in turn must align with institutional outcomes.
Effective planning and integration depends on clearly articulated goals for student learning. Outcome statements must also be measurable and must target various skill levels within the cognitive domain.
The links on this page provide guidance with each of these elements of successful assessment planning. The first link discusses the paradigm shift at the heart of contemporary assessment-the shift from a focus on what the instructor does to what the student learns. The second set of links provides specific information designed to guide effective planning. The final link provides a light-hearted but helpful overview of planning components.
DO (How do I teach effectively?) - Assessment gathers data on what students do (what is learned) not on what instructors do (what is taught). However, the "DO" stage of the assessment cycle begins with instructors and with the question, "How do I teach effectively?"
Effective teaching provides continuity between the Plan and Check stages of the assessment cycle. Effective teachers implement program outcomes at the course level in ways that facilitate student learning. That is, they design learning activities that help students achieve what is developed in the Plan stage. The range of possible learning activities is wide and varied: projects, papers, performances, presentations, and exams are the most familiar direct measurements of student learning used at the course level.
Learning activities must be designed to stimulate learning and to yield assessment data for the evaluation that follows in the Check stage. In addition to relying on data gathered within particular courses, program evaluation is also based on other sources of assessment data, including direct measures such as comprehensive and standardized exams and indirect measures such as course evaluations and alumni surveys. The development of these assessment measures is also part of the Do stage.
CHECK (Are my outcomes being met?) - The previous stage concludes with students "doing" activities designed to help them achieve learning objectives developed at the "planning" stage. Effectively designed activities generate assessment data that is "checked" at this stage of the assessment cycle.
Checking should occur at both the course and program levels. Instructors check the array of activities students complete to fulfill course requirements. But if checking stops with the individual instructor, then program assessment will necessarily be limited. Effective program assessment requires that participants gather and share data on student achievement of program outcomes. Some of this data may come from assessments not limited to a particular course (such and surveys and competency exams). Other data will come from student performance within the courses that constitute the academic program.
Checking seeks to determine the extent to which students are achieving each outcome. Thus, a global measure of student success, such as a course grade, is not likely to provide sufficient assessment data. Effective course and program evaluation requires that student performance on individual outcomes be reported as specifically as possible.
ACT (How do I use what I've learned?) - Good instructors constantly act on the results of assessment. When students don't seem to be achieving desired outcomes, instructors make adjustments. Such a process is continuous and includes both reinforcement and revision. The things that work, stay; the things that don't, go.
When the above process is followed within an individual course, the assessment cycle is complete and able to repeat. Instructors can improve at each stage of the process, but the minimum requirements of assessment are being met and modifications (based on assessment data) can be made to improve student learning.
Action can be taken at the program level provided sufficient data have been gathered and checked. If the steps described in the Check stage have been followed, those involved in designing the program can take needed action.
At both the course and program levels, the results of "checking" identify "actions" that will form the basis for subsequent "planning." Action thus allows the Plan-Do-Check-Act cycle to continue.
2. Test item formulation
3. Item analysis - is a process that examines student responses to individual test items (questions) in order to assess the quality of those items and of the test as a whole. Item analysis is especially valuable for improving items that will be used again in later tests, but it can also be used to eliminate ambiguous or misleading items in a single test administration. In addition, item analysis is valuable for increasing instructors’ skills in test construction and identifying specific areas of course content that need greater emphasis or clarity.
4. Reliability - refers to the consistency with which it yields the same rank for individuals who take the test more than once (Kubiszyn and Borich, 2007). That is, how consistent test results or other assessment results from one measurement to another. A test is reliable when it can be used to predict practically the same scores when test administered twice to the same group of students and with a reliability index of 0.60 or above. The reliability of a test can be determined by means of Pearson Product Moment of Correlation, spearman-Brown Formula, Kuder-Richardson Formulas, Cronbach’s Alpha, etc.
5. Validity - is concerned whether the information obtained from an assessment permits the teacher to make a correct decision about a student’s learning. This means that the appropriateness of score-based inferences or decisions made are based on the students’ test results. Validity is the extent to which a test measures what it is supposed to measure.
Types of Validity
Face Validity. It is the extent to which a measurement method appears “on its face” to measure the construct of interest. Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. One reason is that it is based on people’s intuitions about human behaviour, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity.
Content Validity. A type of validation that refers to the relationship between test and the instructional objectives, establishes content so that the test measures what is supposed to measure. Things to remember about validity:
The evidence of the content validity of a test is found in the Table of specification.
This is the most important type of validity for a classroom teacher.
There is no coefficient for content validity. It is determined by experts judgmentally, not empirically.
Criterion-related Validity. A type of validation that refers to the extent to which scores from a test relate to theoretically similar measures. It is a measure of how accurately a student’s current test score can be used to estimate a score on a criterion measure, like performance in courses, classes or another measurement instrument. For example, the classroom reading grades should indicate similar levels of performance as Standardized Reading test scores.
a. Concurrent validity. The criterion and the predictor data are collected at the same time. This type of validity is appropriate for tests designed to assess a student’s criterion status or when you want to diagnose student’s status; it is a good diagnostic screening test. It is established by correlating the criterion and the predictor using Pearson Product Correlation Coefficient and other statistical tools correlations.
b. Predictive validity. A type of validation that refers to a measure of the extent to which student’s current test result can be used to estimate accurately the outcome of the student’s performance at later time. It is appropriate for tests designed to assess students’ future status on a criterion. Regression analysis can be sued to predict the criterion of a single predictor or multiple predictors.
Construct Validity. A type of validation that refers to the measure of the extent to which a test measures a theoretical and unobservable variable qualities such as intelligence, math achievement, performance anxiety, and the like, over a period of time on the basis of gathering evidence. It is established through intensive study of the test or measurement instrument using convergent/divergent validation and factor analysis. There are other ways of assessing construct validity like test’s internal consistency, developmental change and experimental intervention.
Convergent validity is a type of construct validation wherein a test has a high correlation with another test that measures the same construct.
Divergent validity is type of construct validation wherein a test has low correlation with a test that measures a different construct. In this case, a high validity occurs only when there is a low correlation coefficient between the tests that measure different traits.
Factor analysis assesses the construct validity of a test using complex statistical procedures conducted with different procedures.