"We have been trained to be good researchers, and we often teach students how to do research well. It’s time to teach our colleagues. This is a matter of equity in higher education." [Spr] (emphasis added)

The following essay is compiled from some of my previous writings (2010 - 2018) on SET.

Introduction

Student Evaluation of Teaching (SET) has been used at many universities for many years. However, a large body of research shows that SET fails so badly to evaluate the quality of teaching that SET is harmful to a university.

Teaching is a fundamentally optimistic endeavor. I don’t believe one can teach effectively if one believes students always get it wrong. The views expressed here are not intended to denigrate those who receive high ratings from students. I have heard several winners of excellence in teaching awards speak; in most cases, I believe they are indeed effective in the classroom, and have perhaps earned their high ratings. Rather, my views are intended to point out that student ratings are highly unreliable, often – not always, but far too often – rewarding entertainers and grade inflators, and punishing effective teaching. Our efforts ought to be directed at improving our universities; that’s what’s really pro-student.

It is saddening that some continue to tweak the use of student evaluations, in light of the large body of research proving conclusively that it is superstition to believe that student evaluations are anything but harmful to an educational institution. If, as I hope, you take pride in teaching, why would you want your career evaluated by methods that reward grade inflation and punish honesty, conscience, and professionalism? To what is this comparable? Should we expend tuition revenues to improve our transportation system by breeding better mules to pull our buggies? Should we issue the sturdiest, lightest-weight model of abacus to all faculty as replacements for laptop computers? Perhaps we can bring back the barter system, allowing students to pay their bills in eggs or hand-knit sweaters, to be passed on to faculty in lieu of salaries; perhaps TIAA-CREF can be induced to accept such contributions to our retirement investments. A university should respect research.

There is strong evidence that student responses to questions of “effectiveness” do not measure teaching effectiveness…. We do not measure teaching effectiveness. We measure what students say, and pretend it’s the same thing. [emphasis added] – [StaFr]

… our so-called course evaluations are not evaluations at all. Administrators interpret (and often use) the scores as measuring instructor effectiveness. Yet such scores cannot reflect how much students have learned. The scores encourage us to become popular teachers rather than good educators. – [Fried]

Student evaluation of teaching ratings and student learning are not related. – [UtWG]

We cite some of the research showing the deficiencies of SET, and propose a better way to evaluate teaching.

SET doesn’t work

A long list of the flaws of SET includes the following.

· Anonymous students often write false statements. Why should a university permit a professor’s career be judged by lies or egregious errors?

· Many students who complained about a course or professor being too difficult should never have taken the course in the first place. Data that could be used for advisement is often ignored – I have often learned, after the passage of the drop/add period, that a student in one of my courses is unqualified for the course, by virtue of not having taken a prerequisite, or by virtue of reliable predictors of failure such as low SAT scores or low grades in previous indicator courses. Why should a lack of advisement or a third party’s poor advisement reflect negatively on a teaching evaluation?

· So many students evaluate teaching based on their expected grades [BrPaPe, CarW, C&S, Murray, N&Z, Red, V&S], rather than based on the quality of instruction, that faculty have incentive to “dumb down” courses and inflate grades.

The effect of grading on SET ratings is substantial enough that an instructor can move from being in the bottom third to the top third of instructors at their university simply by grading more leniently…. SETs … fuel grade inflation … a primary contributor to declining academic standards and the ‘dumbing down’ of the curriculum. – [Red]

· Other factors that should be irrelevant can be determinants of SET:

James Felton, a professor of finance and law at Central Michigan University, and colleagues looked at ratings for nearly 7,000 faculty members from 370 institutions in the United States and Canada, and his verdict is: the hotter and easier professors are, the more likely they’ll get rated as a good teacher…. – [Eps]

… overall results suggest that vacuous but animated, charismatic, and amusing lectures yield significantly higher student ratings than substantive but less animated lectures. – [Da]

· Although some students “get it right” by rating highly instructors who are excellent teachers, multiple studies have found that the most effective teachers, as determined by their students’ relative success on common exams or in sequel courses, get the worst SET ratings, while those who “dumb down” their courses and inflate their grades are uplifted in SET [A&L, BrPaPe, CarW, Henr, StaFr]. Authors of these studies include those who surmise that this is because effective teachers often have their students doing more work and/or hold students to high standards, and the students use SET to push back against greater difficulty in obtaining a high grade.

At the University level in first-year economics neither the quality of the instructors nor that of the course as evaluated by the students had any significant effect on performance on the posttest. At the school level student opinion was a significant variable with a surprising negative sign; the poorer the student considered his teacher to be, the more economics he understood. – [A&L]

…teachers who are more effective in promoting future performance receive worst evaluations from their students. – [BrPaPe]

We … present evidence that professors who excel at promoting contemporaneous student achievement teach in ways that improve their student evaluations but harm the follow-on achievement of their students in more advanced classes…. Student evaluations are positively correlated with contemporaneous professor value added and negatively correlated follow-on student achievement. That is, students appear to reward higher grades in the introductory course, but punish professors who increase deep learning…. - [CarW]

By making student evaluation … the principal way of deciding what constitutes good teaching, we end up by punishing those who try to teach students how to learn while rewarding those who make them feel content with old bad habits that turn a college “education” into an amusing game of little value. – [Henr]

· Lots of students have yet to develop an adult perspective on learning, failing to appreciate that they must learn faster and more independently than they did in high school. Psychologists classify the college years as “emerging adulthood” and note that students emerge into adulthood at varying rates. Many, especially at the freshman and sophomore levels, “have a foot, or even their head, in high school; they tend to view the instructor as the primary source of their learning, like a high school teacher…. the way the surveys (may) get used pushes us to dumb down our courses.” – [Z] How is it appropriate to have a professor’s career dependent on immature judgment?

… teacher ratings are detrimental to students because they are a signal that responsibility for learning lies not with them, but with teachers and administrators. Studies … suggest that when people do not accept responsibility for their learning, they are not very successful. – [Arm]

· Often, only a minority of students write subjective comments. Since a satisfied student is typically less motivated to comment subjectively, these comments are typically biased in favor of complaints, and may be mistaken or deliberately abused as representative by readers of the SET.

…people tend to be motivated to act (e.g., fill out an online evaluation) more by anger than by satisfaction. Have you ever seen a public demonstration where people screamed, ‘we’re content!’ …. ? – [StaFr]

· All student responses have the same weight in anonymous SETs. A student with chronic absenteeism and tardiness, missing homework assignments, and skipped exams, has the same standing as a diligent student to judge a professor’s teaching proficiency. Why should a student who made no effort have the right to judge the instructor as too hard or boring?

· When SET is done by having students fill out hardcopy forms and a student collects them for submission, we have no guarantee that extra forms were not used to give someone extra votes.

· When SET is done by having students fill out electronic forms, the response rate is low. This suggests that received responses may be unrepresentative. It also shows that many students do not take SET seriously; this conclusion is confirmed by Web sites such as ratemyprofessor.com, where all the world can see lots of SET based on non-academic factors.

· Studies have found gender bias in SET [BOS0, BOS1, Spr, Wie].

In two very different universities and in a broad range of course topics, SET measure students’ gender biases better than they measure the instructor’s teaching effectiveness. Overall, SET disadvantage female instructors. – [BOS1]

… the data indicate that it would be nearly impossible for a physically unattractive female instructor teaching a large required introductory physics course to receive as high an evaluation as that of an attractive male instructor teaching a small fourth-year elective course for physics majors, regardless of how well either teaches. – [Wie]

· Studies have found bias in SET favoring non-quantitative courses over quantitative courses [R&S, Ut&S, UtWG].

Professors teaching quantitative courses are far less likely to be tenured, promoted, and/or given merit pay when their class summary ratings are evaluated against common standards, that is, when the field one is assigned to teach is disregarded. They are also far less likely to receive teaching awards based on their class summary SET ratings. – [Ut&S]

Why should a professor of mathematics or accounting start the evaluation process at a disadvantage, as a result of his/her field, relative to a professor of humanities? Why should a computer science professor teaching mathematical foundations of computing start the evaluation process at a disadvantage relative to a departmental colleague whose courses have less abstract material?

A Catholic university encouraged this bias one semester by seeking, on its electronic evaluation form, the extent of agreement with “This course helped further the University mission by doing one or more of the following: …. (b.) extending a gospel-based, value-centered education, (c.) encouraging service to society, especially the poor and oppressed, and/or (d.) developing the whole person, mind, body, heart & soul.”

  • Item (b) is prejudicial regarding a faculty member known, or believed, not to be Christian.

  • Item (c) is prejudicial, as many students see courses in Philosophy, Religious Studies, and the social sciences as doing more to further these goals than courses that are quantitative-based.

  • Item (d) is inappropriate for multiple reasons. Of course a faculty member should contribute to the development of students' minds. But are we seriously going to require professors to lift weights, spin, or engage in dance-exercise fads with students? Are we going to expect faculty to be involved with students' romantic lives or religious beliefs?

  • It is obvious from years of grading students’ essays that many students do not carefully distinguish OR and AND, hence will confuse “one or more of the following” with “all of the following”.

An implicit premise of SET is that good teaching results in contented students. This is an invalid premise.

As teachers, we often question students’ fundamental assumptions. Students cannot reasonably expect to broaden their awareness and deepen their comprehension without being disquieted. Nor can they reasonably expect these changes to leave their most cherished beliefs and deepest sentiments undisturbed. Education is transformative: it does not leave you as it found you. Some students recoil from this in anger…. it is a direct result of doing my job well. Teachers do not aim at the kinds of outcomes that drive businesses. Sometimes a dissatisfied student is the direct consequence of excellent teaching. – [Tierno, p. 14]

Consequences of SET

The evils discussed above let us conclude that SET fails miserably as a method of evaluating instructional quality. Other consequences of the use of SET are discussed below.

Rather than unfettered excellence in post secondary education, the overarching institutional agenda revealed by such practices is classroom marketability, elevated enrollments, and very high consumer satisfaction.… while such a stratagem may produce contented students, it essentially forsakes responsibility for educational leadership …. it is difficult to imagine a practice more harmful to a community that is ostensibly committed to instructional effectiveness…. And ironically, because student instructional ratings are poorly correlated with instructional products, the changes instructors make to their teaching routines to elevate student ratings are more likely to compromise than improve teaching effectiveness. – [Da] (emphasis added)

… society must hold higher education to much higher expectations or risk national decline…. Establishing higher expectations [for higher education], however, will require that students and parents rethink what too many seem to want from education: the credential without the content, the degree without the knowledge, and the effort it implies…. The simple fact is that some faculties and institutions certify for graduation too many students who cannot read and write very well, too many whose intellectual depth and breadth are unimpressive, and too many whose skills are inadequate in the face of the demands of contemporary life. – [Wing]

Learning, not student satisfaction, should be the teacher’s primary goal.

Good teaching is known primarily by the quality of student learning--an outcome that is difficult to observe and often not immediately evident. A successful career in teaching is marked by the teacher's reputation in propelling students to notable accomplishments. The level of accomplishment is more important than the number of students; the cases of individual students whose accomplishments are truly exceptional are of special importance. A teacher distinguished only by service to a large number of students is rarely considered exceptional. The job performance of teachers is directly enabled by factors such as adequately prepared and highly motivated students and, more than anything, by the presence of an organizational environment in which the achievement of academic excellence is accorded an unrivaled priority…. The institutionally emphasized concept that quality learning is a product of faculty commitment to "customer service" encourages waste by lessening student responsibility for learning…. The administrative penchant for presuming that the student customer is always right not only undermines student motivation, it makes teaching unattractive…. – [Stone]

The kindest thing you can do for a student is to be honest with him/her in evaluating his/her work. Students need to know how they measure up against "World Class" standards, not just against their classmates; not every class is an average class. If, as occasionally happens, I have a weak class, how many of my students should I lie to in order to achieve a B- average grade? In my courses for the major, how many students should I encourage to pursue careers for which I know they are academically unsuited by awarding higher grades than the students merit? How many recruiters should I lie to by inflating grades, risking that years down the road, the same recruiters will boycott my university as a den of grade inflation? Faculty are paid for expertise in our fields, including expertise in assessing students’ work. Shouldn’t we use the latter honestly, even at the risk of disapproval in student evaluations?

Fear of SET makes some faculty reluctant to oppose academic dishonesty:

…. A recurring theme emerged: some faculty members were unwilling to report cases of academic dishonesty to the dean of students. They were concerned about the negative effect such action might have on their end-of-semester Course Instructor Survey (CIS), which may be used for promotions, tenure decisions, and merit reviews…. These faculty members believed that a “whistleblower effect” would affect student evaluations of teaching. Many viewed it as professionally advantageous not to report academic violations and believed that reporting students would lead to lower course-evaluation ratings that could impede professional advancement. – [AB]

Faculty careers and the quality of a college education are powerfully affected by reliance on SET. When tenure, promotion, and post-tenure reviews rely heavily on SET, the research cited above argues that we evaluate very badly, basing evaluations on inappropriate assumptions and data of dubious reliability. SET measures popularity, not pedagogical effectiveness. While professor A becomes popular, correctly recognized as an outstanding teacher, professor B becomes popular for “dumbing down,” inflating grades, and entertaining rather than educating. A favorable tenure decision for B can result in a university being stuck with faculty deadwood for decades. While professor C becomes unpopular, correctly recognized as an ineffective teacher, professor D becomes unpopular for excellently challenging students to learn and holding them accountable for doing so. An unfavorable tenure decision for D strips the university of the services of a professor who could uphold the university’s excellence for decades, and may do a major injustice to the professor.

The ethical argument

If you do data-based research and submit for publication a manuscript based on data and assumptions that you know are flawed, when your fraud is discovered, your blatantly unethical conduct is likely to cost you your career, even if you have tenure. So how can it be ethical for a university, knowing SET data and underlying assumptions are horribly flawed, to judge faculty careers based on this data? How can a university claim to teach its students to pursue justice while clinging to methods of evaluation guaranteed to create many injustices?

Better way

One of the questions raised in discussion of doing away with SET is what sort of evaluation system should replace it. The paper [Wie] offers a proposal that is more appropriate to large research universities than to institutions that put a greater emphasis on teaching. Here, we propose an evaluation system suitable for teaching universities.

Let’s replace SET by evaluations done by faculty from a consortium of universities formed for the purpose of evaluation of teaching. In the age of Zoom and livestreaming, it would not be necessary to restrict evaluators to a small radius about the faculty being evaluated. Faculty should be evaluated by colleagues from other universities as well as their own. Advantages of such a system include:

· Disinterested evaluations based on professional knowledge of teaching, as opposed to student reactions to irrelevances such as “hotness,” “coolness,” fashionable garb, and entertainment skills, or to counterproductive measures such as “easiness” of grading, that should have no part in the evaluation of teaching.

· Decreased likelihood of biases based on race, gender, nationality, religion, accent, or subject matter.

· Small schools have many departments in which the faculty have a breadth of specializations. Consequently, there may be no members of the department sufficiently knowledgeable to evaluate a course properly other than the instructor to be evaluated. E.g., in a department of computer and information systems, a programming specialist and an information security specialist might have little knowledge of how each other’s courses should be taught; in a foreign languages department, instructors of Spanish and Chinese might have insufficient knowledge of how to teach each other’s subjects. Participation in an evaluation consortium would make available a larger pool of evaluation expertise.

· Disinterested outside opinion can balance against the following.

o Colleagues within a department might be reluctant to offer useful criticism, as they may wish to avoid ill will for their own future evaluations.

o Where there is friction among department members, unfair criticism can result.

The proposed system could generate small increases in the expense of evaluation; e.g., an outside evaluator who travels to observe a class should be compensated for expenses. The additional expenses might be limited by requiring outside evaluation less frequently than annually and by using livestreaming or Zoom instead of in-person class visits by the evaluator. It would be worth such expenses to help make better decisions based on evaluation of teaching. In the long run, the University’s reputation for excellence in teaching would be fortified by an evaluation process that results in better tenure, promotion, and post-tenure evaluation decisions; it is reasonable to assume this would have a positive effect on enrollments. Also, it might be possible to find funding for such a process from agencies such as the Department of Education or private foundations that focus on education.

Members of a professor’s department and the professor’s dean could still participate in evaluating the professor’s teaching, as appropriate.

A password-protected website could be established which an administrator who has access to various parts of the site would regulate. The use of such a website would make it possible for on-campus and off-campus evaluators to give anonymous reviews of classroom observations, syllabi, grade distributions, grading methods, quality of assignments, and quality of exams.

Further remarks

Many universities proclaim dedication to the pursuit of knowledge, excellence, and integrity. The research cited in this document shows that the practice of SET mocks these values.

“Remember the days of old, consider the years of ages past; ask your father, he will inform you; your elders, they will tell you” (Deut. 32:7). This verse should be understood as more than poetry; it teaches us to learn from history and research. History and research teach us to evaluate teaching by a system other than SET, which once was an interesting innovation but is now known to be destructive of a university. An institution of higher learning that refuses to honor and learn from history and research concerning the conduct of its core activities risks degrading itself from a university to a mere “university.”

References

[AB] Mihran Aroian and Raymond Brown, “The Whistleblower Effect,” Academe 101 (5), Sept.-Oct., 2015, 16-20, http://www.aaup.org/article/whistleblower-effect#.VfrnkH2hvVY

[Arm] J. Scott Armstrong, “Are student ratings of instruction useful?”, American Psychologist 53 (11) (1998), 1223-1224 (abstract online at http://psycnet.apa.org/journals/amp/53/11/)

[A&L] Richard Attiyeh and Keith G. Lumsden, “Some Modern Myths in Teaching Economics: The U. K. Experience,” The American Economic Review 62 (1972), 429-433 (article online at http://links.jstor.org/stable/1821578?seq=4)

[ACS] American Chemical Society, “Academic Professional Guidelines,” http://portal.acs.org/portal/acs/corg/content?_nfpb=true&_pageLabel=PP_ARTICLEMAIN&node_id=1095&content_id=CNBP_023288&use_sec=true&sec_url_var=region1&__uuid=5dc5614f-1974-45f1-927c-8324f87d44a3

[BOS0] Anne Boring, Kellie Ottoboni, Philip B. Stark, “Student evaluations of teaching are not only unreliable, they are significantly biased against female instructors,” London School of Economics and Political Science Impact Blog, http://blogs.lse.ac.uk/impactofsocialsciences/2016/02/04/student-evaluations-of-teaching-gender-bias/

[BOS1] Anne Boring, Kellie Ottoboni, Philip B. Stark, “Student evaluations of teaching (mostly) do not measure teaching effectiveness,” scienceOpen Research 2016 (DOI: 10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1), https://www.scienceopen.com/document_file/25ff22be-8a1b-4c97-9d88-084c8d98187a/ScienceOpen/3507_XE6680747344554310733.pdf

[BrPaPe] Michela Braga, Marco Paccagnella, Michele Pellizzari, “Evaluating students’ evaluations of professors,” Economics of Education Review 41 (2014), 71-88

[B&R] Nancy W. Burton and Leonard Ramist, “Predicting Success in College: SAT® Studies of Classes Graduating Since 1980,” The College Board Research Report No. 2001-2, available at http://professionals.collegeboard.com/profdownload/pdf/rdreport200_3919.pdf

[CarW] Scott E. Carrell and James E. West, “Does Professor Quality Matter? Evidence from Random Assignment of Students to Professors,” National Bureau of Economic Research Working Paper No. 14081 – online at http://www.nber.org/papers/w14081.pdf

[Cash] William Cashin, “Students do rate different academic fields differently,” New directions for teaching and learning 43 (1990), 113-121

[C&S] B.A. Chambers and N. Schmitt, “Inequity in the Performance Evaluation Process: How You Rate Me Affects How I Rate You,” Journal of Personnel Evaluation in Education 16 (2) (2002), 103-112

[Da] John C. Damron, “Instructor Personality and the Politics of the Classroom,” 1996, http://krypton.mnsu.edu/~br8520zh/Damron_politics.html

[Eps] David Epstein, “’Hotness’ and Quality,” Inside Higher Ed, May 8, 2006 - https://www.insidehighered.com/news/2006/05/08/rateprof

[Fried] Mike Fried, “Classroom Assessment vs. Student Satisfaction,” Notices of the AMS 58 (2) (2011), 229 – online at http://www.ams.org/notices/201102/rtx110200229p.pdf

[Henr] Melvin Henriksen, Letter to the Editor, Notices of the American Mathematical Society, November, 1996, 1325 – online at http://www.ams.org/notices/199611/letters.pdf

[Murray] Sean Murray, “Teaching And Tenure In The Vocationalized University,” Workplace: A Journal for Academic Labor 21 (2012), 53-60 – online at http://ojs.library.ubc.ca/index.php/workplace/article/view/182518/183707

[N&Z] Jill M. Norvilitis and Jie Zhang, “The effect of perceived class mean on the evaluation of instruction,” Educational Assessment, Evaluation and Accountability 21 (2009), 299-311

[Red] Richard E. Redding, “Students’ evaluations of teaching fuel grade inflation,” American Psychologist 53 (11) (1998), 1227-1228 (abstract online at http://psycnet.apa.org/journals/amp/53/11/)

[R&S] Kenneth D. Royal & Myrah R. Stockdale, Are Teacher Course Evaluations Biased Against Faculty That Teach Quantitative Methods Courses?, International Journal of Higher Education 4 (1) (2015), 217-224

[Sob] E.K. Sobel, “The Modern Educator,” IEEE Computer, June, 2013, 82-83 (online at http://online.qmags.com/CMG0613?pg=1&mode=2#pg86&mode2)

[Spr] Joey Sprague, “The Bias in Student Course Evaluations,” Inside Higher Ed, June 17, 2016 - https://www.insidehighered.com/advice/2016/06/17/removing-bias-student-evaluations-faculty-members-essay

[StaFr] Philip B. Stark and Richard Freishtat, “An Evaluation of Course Evaluations,” scienceOpenResearch, https://www.scienceopen.com/document_file/ad8a9ac9-8c60-432a-ba20-4402a2a38df4/ScienceOpen/1826_XE9106672292100478299.pdf

[Stone] J.E. Stone, “Inflated Grades, Inflated Enrollment, and Inflated Budgets: An Analysis and Call for Review at the State Level,” Education Policy Analysis Archives 3 (11), June 26, 1995

[Tierno] J.T. Tierno, “How Many Ways Must We Say It?” Academe 100 (6) (2014), 11-14 – http://www.aaup.org/article/how-many-ways-must-we-say-it#.VNfoySyhtsk

[Ut&S] Bob Uttl and Dylan Smibert, Student evaluations of teaching: teaching quantitative courses can be hazardous to one's career, PeerJ 5:e3299; DOI 10.7717/peerj.3299 (2017)

[UtWG] Bob Uttl, Carmela A. White, Daniela Wong Gonzalez, Meta-analysis of faculty's teaching effectiveness: Student evaluation of teaching ratings and student learning are not related, Studies in Educational Evaluation (2016), http://dx.doi.org/10.1016/j.stueduc.2016.08.007

[V&S] R. Vasta and R.F. Sarmiento, “Liberal grading improves evaluations but not performance,” Journal of Educational Psychology 71 (2) (1979), 207-211

[Wie] Carl Wieman (2015) A Better Way to Evaluate Undergraduate Teaching, Change: The Magazine of Higher Learning, 47:1, 6-15 - http://www.tandfonline.com/doi/abs/10.1080/00091383.2015.996077

[Wing] Wingspread Group on Higher Education. (1993). An American imperative: Higher expectations for higher education, an open letter to those concerned about the American future. Racine, WI: The Johnson Foundation, Inc. http://cacm.acm.org/blogs/blog-cacm/186797-learning-about-parallel-and-distributed-computing/fulltext

[Z] Steven Zucker, “Evaluation of Our Courses,” Notices of the American Mathematical Society 57 (7) (2010), p. 821 (online at http://www.ams.org/notices/201007/rtx100700821p.pdf)