Investigation of Rater Tendencies and Reliability in Different Assessment Methods with Many Facet Rasch Model

Duygu Koçak

Vol. 12 No. 4 (2020) IEJEE

Investigation of Rater Tendencies and Reliability in Different Assessment Methods with Many Facet Rasch Model

Duygu Koçak

Published April 12, 2020 | Pages: 349-358 | Views: 710

Download PDF

SHARE

Abstract

One of the most commonly used methods for measuring higher-order thinking skills, such as problem-solving and written expression is open-ended items. Three main approaches are used to evaluate responses to open-ended items: general evaluation, rating scale, and the rubric. In order to measure and improve the problem-solving skills of the students, firstly, an error-free measurement process should be performed. Error caused by rater is a common problem in the evaluation of open-ended items. Errors caused by the rater, such as bias, high or low tendency to score, adversely affect the accuracy of decisions to be made. In this study, the raters' tendencies are evaluated in terms of general evaluation, rating scale, and rubric conditions used to evaluate open-ended items. The rater behaviors in each assessment method and the raters' opinions about the assessment methods were determined. The participants of the study consisted of 12 different mathematics teachers, and the analyses were based on the Many Facet Rasch Model. The scoring reliability of each method was estimated. When using the rating scale, it was concluded that the raters had a more homogeneous scoring tendency. In addition, while the majority of raters stated that they prefer to use rubric, the most difficult method to use was stated by the raters.

<p>One of the most commonly used methods for measuring higher-order thinking skills, such as problem-solving and written expression is open-ended items. Three main approaches are used to evaluate responses to open-ended items: general evaluation, rating scale, and the rubric. In order to measure and improve the problem-solving skills of the students, firstly, an error-free measurement process should be performed. Error caused by rater is a common problem in the evaluation of open-ended items. Errors caused by the rater, such as bias, high or low tendency to score, adversely affect the accuracy of decisions to be made. In this study, the raters' tendencies are evaluated in terms of general evaluation, rating scale, and rubric conditions used to evaluate open-ended items. The rater behaviors in each assessment method and the raters' opinions about the assessment methods were determined. The participants of the study consisted of 12 different mathematics teachers, and the analyses were based on the Many Facet Rasch Model. The scoring reliability of each method was estimated. When using the rating scale, it was concluded that the raters had a more homogeneous scoring tendency. In addition, while the majority of raters stated that they prefer to use rubric, the most difficult method to use was stated by the raters.</p>

Listen -

References

Aiken, L.R. (1996). Rating scales and checklists: Evaluating behaviors, personality, and attitudes. New York: John Wiley & Sons
Akın, Ö. & Baştürk, R. (2012). Keman eğitiminde temel becerilerin Rasch ölçme modeli ile değerlendirilmesi. Pamukkale Üniversitesi Eğitim Fakültesi Dergisi, 31 (31), 175-187.
Alharby, E.R. (2006). A comparison between two scoring methods, holistic vs. analytic using two measurement models, the generalizability theory and the many facet rasch measurement within the context of performance asssessment. (Unpublished doctoral thesis). The Pennsylvenia State University, USA
Anastasi, A. & Urbina, S. (1997). Psychological Testing (7th ed.). Upper Saddle River, NJ.: Prentice Hall.
Anderson, R.S. & Puckett, J.B. (2003). Assessing students' problem‐solving assignments. New Directions for Teaching and Learning, (95), 81-87. http://dx.doi.org/10.1002/tl.117
Arter, J. & McTighe, J. (2001). Scoring Rubrics in the Classroom: Using Performance Criteria for Assessing and Improving Student Performance. Thousand Oaks, CA: Corwin Press, Inc.
Atılgan, H. (2005). Genellenebilirlik kuramı ve puanlayıcılar arası güvenirlik için örnek bir uygulama. Eğitim Bilimleri ve Uygulama, 4 (7).
Baştürk, R. (2010). Bilimsel araştırma ödevlerinin çok yüzeyli Rasch ölçme modeli ile değerlendirilmesi. Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi, 1(1), 51-57.
Black, P. (1998). Testing: Friend or Foe? London: Falmer Press.
Bond, T.G. & Fox, C.M. (2001). Applying the Rasch model: Fundamental Measurement in the Human Sciences. London: Lawrence Erlbaum Associates.
Brookhart, S.M. & Walsh, J.M., Zientarski, W.A. (2006). The dynamics of motivation and effort for classroom assessment in middle school science and social studies. Applied Measurement in Education, 19(2), 151-184. http://dx.doi.org/10.1207/s15324818ame1902_5
Busching, B. (1998). Grading inquiry projects. New Directions for Teaching and Learning, 74, 89-96.
Casabianca, J.M. & Junker, B. (2013). Hierarchical rater models for longitudinal assessments. Paper in Annual Meeting of the National Council for Measurement in Education’. San Francisco, California.
Casabianca, J.M. & Junker, B. (2014). The hierarchical rater model for evaluating changes in traits over time. Paper in 121st Annual Convention of the American Psychological Association, Division 5: Evaluation, Measurement and Statistics, Washington D.C.
Cooksey, R. W., Freebody, P. & Wyatt-Smith, C. (2007). Assessment as judgment-in-context: Analyzing how teachers evaluate students’writing. Educational Research and Evaluation, 13(5), 401 434.https://doi.org/10.1080/13803610701728311.
Cooper, W. H. (1981). Unbiquitous halo. Psychological Bulletin, 90 (2), 218-244.
Crocker, L. & Algina, J. (1986). Introduction to Classical and Modern Test Theory. Harcourt Brace Javanovich College Publishers, USA.
Çıkrıkçı, N. (2010). Üst düzey düşünme becerilerinin ölçülmesinde gündelik yaşam unsuru. Cito Eğitim: Kuram ve Uygulama. 1, 9-26.
DeCarlo, L.T. (2005). A model of rater behavior in essay grading based on signal detection theory. Journal of Educational Measurement, 42 (1), 53-76.
DeCarlo, L.T. (2010). Studies of a Latent Class Signal Detection Model for Constructed Response Scoring II: Incomplete and Hierarchical Designs. ETS Research Report Series, (08). Princeton, NJ: Educational Testing Service.
DeCarlo, L.T., Kim, Y.K. & Johnson, M.S. (2011). A hierarchical rater model for constructed responses, with a signal detection rater model. Journal of Educational Measurement, 48 (3), 333-356.
Docktor, J. & Heller, K. (2009). Assessment of student problem solving processes. In AIP Conference Proceedings, 1179, 133-136. http://dx.doi.org/10.1063/1.3266696
Donoghue, J.R. & Hombo, C.M. (2000). A comparison of different model assumptions about rater effects. In Annual Meeting ofthe National Council on Measurement in Education Proceedings. New Orleans, LA.
Eckes, T. (2008). Rater types in writing performance assessments: A classification ap-proach to rater variability.Language Testing, 25(2), 155–185.https://doi.org/10.1177/0265532207086780.
Eckes, T. (2012). Operational rater types in writing assessment: Linking rater cognition torater behavior.Language Assessment Quarterly, 9(3), 270–292. https://doi.org/10.1080/15434303.2011.649381
Engelhard, G. (1994). Examining rater errors in assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31 (2), 93- 112
Engelhard, G. & Myford, C.M. (2003). Monitoring Faculty Consultant Performance in the Advanced Placement English Literature and Composition Program with a Many-Faceted Rasch Model. ETS Research Report Series, (01). Princeton, NJ: Educational Testing Service.
Gadanidis, G. (2003). Tests as performance assessments and marking schemes as rubrics. Reflections, 28(2), 35-40.
Güler, N. (2014). Analysis of open-ended statistics questions with many facet Rasch model. Eurasian Journal of Educational Research, 55, 73-90.
Haladyna, T.M. (1997). Writing Test Items to Evaluate Higher Order Thinking. USA: A Pearson Education Company.
Davidson, M., Howell, K. W. & Hoekema, P. (2000). Effects of ethnicity and violent content on rubric scores in writing samples. Journal of Educational Research, 93, 367–373.
Iramaneerat, C., Myford, C.M., Yudkowsky, R. & Lowenstein, T. (2009). Evaluating the effectiveness of rating instruments for a communication skills assessment of medical residents. Advances İn Health Sciences Education,14 (4), 575-594.
Iramaneerat, C., Yudkowsky, R., Myford, C.M. & Downing, S.M. (2008). Quality control of an OSCE using generalizability theory and many-faceted Rasch measurement. Advances İn Health Sciences Education, 13 (4), 479-493.
Johnson, B.R., Onwuegbuzie A.J. & Turner, L.A. (2007). Toward a definition of mixed methods research. Journal of Mixed Methods Research. 1:112–133. doi: 10.1177/1558689806298224.
Junker, B.W. & Patz, R.J. (1998). The hierarchical rater model for rated test items. In Annual North American Meeting of the Psychometric Society Proceeding. Champaign-Urbana, IL.
Kastner, M. & Stangla, B. (2011). Multiple choice and constructed response tests: Do test format and scoring matter? Procedia-Social and Behavioral Sciences, 12, 263-273.
Kim, Y.K. (2009). Combining constructed response items and multiple choice items using a hierarchical rater model (Unpublished Doctorial Thesis). Teachers College, Columbia University.
Lee, Y.W. & Kantor, R. (2003). Investigating differential rater functioning for academic writing samples: an MFRM approach. In Annual Meeting of National Council on Measurement in Education proceeding. Chicago, IL.
Li, J. & Lindsey, P. (2015). Understanding variations between student and teacher ap-plication of rubrics. Assessing Writing, 26,67–79. https://doi.org/10.1016/j.asw.2015.07.003.
Linacre, J. M. & Wright, B. D. (2004). Construction of measures from many-facet data. In E.V. Smith ve R.M. Smith (Eds.), Introduction to Rasch Measurement (pp.296-321). Maple Grove, MN: JAM Press
Linacre, J.M. (1989). Many-facet Rasch measurement (Unpublished Doctorial Thesis). University of Chicago, USA.
Linacre, J.M. (1990). A Facet Model for Judmental Scoring. MESA Memo 61.
Linacre, J.M. (1994). Many-Facet Rasch Measurement. Chicago: MESA.
Linacre, J.M. (2003). The hierarchical rater model from a Rasch perspective. Rasch Measurement Transactions (Transactions of the Rasch Measurement SIG American Educational Research Association), 17 (2), 928.
Linacre, J.M., Wright B.D. & Lunz M.E. (1990). A Facets Model of Judgmental Scoring. Memo 61. MESA Psychometric Laboratory. University of Chicago. www.rasch.org/memo61.html.
Wilson,L.D. (1993). Assessment in a secondary mathematics classroom. (Ph.D. diss.), University of Wisconsin-Madison.
Shepard, L.A. (1989). Why we need better assessments. Educational Leadership, 46(7).
Lunz, M. E. & Schumacker, R. (1997). Scoring and analysis of performance examinations: a comparison of methods and interpretations. Journal of Outcome Measurement, 1 (3), 219-238.
Lynch, B. K. & McNamara, T. F. (1998). Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Langauge Testing, 15, 158-80.
Mariano, L.T. (2002). Information accumulation, model selection and rater behavior in constructed response student assessments (Unpublished doctorial thesis). Carnegie Mellon University, Pennsylvania
McNamara, T.F. (1996). Measuring Second Language Performance. London and New York: Longman.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23 (2), 13-23.
Miles, MB. & Huberman, AM. (1994). Qualitative Data Analysis (2nd edition). Thousand Oaks, CA: Sage Publications.
Morrison, G. R. & Ross, S. M. (1998). Evaluating technology-based processes and products. New Directions for Teaching and Learning, 74, 69–77.
Moskal, B.M. & Leydens, J.A. (2000). Scoring rubric development: Validity and reliability. Practical Assessment, Research & Evaluation, 7(10). Retrieved from http://areonline.net/getvn.asp?v=7&n=10
Mulqueen C., Baker D. & Dismukes, R.K. (2000) Using multifacet Rasch analysis to examine the effectiveness of rater training. Presented at the 15th Annual Conference for the Society for Industrial and Organizational Psychology (SIOP). New Orleans.
Myford, C. M. & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386-422.
Myford, C. M., Johnson, E., Wilkins, R., Persky, H. & Michaels, M. (1996). Constructing scoring rubrics: Using “facets” to study design features of descriptive rating scales. In Paper presented at the annual meeting of the American Educational Research Association.
Nakamura, Y. (2000). Many facet rasch based analsis of communıcative language testing results. Journal of Communication Students, 12, 3-13.
Nakamura, Y. (2002). Teacher assessment and peer assessment in practice. Educational Studies, 44, 203-215.
Patz R.J., Junker B.W. & Johnson M.S. (2000) The Hierarchical Rater Model for Rated Test Items and its Application to Large-Scale Educational Assessment Data. Revised AERA Paper.
Patz, R.J., Junker, B.W., Johnson, M.S. & Mariano, L.T. (2002). The hierarchical rater model for rated test items and its application to large-scale educational assessment data. Journal of Educational and Behavioral Statistics, 27 (4), 341- 384
Penny, J., Johnson, R.L. & Gordon, B. (2000). Using rating augmentation to expand the scale of an analytic rubric. The Journal of Experimental Education, 68(3), 269-287.
Perlman, C.C. (2003). Performance Assessment: Designing Appropriate Performance Tasks and Scoring Rubrics. North Carolina, USA.
Pollack, J.M., Rock, D.A. & Jenkins, F. (1992). Advantages and disadvantages of constructed-response item formats in large-scale surveys. Paper in annual meeting of the American Educational Research Association. San Francisco, California.
Popham, W.J. (2008). Classroom Assessment What Teachers Need to Know. USA: Pearson Education
Rodriquez, M. C. (2002). Choosing An Item Format. Tindal, G. ve Haladyna, T.M. (Ed.). Large-Scale Assessment Programs For All Students (213-231). New Jersey: Lawrence Erlbaum Associates Publishers.
Roid, G.H. & Haladyna T.M. (1982). A Technology for Test-Item Writing. New York: Academic Pres.
Romagnano, L. (2001). The Myth of Objectivity in Mathematics Assessment. Mathematics Teacher, 94(1), 31-37. Retrieved from http://www.peterliljedahl.com/wp-content/uploads/Myth-of-Objectivity.pdf
Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465-493.
Sebok, S. (2010). “Pick me, pick me, ı want to be a counsellor” assessment of med. counselling application selection process using Rasch analysis and generalizability theory (Unpublished master thesis). University of Northern British Columbia: USA.
Stuhlmann, J., Daniel, C., Dellinger, A., Denny, R. K. & Powers, T. (1999). A generalizability study of the effects of training on teachers’ abilities to rate children’s writing using a rubric. Journal of Reading Psychology, 20, 107–127.
Sudweeks, R.R., Reeve, S. & Bradshaw, W.S. (2004). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing, 9 (3), 239-261.
Szetela, W. & Nicol, C. (1992). Evaluating problem solving in mathematics. Educational Leadership, 49(8),42-45. Retrieved from http://www.ascd.org/ASCD/pdf/journals/ed_lead/el_199205_szetala.pdf.
Seker, M. (2018). Intervention in teachers’differential scoring judgments in assessing L2 writing through communities of assessment practice. Studies in Educational Evaluation, 59, 209-217. https://doi.org/10.1016/j.stueduc.2018.08.003.
Tan, M. & Turner, C. E. (2015). The impact of communication and collaboration betweentest developers and teachers on a high-stakes ESL exam: Aligning external assessmentand classroom practices. Language Assessment Quarterly, 12,29–49. https://doi.org/10.1080/15434303.2014.1003301.
Verhelst, N. & Verstralen, H. (2001). IRT Models for Multiple Raters. A. Boomsma, T. Snijders, and M. van Duijn, (Ed.). In Essays in Item Response Modeling. New York: Springer-Verlag.
Wang, Z.G. (2012). On the use of covariates in a latent class signal detection model, with applications to constructed response scoring (Unpublished doctoral thesis). Columbia University, New York
Weigle, S.C. (1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing, 6(2), 145-178.
Wiggins, G. (1998). Educative Assessment. San Francisco: Jossey-Bass.
Wilson, M. & Hoskens, M. (2001). The rater bundle model. Journal of Educational and Behavioral Statistics, 26, 283–306.
Wright, B. D. & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement: Transactions of the Rasch Measurement SIG, 8(3), 370.
Wright, B.D. & Masters, G.N. (1982). Rating Scale Analysis: Rasch Measurement. Chicago: MESA Press.

Keywords

Problem-solving, Rubric, Rating scale, Rater tendency, Rater reliability, Many Facet Rasch Model.

Affiliations

Duygu Koçak

-

Downloads

Download data is not yet available.

How to Cite

Koçak, D. (2020). Investigation of Rater Tendencies and Reliability in Different Assessment Methods with Many Facet Rasch Model. International Electronic Journal of Elementary Education, 12(4), 349–358. Retrieved from https://iejee.com/index.php/IEJEE/article/view/1024