Problem Statement: There have been many attempts to research the effective assessment of writing ability, and many proposals for how this might be done. In this sense, rater reliability plays a crucial role for making vital decisions about testees in different turning points of both educational and professional life. Intra-rater and inter-rater reliability of essay assessments made by using different assessing tools should also be discussed with the assessment processes.
Purpose of Study: The purpose of the study is to reveal possible variation or consistency in grading essay writing ability of EFL writers by the same/different raters using general impression marking (GIM), essay criteria checklist (ECC), and essay assessment scale (ESAS), and discuss rater reliability.
Methods: Quantitative and qualitative data were used to present the discussion and implications for the reliability of ratings and the consistency of the measurement results. The assessing tools were applied to 44 EFL university students and 10 graders assessed the essay writing ability of the students by using GIM, ECC, and ESAS in different occasions.
Findings and Results: The findings and results of the analyses indicated that using general impression marking is evidently not reliable for assessing essays. The coefficients obtained from checklist and scale assessments, considering the correlation coefficients, estimated variance components, and generalizability coefficients present valuable information, clearly show that there is always variation among the results.
Conclusions and Recommendations: When the total scores and the rater consensus results in this study are examined, it can be clearly seen that the scores are almost always not identical and they are different from each other. For this reason, opposed to the idea that is commonly agreed upon, checklists or even scales may not be effectively as reliable as expected and they may not
improve inter-reliability or intra-reliability of ratings unless the raters are very well-trained and they have strong agreement or common inferences on performance indicators and descriptors since they should not have ambiguous interpretations on the criteria set. The results might be more accurate and reliable if the accepted interpretation of a meaningful correlation coefficient for this kind of measurements can be considered as .90 minimum for giving evidence of reliable ratings. This might mean that the proximity of
the scores which are assigned to same or independent essays will be higher and more similar. However, the scale use could still be emphasized as more reliable. Still, an elaborate and careful examination with more raters is seen needed.
Keywords: Essay, assessment, intra-rater, inter-rater, reliability.