Journal of Technology and Science Education SOME LIMITS IN PEER ASSESSMENT

Nowadays, the educatonal methodology known as ‘peer assessment’ consttutes one of the pillars of formatve assessment at the diferent levels of the educatonal system, partcularly at the University level. In fact, in recent years, it has been increasingly used to enhance students' meaningful learning, as it is considered to be an element of social learning, in which students beneft from the lessons learned by other classmates, and draw upon the ability to assess the quality of the learning, contrastng it with the level of knowledge that each has about the subject/course being evaluated, and using common evaluaton criteria. In this regard, this paper represents the experience of two groups of students. It allows us to determine how many peer assessments should be required of students in a partcular course in order to consttute a serious, reliable actvity. On the other hand, from the point of view of the student, the assessments are evaluated to the extent that they are seen as a required and mandatory exercise that must be carried out by students simply to pass the course. In the later case, the actvity can become extremely trivial and banal. Statstcal analysis of the results indicates that three peer assessments per student appraised represents an adequate number. On the other hand, more than thirty peer assessments fail to contribute to learning, nor do they represent serious actvites.


INTRODUCTION
Authentc, real learning always occurs as the result of refecton (Cowan, 2006), or in other words, awareness of learning and its implicatons for the personal structure of knowledge of each individual.In additon, the last objectve of learning is frequently the ability to make good, correct decisions based on knowledge; i.e., the evaluaton or assessment of a situaton and in order to reach a decision.Accordingly, the failure to refect on learning results in "low-quality" learning.As a consequence, evaluaton should not be considered a simple act of classifcaton or grading, as it has more and very important dimensions.
Perhaps the most important, critcal, and judgmental of the diferent kind of assessments may be what is known as 'self-assessment', in which each student assesses himself/herself.This is, understood as one of several "refectons tool" that are available.Moreover, it can be also useful for adjustng the scope of learning (Boud, 1995;Andrade & Du, 2007).
In order to facilitate self-assessment, it is also very important to consider the tool known as 'peer assessment', which is understood as the exercise of value judgments regarding the learning of others, who are presumed to be cognitvely equal, and in the most practcal of cases, learning peers (classmates).When students refect on the product of the learning of peers (Keig & Waggoner, 1994), at the same tme as they are also learning, this encourages an internal refecton on whether one's own learning is at the same, higher or lower level than that of others.Therefore, peer assessment posits the student as an observer and, at the same tme, as an evaluator.Consequently, the student's own learning is, in turn, reinforced.
In terms of self-assessment, certain precise, external elements of control are required, according to which students can establish the authentcity of their knowledge, the understanding of concepts and, in general, of their learning.They provide reference models to the students, in order to compare the evoluton of their own learning.If the learning is based on concepts, the references should be related to the students' ability to answer questons, make inferences, draw conclusions from situatons, etc.These concepts should be presented in the texts or materials selected, or prepared by the course professors themselves.With regard to procedures, they should be oriented towards problems, situatons, examples, etc.These should be also selected by the course professors.As far as attudes (a much more complex competence to establish, since it is not restricted to a scientfc approach or knowledge, rather it depends on the social and cultural needs of each student, among other things) are concerned, they are based on diferent factors, such as the attudes of the professor and the educatonal center, the insttuton itself, appropriate readings, the proposal of situatons, etc., trying to refrain from indoctrinaton.
Therefore, in self-assessment, there are many elements that promote authentc learning.It remains up to the professors to make proposals, giving the students the opportunity to engage in self-assessment so that they at least become aware of the learning that has taken place, of what remains to be learned, and the importance and status of such knowledge in the personal framework, under a "constructvist" approach to learning.
Without the ability to compare one's own learning to that of other classmates, the assessment process seriously lacks an element of reference.Furthermore, as Boud and Falchikov point out, "peer assessment requires students to provide either feedback or grades (or both) to their peers on a product or a performance, based on the criteria of excellence for that product or event which students may have been involved in determining" (2007, p.132).Actually, it is not only the comparison of one's own learning in terms of formal or scientfc knowledge about concepts, procedures and attudes.An element of reference would also be involved: the comparison of our own knowledge to that of our peers; i.e., the knowledge exhibited by other peers (usually fellow students or classmates).This allows the positoning of each student in relaton to the rest of his classmates.Without the possibility to assess the knowledge of others (peer-assessment), the assessment triangle, formed by hetero-assessment (that carried out by the professor on his students) and self-assessment (that conducted by the student on his own performance), becomes faulty and weak.It would consist of an individual student who is presented formal knowledge, but without the support of peers, and thus, the support and assistance provided by social learning.
There are many aspects of social learning.The clearest is that two students learn more and faster than when working alone (conventonal wisdom has summarized this in the old adage "two heads are beter than one").It is also true that nowadays, in most areas in our daily life, social learning occurs more frequently than individual learning.People are contnually asked how to do certain things, perform diferent actvites, etc. (for example, sending an e-mail, a fax, how a smartphone or PC applicaton works, which TV or radio channel broadcasts a certain program, the tme you need to be somewhere, etc.).This is a dimension of human relatonships in which the social learning component is evident in our daily lives.This daily occurrence is also very common in academic learning: How do you calculate...?, How do you program...?, How do you say ...?, How do I mix ...?, What have you done with...?, etc.
One aspect of social learning that is favored by belonging to a certain group or collectve (Pigot, Fantuzzo & Clement, 1986) is learning from others and with others and from the academic products of others, derived from a partcular learning process.This means that the professor's educatonal goal is for his students to achieve a certain goal as the result of this training.The instructor establishes a learning procedure in which a teaching and educatonal methodology are chosen (narratve or cooperatve, based on projects, problems or cases, portolios, etc.), and students are required to provide a result of this learning in the form of a product that can be analyzed in terms of quality, having previously established quality criteria for said product.When students engage in the learning process established by the professor simply for the sake of carrying it out, they have already learned certain things (Cowan, 2006).However, if they do not refect on what they have done, what they have learned, the learning may contribute litle to the building of knowledge.This is the situaton that occurs in the scafolding of the concepts, procedures and attudes inherent in the course assignment.
Previous ideas have been described by Topping (1998), Race (2001) andNilson (2003), who outline how professors can use peer feedback as an alternatve method of evaluaton, in order to help students to acquire important life skills.Thus, convinced of its beneft, and the advantages suggested above, the experience presented in this artcle is derived from this approach.During the course of this experience, the authors have atempted to analyze some of the limits of peer assessment, an aspect which has been neglected in the previously consulted literature, and in partcular the amount of work peer evaluaton represents for the students.It would stand to reason that it is not the same thing to assess products made by only two classmates as it is to evaluate the actvites of fve -or even ffy -peers.The reader will probably perceive that ffy is a large number, but the questons remain: How many is it too many?and How many actvity assessments are enough?Lin (2001), for example, describes a relatvely good experience with six reviewers, in an efort to decrease bias due to assessments with fewer partcipants.On the other hand, it is worth examining whether students are really interested in doing peer assessments, even when this process is strongly benefcial and advantageous for their learning.
These questons are beyond the scope of this paper, because the answers depend on a variety of aspects: the group subjected to the experience, whether students are frst-year university students or they are in fnal or intermediate courses, etc.We understand that, of course, answers could also vary, as in the case of students subjected to higher or lower overall academic pressure, or the period within the course in which the actvity is proposed.Therefore, in this artcle, we focus on the experience in terms of the amount of peer assessment required, leaving the queston of interest and benefts for further study.

RESEARCH QUESTIONS
This study atempted to answer the following research questons: • Can diferences be found among assessments made by diferent numbers of reviewers?
• Is there any way to process the informaton generated by the evaluators that is not excessively onerous in terms of tme?
• Is there much diference between the results of peer assessment and those of professor assessment?
• What does the grade distributon look like?In other words, how are the marks distributed for each assignment?
We have restricted our investgaton to reaching a fnal determinaton: Is there an optmal number of assessments per person in order to obtain reliable results in terms of grading?

EXPERIENCE
The peer assessment experience was carried out within the context of an Industrial Engineering course.More specifcally, the course was enttled Control and Industrial Automaton, and forms part of the framework of the European Higher Educaton Area (EHEA) or Bologna Process.The course is common to six diferent engineering degrees (Electronic, Electrical, Mechanical, Chemical, Biomedical and Energy Engineering) taught at the Barcelona College of Industrial Engineering (EUETIB) at the Technical University of Catalonia (UPC).The course was taught to six diferent groups of approximately 45 students each.Four of these groups received classroom instructon in the morning, and the other two in the afernoon.Students from the six diferent degrees were combined in the course groups, since it is a core course in the engineering program.With regard to assessing larger groups, the important thing to recognize is that they may require strategic solutons, which can only be implemented at the departmental or even insttutonal level, and which are beyond the control of individual tutors (Rust, 2001).In order to implement our experience, only two of these six groups were considered, one from the morning schedule and the other in the afernoon, because the overall performance of the students in the groups difered from morning to afernoon.Generally speaking, the students in the afernoon groups work and study at the same tme.On the other hand, the students in the morning groups are dedicated to a single actvity, i.e., studying.These two groups (the morning and the afernoon groups) have been used as a sample of each group, in representaton of the remaining groups.This allowed us to prevent the transfer of opinion between students of both groups, and thus beter isolate the two populatons.One of the skills that a university graduate should possess is explicitly set out in Spanish law: "Students can communicate informaton, ideas, problems and solutons to both specialist and non-specialist audiences" (BOE, 2007).This competence is developed by all students and must be assessed by their classmates.Thus, it is critcal for students to be understood.This, in turn, is one of the characteristcs of oral and writen expression, which, without a doubt, is a fundamental competence for any university graduate.
Specifcally, the assigned task required students to give a simple explanaton of a technological issue of a certain degree of complexity, with the premise that anyone (a non-expert or layperson in the topic) could understand it.It should be considered that simple questons normally have simple explanatons, while complex issues rarely have a simple explanaton.Surely, for instance, it is not easy to explain in a nutshell the splitng of the atom to an audience that has absolutely no knowledge of what mater is made of.However, it is always possible to at least use comprehensible terminology, give examples and analogies, and make use of explanatory resources that can palliate the inherent difculty of complex concepts.
In this work, the actvity assigned to one of the groups consisted of describing how an ionic smoke detector works.It is based on the principle of the emission of ionizing radiaton consistng of certain chemical elements, such as americium 140.This is a radioactve material that emits alpha partcles and ionizes the air around it.This enables an electric current to fow between two electrodes.Thus, when the smoke partcles fll the air around the material, the electrical current decreases and an electronic circuit detects the presence of these smoke partcles.The topic assigned to the second group was the descripton of the term "phantom".In this case, this term applies to the power supply for capacitor-based microphones.In both cases, we need to understand both concepts very well in order to give a simple and concise explanaton.As a mater of fact, it is only possible to explain something correctly, concisely, and completely if it is well known.Consequently, the aim of the proposed actvity is for students to study these elements in detail.Only then can they give a competent and relevant explanaton to a non-specialist audience.
Notwithstanding the difcultes already described, and others listed in Rust (Rust, 2001), including the problem of a small number of assessments, the nature of the actvity is rooted in simplifcaton, as the students in course had to make an efort in order to simplify the discourse and explanatons.Thus, the actvity was carried out using the social network Twiter.Accordingly, a limit of140 characters was imposed to determine the student's comprehension (competence) to an even greater degree.
Therefore, the frst part of the actvity required the students to post their explanatons on Twiter.Next, the second part of the actvity involved the peer assessment of this explanaton by some of their classmates.Over the course of a week, they scored each explanaton, awarding between zero and ten points, based on whether it would be understood by a layperson.
In order to carry out the assessment actvity, two paterns have been established, the frst with a relatvely low number of peer assessments, consistng of only three randomly chosen assessments.According to Race (Race, 2001), student peer-assessment can be anonymous, with assessors randomly chosen so that friendship factors are less likely to distort the results.However, in our case, we established a public list of assessments .However, since the students were coded, it would have been very difcult for anyone to fnd out the identty of the students evaluated or those that evaluated them.Thus, for all practcal purposes, this implied randomness in selectng the reviewers.
On the other hand, the second assessment actvity was massive, using the entre group of 37 students.Of course, it was assumed that assessment based on only three peers would yield diferent results than when the number was signifcantly larger, performed by 37 students in our case.Furthermore, in this second case, there was also a self-assessment component.When many assessments are made, it allows us to see whether the assessment carried out on oneself difers greatly from that performed by one's classmates.On the other hand, it should be noted that in the frst case, we have preferred to limit the actvity to peer-assessment (with no selfassessment) because, based on the authors' teaching experience, we believe that self-evaluaton combined with only a few samples of peer-assessment (only three) could generate a bias efect on the fnal outcome.Data obtained from Twiter were processed using Microsof Excel.

Peer Assessment Carried Out By 3 Classmates (3-peer Assessment)
In one of the groups studied, it was established that the peer assessment of each student who had posted an explanaton was to be carried out by three diferent classmates.The percentage of people postng an explanaton was 89% (33 of 37).Thus, in principle, these were the students who could take part in the peer assessment.Finally, the populaton that took part in the peer assessment consisted of 29 out of the 33 students; i.e., 88%.Therefore, from the standpoint of partcipaton, the minimum number of partcipants required to carry out the experience was exceeded.Thus, sufcient data were obtained for an accurate and reliable analysis.
The calculaton of the results was based on the average of the marks given to each student by their peers, according to the method indicated by Brown, Bull and Pendlebury (1997): "An average for each student can be generated from the range of marks their peers give them".Since not every member of the populaton took part in the peer assessment, some students were assessed only once or twice: 2 of the 33 were assessed only once, and 11 of the 33, only twice.The remaining 21 students received three peer assessments, and additonal fgures below are related to them.In additon, in this frst case, the course professors analyzed all the given feedback, made any adjustments they considered necessary (Davies, 2006) and added an additonal grade (the professor's own score) for these students.
The average diference between the marks assigned by the course professors and the average grade given by students amounted to around ± 1 point, as is seen in Figure 1.It is curious to note that almost always, the marks assigned by the professor were more favorable (that is, the grades given by the professors are, in general, higher than those given by the course peers).

Figure 1. Diference between the grades given by the course professors and the average grade given by students for each of the explanatons given (out of a total of 10 points)
Another element in our investgaton was the analysis of the standard deviaton between the grades given by the students.Figure 2 shows the concentraton or dispersion of grades around the mean.In one case (student #21), the discrepancy was somewhat higher.Figure 2 can be understood as a measure of agreement or disagreement among the three students in terms of the respectve average score.
It is important to highlight that, for the purposes of calculatng the deviaton, Bessel's correcton, which considers N-1 samples instead of populaton = N, was used to compensate for our small number of samples.An additonal aspect of the study was the analysis of the grade distributon.Figure 3 shows a graph of all the grades and how their dispersion was distributed around the mean (standard deviaton).A 6 th order polynomial interpolaton determined that the maximum grades for the diferent explanatons fell between 7.5 and 8.5.
Figure 3. Distributon of the grades given to each explanaton, with a polynomial interpolaton curve of order 6 (the grades exceed 10 in the fgure to allow proper interpretaton of the interpolatng curve)

Whole-group Peer Assessment
In the other course group studied, the peer assessment was performed for each student who had posted an explanaton.Thus, each student was required to assess all the submited explanatons, including his or her own actvity (self-assessment).In this case, the authors did not take into account the diference between the grades given by the professor and the average grade given by the students for each of the given explanatons.The reason is clear: the dispersion of results is so great that the professor's grade is quite insignifcant in terms of the total.
86% of the students (37 of 43) posted an explanaton, and therefore, they were considered to be the populaton that should be peer-assessed.In turn, 92% of the students (34 of 37) performed peer assessment.As before, from the standpoint of partcipaton, the threshold was surpassed in order to consider this to be a reliable number of data for analysis.
In this case, as the course professor also assigned a self-grade for each student (i.e., self-assessment), it was found that these marks were higher in almost all cases than the average of the grades given by the rest of their peers, as shown in Figure 4. Except for one student, whose self-assessment was half a point below the average grade, the rest of students assigned themselves a higher mark, with more than four points of diference in some cases.
These errors in judgment lead us to suspect that the data should be considered, at best, doubtul, even when half of the students made an error of +1.5 points.This is somewhat reasonable, since one's appreciaton of oneself (and, thus, "self-assessment") is usually more generous than that of one's peers.

Figure 4. Self-assessment error
The calculaton results are based on the average of the grades given to each student by his or her peers.In this case, not all students studied took part in the peer assessment.However, there were enough data (grades) on the explanatons given by the students in the class and therefore this fact does not signifcantly infuence the average results.One of the expected results was the variaton range (i.e., the diference between the highest and lowest marks) found, which can be seen in Figure 5.
Range of the diference between maximum and minimum grades Regarding the analysis of the standard deviaton of the grades given by the students, Figure 6 shows the concentraton or dispersion of the grades around the average.Compared to the case of three-peer assessments presented in previous secton, it can be seen that, in this second case, the standard deviaton increases when the entre group is considered.As before, it is important to note that for the calculaton of the standard deviaton, Bessel's correcton was also considered.

Figure 6. Standard deviaton of the grades given by the students for the both cases considered in the study
When the dispersion of the distributons are compared in the case of 3 and the case of 37 peer assessments, the global deviaton tends to increase as the number of peer assessments increases.
The fnal element of this case study is the analysis of how the marks are distributed for each explanaton.Accordingly, Figure 7 depicts four graphs (grouping students according to the dispersion of the grades given by their classmates and how their dispersion is distributed around each mean.The highest grade obtained through mass peer assessment was 6.9.However, in the case of 3-peer assessments, it reached 10 points (see Figure 8).It should be noted that students in the course have primarily learned the topic addressed in the frst part of the actvity, where their classmate were required to give a clear, concise explanaton of a complex concept to a non-expert audience.However, it is interestng to note that the task of simply reading the diferent explanatons (from the rest of the classmates) for the same concept produced greater learning, as compared to the understanding that the student initally had.Therefore, from the point of view of a student, we can conclude that the learning is both greater and richer: on one hand, thanks to the act of performing the task itself, and on the other hand, thanks to the task of reading (and assessing) the explanatons given by many classmates about the same topic.

OBSERVATIONS AND DISCUSSION
It is interestng to note that, in the case of 3-peer assessments, it is publicly known whom is evaluatng whom.Thus, it could be suspected that students who completed the actvity later than others might know the grade that the other peer reviewers had already assigned to the classmate being assessed.They therefore might have had a reference in order to determine their own marks for their other classmates.In our case, however, the authors do not believe that this was the case, because the allocaton of peer assessments follows no specifc patern, and logic would tell us that it would be more tring to analyze the grades given by other students than to perform the assessment task assigned oneself.
In the case of global peer assessments, where the allocaton of grades was an assignment, it may have been the case that some students chose one of the previous entries in Twiter and made slight modifcaton to each grade.Nevertheless, if this did in fact occur, we have failed to observe the characteristc binomial distributon that was previously mentoned.Therefore, the authors believe that this efect has not taken place.
It is true, however, that if the professor passes around a sheet in class on which each student must write down a grade, a "memory" efect appears.Thus, the overall marks tend to resemble the frst grade, since all the students know what the perceptons of the previous classmates are.In this way, classmate '2' assesses in a way similar to classmate '1'; classmate '3' assesses similar to classmates '2' and '1', and so on.This means that peer assessment results obtained by means of public data are not very reliable or desirable.Fortunately, the authors have found that, with the use of social networks, this efect dissipates somewhat.
If there are a large number of evaluatons per person, most likely more than fve, a high dispersion of results is observed, even if the task is simple and easy to complete.It should be noted that there are diferences between the point of view of the student evaluator (four diferent evaluatons carried out by the same person) and that of the evaluated student (evaluated by four diferent people).
From the point of view of the student evaluators, in the frst experience, the authors notced that students spent a certain amount of tme when carrying out their frst assessment.However, subsequent assessment tmes were faster, but less accurate.Therefore, a reasonable amount of peer assessments that students should be asked to do in order to obtain reliable results in terms of grading reliability is around three or four.We estmate that above this number, student evaluators will resort to random grading.In fact, in spite of its demonstrated virtues, peer assessment has a limitaton to the number of persons engaged: the more people perform the evaluaton, the less reliable the results are, which results in a greater dispersion of the ratngs assigned by reviewers.
Conversely, from the point of view of the evaluated students, an optmal number of reviewers is not thought to exist.In the second experience, the four evaluatons are more or less in agreement with one another in terms of the ratng assigned by each evaluator, and the deviaton is reasonable.Thus, the authors can conclude that between three and fve evaluatons generate reliable assessments and, above this number, the quality of its virtues progressively deteriorates.
Finally, in Figure 2, it is curious that the results are almost always positve; i.e., more favorable grades are assigned by the professor.Thus, this confrms that the grades given by the professors are, in general, higher than those given by classmates.

CONCLUSIONS
Based on the study results, the authors conclude the following: a) If the number of peer assessments assigned to the students is reasonably low (around two or three), the students assign a grade for the task that is quite similar to that which would be assigned by the course professor.
b) As seen in Figure 2, if the professor's grade is added to the calculaton (thus, increasing the number of evaluators from 3 to 4), the overall distance from the mean decreases in most cases.c) From Figure 3, we can infer from the low deviaton in the grades given by the students and the low error rate of the professors, the resultng grade could be truly representatve of what each student has learned on a scale of 0 to 10.In additon, we can conclude that there is not a clear binomial distributon, a signifcant sign that students did not carry out a random evaluaton for each assessment completed.
d) From Figure 5, we conclude that assessments made by few students (the 3-peer assessment in our frst case) for the same explanaton result in less diference between the highest and lowest grades than those carried out by the whole group (the 37-peer assessment in our second case).
e) From Figure 7, we can infer that the deviatons of the grades given by the students are quite high, and a clear binomial distributon is evident.This is a clear sign that, for each assessment process, the data collected came from the students' own random (or at least pseudorandom) assessments.
f) The remarkable dispersion of skills in the case of multple peer assessments causes us to suspect that the students have not actually carried out the assessment actvity, rather they have simply recorded numbers instead of giving reasons for their marks following a detailed reading of the explanatons given by their peers.This is the reason for the huge diferences (as much as nine points) in grades given for the same explanaton, resultng in an evaluaton range from zero to ten.In additon, more than half of the explanatons exhibit diferences of up to six points that, on average, fall three points above and three below the average value.
g) With regard to the previous point, it is important to highlight that course students perceived the task of performing so many peer assessments to be excessive.Thus, they completed the task, but not in a serious manner, in terms of "scientfc" peer assessment.They did not even use the evaluatons previously conducted and published by other classmates as a reference.
h) In fact, the number of peer assessments that students can reasonably be asked to perform, producing "reliable" grades that can be taken into account, is about three.i) By using statstcal tools such as standard deviaton, averages, variances, interpolatons, etc., it is possible to determine the quality of the peer assessment carried out by students, especially in light of the impractcality of evaluatng each on an individual basis and the fact that it has not been established as a general approach.j) With more than 3 peer assessments, instead of carrying out the desired learning process, students tend to assign a simple sequence of numbers with litle sense and no actual qualifying value.
k) In the case of mass peer assessments, the average grade assigned by student evaluators is 6.0.However, in the case of 3 peer assessments, this average mark is notceably higher, i.e., 8.1.
l) In the case of mass peer assessments, the trend was towards fairly similar grades in all cases, which is cause to suspect that the applicable assumptons of the law of large numbers cannot be valid, and thus the results are not reliable.
m) The subject of peer assessment has been well documented, and the results reported in this artcle are predictable from a logical point of view.

Figure 2 .
Figure 2. Standard deviaton of the grades given by students and the efect when the instructor's grade is added (Only deviatons >0 are shown)

Figure 5 .
Figure 5.Comparison of peer assessments carried out for either 3 or 37 peers (whole-class group)

Figure 7 .
Figure 7. Distributon of the grades assigned to each explanaton.a) Large spread of scores, b), c) and d) Higher or lower dispersion

Figure 8 .
Figure 8. Marks obtained versus number of peer assessments