CHOOSE YOUR CURRENCY

ESTIMATING MEASUREMENT ERROR AND SCORE DEPENDABILITY IN EXAMINATIONS USING GENERALIZABILITY THEORY

Amount: ₦5,000.00 |

Format: Ms Word |

1-5 chapters |



ABSTRACT

This   study   investigated   the   estimation   of   measurement   error   and   score dependability in  examinations using Generalizability Theory. The  fact that  the scores obtained by the students (object of measurements) in examinations were affected by multiple sources of error (facets) and these scores, were used in taking relative and absolute decisions about the students, there was the need to estimate measurement error and dependability of the scores.  This   was to find the extent of the contributions of these sources of error (facets) in examination scores. Four research questions and two hypotheses were posed to guide the study. The population of the study comprised  25,530 senior secondary three (SS3) students in public secondary schools in Rivers State for 2011/2012 academic year. The sample consisted  of  2,553,  SS3  students  selected  through  the  proportionate  stratified random sampling technique. A Mathematics Achievement Test with items drawn from past WAEC and NECO SSCE questions was used for data collection. EduG version 6.0-e based on ANOVA and Generalizability theory was used to answer the four research questions. A 95% confidence interval was computed using the S E variance components to determine whether there was a significant difference in the contributions  and   effects of the facets and their interactions to measurement error and score dependability in examinations. The findings of the study revealed that some hidden sources of error were at play in the study. The residual    made the highest contribution to  measurement error.    This was followed by  the student factor. Similarly, the residual and the students variance components were significantly (p < 0.05) different in their contributions to measurement error in examination scores. Conversely, questions and invigilators were not significantly different in their contributions and effects on measurement error and score dependability  in  examinations  (p  >  0.05).  Invigilators  and  the  interaction  of students   x   invigilators   has   the   highest   effect   on   score   dependability   in examinations. The findings also revealed that an increase of invigilators to 90, increased the  generalizability coefficient (EP2)  and  index of  dependability (Ø) which  rank  ordered  students  and  classified  them based  on  their  performance, irrespective of the performance of other students. Therefore, generalizability theory provided a framework for evaluating multiple sources of variability in examination scores and for deriving implications for test development and test score interpretation.  It  is  therefore  recommended  that  in  conducting  examinations, enough invigilators should be recruited so as to minimize error and maximize score reliabilities.

Background of the Study

CHAPTER ONE INTRODUCTION

Measurement pervades almost  every  aspect  of  modern  society.  Nworgu (2003) looked at measurement as the process of assigning numerical values to describe features or characteristics of objects, persons or events in a systematic manner. Measurement involves assigning figures, numerical quantities or scores to variables  or  traits  of  interest.  For  example,  a  great  variety  of  things  about individuals (achievement, aptitude, intelligence, height, weight) are measured by various people like teachers, doctors ,etc ,on a regular basis. At a glance, obtaining scores of these attributes seems quite simple, but unfortunately, a major problem with many kinds of measurement is that, there is often no basis to assume that the numerical value provided, accurately and truthfully represents the underlying quantity of interest. For the mere fact that the results of these measurements can have a profound influence on an individual’s life, it is important to understand how these scores are derived and the accuracy of the information they contain.

Teachers give tests to determine what students know and are able to do in a particular content area. If there is confidence in a test, the belief is that a student who scores high on the test knows more in that area than a student who scores low. In like manner, two students whose scores are similar probably have roughly the same level of ability in the area being tested. Two questions, therefore, arise; how much confidence should we have in any particular test? How should our level of confidence in the test affect the way we think about student’s scores? Yet no test, however well  designed can  measure a  student’s true  ability because there are numerous factors that interfere with our ability to measure it accurately and precisely.  Among  these  factors  are  questions,  examiners,  invigilators,  factors related to students, etc, are sources of “measurement error” or “random error”. Every score obtained by students has some amount of error in it. The goal is to create tests that have as little measurement error in them as much as possible. Measurement error is the result of random fluctuations due to the choice of a particular sample of questions or conditions of observation.

Measurement error is a situation in which a student’s true ability or achievement is either underestimated or overestimated (Johnson, Dulany & Banks,

2000). According to Hofman (2005), measurement error is defined as the difference between the distorted information and undistorted information about a measured product expressed in its physical quantity.   An error is defined as real (untrue, wrong, false, no go) value at the output of a measurement system minus ideal (true, good, right, go) value at the input of a measurement system mathematically expressed as:

X  = Xr   – Xi

where

X is the measurement error,

Xr is the real untrue measurement value and

Xi is the ideal true measurement value.

A measurement under ideal condition has no error. Regarding measurement error,  it  is  important  to  emphasize  that  scores  of  tests  are  either  reliable  or unreliable (Thompson, 1994; Vacha-Haase, 1998). Test scores are not a definitive measure of student’s knowledge or skills. An examinee’s score can be expected to vary across different versions of a test because of differences in the way graders evaluate student’s responses and differences in transitory factors such as the student’s attentiveness on the day the test was taken. For these reasons, no single test score can be a perfectly dependable indicator of a student’s performance. Measurement (random) errors can result from the way the test is designed, or from the factor related to individual students or the testing situation and many other sources like,  the  mood of  examiners, the  time  of  the  test  (occasion), the  test environment, invigilators, and the changing order of the questions, which may lead to higher or lower scores (Johnson, Dulany &Banks 2000). Some test items or questions may be biased in favour of or against particular groups of students.

The need for estimating measurement error arises because of the inconsistencies in measurements especially those involving multiple sources of error. The low performance of students in examinations calls for the estimation of multiple sources of error, so as to determine the contributions to error of the different facets in

examinations and then see how these errors can be minimized or eliminated and hence increase reliability in examination scores. In this study, the facets of interest are students, refered to as differentiation facet, because they are called objects of measurement. The other facets are the test questions (q), and invigilators (i) refered to as instrumentation facets, because they contribute to error variance.

In  support  of  the  reliability  of  test  scores,  Thompson  and  Vacha-Haase (2000),  made  a  case  for  characterizing reliability in  terms  of  scores,  not  test stressing that the argument that reliability is a function of scores not the test itself is not mere phrase. The following arguments gave rise to their support. In estimating measurement  error,  emphasis  is  on  whether  scores  not  tests  are  reliable  or unreliable (Thompson, 1994; Vacha-Haase, 1998).  It is more appropriate to speak of the reliability of ‘test scores’ or the ‘measurement’ than of the ‘tests’ or the

‘instrument (Gronlund & Linn, 1990).    Reinhardt (1996), emphasized the importance of reliability in discovering effect in substantive research, stressing that if scores gotten from a dependent variable is found to be perfectly unreliable the effect size in the study will be equal to zero hence the result will not be statistically significant at any sample size.

After instruction, students are given tests to find out if they have mastered what they were taught. This form of test is called achievement test. It is a test to determine the cognitive ability of the student. Nworgu (2003) defined an achievement   test   as   a   test   designed   to   measure   the   outcome   level   of accomplishment in  a  specified programme  of  instruction  in  a  subject  area  or occupation which a student had undertaken in the recent past. Similarly, Nwana (1979) described a test given to determine how much the pupils have learned as an achievement  test.  The  daily  class  tests,  the  weekly  test,  the  end  of  year examinations and end of programme examinations like First School leaving Certificate Examination, West African Senior School Certificate examination and the National Examinations Council’s Senior School Certificate Examinations are all examples of achievement tests. However, an achievement test is only relevant if it determines how much the students have learned after instruction.

When an achievement test is given to students, it is expected that the test scores will exhibit some degree of consistency or stability. In assessing the quality of  data  collected  from  an  achievement  test,  a  researcher  is  confronted  with questions such as; Is the right thing being measured? How consistent is the measurement?   This leads to the issue of reliability of a test. Reliability is the property of a set of test scores that indicates the amount of error associated with scores.  According to  Nworgu  (2003),  reliability is  the  proportion of  the  total variance of the test that is due to error variance, stressing that  any condition which is irrelevant to the purpose of the test constitutes a source of error. Accordingly, poor reliability can reduce statistical power (Onwuegbuzie & Daniel, 2000) and potentially result to inappropriate conclusion concerning any research finding (Thompson, 1994). Dawis (1987) emphasized that reliability is a function of the sample of the population that took the test as well as the instrument ; insisting that reliability should be evaluated for both the sample from the intended target population (students) and the instrument, an obvious but sometimes overlooked

point.

Reinhardt (1996) explained that both the characteristics of the person sample selected and the characteristics of the test items can affect reliability. Score reliability then may vary depending on the characteristics of the sample from which the score are obtained including differential impact from the same or different sample types. That score reliability can vary from study to study, Vacha-Haase (1998) presented reliability generalization as a method for examining measurement error across studies. The consideration of what factors constitutes error variance is specific to the particular test under consideration. Since examinations involve more than one major random factor (facets), single reliability estimate is not adequate. Similarly, in using examinations, to make decisions about the object of measurements, emphasis is always on the interpretation of both norm and criterion referenced decisions about students. Therefore there is the need to determine the reliability of the examinees (students), test questions, invigilators, and the overall reliability of examination scores. Students’ performances may vary across samples of  assessment- occasions, attitude of  invigilators, the  nature and  the  changing

nature of the test questions. Once the measurement error due to these sources of error are observed, the statistical framework of generalizability (G) theory will be brought  to  bear  on  the  technical  quality  of  performance  assessment  scores (Brennan, 1991) . This context requires employing a generalizability (G) theory approach that can analyze more than one source of measurement error simultaneously in addition to the object of measurement (Lee, 2005). The identification and reduction of measurement errors is a major challenge in psychological testing. Researchers still rely on classical test theory for assessing reliability, despite the fact that   recommendations for the use of Generalizability Theory have been made to estimate the various sources of error in examinations by some researchers.

Feldt & Brennan (1998) observed that in recognition of the classical test theory’s view about error as undifferentiated, generalizability (G) theory provided an answer to the multiple sources of measurement error. In contrast to test construction within the classical test theory framework, generalizability (G) theory gives new possibilities for evaluating test scores. Generalizability theory highlights both validity and reliability issues (Ø degard, Hadgtvet& BjØrkly, 2008). This is in agreement with Kane’s (1982) treatment of a sampling model for validity. In this article,  Kane  made  an  explicit  link  between  G-theory and  issues  traditionally subsumed under validity. This is seen as a major contribution to the literature on G theory in the last 25 years.

Generalizability theory is a conceptual and statistical framework     for analyzing more than one facet in investigations of measurement error and score dependability (Brennan, 2000, 2001; Shavelson & Webb, 1991).Generalizability theory involves a generalizability (G) study and a decision (D) study. Through a two stage investigation that includes a generalizability (G) study and a decision (D) study, generalizability (G) theory enable researchers to disentangle multiple sources of error and investigate the impact of various changes in the measurement design on score reliabilities.In addition, researchers can evaluate the relative importance of various sources of measurement error and interpret score reliability from both norm and  criterion-referenced perspectives. With data collected in a generalizability (G)

study, an observed measurement can be decomposed into a component or effect for the universe score and one or more error components. The effect of each of the independent variables, and of their interactions, can then be tested by  an F-ratio, which is the mean square for the effect divided by the mean square for the appropriate error term. In generalizability theory, analysis of variance (ANOVA) is used to compute mean squares used in estimating variance components. However, instead of using this to test hypotheses, they are used in a generalizability (G) study to generate ‘variance component’ (Rae & Hyland, 2001). Besides looking at the conventional F statistics to establish whether each facet makes a significant contribution to  the  scores,  it  is  used  to  compute  variance  components. These variance components reflect the size rather than the statistical significance of the contributions of each facet to the observed scores (Mitchell 1979).

Brennan  (1997)  observed  that  G  theory  has  no  substantive  role  for hypothesis testing. Rather, it emphasizes the estimation of random effects variance components. However, in this study since sample scores are used to generalize for the entire population, two hypotheses are stated to determine if a    significant difference exist in  the contributions of the facets to measurement error and score dependability in  examinations. In G-theory, the sources of variation are associated with  the  persons  being  measured  (the  objects  of  measurement)  and  potential sources of error arising from the testing situation, such as questions, invigilators, examiners, occasions.   These situations are called facets (rather than factors) and each facet is composed of one or more levels or conditions. Although, the choice and number of facets in a G-study may vary according to the interests of the researcher, the object of measurement is always included as a distinct facet. In G theory, objects of measurement are the entities that are measured. In most testing context, the objects of measurement could also be other entities such as classrooms or schools

In the  generalizability (G) study, the variances associated with different facets of measurement(in this case, questions(q), invigilators(i), and also the object of measurement – students(s)) are estimated and evaluated in terms of their relative importance in contributing to the total score variance given a universe of admissible

observations (Brennan, 2001). The goal of generalizability analysis is to estimate each of these variance components and their contribution to the total observed variance. Of the variance components estimated in this study, only the differentiation facet, viz student, produces relevant variance, all other sources are extraneous to the purpose of testing. Replacing the “true score” with “universe score” shows that the researcher is making inferences from a sample of possible observations; the choice of the universe, emphasizes that there is more than one universe to which an investigator might wish to generalize. To be precise, if a student  is  tested  on  an  achievement  measure,  he  or  she  may  be  tested  in  a population of circumstances involving different questions forms, time of the day (occasion), attitude of invigilators, and the mood of examiners. Thus, in generalizability (G) theory, any measurement of an attribute is considered to be a sample from some large set or universe of possible measurements, the universe of admissible observations defined by observation in all possible combinations of circumstances. Circumstances of the same kind (facet), may consists of two or more  categories  or  conditions  within  the  universe  of  admissible  observations (Guion & Ironson, 1979).

According to  Webb,  Shavelson  &  Hartel  (2007)  in  generalizability (G) study, a behaviourial measurement (e.g. achievement test score) is conceived of as a sample from a universe of admissible observation, which consists of all possible acceptable substitutes for observation in hand. A researcher may wish to generalize the result of measurement to only a limited portion of the overall universe defined by these facets; that portion is the universe of generalization (Guion & Ironson,

1979). In this study which is on estimating measurement error and score dependability in examinations using Generalizability theory, possible facets will include,   question   forms,   time   of   the   examination   (occasion),   invigilators, examiners, or combinations of both facets. In designing a study, all facets that might influence scores should be identified. An important feature of G-theory is that the relative contributions from individual’s sources of measurement error to the overall error variance can be investigated including the interaction between the sources  (compounded  in  error).  Generalizability  theory  is  used  to  conduct  a

simultaneous  analysis  of  multiple  sources  of  measurement  error  and  score dependability for a single test or examination (Ofqual, 2009).

It ensures that one can design a study to estimate the contribution to error that each of these different error sources makes and moreover, to find generalizability coefficient which specifically accounts whichever source of error is of particular interest (Lang, 1978). This study, will find a coefficient of generalizability and index  of  dependability  which  will  show  how  the  evaluation  of  students’ performance  in  examinations  generalizes  over  the  number  of  invigilators  and number of questions . It will as well help the researcher to decide on whether the number and strictness of the invigilators  are sufficient to get a dependable score or how many questions or number of invigilators  are needed to obtain a satisfactory generalizability coefficient( Ep2) and index of dependability( Ø ).

Statement of the Problem

In measuring student’s performances in  a given  examination,  either relative to that of other students or  in an absolute sense, there are  characteristics other than the students factor that affect the scores made by them in examinations. These characteristics called sources of error such as test questions, invigilators etc contribute to error in measurement of student’s achievement   and affect the score dependability of these measurements. There is the need to find out their contributions to  measurement error,  in  examination scores.  Estimating measurement error and score  dependability in  examinations involves a multifacet approach,therefore the  Classical test theory which  addresses only one source of measurement error is not fit to be used in assessing the effects of multiple sources of error. The advantages of G theory over the classical reliability theory are more obvious when more than one random facet is involved.   Observed     scores   in examinations are affected by factors other than the students. Such specific factors (facets)  as  test  questions,  invigilators,  among  others  are  likely  to  affect  the reliability of an observed score in examinations. The impact of these factors leads to questions about the accuracy, precision, and ultimately, the fairness of the scores obtained by   students in examinations.

Since  the  scores  obtained  by  the  objects  of  measurement, student(s) in examinations are affected by multiple sources of error and scores from the examinations are used in making relative and absolute decisions concerning students, there is the need to estimate measurement error and score dependability of examinations   using   Generalizability(G)   Theory,   so   as   to   determine   the contributions to error of these facets in measurement situations   in examinations with a view to minimizing errors and maximizing reliability of their scores.  The problem of this study therefore, is on how to; estimate measurement error and score dependability in examinations using Generalizability theory.

Purpose of the Study

The purpose of the study was to estimate measurement error and score dependability in examinations using generalizability theory. The study specifically, was designed to;

1.    ascertain  the  contributions      of  the  facets:  students,(s),  questions  (q), invigilators (i) and their interactions to measurement error in  examination scores.

2.     determine the effects    of    the instrumentation facets:  questions (q), and invigilators (i) and their interactions to score dependability in examinations.

3.    determine  the extent generalizability coefficients show the degree to which students maintain their rank order across facets: questions (q), and invigilators (i) in examination scores.

4.  determine   the   extent   dependability   index   expresses   the   degree   of performance of students, irrespective of the performance of others across facets questions (q), and invigilators (i) in examination scores.

Significance of the Study

The study is theoretically significant. Generalizability theory anchored on the view that observed score is the sum of unobservable true scores and multiple error components emphasized that the errors from the estimated variance components are of   great   importance.   The   result   of   the   generalizability   coefficients   and

dependability index are used to make inferences about the objects of measurements

–students such as selection, placement and certification etc.  The result of the study showed that generalizability theory is a more reliable method of estimating measurement error than the classical test theory. G –theory estimates multiple sources of error simultaneously in order to determine the contributions to error of each source to measurement error in examination scores, but the C TT considers only one  source  of  error  at  a  time.  To  a  large  extent,  generalizability theory provided a more unified approach for assessing reliability of examination scores than the classical test theory. The result of this study will encourage the application of generalizability theory for reliability estimation against the continued use of the classical test theory.

The practical significance of the study showed that the findings were useful to public examination bodies, test item writers, research students and  teachers respectively. Public examination bodies like West African Examination Council, National Examination Council, Joint Admission and Matriculation Board etc. benefited from the findings of the study, in that it  provides them with very specific informations about measurement error and how to successfully design their examinations. The test item writers equally benefited from the findings of the study. It   enabled them to write test items that are suitable for specific purposes with minimium measurement error,  such  test  items that  can  be  used  to  make decisions concerning students and ones that can distinguish between students of different achievement levels. Research students also benefited from the findings. It provided them with the knowledge of the relative importance of the various sources of error and information on how to design efficient measurement procedures. The study also provided for them avenue for further research on generalizability theory. To classroom teachers, the findings  of  the study were useful to them. Teachers were acquainted with knowledge about the existence of multiple errors in examinations, hence to maximize reliability and reduce error or eliminate error in examinations, there is need to estimate as many sources of error as  is economically viable in order to determine the level of involvement of each source of error in the scores obtained in examinations. It guided the teachers on how to estimate multiple

sources of error which are part of the examination process but not related to the construct being measured (students).

Scope of the study

The study was restricted to estimating measurement error and score dependability in  examinations  using  generalizability theory  approach.  The  study  specifically ascertained and determined the contributions and effects of the facets; students, questions, invigilators and their interactions to measurement error and score dependability in examinations. It also determined how these facets were used in making relative and absolute decisions concerning the objects of  measurement (students). Two facets; questions, invigilators and the object of measurement – students were used for the study despite the numerous facets mentioned in the background of the study. The justification for this was based on the decision of the researcher to use the most important facets it might wish to generalize over namely questions and invigilators. The area covered by this study was restricted to SS3 students,  2011/2012  academic  ssession  in  public  owned  secondary  schools  in Rivers State.

Research Questions

1.         What are  the  contributions of  the  facets:    students(s), questions (q), invigilators   (i),   and   their   interactions   to   measurement   error   in examination scores?

2.         What are the effects of the instrumentation facets:   questions (q), and invigilators (i), and their interactions on score dependability in examinations?

3.         To what extent do the generalizability coefficients show the degree to which students maintain their rank order across facets: questions (q) and invigilators (I) in examination scores?

4.         What is the extent to which dependability index expresses the degree of performance  of  students,  irrespective  of  others  performance  across facets: questions (q), and invigilators (i)   in examination scores?

Hypotheses

Two hypotheses were formulated and tested for significance at 0.05 levels.

Ho1: There is no significant difference in the contributions of facets:  students (s), questions (q), invigilators (i), and their interactions to measurement error in examination scores.

Ho2: There is no significant difference in the effects of the instrumentation facets: questions  (q),   and   invigilators   (i),   and   their   interactions   on   score dependability in examinations.


This material content is developed to serve as a GUIDE for students to conduct academic research



ESTIMATING MEASUREMENT ERROR AND SCORE DEPENDABILITY IN EXAMINATIONS USING GENERALIZABILITY THEORY

NOT THE TOPIC YOU ARE LOOKING FOR?



PROJECTOPICS.com Support Team Are Always (24/7) Online To Help You With Your Project

Chat Us on WhatsApp » 07035244445

DO YOU NEED CLARIFICATION? CALL OUR HELP DESK:

  07035244445 (Country Code: +234)
 
YOU CAN REACH OUR SUPPORT TEAM VIA MAIL: [email protected]


Related Project Topics :

DEPARTMENT CATEGORY

MOST READ TOPICS