ABSTRACT
This study investigated the estimation of measurement error and score dependability in examinations using Generalizability Theory. The fact that the scores obtained by the students (object of measurements) in examinations were affected by multiple sources of error (facets) and these scores, were used in taking relative and absolute decisions about the students, there was the need to estimate measurement error and dependability of the scores. This was to find the extent of the contributions of these sources of error (facets) in examination scores. Four research questions and two hypotheses were posed to guide the study. The population of the study comprised 25,530 senior secondary three (SS3) students in public secondary schools in Rivers State for 2011/2012 academic year. The sample consisted of 2,553, SS3 students selected through the proportionate stratified random sampling technique. A Mathematics Achievement Test with items drawn from past WAEC and NECO SSCE questions was used for data collection. EduG version 6.0-e based on ANOVA and Generalizability theory was used to answer the four research questions. A 95% confidence interval was computed using the S E variance components to determine whether there was a significant difference in the contributions and effects of the facets and their interactions to measurement error and score dependability in examinations. The findings of the study revealed that some hidden sources of error were at play in the study. The residual made the highest contribution to measurement error. This was followed by the student factor. Similarly, the residual and the students variance components were significantly (p < 0.05) different in their contributions to measurement error in examination scores. Conversely, questions and invigilators were not significantly different in their contributions and effects on measurement error and score dependability in examinations (p > 0.05). Invigilators and the interaction of students x invigilators has the highest effect on score dependability in examinations. The findings also revealed that an increase of invigilators to 90, increased the generalizability coefficient (EP2) and index of dependability (Ø) which rank ordered students and classified them based on their performance, irrespective of the performance of other students. Therefore, generalizability theory provided a framework for evaluating multiple sources of variability in examination scores and for deriving implications for test development and test score interpretation. It is therefore recommended that in conducting examinations, enough invigilators should be recruited so as to minimize error and maximize score reliabilities.
Background of the Study
CHAPTER ONE INTRODUCTION
Measurement pervades almost every aspect of modern society. Nworgu (2003) looked at measurement as the process of assigning numerical values to describe features or characteristics of objects, persons or events in a systematic manner. Measurement involves assigning figures, numerical quantities or scores to variables or traits of interest. For example, a great variety of things about individuals (achievement, aptitude, intelligence, height, weight) are measured by various people like teachers, doctors ,etc ,on a regular basis. At a glance, obtaining scores of these attributes seems quite simple, but unfortunately, a major problem with many kinds of measurement is that, there is often no basis to assume that the numerical value provided, accurately and truthfully represents the underlying quantity of interest. For the mere fact that the results of these measurements can have a profound influence on an individual’s life, it is important to understand how these scores are derived and the accuracy of the information they contain.
Teachers give tests to determine what students know and are able to do in a particular content area. If there is confidence in a test, the belief is that a student who scores high on the test knows more in that area than a student who scores low. In like manner, two students whose scores are similar probably have roughly the same level of ability in the area being tested. Two questions, therefore, arise; how much confidence should we have in any particular test? How should our level of confidence in the test affect the way we think about student’s scores? Yet no test, however well designed can measure a student’s true ability because there are numerous factors that interfere with our ability to measure it accurately and precisely. Among these factors are questions, examiners, invigilators, factors related to students, etc, are sources of “measurement error” or “random error”. Every score obtained by students has some amount of error in it. The goal is to create tests that have as little measurement error in them as much as possible. Measurement error is the result of random fluctuations due to the choice of a particular sample of questions or conditions of observation.
Measurement error is a situation in which a student’s true ability or achievement is either underestimated or overestimated (Johnson, Dulany & Banks,
2000). According to Hofman (2005), measurement error is defined as the difference between the distorted information and undistorted information about a measured product expressed in its physical quantity. An error is defined as real (untrue, wrong, false, no go) value at the output of a measurement system minus ideal (true, good, right, go) value at the input of a measurement system mathematically expressed as:
X = Xr – Xi
where
X is the measurement error,
Xr is the real untrue measurement value and
Xi is the ideal true measurement value.
A measurement under ideal condition has no error. Regarding measurement error, it is important to emphasize that scores of tests are either reliable or unreliable (Thompson, 1994; Vacha-Haase, 1998). Test scores are not a definitive measure of student’s knowledge or skills. An examinee’s score can be expected to vary across different versions of a test because of differences in the way graders evaluate student’s responses and differences in transitory factors such as the student’s attentiveness on the day the test was taken. For these reasons, no single test score can be a perfectly dependable indicator of a student’s performance. Measurement (random) errors can result from the way the test is designed, or from the factor related to individual students or the testing situation and many other sources like, the mood of examiners, the time of the test (occasion), the test environment, invigilators, and the changing order of the questions, which may lead to higher or lower scores (Johnson, Dulany &Banks 2000). Some test items or questions may be biased in favour of or against particular groups of students.
The need for estimating measurement error arises because of the inconsistencies in measurements especially those involving multiple sources of error. The low performance of students in examinations calls for the estimation of multiple sources of error, so as to determine the contributions to error of the different facets in
examinations and then see how these errors can be minimized or eliminated and hence increase reliability in examination scores. In this study, the facets of interest are students, refered to as differentiation facet, because they are called objects of measurement. The other facets are the test questions (q), and invigilators (i) refered to as instrumentation facets, because they contribute to error variance.
In support of the reliability of test scores, Thompson and Vacha-Haase (2000), made a case for characterizing reliability in terms of scores, not test stressing that the argument that reliability is a function of scores not the test itself is not mere phrase. The following arguments gave rise to their support. In estimating measurement error, emphasis is on whether scores not tests are reliable or unreliable (Thompson, 1994; Vacha-Haase, 1998). It is more appropriate to speak of the reliability of ‘test scores’ or the ‘measurement’ than of the ‘tests’ or the
‘instrument (Gronlund & Linn, 1990). Reinhardt (1996), emphasized the importance of reliability in discovering effect in substantive research, stressing that if scores gotten from a dependent variable is found to be perfectly unreliable the effect size in the study will be equal to zero hence the result will not be statistically significant at any sample size.
After instruction, students are given tests to find out if they have mastered what they were taught. This form of test is called achievement test. It is a test to determine the cognitive ability of the student. Nworgu (2003) defined an achievement test as a test designed to measure the outcome level of accomplishment in a specified programme of instruction in a subject area or occupation which a student had undertaken in the recent past. Similarly, Nwana (1979) described a test given to determine how much the pupils have learned as an achievement test. The daily class tests, the weekly test, the end of year examinations and end of programme examinations like First School leaving Certificate Examination, West African Senior School Certificate examination and the National Examinations Council’s Senior School Certificate Examinations are all examples of achievement tests. However, an achievement test is only relevant if it determines how much the students have learned after instruction.
When an achievement test is given to students, it is expected that the test scores will exhibit some degree of consistency or stability. In assessing the quality of data collected from an achievement test, a researcher is confronted with questions such as; Is the right thing being measured? How consistent is the measurement? This leads to the issue of reliability of a test. Reliability is the property of a set of test scores that indicates the amount of error associated with scores. According to Nworgu (2003), reliability is the proportion of the total variance of the test that is due to error variance, stressing that any condition which is irrelevant to the purpose of the test constitutes a source of error. Accordingly, poor reliability can reduce statistical power (Onwuegbuzie & Daniel, 2000) and potentially result to inappropriate conclusion concerning any research finding (Thompson, 1994). Dawis (1987) emphasized that reliability is a function of the sample of the population that took the test as well as the instrument ; insisting that reliability should be evaluated for both the sample from the intended target population (students) and the instrument, an obvious but sometimes overlooked
point.
Reinhardt (1996) explained that both the characteristics of the person sample selected and the characteristics of the test items can affect reliability. Score reliability then may vary depending on the characteristics of the sample from which the score are obtained including differential impact from the same or different sample types. That score reliability can vary from study to study, Vacha-Haase (1998) presented reliability generalization as a method for examining measurement error across studies. The consideration of what factors constitutes error variance is specific to the particular test under consideration. Since examinations involve more than one major random factor (facets), single reliability estimate is not adequate. Similarly, in using examinations, to make decisions about the object of measurements, emphasis is always on the interpretation of both norm and criterion referenced decisions about students. Therefore there is the need to determine the reliability of the examinees (students), test questions, invigilators, and the overall reliability of examination scores. Students’ performances may vary across samples of assessment- occasions, attitude of invigilators, the nature and the changing
nature of the test questions. Once the measurement error due to these sources of error are observed, the statistical framework of generalizability (G) theory will be brought to bear on the technical quality of performance assessment scores (Brennan, 1991) . This context requires employing a generalizability (G) theory approach that can analyze more than one source of measurement error simultaneously in addition to the object of measurement (Lee, 2005). The identification and reduction of measurement errors is a major challenge in psychological testing. Researchers still rely on classical test theory for assessing reliability, despite the fact that recommendations for the use of Generalizability Theory have been made to estimate the various sources of error in examinations by some researchers.
Feldt & Brennan (1998) observed that in recognition of the classical test theory’s view about error as undifferentiated, generalizability (G) theory provided an answer to the multiple sources of measurement error. In contrast to test construction within the classical test theory framework, generalizability (G) theory gives new possibilities for evaluating test scores. Generalizability theory highlights both validity and reliability issues (Ø degard, Hadgtvet& BjØrkly, 2008). This is in agreement with Kane’s (1982) treatment of a sampling model for validity. In this article, Kane made an explicit link between G-theory and issues traditionally subsumed under validity. This is seen as a major contribution to the literature on G theory in the last 25 years.
Generalizability theory is a conceptual and statistical framework for analyzing more than one facet in investigations of measurement error and score dependability (Brennan, 2000, 2001; Shavelson & Webb, 1991).Generalizability theory involves a generalizability (G) study and a decision (D) study. Through a two stage investigation that includes a generalizability (G) study and a decision (D) study, generalizability (G) theory enable researchers to disentangle multiple sources of error and investigate the impact of various changes in the measurement design on score reliabilities.In addition, researchers can evaluate the relative importance of various sources of measurement error and interpret score reliability from both norm and criterion-referenced perspectives. With data collected in a generalizability (G)
study, an observed measurement can be decomposed into a component or effect for the universe score and one or more error components. The effect of each of the independent variables, and of their interactions, can then be tested by an F-ratio, which is the mean square for the effect divided by the mean square for the appropriate error term. In generalizability theory, analysis of variance (ANOVA) is used to compute mean squares used in estimating variance components. However, instead of using this to test hypotheses, they are used in a generalizability (G) study to generate ‘variance component’ (Rae & Hyland, 2001). Besides looking at the conventional F statistics to establish whether each facet makes a significant contribution to the scores, it is used to compute variance components. These variance components reflect the size rather than the statistical significance of the contributions of each facet to the observed scores (Mitchell 1979).
Brennan (1997) observed that G theory has no substantive role for hypothesis testing. Rather, it emphasizes the estimation of random effects variance components. However, in this study since sample scores are used to generalize for the entire population, two hypotheses are stated to determine if a significant difference exist in the contributions of the facets to measurement error and score dependability in examinations. In G-theory, the sources of variation are associated with the persons being measured (the objects of measurement) and potential sources of error arising from the testing situation, such as questions, invigilators, examiners, occasions. These situations are called facets (rather than factors) and each facet is composed of one or more levels or conditions. Although, the choice and number of facets in a G-study may vary according to the interests of the researcher, the object of measurement is always included as a distinct facet. In G theory, objects of measurement are the entities that are measured. In most testing context, the objects of measurement could also be other entities such as classrooms or schools
In the generalizability (G) study, the variances associated with different facets of measurement(in this case, questions(q), invigilators(i), and also the object of measurement – students(s)) are estimated and evaluated in terms of their relative importance in contributing to the total score variance given a universe of admissible
observations (Brennan, 2001). The goal of generalizability analysis is to estimate each of these variance components and their contribution to the total observed variance. Of the variance components estimated in this study, only the differentiation facet, viz student, produces relevant variance, all other sources are extraneous to the purpose of testing. Replacing the “true score” with “universe score” shows that the researcher is making inferences from a sample of possible observations; the choice of the universe, emphasizes that there is more than one universe to which an investigator might wish to generalize. To be precise, if a student is tested on an achievement measure, he or she may be tested in a population of circumstances involving different questions forms, time of the day (occasion), attitude of invigilators, and the mood of examiners. Thus, in generalizability (G) theory, any measurement of an attribute is considered to be a sample from some large set or universe of possible measurements, the universe of admissible observations defined by observation in all possible combinations of circumstances. Circumstances of the same kind (facet), may consists of two or more categories or conditions within the universe of admissible observations (Guion & Ironson, 1979).
According to Webb, Shavelson & Hartel (2007) in generalizability (G) study, a behaviourial measurement (e.g. achievement test score) is conceived of as a sample from a universe of admissible observation, which consists of all possible acceptable substitutes for observation in hand. A researcher may wish to generalize the result of measurement to only a limited portion of the overall universe defined by these facets; that portion is the universe of generalization (Guion & Ironson,
1979). In this study which is on estimating measurement error and score dependability in examinations using Generalizability theory, possible facets will include, question forms, time of the examination (occasion), invigilators, examiners, or combinations of both facets. In designing a study, all facets that might influence scores should be identified. An important feature of G-theory is that the relative contributions from individual’s sources of measurement error to the overall error variance can be investigated including the interaction between the sources (compounded in error). Generalizability theory is used to conduct a
simultaneous analysis of multiple sources of measurement error and score dependability for a single test or examination (Ofqual, 2009).
It ensures that one can design a study to estimate the contribution to error that each of these different error sources makes and moreover, to find generalizability coefficient which specifically accounts whichever source of error is of particular interest (Lang, 1978). This study, will find a coefficient of generalizability and index of dependability which will show how the evaluation of students’ performance in examinations generalizes over the number of invigilators and number of questions . It will as well help the researcher to decide on whether the number and strictness of the invigilators are sufficient to get a dependable score or how many questions or number of invigilators are needed to obtain a satisfactory generalizability coefficient( Ep2) and index of dependability( Ø ).
Statement of the Problem
In measuring student’s performances in a given examination, either relative to that of other students or in an absolute sense, there are characteristics other than the students factor that affect the scores made by them in examinations. These characteristics called sources of error such as test questions, invigilators etc contribute to error in measurement of student’s achievement and affect the score dependability of these measurements. There is the need to find out their contributions to measurement error, in examination scores. Estimating measurement error and score dependability in examinations involves a multifacet approach,therefore the Classical test theory which addresses only one source of measurement error is not fit to be used in assessing the effects of multiple sources of error. The advantages of G theory over the classical reliability theory are more obvious when more than one random facet is involved. Observed scores in examinations are affected by factors other than the students. Such specific factors (facets) as test questions, invigilators, among others are likely to affect the reliability of an observed score in examinations. The impact of these factors leads to questions about the accuracy, precision, and ultimately, the fairness of the scores obtained by students in examinations.
Since the scores obtained by the objects of measurement, student(s) in examinations are affected by multiple sources of error and scores from the examinations are used in making relative and absolute decisions concerning students, there is the need to estimate measurement error and score dependability of examinations using Generalizability(G) Theory, so as to determine the contributions to error of these facets in measurement situations in examinations with a view to minimizing errors and maximizing reliability of their scores. The problem of this study therefore, is on how to; estimate measurement error and score dependability in examinations using Generalizability theory.
Purpose of the Study
The purpose of the study was to estimate measurement error and score dependability in examinations using generalizability theory. The study specifically, was designed to;
1. ascertain the contributions of the facets: students,(s), questions (q), invigilators (i) and their interactions to measurement error in examination scores.
2. determine the effects of the instrumentation facets: questions (q), and invigilators (i) and their interactions to score dependability in examinations.
3. determine the extent generalizability coefficients show the degree to which students maintain their rank order across facets: questions (q), and invigilators (i) in examination scores.
4. determine the extent dependability index expresses the degree of performance of students, irrespective of the performance of others across facets questions (q), and invigilators (i) in examination scores.
Significance of the Study
The study is theoretically significant. Generalizability theory anchored on the view that observed score is the sum of unobservable true scores and multiple error components emphasized that the errors from the estimated variance components are of great importance. The result of the generalizability coefficients and
dependability index are used to make inferences about the objects of measurements
–students such as selection, placement and certification etc. The result of the study showed that generalizability theory is a more reliable method of estimating measurement error than the classical test theory. G –theory estimates multiple sources of error simultaneously in order to determine the contributions to error of each source to measurement error in examination scores, but the C TT considers only one source of error at a time. To a large extent, generalizability theory provided a more unified approach for assessing reliability of examination scores than the classical test theory. The result of this study will encourage the application of generalizability theory for reliability estimation against the continued use of the classical test theory.
The practical significance of the study showed that the findings were useful to public examination bodies, test item writers, research students and teachers respectively. Public examination bodies like West African Examination Council, National Examination Council, Joint Admission and Matriculation Board etc. benefited from the findings of the study, in that it provides them with very specific informations about measurement error and how to successfully design their examinations. The test item writers equally benefited from the findings of the study. It enabled them to write test items that are suitable for specific purposes with minimium measurement error, such test items that can be used to make decisions concerning students and ones that can distinguish between students of different achievement levels. Research students also benefited from the findings. It provided them with the knowledge of the relative importance of the various sources of error and information on how to design efficient measurement procedures. The study also provided for them avenue for further research on generalizability theory. To classroom teachers, the findings of the study were useful to them. Teachers were acquainted with knowledge about the existence of multiple errors in examinations, hence to maximize reliability and reduce error or eliminate error in examinations, there is need to estimate as many sources of error as is economically viable in order to determine the level of involvement of each source of error in the scores obtained in examinations. It guided the teachers on how to estimate multiple
sources of error which are part of the examination process but not related to the construct being measured (students).
Scope of the study
The study was restricted to estimating measurement error and score dependability in examinations using generalizability theory approach. The study specifically ascertained and determined the contributions and effects of the facets; students, questions, invigilators and their interactions to measurement error and score dependability in examinations. It also determined how these facets were used in making relative and absolute decisions concerning the objects of measurement (students). Two facets; questions, invigilators and the object of measurement – students were used for the study despite the numerous facets mentioned in the background of the study. The justification for this was based on the decision of the researcher to use the most important facets it might wish to generalize over namely questions and invigilators. The area covered by this study was restricted to SS3 students, 2011/2012 academic ssession in public owned secondary schools in Rivers State.
Research Questions
1. What are the contributions of the facets: students(s), questions (q), invigilators (i), and their interactions to measurement error in examination scores?
2. What are the effects of the instrumentation facets: questions (q), and invigilators (i), and their interactions on score dependability in examinations?
3. To what extent do the generalizability coefficients show the degree to which students maintain their rank order across facets: questions (q) and invigilators (I) in examination scores?
4. What is the extent to which dependability index expresses the degree of performance of students, irrespective of others performance across facets: questions (q), and invigilators (i) in examination scores?
Hypotheses
Two hypotheses were formulated and tested for significance at 0.05 levels.
Ho1: There is no significant difference in the contributions of facets: students (s), questions (q), invigilators (i), and their interactions to measurement error in examination scores.
Ho2: There is no significant difference in the effects of the instrumentation facets: questions (q), and invigilators (i), and their interactions on score dependability in examinations.
This material content is developed to serve as a GUIDE for students to conduct academic research
ESTIMATING MEASUREMENT ERROR AND SCORE DEPENDABILITY IN EXAMINATIONS USING GENERALIZABILITY THEORY>
PROJECTOPICS.com Support Team Are Always (24/7) Online To Help You With Your Project
Chat Us on WhatsApp » 07035244445
DO YOU NEED CLARIFICATION? CALL OUR HELP DESK:
07035244445 (Country Code: +234)YOU CAN REACH OUR SUPPORT TEAM VIA MAIL: [email protected]