The Official Journal of Global Chinese Society for  Computers in Education (全球華人計算機教育應用學會)

Making a-Stratified Computerized Adaptive Testing Design More Practical: Imposing Non-statistical Constraints
加入非統計類限制以強化電腦自適測試「分層遞增a法」的實用性
March 2001

Chi-keung Leung, The Hong Kong Institute of Education
Hua-hua Chang, National Board of Medical Examiners
Kit-tai Hau, The Chinese University of Hong Kong
梁志強   香港教育學院
張華華   國立醫生考試委員會(美國)
侯傑泰   香港中文大學

Abstract

Computerized Adaptive Testing (CAT) is becoming a prevalent form for large-scaled educational tests. In CAT, each student encounters a unique test in which items are adaptively selected based on his/her responses to previous questions. The traditional information-based item selection method has created a number of problems, including high test-overlap rate and substantially skewed item exposure distribution. On a different approach, Chang and Ying (1999) proposed the a-stratified design (STR) and advocated the use of low discriminating items in the earlier stages of testing. Research findings have indicated that this method is effective in achieving a balanced utilization of the entire item pool and reducing test-overlap rate, without sacrificing the efficiency in ability estimation. Nevertheless, this new approach has not taken into consideration the many practical situations in which non-statistical constraints are necessary. This paper reviews existing models that tackle non-statistical constraints of various complexities. Building on these models, the paper proposes three approaches on how to incorporate non-statistical constraints in the STR designs. The strengths and weaknesses of these methods as well as problems in implementations are also discussed.

摘要

電腦自適應考試(Computerized Adaptive Testing, CAT已成為大型公開考試的一種流行模式。在考試時,電腦先透過考生已答的題目,不斷估計其能力,然後在題庫中選擇合宜的題目測試考生。傳統的理念是選擇那些判別度(a)最大的試題。結果,高判別度的題目最先被選用,留下低判別度的題目原封不動或只在測驗後段使用,形成極不平均的使用率分佈。有見及此,張及應(1999)發展了另一種選題策略「分層遞增法」,並建議於尚未有充分有關考生能力的測驗初段,先選用低判別度的題目。其研究結果顯示此新策略更能善用試題庫、減少使用高曝光率的題目及兩卷試題重疊的情況。針對現時尚未有研究將此新方法廣泛應用於一些較複雜的考試類別,本文提出三個可行模式,讓「分層遞增a法」應用於有不同要求的考試,並分析它們的優點和限制。

 

The main purpose of an academic test is to generate information about the achievement of examinees in a curricular domain or other more general cognitive abilities (Millman & Greene, 1993). Generally, the information will be used to make decisions about individuals. If the information is sufficient, the ability levels of individuals can be estimated precisely, and subsequently the risk of making wrong decisions is small. One of the many shortcomings of measurement practices under the classical test theory is the inefficiency of the fixed-length conventional paper-and-pencil (P&P) tests in assessing examinees at the two ends of the ability continuum. In a typical fixed-length P&P testing situation, competent examinees have to waste their time on relatively easy items that provide very little information on their abilities. On the other hand, low-ability examinees are given very difficult items that may discourage them and also provide no information on their abilities. Besides, classical test theory does not tell how effective each item in the pool is in measuring at various ability levels and it cannot predict the psychometric properties of a test when administered to a specific target group of examinees (Lord, 1980). For these reasons, researchers and practitioners have attempted to develop better systems so that examinees will be given tests tailored to each ability level.

Lord (1980), one of the pioneers who developed adaptive testing under the framework of Item Response Theory (IRT), believed that "an examinee is measured most effectively when the test items are neither too difficult nor too easy for him" (p. 150). This means that a typical adaptive test attempts to match the difficulties of test items to the examinee's ability during the test. Thus, if an examinee answers the current test item correctly, the next item to be administered should be more difficult. Otherwise, an easier item should be provided.

Understandably, the concept of IRT-based adaptive testing procedure is difficult to be realized in the P&P form in large-scale examinations. When computers are programmed to carry out adaptive testing, the system is called computerized adaptive testing (CAT). Weiss and Kingsbury (1984) described CAT as a combination of IRT, adaptive testing, and interactive computer administration of tests. The increased availability of high-speed personal computers and the advent of necessary technology have spurred the development of CAT systems in which computers and item selection algorithms based on IRT are used to select and present the appropriate test items to individual test-takers. In CAT, each examinee is given a unique set of test items sequentially selected on the basis of the current estimated ability (Lord, 1980; Weiss, 1982). The administration procedure of a typical CAT generally consists of the following steps (Straetmans & Eggen, 1998):

  1. The computer selects an item from a pool of items.
  2. The item is displayed on the computer screen.
  3. The examinee responds to the question by typing or selecting an answer.
  4. The computer evaluates the response as correct or incorrect.
  5. If the answer is correct, the next item presented will be more difficult; otherwise an easier item will be administered.
  6. The computer terminates testing when pre-specified stopping rule (e.g., when a certain number of items have been administered) is satisfied.

Definitely CAT is better than P&P test in having a much more efficient and precise ability estimation (Green, 1983; Lord, 1970; McBride & Martin, 1983; Owen, 1975; Wainer, 1990; Weiss, 1982). Some other attractive advantages of CAT over P&P are summarized by Straetmans and Eggen (1998), as follows:

  1. Examinees can have more flexibility to schedule their time in taking the test as computers support on-demand test delivery.
  2. Alternate item forms that involve graphics, sounds, video and text are feasible.
  3. Teachers are freed from the laborious tasks of test construction and marking.
  4. Examinees can have better planning as they will be informed of the result by the computer immediately after the completion of the test.
  5. Measurement accuracy and efficiency can be improved because CAT can provide measurements of equal precision at all trait levels with fewer items than is possible with a paper-based test.

The appearance of CAT has had a great impact on educational measurement practice. Its appealing advantages have sought increasing attention of the educational community. Numerous applications of CAT have emerged in testing programs such as in the French language proficiency test (Burston, Harfouch & Monville-Burston, 1995), Japanese language proficiency test (Brown & Iwashita, 1996), and ESL reading comprehension test (Young, Shermis, Brutten & Perkins, 1996). Weiss and Yoes (1991) predicted that "many more IRT-based computerized adaptive versions of existing instruments can be expected in years to come" (p.91). In fact, some continuous large-scale tests such as Graduate Management Admissions Test (GMAT) and the Test of English as a Foreign Language (TOEFL) have been completely or partially converted into CAT (Educational Testing Service, 1998). Many other testing organizations have started or continue to put a substantial amount of resources into the research and development of CAT programs.

As CAT shows promising advantages over P&P tests, the number of CAT application programs is anticipated to grow rapidly along with the endless evolvement of computer technology. However, two important issues have to be addressed before CAT can be more widely accepted by users. The first concerns item exposure. In contrast to conventional P&P tests that are administered to large groups of examinees at some fixed dates using the same or equivalent test forms, individual CAT tests are now administered to small groups of examinees continuously at flexibly scheduled test dates. In such a continuous testing context, some popular items may become overly exposed, which leads to serious item and test security risks.

The second issue concerns face validity. If a CAT is going to replace a conventional P&P test, the proportions of items of different content areas in each unique test to be administered must be parallel to those of the conventional version. Otherwise, examinees will challenge the testing organization on both test equity as well as validity. For example, in a primary school test on arithmetic, if the conventional P&P test consists of both addition and subtraction items in equal proportion, the validity of a prospective CAT will be challenged if it administers mainly difficult subtraction items to high ability students. Thus, in CAT item selection, both psychometric and non-statistical properties of items have to be considered. Davey and Parshall (1995) pointed out that in practice, items are selected with regard to at least three conflicting goals:

  1. to maximize test efficiency by measuring examinees as quickly and accurately as possible;
  2. to protect the security of the item pool by controlling the rates at which popular items can be administered; and
  3. to assure that the test measures the same composite of multiple traits for each examinee by balancing the rates at which items with different content properties are administered.

Clearly CAT acquires its theoretical efficiency by successively selecting items that provide optimal information at each estimated level of ability. However, operational testing programs unavoidably have to consider additional factors in item selection. In fact, many methods have been proposed and developed during the last two decades for controlling item exposure, solving non-statistical constraints, or both.

Information-based Item Selection and Exposure Rate Control

The traditional wisdom and common practice in item selection is to use items of maximum Fisher item information at the currently estimated ability, based on the test-taker's responses to previously administered items. Item information becomes larger when the item difficulty approaches the examinee's ability; the discrimination parameter increases; and probability of guessing correctly is close to zero (Hambleton & Swaminathan, 1985, pp. 104-105). It has been noted that this selection criterion would cause skewed item exposure. In particular, items with high discrimination may be overly exposed while others are rarely used (Chang & Ying, 1999; Mills & Stocking, 1996). If item content is leaked to some of the examinees before the test, the item can no longer provide valid measurement on the trait that it is developed to measure. This would subsequently impair test security and test validity. On the other hand, if too many items are under-utilized in actual testing, item bank development and maintenance would not be cost-effective. This would in turn cause a great concern on the cost of item pool management and utilization.

It is understandable, therefore, that the control of item exposure is an important issue in computerized adaptive testing designs (Mills & Stocking, 1996; Stocking & Swanson, 1998; Way, 1998). Remedies to control high exposure rates have been proposed by McBride and Martin (1983), Sympson and Hetter (1985), Stocking and Lewis (1995, 1998), Davey and Parshall (1995), Thomason (1995), and others. Among these methods, the most popular one is due to Sympson and Hetter (SH), which can be applied regardless of the item selection method used. The general idea of the SH method is to put a "filter" between selection and administration – an item that is selected by the maximum information criterion may not be administered beyond a certain rate. Based on the concept of conditional probability P(A) = P(A|S)*P(S), each item has an exposure control parameter, P(A|S), which is determined through a series of adjustment simulations so that the probability of administration is restricted to about the pre-specified maximum exposure rate (Hetter & Sympson, 1997). The extended method of Stocking and Lewis (1998) makes use of a matrix of item exposure parameters conditional on examinee's abilities while the method proposed by Davey and Parshall (1995) restricts the frequency of item administration, conditional on all items that have already been included in the test. These two methods are similar in involving the conditional probability and the maximum information selection criterion, but each has its own underlying assumptions.

Most of the CAT designs are built on Item Response Theory (IRT; Lord, 1970). Suppose an item follows a three-parameter logistic (3-PL) IRT model. The probability that an examinee of ability q answers the item correctly (Y=1) is

(1)

where a, b and c are respectively the discrimination, difficulty and pseudo-guessing parameters (Birnbaum, 1968; Lord, 1980). In terms of the 3-PL model, the Fisher information (also known as item information) of the item is a function of examinee's ability and is expressed as

(2)

In order to understand the characteristics of item information, information is plotted against the item discrimination for a fixed value of c and some fixed difference between examinee's ability and item difficulty. Figure 1 shows five information curves for the five cases having the same typical c of 0.2 but different (qb) of 0, 1, -1, 2 and -2 respectively.

 

Figure 1: Information curves (Information vs. Discrimination)

Note. c=0.2; L1: q - b = 0, L2: q - b = 1, L3: q - b = -1, L4: q - b = 2, L5: q - b = -2

In theory and practice, the maximum information method attempts to select an item with the largest value of a and with b closest to the examinee's ability q . The benefit of this approach can be easily visualized in Figure 1. If the item difficulty is exactly the same as the examinee's ability, i.e. (q b) = 0, the information rises sharply as discrimination increases, following the trend of the line L1. Thus, if the true ability qo is known, the maximum information approach leads to a substantial efficiency gain when an item with b close to qo and with largest possible a is chosen. However, qo, which needs to be estimated, cannot be known during the testing. Instead of qo, the estimated ability, , is used in the information calculation. During the initial stages of testing, the discrepancy between and qo is usually large, which will gradually diminish as the testing continues. As a result, the b of the chosen item which seems to be closest to may be actually far away from the true qo and hence the expected efficiency gain of maximum information approach cannot be realized. For example, we can see from the line L3 of Figure 1 that an item with a of 3.0 actually provides less information than the one with a of 1.0 does when (qob) is beyond –1.0. In fact, the information goes to zero as a goes to infinity. The implication is that information-based selection procedures tend to over-expose items of high a values that may not be useful at early stages due to inaccurate q estimation.

Multistage a-Stratified Design

Chang and Ying (1996, 1999) have argued that during the early stages of adaptive testing, the information criterion is inefficient because the estimated trait will not be close to its true value, and therefore high discrimination items may not be useful. In a different line of thought, they have proposed a multi-stage a-stratified (STR) CAT design by partitioning the item pool and the testing into multi-levels and multi-stages respectively according to the item discrimination (Chang & Ying, 1999). They have also advocated that the less discriminating items should be used in the earlier stages while the more discriminating ones should be left later. This approach will lead to a more balanced item exposure distribution and thus significantly improve item pool utilization.

The steps in the multi-stage a-stratified design (Chang & Ying, 1999) are as follows:

  1. The item pool is partitioned into k subpools according to the ascending order of item discrimination (a-parameter). The first subpool contains items with lowest a parameter and each subsequent subpool has items of slightly higher a parameter until the last subpool with the largest a parameter items.
  2. Accordingly, an entire test is partitioned into k stages, with items administered from the jth subpool during the jth stage.
  3. At each stage, two items are chosen from the unadministered items of the corresponding level with difficulty matching as closely as possible to the current ability estimate. The item with difficulty closest to the estimated ability is administered if a random number from U(0, 1) is less than 0.5, otherwise the item with second closest difficulty parameter is displayed. This randomization procedure reduces the possibility of getting identical item sequences among different examinees. Testing moves to the next stage and item stratum when a prespecified number of items have been administered.
  4. The process continues until the last set of items has been administered from the last subpool.

Through simulation studies with simulated items and operational items, Chang and Ying (1999) have demonstrated that the a-stratified method outperformed many other information-based procedures that seldom search for less informative items by increasing the utilization of these low discriminating items. It has also been shown that such approach maintains high efficiency in terms of the mean squared error and average bias of ability estimates.

The traditional wisdom associated with information-based selection method of using more discriminating items first has been challenged by Hau and Chang (in press). Their findings also show that both maximum information approach and the STR have comparable measurement efficiency. However, the former method frequently asks for replenished items with high discrimination power, which are difficult to construct and thus drastically increase the cost of test maintenance. In contrast, the STR tends to equalize item usage, which in turns enhances test security and maintains the stability of the item pool structure.

Though the a-stratified design is promising in balancing the item utilization, it also has several shortcomings. First, there is no guarantee on item exposure control as some observed exposure rates were much higher than the pre-specified limit. Recently, Leung, Chang and Hau (1999) showed that this problem could be solved when STR was used with the SH exposure control algorithm. In their study, the integrated method yielded a more balanced exposure distribution in which at most one item had exposure above the target rate, while high estimation efficiency was maintained.

The second shortcoming of STR is that it is not yet designed to deliver tests in which non-statistical constraints such as content balancing and item type specifications are necessary. With such serious limitation, the improvement on pool utilization by the STR would be of theoretical importance but impractical to most operating testing. This paper attempts to provide some solutions to this important issue. Models for solving non-statistical constraints of different levels of complexity will also be reviewed. Then some refinement for STR to addressing the issue of non-statistical constraints will be discussed.

Models for Solving Non-statistical Constraints

CAT has many advantages appealing to testing communities and organizations. Yet general acceptance of a CAT implementation depends on the conformance of the adaptive tests to the desired test specification. For example, a P&P mathematics test might consist of 50% pre-algebra and 50% algebra questions. If the item selection procedure of a CAT constructed from the same item bank does not incorporate content balancing, a student of low ability may be administered an adaptive test entirely of pre-algebra problems while another one of high ability may receive a test entirely of algebra questions. These two adaptive tests may provide the same amount of information concerning mathematics competence, but the users or practitioners may be reluctant to accept the results due to the lack of content comparability.

For a sophisticated test, item selection is usually subject to various rules, often called test specifications. In addition to constraints on statistical properties such as precision of estimate, test specifications may impose constraints on non-statistical aspects like intrinsic item properties, item features in relation to all other candidate items, and item features in relation to a subset of all other candidate items (Stocking & Swanson, 1993; Swanson & Stocking, 1993). To enhance face validity, tests have to be constructed to meet the structural specifications as far as possible. Three available models for solving various degrees of non-statistical constraints in item selection are described below.

Constrained CAT (CCAT)

The concerns of content control in CAT were raised by Green, Bock, Humphreys, Linn, and Reckase (1984), Thissen and Mislevy (1990), and Wainer et al. (1990). One of the earliest and most effective mechanism was proposed by Kingsbury and Zara (1989). This content-balancing algorithm using maximum information criterion is easy to be implemented. Basically it selects the most informative unadministered item from the content area that is farthest below its ideal administration percentage for each examinee. To start, items are labeled according to the content areas identified. Content specifications are also set (e.g., percentage of a test from each of the identified content areas). Maximum information item selection algorithm is used. Before considering the item information, a check is made to select the content area that is the farthest below the ideal administration percentage for each simulated examinee. Then the maximum information item search procedure restricts itself to unadministered items within this selected content area and administers the one with the largest information.

This mechanism provides a convenient way for CAT to incorporate content-balancing in the item-by-item selection process. When content coverage is the only non-statistical constraint to be attended, the CCAT algorithm is certainly a good method to be considered. Stocking and Swanson (1993) has pointed out that one disadvantage of this approach is its partitioning of the pool into mutually exclusive subsets according to the item features of interest. The number of items in each partition can become quite small when there are a lot of item features of interest to the test specialist. The CCAT algorithm has been challenged to be inadequate to tackle other constraints such as item sets and overlap items.

Weighted Deviation Model (WDM)

Stocking and Swanson (1993) have proposed the Weighted Deviation Model for CAT to emulate the test construction practices of expert test specialists. The WDM relaxes test constraints as desired properties for the reason that a pool structure may not be able to satisfy all constraints when test specifications are very complicated.

In the WDM, test constraints and maximum information are treated as desired properties. Thus, when there is no feasible solution to the system of the constraints, the objective function can still find an item yielding the minimal deviation from the constraints. The model is:

Minimize

(sum of weighted deviations) (3)

Subject to

(for lower bounds of constraints) (4)

(for upper bounds of constraints) (5)

(6)

and

(Stocking & Swanson, 1993, p. 280-281) (7)

where equals to 1 if item i has property j and 0 otherwise; is the weight assigned to constraint j; is the weight assigned to the information constraint; and are the lower bound and upper bound of constraint j; and represent deficit from the lower bound and surplus from upper bound respectively; and represent excess from lower bound and deficit from upper bound respectively; equals to 1 if ith item is included in the test, or 0 otherwise.

A heuristic for item selection using WDM has also been proposed:

(i) For every item not already in the test, compute the deviation for each of the constraints if the item were added to the test;

(ii) Sum the weighted deviations across all constraints;

(iii) Select the item with the smallest weighted sum of deviations (p. 281).

The extremely time-consuming process to select the 'best' item due to heavy computations is one of the disadvantages of this sophisticated model. Besides, maximizing information is treated only as a desired property. Thus, even if an optimal test satisfying all constraints exist, it is not guaranteed to be administered by this model. Van der Linden and Reese (1998) have criticized that adding weights to constraints would increase the complexity that may lead to unpredictable violations of the constraints and the principal of maximum information. The appealing advantage of WDM is that it constructs tests like an expert and takes into account the number and complexity of constraints on item selection to a great extent.

The LP Optimal Model (LPM)

Van der Linden and Reese (1998) have proposed a linear programming (LP) model to tackle test constraints. The following example briefly explains how inequalities and equations are used to model the constraints.

Maximize (maximum information at ) (8)

subject to

, (test length) (9)

, (number of item sets) (10)

(number of items in item set j) (11)

(number of items in item set j) (12)

(maximum item exposure) (13)

(number of item sets per content category) (14)

(number of item sets per content category) (15)

(mutually exclusive items) (16)

(mutually exclusive item sets) (17)

(domain of decision variables) (18)

and

(domain of decision variables) (19)

where the superscripts u and l represent the upper and lower bounds respectively (p.262-263).

In this model, the information is not considered as a psychometric constraint. Instead, it is the objective function to be maximized subject to a number of non-statistical constraints. An optimal adaptive test is constructed as follows:

Step 1: Initialize the model.

Step 2: Assemble an initial test according to the model.

Step 3: Administer the item with maximum information for the current ability estimate.

Step 4: Update the model after the response.

Step 5: Reassemble the remaining part of the test after returning the unadministered items to the bank.

Step 6: Repeat Steps 3-6 until a full test has been administered (p.264).

The above adaptive implementation implies that at each item selection process, a full test is assembled to have maximum information at the current ability estimate, taking into consideration the set of items already administered. An unadministered item having the maximum information is then selected. The LP model guarantees that each adaptive test is always optimal and meets the entire set of constraints.

In contrast to WDM in which information is a key constraint, the LP model places top priorities on non-statistical constraints. If feasible solutions exist for the system of constraints across the whole ability range, optimal tests will be adaptively constructed using the model. Otherwise, the CAT program adopting this item selection algorithm may fail to deliver a complete test.

Approaches to Imposing Non-statistical Constraints in STR

This section provides some suggestions on how to enhance the STR by incorporating non-statistical constraints. The direction for enhancement is based on the concept that an ideal item selection algorithm should be able to simultaneously satisfy the test specifications as far as possible, effectively control the exposure rates below a desired limit and fully utilize all items.

It has been pointed out that the efficiency gain by using high discriminating items may not be realized during the early stages due to inaccuracy of ability estimation. In fact, the valuable highly discriminating items should be saved for latter stages where the ability estimate is normally very close to the true value. Chang and Ying (1999) recognized the importance of saving the highly discrimination items for latter stages and developed the multistage a-stratified design. In the a-stratified design, less discriminating items are utilized in the early stages when they can be used most effectively. This increases the exposure of those items that would be otherwise underexposed in those systems that adopt an information-based algorithm. The results of the simulation studies also demonstrated that the design exerted promising effects on improving pool utilization and reducing the test-overlap rate. However, there are a lot of issues that need further research. It has been noted in Chang and Ying's study that several items were overly exposed with rates high above the desired target. This issue concerning item security has been partially solved by incorporating SH conditional exposure method into the a-stratified design (Leung, Chang, & Hau, 1999). In their simulation studies, the enhanced a-stratified method restricted all the exposure rates below the desired level, further reduced the test-overlap rates and improved pool utilization.

In general, a CAT needs to address the issue of non-statistical constraints. There should be no exception for the a-stratified design. When looking back the three models for solving various kinds of constraints, all the models adopt the maximum information criterion. Thus they tend to suffer the same shortcomings mentioned earlier. That is, the high discriminating items are likely to be selected so often that they will be overly exposed, and the low discriminating items are seldom touched, leading to extremely skewed exposure rates distribution. It should be of both research and practical interest to investigate how the STR could integrate with the CCAT, WDM and LPM so that a more realistic and practical model close to the ideal design would emerge.

CCAT-STR

When content-balancing is the only non-statistical requirement for adaptive tests, it is reasonable to develop a simple but better model by integrating the STR with the CCAT. The procedure would be this way. Items are labeled according to the content areas identified. Content specifications are set (e.g. percentage of a test from each of identified content areas). The item pool is partitioned into k strata in an ascending order of the item a-parameter. Each test is divided into k stages. In the first stage, items are administered from the first stratum. In the second stage, next group of items are administered from the second stratum,…, and the last group of items from the k-th stratum. In each stage, the best unadministered item of the corresponding stratum with difficulty closest to the currently estimated theta is chosen. This item is checked whether it belongs to a content area which has already used up its pre-specified quota. If not, it is administered. Otherwise, the next best item is identified and checked against the desired content area before administration. The process continues until a suitable item is administered. The test moves to the next stage when a pre-specified number of items are administered from the current stratum and the test stops when all stages complete. If the spectrum of a-parameters is very narrow for some content areas, it may happen that no suitable item can be found from the corresponding stratum at later stages in which most content specifications are fully satisfied. In such cases, backward searching of an appropriate item at previous stratum is imposed so that a complete test can be assembled.

WDM-STR

Sometimes, specifications set for a complex test are so complicated that the pool structure is unlikely to meet all requirements. In such circumstances, an integrated model of the STR and the WDM may be helpful. An integrated model can be constructed by the following steps:

(i) Test constraints are mathematically formulated as Equations 4 and 5.

(ii) The item pool is partitioned into k strata according to the ascending order of the a-parameter values of the items. Each test is divided into k stages. In the first stage, items are administered from the first stratum. In the second stage, the next group of items are administered from the second stratum,…, and the last group of items from the k-th stratum.

(iii) At individual item selection level in each stage, three unadministered items with difficulty parameter closest to the currently estimated theta are chosen. For each of them, the deviation for each of the constraints is computed as if the items were added to the test. The weighted deviations across all constraints are summed and then the item with the smallest weighted sum of deviations is administered.

This integrated model should exhibit the combined strength of STR and WDM. As a result, WDM-STR utilizes low discriminating items that minimize the weighted sum of deviations during the early stages, the time when these items offer their best contribution to the trait estimation.

LPM-STR

An integrated model of the STR and the LPM may be suitable for the case in which the number of non-statistical constraints is moderate and the pool structure can satisfy all these constraints. The following steps show one way of constructing such an integrated model.

(i) Stratify the item pool and divide an adaptive test into stages as described in the STR.

(ii) Add constraints on number of items to be administered from each stratum, say

(number of items to be administered from the 1st stratum)

(number of items to be administered from the 2nd stratum)

(number of items to be administered from the 3rd stratum)

(iii) At item-selection level of each stage, apply full LPM and test assembly procedure. The best item providing largest information is checked whether it belongs to the corresponding stratum or previous stratum. If yes, it is administered. Otherwise, the same checking applies to the next best item until an item is administered.

(iv) The test moves to the next stage when the specified minimum number of items are administered from that stage.

Similar to CCAT-STR and WDM-STR, LPM-STR guarantees the utilization of low discriminating items during the early stages of testing. Such guarantee is warranted by Steps (ii) and (iii). This integrated method should exhibit the combined strength of STR and LPM.

Discussion

Computerized adaptive testing has many advantages that traditional paper-and-pencil tests do not possess. It is one of the many practical applications of computers in education. In the past two decades, many researchers have been attracted to improving the design of item selection in CAT. Based on the fact that there is generally a big discrepancy between the true ability and its estimate in the early stages of adaptive testing and information-based item selection algorithms that tend to over-expose highly discriminating items and rarely select low discriminating ones, Chang and Ying (1999) have proposed the STR that uses an alternate selection method. The new approach has been demonstrated to be effective in improving item usage and reducing test-overlap rate. However it also has two major shortcomings. First, there is no guarantee on item exposure control as some observed exposure rates are still much higher than the desired target. This issue has been solved by Leung, Chang and Hau (1999) who demonstrate that exposure rate control is guaranteed when STR is used with SH algorithm.

The second shortcoming of STR is its limited applicability to general testing situations which usually need to address content balancing and other non-statistical constraints. The paper reviews three models in the literature that can be combined with STR. These models, CCAT, WDM and LPM basically attempt to address different non-statistical constraints of various complexities. In the previous sessions, we have described how to integrate each of them with STR, forming the integrated models of CCAT-STR, WDM-STR and LPM-STR. The proposed models aim to improve STR so that it can be applied to more general testing environments with various levels of non-statistical constraints. In theory, the integrated models should maintain the advantages of STR. That is, well-balanced item exposure distribution should be resulted as selection process is restricted to low discriminating items in the early stages and free to use high discriminating items in the later stages to pin-point the ability estimate. As a result, these models should be able to simultaneously satisfy the test specifications as far as possible, effectively control exposure rates and well utilize the item pool. It is anticipated that many researchers and practitioners would be interested in these new models and would like to compare their performances with their counterparts by simulations using operational item parameters before applying them to real testing. As the strengths and weaknesses of any design may vary across many factors such as item pool structure, examinee population, test length, and the number of constraints, more research along these lines need to be done before these theoretical models can be put into practice.

References

Birnhaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F.M. Lord, & M.R. Novick, Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

Brown, A., & Iwashita, N. (1996). Language background and item difficulty: The development of a computer-adaptive test of Japanese. System, 24(2), 199-206.

Burston, J, Harfouch, J., & Monville-Burston, M. (1995). The French CAT: An assessment of its empirical validity. Australian Review of Applied Linguistics, 18(1), 52-68.

Chang, H.H., & Ying, Z. (1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 20, 213-229.

Chang, H.H., & Ying, Z. (1999). A-stratified multistage computerized adaptive testing. Applied Psychological Measurement, 23(3), 211-222.

Davey, T., & Parshall, C.G. (1995, April). New algorithms for item selection and exposure control with computerized adaptive testing. Paper presented at the Annual Meeting of the American Educational Research Association, San Francisco, USA.

Educational Testing Service (1998, July). Computer-based GMAT and TOEFL introduced as computer power continues to improve testing. http://www.ets.org/aboutets/zgmattfl.html.

Green, B.F. (1983). The promise of tailored tests. In. H. Wainer & S. Messick (Ed.), Principals of modern psychological measurement. Hillsdale, NJ: Lawrence Erlbaum Associates.

Green, B.F., Bock, R.D., Humphreys, L.G., Linn, R.L., & Reckase, M.D. (1984). Technical guidelines for assessing computerized adaptive tests. Journal of Educational Measurement, 21, 347-360.

Hambleton, R.K., & Swaminathan, H. (1985). Item Response Theory. Principles and Applications. Boston: Kluwer-Nijhoff.

Hau, K.T., & Chang, H.H. (in press). Item Selection in Computerized Adaptive Testing: Should More Discriminating Items be Used First? Journal of Educational Measurement.

Hetter, R.D., & Sympson, J.B. (1997). Item Exposure Control in CAT-ASVAB. In W.A. Sands, B.K. Waters, & J.R. McBride (Ed.), CAT: From inquiry to operation. Washington, DC: American Psychological Association.

Kingsbury, G.G., & Zara, A.R. (1989). Procedures for selecting items for computerized adaptive tests. Applied Measurement in Education, 2, 359-375.

Leung, C.K., Chang, H.H., & Hau, K.T. (1999, April). An enhanced a-stratified computerized adaptive testing design. Paper presented at the Annual Meeting of the American Educational Research Association, Montreal, Canada.

Lord, M.F. (1970). Some test theory for tailored testing. In W.H. Holzman (Ed.), Computer Assisted Instruction, Testing, and Guidance. New York: Harper and Row.

Lord, M.F. (1980). Applications of item response theory to practical testing problems. Hillsdale NJ: Erlbaum.

McBride, J.R., & Martin, J.T. (1983). Reliability and validity of adaptive ability tests in a military setting. In D.J. Weiss (Ed.), New horizons in testing: Latent trait test theory and computerized adaptive testing. New York: Academic Press.

Millman, J., & Greene, J. (1993). The specification and development of tests of achievement and ability. In R.L. Linn (Ed.), Educational measurement (3rd ed.). New York: Macmillan Publishing Company.

Mills, C.N., & Stocking, M.L. (1996). Practical issues in large-scale computerized adaptive testing. Applied Measurement in Education, 9, 287-304.

Owen, R.J. (1975). A Bayesian sequential procedure for quantal response in the context of adaptive mental testing. Journal of the American Statistical Association, 70, 351-356.

Stocking, M.L., & Lewis, C. (1995). A new method of controlling item exposure in computerized adaptive testing. Research Report 95-25. Princeton, NJ: Educational Testing Service.

Stocking, M.L., & Lewis, C. (1998). Controlling item exposure conditional on ability in computerized adaptive testing. Journal of Educational and Behavioral Statistics, 23, 57-75.

Stocking, M.L., & Swanson, L. (1993). A method for severely constrained item selection in adaptive testing. Applied Psychological Measurement, 17, 277-292.

Stocking, M.L., & Swanson, L. (1998). Optimal design of item banks for computerized adaptive tests. Applied Psychological Measurement, 22, 271-279.

Straetmans, G.J.J.M., & Eggen, T.J.H.M. (1998). Computerized adaptive testing: What it is and how it works. Educational Technology, Jan.-Feb., 45-52.

Swanson, L., & Stocking, M.L. (1993). A model and heuristic for solving very large item selection problems. Applied Psychological Measurement, 17, 151-166.

Sympson, J.B., & Hetter, R.D. (1985). Controlling item-exposure rates in computerized adaptive testing. Proceedings of the 27th Annual Meeting of the Military Testing Association (pp. 973-977). San Diego, CA: Navy Personnel Research and Development Center.

Thissen, D., & Mislevy, R.J. (1990). Testing algorithms. In H. Wainer (Ed.), Computerized adaptive testing: A primer (pp. 103-135). Hillsdale, NJ: Lawrence Erlbaum Associates.

Thomason, G.L. (1995, June). New item exposure control algorithms for computerized adaptive testing. Paper presented at the Annual Meeting of Psychometric Society, Minneapolis.

van der Linden, W.J., & Reese, L.M. (1998). A model for optimal constrained adaptive testing. Applied Psychological Measurement, 22, 259-270.

Wainer, H., Dorans, N.J., Flaugher, R., Green, B.F., Mislevy, R.J., Steinberg, L., & Thissen, D. (1990). Computerized adaptive testing: A primer. Hillsdale, NJ: Lawrence Erlbaum.

Way, W.D. (1998). Protecting the integrity of computerized testing item pools. Educational Measurement: Issues and Practice, 17, 17-27.

Weiss, D.J. (1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 6, 473-492.

Weiss, D.J. & Kingsbury G.G. (1984). Application of Computerized Adaptive Testing to Educational Problems. Journal of Educational Measurement, 21(4), 361-375.

Weiss, D.J., & Yoes, M.E. (1991). Item response theory. In R.K.Hambleton, & J.N. Zaal (Ed.), Advances in educational and psychological testing. Boston, MA: Kluwer Academic Publishers Group.

Young, R., Shermis, M.D., Brutten, S.R. & Perkins, K. (1996). From conventional to computer-adaptive testing of ESL reading comprehension. System, 24(1), 23-40.


Chi-keung Leung