METHOD FOR AUTOMATICALLY PRODUCING A COMPUTERIZED ADAPTIVE TESTING QUESTIONNAIRE
BACKGROUND OF THE INVENTION Field of the Invention
This invention relates to the field of computerized, interactive skills- assessment and to statistical validation of adaptive questionnaires in particular.
Description of Related Art
Computerized Adaptive Testing (CAT) refers to skill assessments that receive feedback from a test-taker and dynamically adapt the skill level, or difficulty, of subsequent questions put to the test-taker. Thus, for every question that the test-taker answers, a CAT system is able to calculate an approximation of the skill level of the test-taker in order to next ask the most relevant question available in a set of questions. CAT is a fast and accurate way of determining the proficiency of people in a given field, and yet it has been out of reach for almost all tests because of the difficulty and expenses of transforming a regular questionnaire into a CAT questionnaire.
Typically, a CAT system is based on Item Response Theory (IRT). According to IRT, the probability of a test-taker of a certain proficiency to answer a particular question correctly is a mathematical function that can be described with a few parameters. The current state of the art is a three- parameter model, where the three parameters are indicative of: a difficulty level of the question; a discrimination of the question; and guessing. In order to implement a CAT test, then, some of the three IRT parameters for each question are needed.
In related art, a human specialist is required to begin with a set of questions and produce from it a CAT questionnaire. The human specialist, referred to as a psychometrician, reviews a set of questions for bias, irrelevancy, ambiguity and other factors and assists the questionnaire's author in an iterative improvement of the questionnaire. The psychometrician then gathers data by administering the questionnaire on a sample population and statistically analyzing the results using an IRT to produce a CAT questionnaire. In these related methods, producing the CAT questionnaire requires a long time and is expensive, due to the human involvement of psychometricians, statisticians, etc.
What is needed instead is a method for automatically producing a CAT questionnaire from a set of questions with reduced involvement of human specialists.
SUMMARY OF INVENTION
This invention is a method for generating a statistically validated CAT questionnaire on a computer. An object of the invention is to enable a user to transform a set of questions into a CAT questionnaire. Another object of the invention is to enable a user with no special training to transform a set of questions into a CAT questionnaire.
These and other objects of the invention are achieved in an embodiment that calibrates questions in a set of questions with statistical modeling and supplies a user with information indicative of the appropriateness of the question to a CAT questionnaire. Based on the information provided by the statistical modeling, the user may amend the set of questions, or corresponding answers, or a sample population used in the statistical modeling. In an iterative manner, the user arrives at a statistically validated CAT questionnaire. A preferred embodiment guides the user with expert recommendations regarding improvements to the questionnaire and operates over a computer network.
BRIEF DESCRIPTION OF THE FIGURES
FIG. 1 A illustrates the steps for transforming a regular questionnaire into a CAT questionnaire according to an embodiment.
FIG. IB shows a process flow for transforming statistically non- calibrated questions into statistically calibrated questions according to an embodiment.
FIG. 2 illustrates calibration and statistical analysis steps according to an embodiment.
FIG. 3 is a flow chart summarizing a system embodiment for analyzing the global results of the calibration and statistical analysis for each question in a questionnaire.
FIG. 4 is a flow chart summarizing a system embodiment for diagnosing the results of the calibration and statistical analysis for each question in a questionnaire.
DETAILED DESCRIPTION
FIG. 1 A illustrates the overall process flow in an embodiment where the method is used to build a skill assessment questionnaire. However the present invention is not limited to this application and may be applied to other applications. Alternate embodiments may be based on the same overall structure using different algorithms, formulas and/or different decision trees to produce similar types of results. Embodiments extend over fields where IRT may apply. For example, an alternate embodiment is a system to automatically build fast and efficient market survey systems (computerized adaptive surveys) where IRT is relevant. Preferred embodiments have a 3-parameter model; however, other embodiments have a 2-parameter model or a 1 -parameter model. A formulary for an exemplary IRT model is given below at the end of the description.
Referring to FIG. 1 A, block 100, a user authors a set of questions. The only limitation placed on the type of questions is that the answer to a particular question must be either correct or incorrect. That is, there are no partially correct answers. Partially correct answers have to be considered as incorrect answers.
At block 101 , the user builds a questionnaire. This comprises sub-steps (not shown) of gathering questions, some of which may already have been calibrated (statistically validated) inside another questionnaire. The questions must pertain to the same subject matter. If some of the questions have been previously calibrated, the calibration is most preferably to have been with a relevant sample population. That is, the same type of population as the one that will take the final CAT questionnaire produced by the method of this invention.
At block 102, an indication is supplied of the number of sample candidates needed to calibrate the questions that are, as yet, not calibrated. This indicative figure may be based on empirical laws that take into account the number of non-calibrated questions and the number of calibrated questions as
well as the previous results of the calibration. As described, the number of sample candidates given at this step is only an indication. If the calibration does not give good results, the program produces a new estimation of the number of candidates needed.
At block 103, the questionnaire is posted on the Internet or on an Intranet in order to gather data for the calibration and analysis. While such computer networks are preferred, other embodiments have an examinee take the questionnaire without using a computer network.
At block 104, sample candidates take the test, or questionnaire, as a regular sequential test. In differing embodiments, the test may be split into several parts with test candidates taking only a portion of the test.
At block 105, all data are collected (i.e. the answers of the sample candidates to the questions) calculations are performed to determine the IRT parameters of the questions and if the question is suitable for a CAT or statistically valid test. These sub-steps are fully described below for an embodiment in FIG. 3.
At block 106, results of preceding statistical analyses and calibration are reviewed to determine which questions are not suitable. If particular questions seem unsuitable, possible reasons are determined. Optionally, a report containing the results of this post-calibration analysis may be generated.
At block 107, a generated report is given to the user. The report contains two types of information. First, a report concerning the overall calibration and then several sub-reports concerning questions that showed a potential problem. Here, a user sees a report and decides on further action. Typical "Global" actions are: having more sample candidates take the test; automatically removing the bad questions and creating a questionnaire with the remainder; and reviewing the per question reports. Reviewing the question may still imply
making a later choice between the two first global actions, above. The last global action, above, may not be proposed to the user if results show that the calibration globally failed. The user may be provided additional advice for that choice. At block 108, an evaluation for global failure of the calibration is made.
If global failure occurred, more sample candidates should take the test. If there is no global failure and if the user want to review the question, the system goes to block 109.
At block 109, a per-question review shows the problem that appeared for some of the questions, the possible causes for the problem and the recommended actions in each case. Possible "per-question" actions include: removing the question; modifying the key (the correct answer to the question); and modifying the question (rephrasing the question for instance). If a user chooses to modify some of the questions or added new questions (see block
110, FIG. 1), the questionnaire may be recalibrated with new sample candidates. If the user modifies only keys (correct answers) of some questions (see block
111, FIG. 1), the calibration may be restarted and statistical analysis continued with the same data.
Block 112 is reached only when no question was modified. Possibly some questions were removed (those that showed a problem).
Finally, at block 113, the questionnaire is a CAT test or a statistically validated questionnaire. It may now be used in a CAT test-taking system, or in a sequential test-taking system or in other types of systems. In the case of a CAT test, the IRT parameters of each question, as calculated at block 105, may be imported into a CAT test-taking system.
FIG. IB illustrates how questions are managed according to an embodiment. In FIG. IB, an author creates questions at blocks 130, 135 and 140. The genesis of the questions may be direct, indirect, or by modification.
Logically, questions have two states: calibrated and non-calibrated. In practice, this may be a Boolean flag stored with the question in a database. The only way a question can change from a non-calibrated to calibrated state is through statistical analysis. Questions are created in a non-calibrated state, as illustrated by block 145. Questions are then calibrated by a statistical analysis at block
150, proceeding to a calibrated state, shown at block 160. Once calibrated, a question and calculated IRT parameters may be used in a CAT questionnaire concerning the same field and designed for the same type of population as used for the calibration in block 150. If calibration at block 150 reveals problems with the questions at block 170, the question may be removed entirely, block
165, or modified, block 170. After any modification at blocks 175 or 185, a question is non-calibrated and should return to block 150 for calibration. Modification also invalidates all answers given to that question.
FIG. 2 is a description of sub-steps included in block 105 (see FIG. 1A) according to an embodiment. The formulae and algorithm involved are detailed below.
Initializations, including initialization of the discrete distribution of the levels (i.e. the proficiency variable), are performed at block 200. To set the scale of levels, this distribution is set to a normal Gaussian distribution (mean = 0, variance = 1). At block 201, all of the data (the answers of all sample candidates to each question) are imported from a database.
At block 202, initial values of 3 IRT parameters ("a," "b" and "c") are calculated using standard statistics, as described in detail below. A process loop including blocks 203, 204 and 205 obtains precise values for the IRT parameters of the questions by iteration by calculating a Bayes modal estimate (or maximum marginal likelihood estimate) of the three IRT parameters "a," "b" and "c."
As described above, an initial estimate of the IRT parameters "a," "b" and "c" for a question is accomplished using a standard statistical analysis of answers given by candidates to this question at block 202, FIG. 2. For example, consider a particular question posed to candidates. Let "p" be the proportion of candidates who answered this question correctly. Let "r" be the bi-serial correlation between the total score of candidates and the fact that they answered correctly the particular question (the score is simply computed by calculating the proportion of question answered correctly).
With the above, r is defined by
where N is the number of candidates that answered the question, sn is the score of the nth candidate, xn is 1 if candidate number n answered correctly the question, 0 if not.
The IRT "c" parameter is defined as the reciprocal of the number of possible answers for this question.
The IRT difficulty parameter "d" is calculated to take the guessing into account.
rf =^ l - C 00)
Then, p, a corrected bi-serial correlation, is calculated.
with z such that:
The initial value of "a" is
a = 1.702 P (13)
^ ■ p'
The initial value of "b" is b = -— (14)
P
Further, "a" is bounded at 0.85 and 3.4 and "b" is bounded at -3 and 3.
The above formulae are applied as written with the except for z, which is defined implicitly. To determine z, an approximate Gaussian integral using the trapezium method is preferred. Determining the Gaussian reciprocal function with more precision is not necessary since this first calculation of the IRT parameters is only approximate.
Referring again to the embodiment described in FIG. 2, the general algorithm used for the Bayes modal estimate in blocks 203-205 is an EM (Estimation - Maximization) algorithm. The first part of this algorithm is the estimation step at block 203. Here, an estimation of the number of candidates in each level is calculated, as well as the proportion of candidates in each level that answered correctly each question. The second part of the EM algorithm is the maximization step at block 204. Here, the parameters "a," "b" and "c" are
calculated to maximize a complete likelihood function. Sub-steps in blocks 203-204 are detailed below.
Block 205 is a condition to end the EM algorithm. If the changes in the parameters calculated at block 204 are less than a test value, or if the maximum number of loops for the algorithm was reached, the program exits the EM algorithm and proceeds to block 206. Further detail regarding block 205 is set forth below after an explanation of the proceeding blocks. At block 206, a very accurate estimation of the level of the sample candidates is calculated. At block 206, the IRT model and information function as in a CAT test-taking system.
At block 207, standardized residuals are calculated to determine if the question truly fits the model. In step 208, an answer/level correlation is calculated for each proposed answer of each question. Preferably, the answer/level correlation is the bi-serial correlation between the estimated proficiency of the candidate and the fact that he gave a particular answer to a given question or not. An endorsement rate calculated at block 209 is the proportion of candidates that gave a particular answer to a given question. Preferably, an endorsement rate is calculated for each proposed answer to each question.
As described above, the first part of an EM algorithm is an estimation step at block 203. Here, an estimation of the number of candidates in each level is calculated, as well as the proportion of candidates in each level that answered correctly each question. The following formulas and algorithms are those applied during sub-steps at block 203 of FIG. 2. The goal of the estimation step is to find an estimation of the number of candidates for each qk level and the number of candidates who have a q level and who answered question j correctly.
Defining nk (s) as the estimation at step (s) of the number of candidates among the sample candidates that have qk as level, and rjk (s) as the number of
candidates that have qk as level and that answered question j correctly, the estimation step includes calculating those values using the following formulae:
As an optimization to make this calculation in a reasonable time, first calculate intermediate values:
v, = ^F— OS)
Then the calculation of nk (s) and r,k (s) becomes
As described above, the second part of the EM algorithm is the maximization step at block 204 of FIG. 2. Here, the IRT parameters "a," "b" and "c" are calculated to maximize a complete likelihood function. The following formulae and algorithms are those applied during the step noted 204 in FIG. 2.
At block 204, a new estimation of the three IRT parameters for each question is calculated by numerically solving a set of equations. The Baysian
estimation includes solving a set of equations to find the maximum of a likelihood function (denoted L). For this, the point where the derivative of L is null is determined. In an embodiment, the set of equations can be split in J simple sets of three equations, each set corresponding to one question.
In an embodiment, then, for each question (viz. for each j) the three equations are:
where the three variables are a\, b and c,; ga is the prior distribution of the a parameter; gb is the prior distribution of the b parameter; gcj is the prior distribution of the Cj parameter. The three formulae above represent the derivative of the likelihood with respect to aj, bj and Cj, which are null at the point that is the maximum for the complete likelihood.
To simplify the equations, above, functions are defined as.
lα ω r - nk»P,(gk) dP,(qk)
Jk ^ -- PP kk )).YMk< dαJ
laj
k (s), lbjk
(s), lcj
(s) are functions of aj, bj and Cj.
Laj, Lbj, LCJ are functions of aj, bj and Cj.
ιuu k - daj
lbb™= — '—
Jk db, dlb
1 dc,
(24) _ d dllaay
( ) _ d dllbh)
(j
s> db J, da J dla
ω dlc
( )
dlb
( ) dlc
( )
' dc, db,
laajk (s), lbbjk (s), lccjk (s), labjk (s), lacjk (s), lbcjk (s) are functions of aj, bj and c,.
Laa s), Lbbj (s), Lcc S), Lab,00, Lac,(s), Lbc s) are functions of a,, bj and Cj.
rLa^ Lb? is therefore the gradient of the function L(i' (a, ,b, ,c ,) ,
\ LLcCJ( )
and is the Hessian of the same function.
In a preferred embodiment all the analytical formulae for the derivatives and second derivatives, above, are factorized for computational efficiency.
The following paragraphs give explicit formulas of the terms involving the prior distributions (g
a, gb and g
CJ) used in equations (21) and in definitions (23) and (25).
Knowing that one is in step (s), dealing with question number j, and with level number k (called q ), the following is a simplification of notation:
write n instead of nk r *" a the current value of a, b the current value of b, c the current value of Cj q qk
E e-",(t>rqk )
P C . + —r- l + e~
Thus, the following are the formulas that are applied to get the first and second derivative of the likelihood function in a preferred embodiment.
lby
= (nc-r)E
+ n-r
a
Jk (cE + iχE + 1) lc(s) = (nc-r)E + n-r
Jk (cE + l)(c-l)
_ [{<J2r-n lcc (™") _ = K(^r — McC)),C ~ r)E + (r~ ")2c]E - n + r
'jk ((cE + l)(c-l))2
Iab<j? (»c - r)(a(b -q)- \)cE + r-nc2 + 2(n - r)(a(b -q)- l)c]g + a(b - q)(n - rc) + (r - 2n)c + 2r - Π]E +
((cE + l)(E + l))2 facϊ = — ^— (b-q) Jk (cE + l)2 lbc =- E
(cE+iy
(30)
For efficiency in a preferred embodiment, a great number of intermediate values are calculated and the expressions are replaced in the above formulas. More precisely, the intermediate values are:
Ix=b-q
12 = a(b -q) = alx E = e-a«t-«) =e
\-c
P = c + - ■ = c +
1- i.e-»«,-ι» 1 + E
= (a(b- -q)-\)c-- --(I2-\)c
= (nc- ■r)E + n- r h = cE + 1
J 1 1
J6
(cE + l)(E + \) (E + l)/5
I J 1 1 ι = (cE + l)(c-l) (c-l)I5
(nc-r)E + n-r (cE + l)(E + l) = {[((nc ~ r)E + 2(" - r))E ~r}: + n}E = [((/4 + n - r)E - r)c + n]E
-1
'.o = ((cE + lXE + 1))
2 _
Er Er
1x2 (cE + l)2 I 52
(31)
Using these intermediate values, formulae (26), (27) and (28) become:
laa™=InI2 lbb"=IΩa2 (33)
Ice ={[{(2r-nc)c-r)E + (r-n)2c]E-n + r}l7 2
lab jk S - f((MC -r)I3E + r- nc2 + 2(n - r)I3 )ε + I2(n- c) + (r- 2n)c + 2r - «]E + r - n]li0 lac$ = InIx lbc$=Iua
(34)
In a case when E, above, is too close to 0 to avoid numerical difficulties in computations, preferred embodiments use different formulae. In a preferred embodiment, these modified formulae are used when P > 0.9999999, that is, when E is almost 0. These formulae are more stable and avoid getting NaN (Not a Number) by dividing zeros or infinities and they give more accurate results as E gets closer to 0. These formulae are an asymptotic development of the formulae (28), (29) and (30).
la^=(n-r)(b-q) W™=(n-r)a (35)
J c-\
laa^=E(rc-ή)(b-q)2 lbb™=E(rc-ή)a2 (36) lcc J"k = n~r
(c-1)2
lab™=n-r lac"=Er(b-q) (37)
Ibc™ = Era
For efficiency, intermediate values are calculated and the above expressions are replaced in the formulas. More precisely, the intermediate values are:
Ix=b-q
I2 = a(b -q) = al
E = e-a(l-b) =e
P = C + \-c r-r- = C + \-c
I3=n-r (38)
74 =(rc-ή)E
I5=Er
c-\ n — r
11 . i3 6 c-\
Using these intermediate values, formulas (35), (36) and (37) become:
^'=
lab™=I3 lac"=I5I (41) lbc™=I5a
In a case when E is to close to the positive infinity to prevent numerical difficulties in a computation, preferred embodiments use different formulae. These modified formulae are used when (P - c) < 0.0000001, that is, when E is almost infinite. These formulae are more stable, avoiding getting NaN (Not a Number) by dividing zeros or infinities. Moreover, they give more accurate
results as E gets greater. The formulae are an asymptotic development of the formulae (28), (29) and (30).
nc- la™ = (b-q) cE
/αaω_. »*-',. q ^ ι,k b- y cE nc-
Ibb Jj™k = (43) cE (4) (2r-nc)c-r lcc -J™k =
(c(c-l))2
{ ) _(nc-r)(a(b-q)-\) lab™ =
,k cE r lac™=-^(b-q) (44) c^E lbc™= —a J c2E
In a preferred embodiment, intermediate values are calculated and replace the expressions in the above formulas for efficiency. More precisely, the intermediate values are:
I =b-q
1 = a(b — q) = al
E = e-a(l-b) =e
P D = c + -c —— = c + !~c
\ + e'a{l>-h) 1 + E
I3 -nc-r
/,= c(c l-\)
= ncz = II 6 cE
T r rIA
Using these intermediate values, formulas (35), (36) and (37) become:
la™ = I
6I
λ lb™ = I
6a (46)
laa™=-I6I2 lbb™=-I6a2 (47) lcc™ = ((2r - nc)c ■ -r)l5Is
lab™=-I3(I2- ■i) l c™=Iη (48) lbc™=Iηa
The formulae, above, allow a computer program to calculate the gradient and the Hessian of the likelihood function at any given point. Since the equation systems (21) cannot be solved explicitly, a gradient and Hessian are used to
iteratively compute approximations of the solutions to the equations. Let (t) be the step of this iterative calculation.
The gradient method includes modifying the parameters using a fraction of the gradient as increment. This method is relatively slow but is the most stable to find the maximum. Precisely the formulas are:
b = b + κ
lLb™(a;
( ),b;
( c;
)) (49)
In preferred embodiments, Kl may be either 0.0005 or 0.00025.
In a preferred embodiment, the parameters are modified using a fraction of the increment used for a Newton-Ralphson method. This method is more stable than the normal Newton-Ralphson and less stable than the gradient method, but it is faster than the gradient method and slower than the normal
Newton-Ralphson method.
Precisely, the formulas are given by the following. Let A be the Hessian of L(s) taken at the point (a,(t). b,(t), c,(t)):
(50)
Inverting A, using a transpose of the cofactors method to arrive at the new values of "a," "b" and "c" using:
where K2 is a real number. In a preferred embodiment, K2 is 0.1.
The Newton-Raphson method included solving the equation with an order one approximation of an (La, Lb, Lc) vector. That is, an order two approximation of L. This method is the less stable of the three but fastest.
Precisely, the formulas are given by the following. Let A be the Hessian of L(s) taken at the point (a,(t), b,(t), c 0):
Laa™(a ,b ,c?) Lab?(a, b;( c?) Lac? {a^ ,b^ c^)^
A - Lab?(a?,b?,c?)
Lbc™(a? .ftJ , )
Lac (>) ((aA<),b ,c?) Lbc (™»)(a„?('),b A^(<),c A^'))" Lcc (a)'>,b)'>,c)'>)
(52)
Inverting A, using a transpose of the cofactors method to arrive at the new values of "a," "b" and "c" using:
In a solution procedure, an iterative process is performed for each question to get an approximation of the solution to the equations (21). The criterion for the convergence will be the norm of the gradient vector (aj (t), bj (t), Cj (t)) noted G,(t):
(54)
The iterative process is done in two imbricated loops. The outer loop will be called the "trial" loop, and the inner loop will be called the "phase" loop.
Before starting one of the loops, the current values of "a," "b" and "c"
(which were obtained at the previous M step) are retained in variables called A0, Bo, Co-
At each step of the inner loop, Laj(s), Lb ), LCJ(S), Laa S), Lbbj (s), Lccj(s), Labj (s), Lacj (s), Lbc s) are calculated. Then, the Hessian determinant, which is the determinant of the A matrix defined above is calculated:
= [Laa™ Lbb? Lcc? + 2Lab? Lbc? Lac? - Laa? Lbc? Lbc?
- Lbb? Lac? Lac? - Lcc? Lab? Lab? ](a? ,b? , c? )
(55)
If at this point, the calculation failed (viz., one of the values La s), Lbj s)... or Hj( ) is either infinite or not a number), the process is stopped for this question, moving on to the next question. If (N(t))2 is sufficiently small, for instance less than 10" , it is considered that the maximum of L is close and the process for this question is stopped.
In the other cases, the values of "a," "b" and "c" are modified using one of the three methods described above. The values of "a," "b" and "c" are then bounded at their maximum value to prevent divergence. The maximum for "b" is 3.5 and the minimum -3.5; the maximum for "a" is 4.3 and the minimum
0.43; the maximum for "c" is μCj + 3.5σCj and the minimum is μCJ - 3.5σCJ.
In a preferred embodiment, this inner loop is executed at most a thousand times for each trial. If, at the end of the thousand loops, there is no convergence, another trial is commenced.
In a preferred embodiment, there are at most 3 trials for a question during an M step. The first trial is a "normal" trial. The initial values taken for aj (t), bj (t) and c 0 are A0, B0 and C0. The value for Ki is 0.0005. For the second trial, the initial values taken for aj (t), b l) and Cj (t) are:
α< '
= P
a + 1 ■5σ
arandom b,
(,) - μ
h + 1.5σ
hrandom
2 (56) random,
where random], random and random are three uncorrelated random values between 0 and 1. The value for Kj is 0.00025. For the third trial, the initial values taken for aj(t), bj(t) and c l) are:
a? = μa + 2σ arandomx b? = μh + 2σhrandom2 (57) c(,) = μc + 2σc random
where randomi, random2 and random3 are three uncorrelated random values between 0 and 1. The value for Ki is 0.00025.
The inner loop is called "phase" loop, because there can be different phases in the process that use a different method to estimate the parameters. In a preferred embodiment, there are 3 different phases: a first phase using the gradient method to calculate the parameters; a second phase using the modified Newton-Ralphson method; and a third phase using the Newton-Ralphson method. Note that during one trial, the program can switch several times to a same phase.
In a preferred embodiment, a trial starts with the first phase. The start values A0, Bo and Co are saved in the variables A], Bi and C\. Preferably, as the iterative process proceeds, a switch to the second phase and then to the third phase should occur to find accurate results more quickly. However, there is a complete branching system to detect if one phase is converging enough or diverging to switch from one phase to the other.
A typical procedure for an embodiment follows. Once Laj (s), Lb s), Lc S), Laa,(s), Lbb s), LcCj (s), Lab,(s), Lac s), Lbc s), G s) and H,(s) are calculated, the formula to calculate a, b and c that corresponds to the current phase are applied. Then, tests are made to detect if a change of phase should occur. If the tests show that the current phase is diverging, a switch is made to a previous phase and the current values of a, b, and c are replaced by the corresponding saved values Ai, Bi, Ci. The criteria for determining when a switch from one phase to the other should occur is also changed. If the test shows that the current phase converged, a switch is made to the next step and the current values of "a," "b" and "c" are saved in Ai, Bj, C\. In the case when the determinant is too small to apply the Newton-Ralphson method safely (normal or modified), a switch is made directly to the first phase, even before doing the calculations corresponding to the current phase.
Table 1 , below, summarizes the phase branching system. In Table 1 , limit is a variable which is compared to the square norm of the gradient as a criterion to switch from one phase to the other. The initial value of "Limit" is
100 in a preferred embodiment. Count is the number of loops spent in the same phase. In Table 1, a stands for a s), b for b s), c for c s), H for H,(s) and N for
TABLE 1
Referring now to block 205 of FIG. 2, at each step of the main iterative process, an E step is performed, and then an M step. At the end of the M step, all a, b and c parameters are compared which the value they used to have in the previous step. The maximum of the absolute values of these differences is termed the maximum change in the parameters. In a preferred embodiment, if this maximum change is less than 0.05, the calibration is terminated because an adequate estimation of the parameters is complete. Whatever the changes in the parameters, however, after 12 loops the EM calibration is terminated in a preferred embodiments because continuing further will not bring additional precision.
At block 206 of FIG. 2, an estimate of the level of the test candidates is determined. In an embodiment, the following formulas and algorithms are applied to arrive at the determination.
For computational efficiency, preferred embodiments first calculate intermediate values:
wi considered as a function of k is called the information function. First, this function is only calculated for each qk. A first estimate of the level is calculated using the formula:
In fact, the information function is a function of a continuous variable and can be defined as follows:
To have more precision in the level of the candidate, suppose that the θ variable is continuous and we use a Newton-Ralphson method to find the maximum of the information function which is the level of the candidate. For that purpose, the first and second differential of the information function with respect to θ is calculated. Since only the differentials of the information function are needed, the denominator of the information function, which is a constant, is unneeded.
Therefore, define Ij(θ) as the modified information function (the logarithm of the real information function without the denominator). Let Itj(θ) be the first differential and Itti(θ) be the second differential of r,(θ).
iχθ) = \n{G(θ))+ yl,\n(p,(θ))+(\-yυ)\r \-PJ(θ)) (63)
7=1
Defining:
d\r{P,(θ)) itr,(θ) dθ d\n(\-P,(θ)) itw,(θ) = dθ
(64) d2ln(P,(θ)) ittr^θ) dθ2 d2ln(l-P,(θ)) ittw,(θ) = dθ2
Therefore
Itl(θ) = ^ - = -θ + ∑yuitrJ(θ) + (l-y,/)itw,(θ) (65) dθ 7=1
lit, (θ) = ^ - = ^β = -1 + ∑ y^tr} (θ) + (1 - y„ )ittw, (θ) (66)
To calculate all the derivatives and second derivatives, above, preferred embodiments use analytical formulas. The following shows how such formulas are factorized for computational efficiency in an embodiment.
Knowing that one is in step (s), dealing with candidate number i, and with question number j, the following simplifications are applied:
write θ instead of θi
(s) a
aj
c
CJ
P c
-« ')-ι>1 )
1 + e
The following are the formulae that are normally applied to get the first and second derivative of the information function.
A first case is when yy is 1.Here, only calculate itrj(s) and ittr. (s)
itr. ( _ (l -c)aE
(E + l)(cE + l)
(67)
( \\ --cc))aa2ΔEE((ccE -\) ittr™ = - (E + \)2(cE + \)2
For efficiency, intermediate values are calulated and these expressions are replaced in the above formulae. More precisely, the intermediate values are:
(E + \)(cE + \)
(68)
(\ -c)aE = = (\ -c)aEIl (E + l)(cE + \)
Using these intermediate values, formulas (67) become:
itr™ = L
(69) ittr™ = l2a(cEl -\)lx
A second case is when yy is 0. Here, only calculate itwι(s) and ittWj(s).
itw - . -a
(E + l)
(70) a2E ittw™ = -
(E + l)2
For efficiency, intermediate values are calculated and these expressions are placed in the above formulas. More precisely, the intermediate values are:
'. = (E + l)
(71) - a
= -aL
(£ + 1)
Using these intermediate values, formulas (67) become:
itr? = I2
(72) ittr™ = LaEL
When E is to close to 0, different formulae are used. These modified formulas are used when P > 0.9999999, that is, when E is almost 0. These formulas are more stable, they prevent from getting NaN (Not a Number) by dividing zeros or infinities and they give more accurate results as E gets closer to 0. These formulas are an asymptotic development of the formulae (63).
The first case is when yy is 1. Here, only calculate itr,(s) and ittrj(s).
itr™ = (\ -c)aE
(73) ittr? = -(\ - c)a2
For efficiency, intermediate values are calculated and these expressions are replaced in the above formulae. More precisely, the intermediate values are:
Il = (l - c)aE (74)
Using these intermediate values, formulas (73) become:
itr? = L
(75) ittr? = -al,
The second case is when yy is 0. Here, only calculate itwj(s) and ittWj(s).
itw™ = -a
(76) ittw, (-0 -a2E
When E is too close to the positive infinity, different formulae are used. These modified formulas are used when (P - c) < 0.0000001, that is, when E is almost infinite. These formulas are more stable, they prevent from getting NaN (Not a Number) by dividing zeros or infinities and they give more accurate results as E gets greater. These formulae are an asymptotic development of the formulae (63).
The first case is when yij is 1. Here, only calculate itrι(s) and itix
cF
, (77)
. ( ) (l - c)a2 ittr™ = ^ - — cE
For efficiency, intermediate values are calculated and these expressions are placed in the above formulas. More precisely, the intermediate values are:
cE
Using these intermediate values, formulas (77) become:
itr? = 7,
(79) ittr™ = aL
The second case is when yij is 0. Here, only calculate itw,(s) and ittw,(s).
itw ,, _ a
For efficiency, intermediate values are calculated and these expressions are placed in the above formulae. More precisely, the intermediate values are:
', = ~ E (81)
Using these intermediate values, formulas (80) become:
itw? = I
1 (82) ittw™ = a
A solution algorithm for an embodiment follows. For each question, an iterative process is used to calculate the estimation of the level of the candidate. This calculation is based on a Newton-Ralphson method. Let (s) be the index of the current step. At each step, we calculate the values of the derivative and the second derivative of I,(θ) called respectively Itj(θ) and Ittj(θ) at the point θ,(s)
using the formulas of the previous section. Then the value of θ is updated applying the formula:
It,(θ?) θ. (-+1)
(83)
Itt,(θ?)
The criterion for ending this iterative process is the absolute value of It i(θι(s)). If this value is less than 10"7, the θj(s) sequence has converged and the last θι(s) is taken as the value of θj. If after 20 steps, the absolute value of Itj(θj(s)) is still greater than 10"7 the θj(s) sequence is considered to have not converged and θi(0) is taken for the value of θj.
At block 207 of FIG. 2, a residual is calculated. In an embodiment, the following formulae and algorithms are applied.
To estimate how well the data fit the model, standardized residuals are used. With the IRT parameters being calculated for each question, the probability Pj(θ) is completely defined (see equation (1)). A residual (noted η), then, is calculated for each question using the formulae:
= ∑v<k
7=1
A solution algorithm of an embodiment first calculates the intermediate values Sk. Then, equation (84) is applied by calculating Pj(qk) for each value of k.
At block 208 of FIG. 2, a bi-serial correlation is determined. The answer/level bi-serial correlation calculation is an extra statistical analysis used
to determine how well a particular answer is correlated to the level of the candidate. Normally, the right answer's correlation should be the greatest and positive. Ideally, the other answers' correlations should all be negative. Those values are used to detect irrelevant questions or keying errors (when the right answer was not set correctly). In an embodiment, the following formulae and algorithms are applied.
Taking question number j, all the answers given by the sample candidates for that question are considered. They are sorted into classes of "equivalent" answers. Let nj be the number of classes of answers for the jth question. Then, zyn may be defined which equals 1 if candidate i gave an answer of the nth class for the question number j, and 0 otherwise. The item/level correlation for question j and answer n is defined by:
Σ,=,(^,, -^ )(£, -£)
°jn = (85)
with
_ YN Θ
N
Σ N (86) _ ι=l Z,J»
N A solution algorithm calculates the average value of the level θ while calculating the level of each sample candidate, z . As well, the endorsement rate is calculated while building the classes of given answers. Then, equation (85) is applied for each class of answers for each question.
FIG. 3 is the schematic process of the global analysis of the calibration results and statistical analysis according to an embodiment. A preferred embodiment includes a series of conditional branches, corresponding to block
106 of FIG. 1A. Messages and recommended actions contained FIG. 3 are detailed below for an embodiment, as are values for the tests at blocks 301, 303 and 304.
In FIG. 3, block 301 checks if any estimated "a," "b" or "c" parameter of any question is Not a Number (noted NaN). This occurs when a non- allowable numerical calculation occurs during the calculations (such as division of zeros or infinite values). At block 303, a condition for "too many questions showing a problem" for an embodiment is: the proportion of questions showing a problem is greater than 0.2. What is termed a "question showing a problem" is a question for which a report was generated, as described below. At block 304, questions showing no problem, as defined above, are classified in five groups according to difficulty levels. The intervals for the difficulty levels are, for an embodiment: [-3; -0.8416[ first interval: low level
[-0.8416; -0.2533[ second interval: medium low level
[-0.2533; 0.2533] third interval: medium level
]0.2533; 0.8416] fourth interval: medium high level
]0.8416; 3] fifth interval: high level These intervals correspond to an even partition of a Gaussian distribution into five parts, with the condition that the values are bounded by -3 and 3.
For each group, calculate:
7lβ,eG,
where Gi is group number i (i from 1 to 5), Qj is question number j (j from 1 to J). Mj can be interpreted as the total of the maximum information of group number i. At block 304, a condition for "Not enough discriminative questions for each level" is " 3i e {1.2..5} | M, < 30 "
FIG. 4 is the schematic process of the per-question analysis of the calibration results and statistical analysis. It includes a series of conditional branches. This process is performed for each question and corresponds to the step 302 of FIG. 3. The messages and recommended actions of this figure are detailed below for an embodiment as are values for the tests at blocks 401 to
408 and 411 to 421.
The following tests and corresponding block numbers are those used during the process described in FIG. 4 for an embodiment. Test 401 means that for question number j, the answer that has the highest answer/level correlation
(cjn) is not the answer the author entered as the correct answer. Test 402 means that for question number j, the highest answer/level correlation (Cjn) is negative. Remark that under "normal" circumstances this should never occur. Test 403 means that for question j, Cj > 0.4. Test 404 means that for question j, a, < 0.51. Test 405 means that for question j, bj < -3. Test 406 means that for question j, bj > 3. Test 407 means that for question number j, the answer/level correlation (cjn) of the right answer is less than 2 times the second highest answer/level correlation. Test 411 means that the answer that has the highest answer/level correlation (cjn) is a partially correct answer (this can occur with multiple response questions for instance. Remark that for the calibration, these answers were considered incorrect). Test 412 means that the answer that has the second highest answer/level correlation (CJΠ) is a partially correct answer (this can occur with multiple response questions, for instance. Remark that for the calibration, these answers were considered incorrect). Tests 408 and 413 to 421 means that for question j, η > 2.
Addendum 1: IRT Formulary
As detailed above, preferred embodiments include a three-parameter IRT model. The three parameters are: "a" referred to as the discrimination of the question; "b" referred to as the level of the question; and "c" referred to as the pseudo-guessing of the question.
The mathematical formula describing the three-parameter IRT model is:
W^ +T^ v (1)
Where θ is the level of the candidate, j is the index of the question and ranges between 1 and J, the number of questions.
Observed data come from people taking the questions in a test. N is the number of sample candidates that took the test. J is the number of questions.
The observed data, then, are the responses of the N candidates to the J questions which is contained in a N by J matrix, called Y. Thus, y-ή is 1 if candidate i answered correctly question j 0 if candidate i answered incorrectly question j.
Each examinee has a level θ, which is a missing data, θ is referred to the latent variable. For the purpose of a statistical question calibration, the latent variable is considered to be a discrete variable that can take K known discrete value qk, k ranging form 1 to K and q evenly distributed between -Max and +Max. Therefore, each θj can take any of the qk values.
πk is the probability that a candidate has qk as his level, π = (πi, π2, ..., 7iκ) is the distribution of the levels. To set the scale of the levels, π is taken as a normal Gaussian (with 0 as mean and 1 as standard deviation). As a result, the πk won't be modified during the EM algorithm calculations. The distribution of the levels will sometimes be considered as a continuous variable, in that case, it
is called G(θ). This function is a Gaussian with 0 as mean and 1 as standard deviation.
Baysian estimation of IRT parameters requires a prior distribution for each variable of the IRT model. A preferred embodiment uses the same distribution for all the a parameters of all the questions, the same distribution for all the b parameters, but a different distribution for each Cj parameter.
Preferred embodiment use a lognormal distribution for the IRT "a" parameter:
where Ka is such that f ga (x)dx = 1. Note that an explicit value of Ka is not needed.
If μa is referred to as the mean for this distribution and σa the standard deviation, then μ'a and σ'a are defined by
In the above implementation, μa is 1.28 and σa is 0.2.
For the IRT "b" parameter, preferred embodiments use a normal distribution:
where μb is the mean for this distribution, and σb the standard deviation. In the implementation μb is 0 and σb is 2.
For the IRT "c" parameter, preferred embodiments use a normal distribution:
where μCj is the mean for this distribution, and σcj the standard deviation. For the "c" parameters, the distribution is different for each question, more precisely:
μ- = m
σ< = K,μ<, (8)
where MCj is the number of possible answers for question number j, and Kc a constant (which is the same for all the questions). In the implementation,
Kc is 0.25.
Addendum 2: post-calibration result and messages in FIG. 3-4:
The following messages and recommended actions refer to the embodiment detailed in FIG. 3.
NOTE: Data are provided by the program to help a user to make decisions and to give advice about what to do. The interpretation of the data depends on the type of message and recommended action. It will be noted <data> in the following.
On the other hand, the advised number of sample candidates needed to perform calibration is always given. In the following <additional number> will denote the number of additional sample candidates recommended for calibration.
Message 1 Questionnaire calibration failed. More sample candidates are needed to calibrate the questionnaire. Recommended Action 1 Have approximately <additional number> more sample candidates take the test.
Message 2 <data> percent of the question showed a problem at calibration. Most probably the calibration failed because there were too few sample candidates. Recommended Action 2 Have approximately <additional number> more sample candidates take the test.
Message 3 Calibration succeeded, but figures show that there are too few questions at least for some level of difficulty. This will result in a loss of efficiency if used in a
CAT. Recommended Action 3 First review per question reports. Then create new items as indicated in table <data> (Table shows an approximate number of questions needed for the different levels). Then restart a calibration.
Message 4 Calibration succeeded. <data> percent of the questions showed problem and can be either corrected or removed. A questionnaire without those question will still be suitable for a CAT. Recommended Action 4 Review per question reports. You can then use the remaining questions.
The following messages and recommended actions refer to FIG. 4.
NOTE: Data are provided by the program to help a user to make decisions and to give advice about what to do. The interpretation of the data depends on the type of message and recommended action. It will be noted <data> in the following.
Message 1 The calibration results are not relevant for this question. This can be due to keying error. Candidates with high proficiency tend to give an answer that is not the one entered as a correct answer. The answer given by candidates with high proficiency is <data>.
However this answer is partially correct, and this problem may occur without any keying error. Recommended Action 1 First, check if the answer given as the correct answer is really the right one. If there is no error, this might mean that the question if misleading. If you find what can be misleading, modify or rephrase the question. If not, this is probably because partially correct answers can be given to this question. If you can, modify the question in such a way that there are less possible partially correct answers. If you are not sure about what the problem is or if you don't want to recalibrate the questionnaire, remove this question.
Message 2 The calibration showed a problem. This can be due to keying error. Candidates with high proficiency tend to give an answer that is not, the one entered as a correct answer. The answer given by candidates with high proficiency is <data>. However this answer is partially correct, and this problem may occur without any keying error. Recommended Action 2 First, check if the answer given as the correct answer is really the right one. If there is no error, this might mean that the question is misleading. If you find what can be misleading, modify or rephrase the question. If not this is probably because partially correct answers can be given to this question. If you can, modify the question in such a way that there are less possible partially correct answers. If you are not sure about what the problem is or if you don't want to recalibrate the questionnaire, remove this question.
Message 3 The calibration results are not relevant for this question. This is apparently due to a keying error. Candidates with high proficiency end to give an answer that is not the one entered as a correct answer. Actually, the answer given by candidates with high proficiency is <data>.
Recommended Action 3 Check if the answer given as correct answer is really the right one. If there is no error, this means that the question if misleading. If you find what can be misleading, modify or rephrase the question. If not or if you don't want to recalibrate the questionnaire, remove this question.
Message 4 The calibration shows a possible keying error.
Candidates with high proficiency tend to give an answer that is not the one entered as a correct answer. The answer given by candidates with high proficiency is <data>.
Recommended Action 4 Check if the answer given as correct answer is really the right one. If there is no error, this means that the question if misleading. If you find what can be misleading, modify or rephrase the question. If not or if you don't want to recalibrate the questionnaire, remove this question.
Message 5 The calibration results are not relevant for this question. This question must be misleading since candidates with high proficiency tend to give the right answer less often than candidates with a lower proficiency do. Recommended Action 5 If you find what in the question can be misleading, modify or rephrase it. If not or if you don't want to recalibrate the questionnaire, remove this question.
Message 6 The calibration shows that this question must be misleading since candidates with high proficiency tend to give the right answer less often than candidates with a lower proficiency.
Recommended Action 6 If you find what in the question can be misleading, modify or rephrase it. If not or if you don't want to recalibrate the questionnaire, remove this question.
Message 7 The calibration results are not relevant for this question. This is apparently due to a high guessing
parameter that indicates that there is too high a probability (<data>) for guessing the right answer. Recommended Action 7 Remove this question or modify it in such a way that there are more plausible alternatives among possible answers.
Message 8 The calibration shows a high guessing parameter for this question which indicates that there is a too high probability (<data>) to guess the right answer. Recommended Action 8 Modify this question in such a way that there are more plausible alternatives in the answers or remove it.
Message 9 The calibration results are not relevant for this question. This is apparently due to a low discrimination which indicates that candidate with high proficiency don't answer correctly significantly more often than candidates with a low proficiency (Ratio is <data>). Recommended Action 9 Remove this question.
Message 10 The calibration shows a low discrimination for this question which indicates that candidate with high proficiency don't answer correctly significantly more often then candidates with a low proficiency (Ratio is
<data>).
Recommended Action 10 Remove this question.
Message 11 The calibration results are not relevant for this question. This is apparently due to a very low difficulty level (the proportion of the candidate that answered correctly is indeed <data>).
Recommended Action 11 Remove this question or, if possible, have candidates with more extreme levels (both high and low level to prevent biases) take the test to recalibrate this question.
Message 12 The calibration shows a very low difficulty level for this question (the proportion of the candidate that answered correctly is indeed <data>).
Recommended Action 12 Remove this question or, if possible, have candidates with more extreme levels (both high and low level to prevent biases) take the test to recalibrate this question.
Message 13 The calibration results are not relevant for this question. This is apparently due to a very high difficulty level (the proportion of the candidates that answered correctly is only <data>). Recommended Action 13 Remove this question or, if possible, have candidates with more extreme levels (both high and low level to prevent biases) take the test to recalibrate this question.
Message 14 The calibration shows a very high difficulty level for this question (the proportion of the candidates that answered correctly is only <data>). Recommended Action 14 Remove this question or, if possible, have candidates with more extreme levels (both high and low level to prevent biases) take the test to recalibrate this question.
Message 15 The calibration results are not relevant for this question. This can be due to the following problem: candidates with high proficiency relatively often give an answer that is not the one entered as a correct answer. The second most given answer is <data>.
This may be because the question is misleading. However this answer is partially correct, and this problem may occur even if the question is not misleading. Recommended Action 15 If you find what can be misleading, modify or rephrase the question. If not, this is probably because partially correct answers can be given to this question. If you can, modify the question in such a way that there are less possible partially correct answers. Unless you find out what the problem is and you don't mind re-calibrating the questionnaire, remove this question.
Message 16 The calibration shows a problem for this question. Candidates with high proficiency relatively often give an answer that is not the one entered as a correct answer. The second most given answer is <data>. This may be because the question is misleading. However this given answer is partially correct, and this problem may occur even if the question is not misleading. Recommended Action 16 If you find what can be misleading, modify or rephrase the question. If not, this is probably because partially correct answers can be given to this question. If you can, modify the question in such a way that there are less possible partially correct answers. If you are not sure about what the problem is
or if you don't want to recalibrate the questionnaire, remove this question.
Message 17 The calibration results are not relevant for this question. This is apparently because the question is relatively misleading. Candidates with high proficiency relatively often give an answer that is not the correct answer. The second most given answer is <data>. Recommended Action 17 If you find what can be misleading, modify or rephrase the question. If not or if you don't want to recalibrate the questionnaire, remove this question.
Message 18 The calibration shows that this question is relatively misleading. Candidates with high proficiency relatively often give an answer that is not the correct answer. This second most given answer is <data>. Recommended Action 18 If you find what can be misleading, modify or rephrase the question. If not or if you don't want to recalibrate the questionnaire, remove this question.
Message 19 The calibration results are not relevant for this question. Probably, this question doesn't follow the general model (which assumes that the higher the proficiency of candidate, the higher his probability of answering correctly).
Recommended Action 19 remove this question.