Skip to content


In   on May 5, 2020

Assessing Reliability and Validity

Key Takeaway

Through careful consideration of the strengths and limitations of the CTT and IRT models, test developers can implement valid, reliable and fair testing contexts that control for as much error as possible.

The respective roles of Classical Test Theory (CTT) and Item Response Theory (IRT)


Evaluating the level of competence of an individual within an entry to practice pathway requires measurement. Any measurement is inherently imprecise, and includes error, which is the difference between an expected true value and the value obtained by a measurement instrument. Statistical models are used to determine how reliable and valid exams are despite the challenges of precise measurement faced by all exams. The psychometric statistics from these models can tell us if the exam data is reliable, and therefore discriminating examinees of different competency levels, and if results are valid, which demonstrates the exam is aligned with the intended purpose.


What does reliability and validity mean?


A practical definition considers the reliability of the data produced by the assessment; it is inappropriate to think of the reliability of an assessment as if it will never change or as if it can appropriately evaluate all aspects of individual competence (1). Reliability of the data is expressed as a coefficient between 0 and 1, and reflects a complex relationship between:

a. the examinee’s true ability level, (which might be thought of as a holistic level of competence)
b. their ability to achieve that level on any given task (true variations in their performance due to relevant factors or a context specific level of competence)
c. any error related to the assessment process itself (such as the difficulty of a question or the subjective ratings of an examiner) – which is itself composed of multiple elements (some known and some unknown)


Practically speaking, the reliability of assessment data is an indicator of how well it can distinguish between examinees of different competence levels; if everyone gets the same score, the data have no reliability. Conversely, if everyone gets a vastly different score in no discernible pattern, then the assessment is not tapping into the holistic competency of the examinees and so the data will also have low reliability. The standard of administering exams that are fair demands that decisions about someone’s competence are based on reliable data.

Sometimes, we might expect multiple examinees to have very similar competence, and consequently very similar assessment scores. For example, more advanced trainees might all achieve similar and very high scores on an exam that tests very basic skills. In this case, although we can explain the low reliability of the data, the exam itself may serve limited purpose. An example of a useful assessment that may create data of low reliability is one aimed at ensuring a minimal level for maintenance of competence standards of practicing health professionals.

Ensuring an exam achieves an acceptable level of reliability is necessary before establishing validity.



The validity of assessment data describes how well the data support the intended purpose of the assessment (1). Critically, validity is not expressed as a constant property of an assessment (1). Instead, it is more appropriate to consider the sufficiency of evidence available to support the validity of the assessment. For example, licensure exams ensure that examinees are safe practitioners and can proceed independently; this implies a great deal of trust. Any evidence that supports the claim that examinees who pass an exam, and are consequently licensed, are also safe, can support the validity of that exam. Ideally, there are multiple sources of validity evidence, that can include data sources from the past, present and future.


How can we ensure reliability and validity?

To help better understand the quality of measurement with its inherent level of measurement error, we are guided by the principles of two major test theory models, namely Classical Test Theory (CTT) and Item Response Theory (IRT) (2).

Careful consideration of each model’s strengths and limitations together with their options for statistical analyses helps guide and implement reliable test development, administration and scoring procedures. These test theory models are discussed further below.


Classical Test Theory (CTT)

CTT can be considered the simpler of the two models and for this reason, the most commonly adopted and implemented approach within testing contexts (3). The fundamental premise of CTT rests primarily upon its formula, namely,
X = T + E
Essentially, this formula proposes that every observable test score (X), in other words the score derived from the number of correct responses on a multiple-choice questionnaire or the average of all ratings allocated by examiners in an OSCE, is comprised of two major components, True ability (T) and measurement error (E).

True ability can only ever be estimated and at the same time is always impacted to some extent by random errors of measurement that exist within the test environment (2). The goal of any measurement environment therefore, is to control for as much of this error as possible to ensure that examinees’ observed scores are an “as precise as possible” representation of their true ability. To do this, several test administration strategies are often implemented, some of which include, ensuring well-designed test content from the start, administering tests within highly standardized environments and having examiners complete rigorous training exercises to ensure fair and consistent scoring throughout the OSCE process.

CTT offers useful and simple statistical approaches or methods to help monitor the impact of error on the precision of test scores, as well as the overall quality of the test and its items. For instance, the Cronbach’s alpha reliability coefficient provides an indication of item consistency, the difficulty index (p-value) provides an indication of item easiness or difficulty, while the discrimination index (pt-biserial) helps monitor the extent to which each item behaved as expected and in accordance to high and low performers. The standard error of measurement (SEM) is another vital statistic that provides an overall indication of the spread of examinee scores around the true test score and therefore confidence intervals in which the scores can be interpreted. Lastly, CTT also offers more sophisticated techniques, such as Generalizability Theory (GT), which allows for the examination of error across various facets of the test context, for instance examiner rating behaviour in OSCEs (1,2).

The biggest strength in adopting a CTT test development approach lies within its simplicity as a test model. Since it is based on relatively weak assumptions that are easily met with real data and modest sample sizes, its use is easily certified. Furthermore, its simplistic and intuitive procedures allows for communication of results to a broader audience (3). CTT generated estimates however are confined to the test administration and sample for which they were generated. This sample dependency reflects this models greatest weakness as it makes comparisons across contexts such as proficiency over time or establishing equivalent difficulty across test versions far more complicated.


Item Response Theory (IRT)

In comparison to CTT, IRT is a relatively newer test model, but one that provides sophisticated techniques that conveniently cater for the limitations of CTT.

IRT is a family of models that attempt to describe the relationship between the underlying trait being measured, such as clinical competence, the properties of the individual test items and the examinees’ responses to these items (4). For instance, in the simplest IRT model, this relationship is described as the probability of a correct response to a specific test item through the consideration of the item difficulty and the examinee ability level on the measured trait.

IRT test item difficulty and examinee ability estimates are calculated and reflected along the same invariant scale (1,2). This property of invariance is what makes IRT so appealing, as it overcomes the sample dependent limitation of CTT and allows convenient and direct comparisons to take place along the same scale. For instance, direct comparisons of examinee proficiency over time, as well as equating across versions of different difficulty levels can take place with ease.

A further benefit of this model, especially within the context of OSCEs, is that of sub-models that cater for and adjust scores based on various sources of error. For instance, the Many-facets Rasch Model (MFRM) allows for the inspection of significant lenient or stringent scoring behaviour at the individual examiner level, as opposed to overall level as in CTT’s Generalizability Theory, and adjusts examinee scores based on this scoring behaviour (5).

The benefits of adopting an IRT approach however, can only be realized once the suitability of the data for this model has been confirmed. The large sample size, more stringent set of data assumptions that need to be met for accurate results and technical statistical procedures constitutes the major limitations to using this model (1,2). Overcoming these limitations however offers a complimentary test model approach to CTT.



The principles of CTT and IRT offer useful insight into the process of measurement concerning human abilities and its inherent level of error. Through careful consideration of the strengths and limitations of each model, test developers can implement reliable and fair testing contexts that control for as much of this error as possible.



1. Streiner DL, Norman GR, Cairney J. Health measurement scales: a practical guide to their development and use. Care Management Journals. 2016;17(3).
2. De Champlain, A.F. (2010). A primer on classical test theory and item response theory for assessments in medical education. Medical Education, 44, 109-117. doi: 10.1111/j.1365-2923.2009.03425.x
3. Downing, S.M. (2003). Item response theory: applications of modern test theory in medical education. Medical Education, 37, 739-745. doi: 10.1046/j.1365-2923.2003.01587.x
4. Yang, F. M., & Kao, S. T. (2014). Item response theory for measurement validity. Shanghai archives of psychiatry, 26, 171–177. doi: 10.3969/j.issn.1002-0829.2014.03.010
5. Iramaneerat, C., Yudkowsky, R., Myford, C.M. & Downing, S.M. (2008). Quality control of an OSCE using generalizability theory and many-faceted Rasch measurement. Advances in Health Sciences Education, 13, 479-493. doi: 10.1007/s10459-007-9060-8


on April 15, 2021
What's the difference between standardized and simulated clients?
on October 13, 2020
Findings demonstrated relatively high inter-rater reliability and intra-rater reliability, and that CLB-based speaking descriptors (CLB 6-9) provided sufficient information for raters to discriminate examinees’ oral proficiency.