Testing Theories: TCT and IRT

Tests are used in psychology as measuring instruments.  To get a little closer to the concept and without being completely exact, just as we use a meter to measure length, we could use a test to measure intelligence, memory, attention… One of the differences between one action and the other would be that the tests are not so easy to construct, in addition to the fact that they are not so easy to apply.

Furthermore, just as a single measurement does not allow us to talk about the volume of an object, the administration of a single test does not allow us to give a diagnosis or propose an intervention. Thus, tests are important for evaluation, but they are not a determinant of it.

This is where the psychologist plays the most important role: he or she must somehow use the information obtained from the test, and other sources, to form a coherent assessment that leads to the planning of the intervention. In other words, it is when integrating the results from different sources that the quality of the professional is most evident. We are talking about an expertise that is achieved with knowledge, but also with years of experience.

A brief history of test theories

The origin of the tests is often cited as those carried out by Chinese emperors in the 3000s BC. Thus, they were intended to assess the professional competence of the officers who were to enter their service. (1)

The most recent tests have their origins in the tests carried out by Galton (1822-1911) in his laboratory. However, it was James Cattell who first used the term mental test, in 1890. Since these first tests did not prove to be very predictive of human cognitive ability, researchers such as Binet and Simon (1905) introduced cognitive tasks into their new scale to assess aspects such as judgment, comprehension, and reasoning.

The Binet scale opens a tradition of individual scales. In addition to cognitive tests, there are major advances in personality tests.

Why are testing theories necessary?

Given all the advances that have been made, measurement theories (test theories) are beginning to be developed that directly affect tests as instruments. With the concern of generating instruments that measure what we want them to measure and do so with the least possible error, psychometry appears. Psychometry that will require any test or measurement instrument that claims to be such to be valid and reliable,

Recall that reliability is understood as the stability or consistency of the measurements when the measurement process is repeated. In other words, a test will be more reliable the better it replicates the results when measured by two subjects – or by the same subject on different occasions – who have the same level in what is measured. Validity, on the other hand, refers to the degree to which empirical evidence and theory support the interpretation of the test scores.

Thus, there are two major test theories or approaches when we talk about analyzing and constructing this type of instrument: classical test theory (CTT) and item response theory (IRT).

Classical Test Theory (CTT)

This is the dominant theory in the construction and analysis of tests. The bowl: it is relatively easy to construct tests that meet the minimum requirements of this paradigm. It is also relatively easy to evaluate the test itself in terms of the parameters mentioned: reliability and validity.

It has its origins in the work of Spearman at the beginning of the 20th century. Then, in 1968, researchers Lord and Novick carried out a reformulation of this theory and paved the way for the new approach of IRT .

This theory is based on the classical linear model. This model was proposed by Spearman and consists of assuming that the score that a person obtains in a test, which we call his empirical score, which is usually designated with the letter X, is made up of two components. (2)

On the one hand, we find the subject’s true score on the test (V), and on the other, the error (e). It is expressed as follows: X = V + e.

Spearman adds three assumptions to this theory:

  • First, define the true score (V) as the mathematical expectation of the empirical score: This is the score a person would have on a test if they took it an infinite number of times.
  • There is no relationship between the amount of true scores and the size of the errors affecting those scores.
  • Finally, measurement errors in one test are not related to measurement errors in a different test.

To conclude this theory, Spearman defines parallel tests as those tests that measure the same thing but with different items.

Limitations of the classical approach

The first limitation is that, within this theory, measurements are not invariant concerning the instrument used. This means that if a psychologist were to assess the intelligence of three people with a different test for each one, the results are not comparable. But why does this happen?

Well, the results of the three measuring instruments are not on the same scale: each test has its own scale. To compare, for example, the intelligence of X people who have been evaluated with different intelligence tests, it is necessary to transform the scores obtained directly from the test into other scales.

The problem with this is that by transforming the scores into scales we assume that the normative groups in which the scales of the different tests were drawn up are comparable -same mean, same standard deviation-, which is difficult to guarantee in practice. (1) Thus, the new IRT approach represented a great advance concerning this fact. The IRT will thus ensure that the results obtained by using different instruments are on the same scale.

The second limitation of this approach is the lack of invariance of the properties of the tests concerning the people used to estimate them. Thus, in the framework of TCT, the important psychometric properties of the tests depend on the type of sample used to calculate them. This is a fact that also finds a solution, at least partially, in the IRT approach.

Item Response Theory (IRT)

Item response theory (IRT) was born as a complement to the classic test theory. In other words, TCT and IRT could evaluate the same test, as well as establish a score or relevance for each of the items, which in turn could give us a different result for each person. On the other hand, pointing out that IRT would give us a much better-calibrated instrument, the problem is that this paradigm is associated with a much higher cost and the participation of specialized professionals.

IRT has several assumptions, but perhaps the most important one tells us that any measurement instrument should be consistent with one idea: there is a functional relationship between the values ​​of the variable that the items measure and the probability of getting those items right. This function is called the Item Characteristic Curve (ICC). So what do we assume?

Well, something that from the outside may seem very logical and that TCT does not evaluate. For example, the most difficult items would be those that only the most intelligent people answer. On the other hand, an item that everyone answers correctly would not be useful because it would not have any power to discriminate. In other words, it would not provide any kind of information. This is just a small sketch of the revolution that TCT proposes.

To better understand the differences between one measurement model and another, we can use José Muñiz’s table (2010) as a reference:

Table 1. Differences between TCT and IRT (Muñiz, 2010)

Aspects TCT TRI
Model Linear Non-linear
Assumptions Weak (easy to meet by data) Strong (difficult to meet due to data)
Invariance of measurements No Yeah
Invariance of test properties No Yeah
Scoring scale Between 0 and the maximum in the test Infinite
Emphasis Test Item
Item-test relationship Unspecified Item characteristic curve
Description of items Difficulty and Discrimination Indices Parameters a, b, c
Measurement errors Standard error of measurement common to the entire sample Information Functions (varies by skill level)
Sample Size It can work well with samples between approximately 200 and 500 subjects. More than 500 subjects are recommended.

This is how both test theories are related. Although they are almost contemporaneous, it seems clear that IRT was born as a response to the limitations or problems that TCT can develop. However, it seems clear that research still has a long way to go in this field of psychometrics.


All sources cited were thoroughly reviewed by our team to ensure their quality, reliability, timeliness, and validity. The bibliography of this article was considered reliable and of academic or scientific accuracy.


  • 1. Muñiz Fernández, J. (2010). Test theories: classical theory and item response theory. Papers of the Psychologist: Journal of the Official College of Psychologists.
    2. Binet, A., & Simon, TH (1905). New methods for diagnosing the intellectual level of abnormalities. The Year Psychologique, 11, 191-244.
    3. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. New York: Addison-Wesley.

2024-09-28