Testing Theories: TCT and IRT

Tests are used in psychology as measuring instruments. To get a little closer to the concept and without being completely exact, just as we use a meter to measure length, we could use a test to measure intelligence, memory, attention… One of the differences between one action and the other would be that the tests are not so easy to construct, in addition to the fact that they are not so easy to apply.

Furthermore, just as a single measurement does not allow us to talk about the volume of an object, the administration of a single test does not allow us to give a diagnosis or propose an intervention. Thus, tests are important for evaluation, but they are not a determinant of it.

This is where the psychologist plays the most important role: he or she must somehow use the information obtained from the test, and other sources, to form a coherent assessment that leads to the planning of the intervention. In other words, it is when integrating the results from different sources that the quality of the professional is most evident. We are talking about an expertise that is achieved with knowledge, but also with years of experience.

A brief history of test theories

The origin of the tests is often cited as those carried out by Chinese emperors in the 3000s BC. Thus, they were intended to assess the professional competence of the officers who were to enter their service. (1)

The most recent tests have their origins in the tests carried out by Galton (1822-1911) in his laboratory. However, it was James Cattell who first used the term mental test, in 1890. Since these first tests did not prove to be very predictive of human cognitive ability, researchers such as Binet and Simon (1905) introduced cognitive tasks into their new scale to assess aspects such as judgment, comprehension, and reasoning.

The Binet scale opens a tradition of individual scales. In addition to cognitive tests, there are major advances in personality tests.

Why are testing theories necessary?

Given all the advances that have been made, measurement theories (test theories) are beginning to be developed that directly affect tests as instruments. With the concern of generating instruments that measure what we want them to measure and do so with the least possible error, psychometry appears. Psychometry that will require any test or measurement instrument that claims to be such to be valid and reliable,

Recall that reliability is understood as the stability or consistency of the measurements when the measurement process is repeated. In other words, a test will be more reliable the better it replicates the results when measured by two subjects – or by the same subject on different occasions – who have the same level in what is measured. Validity, on the other hand, refers to the degree to which empirical evidence and theory support the interpretation of the test scores.

Thus, there are two major test theories or approaches when we talk about analyzing and constructing this type of instrument: classical test theory (CTT) and item response theory (IRT).

Classical Test Theory (CTT)

This is the dominant theory in the construction and analysis of tests. The bowl: it is relatively easy to construct tests that meet the minimum requirements of this paradigm. It is also relatively easy to evaluate the test itself in terms of the parameters mentioned: reliability and validity.

It has its origins in the work of Spearman at the beginning of the 20th century. Then, in 1968, researchers Lord and Novick carried out a reformulation of this theory and paved the way for the new approach of IRT .

This theory is based on the classical linear model. This model was proposed by Spearman and consists of assuming that the score that a person obtains in a test, which we call his empirical score, which is usually designated with the letter X, is made up of two components. (2)

On the one hand, we find the subject’s true score on the test (V), and on the other, the error (e). It is expressed as follows: X = V + e.

To conclude this theory, Spearman defines parallel tests as those tests that measure the same thing but with different items.

Limitations of the classical approach

The first limitation is that, within this theory, measurements are not invariant concerning the instrument used. This means that if a psychologist were to assess the intelligence of three people with a different test for each one, the results are not comparable. But why does this happen?

Well, the results of the three measuring instruments are not on the same scale: each test has its own scale. To compare, for example, the intelligence of X people who have been evaluated with different intelligence tests, it is necessary to transform the scores obtained directly from the test into other scales.

The problem with this is that by transforming the scores into scales we assume that the normative groups in which the scales of the different tests were drawn up are comparable -same mean, same standard deviation-, which is difficult to guarantee in practice. (1) Thus, the new IRT approach represented a great advance concerning this fact. The IRT will thus ensure that the results obtained by using different instruments are on the same scale.

The second limitation of this approach is the lack of invariance of the properties of the tests concerning the people used to estimate them. Thus, in the framework of TCT, the important psychometric properties of the tests depend on the type of sample used to calculate them. This is a fact that also finds a solution, at least partially, in the IRT approach.

Item Response Theory (IRT)

Item response theory (IRT) was born as a complement to the classic test theory. In other words, TCT and IRT could evaluate the same test, as well as establish a score or relevance for each of the items, which in turn could give us a different result for each person. On the other hand, pointing out that IRT would give us a much better-calibrated instrument, the problem is that this paradigm is associated with a much higher cost and the participation of specialized professionals.

IRT has several assumptions, but perhaps the most important one tells us that any measurement instrument should be consistent with one idea: there is a functional relationship between the values of the variable that the items measure and the probability of getting those items right. This function is called the Item Characteristic Curve (ICC). So what do we assume?

Well, something that from the outside may seem very logical and that TCT does not evaluate. For example, the most difficult items would be those that only the most intelligent people answer. On the other hand, an item that everyone answers correctly would not be useful because it would not have any power to discriminate. In other words, it would not provide any kind of information. This is just a small sketch of the revolution that TCT proposes.

To better understand the differences between one measurement model and another, we can use José Muñiz’s table (2010) as a reference:

Table 1. Differences between TCT and IRT (Muñiz, 2010)

Aspects	TCT	TRI
Model	Linear	Non-linear
Assumptions	Weak (easy to meet by data)	Strong (difficult to meet due to data)
Invariance of measurements	No	Yeah
Invariance of test properties	No	Yeah
Scoring scale	Between 0 and the maximum in the test	Infinite
Emphasis	Test	Item
Item-test relationship	Unspecified	Item characteristic curve
Description of items	Difficulty and Discrimination Indices	Parameters a, b, c
Measurement errors	Standard error of measurement common to the entire sample	Information Functions (varies by skill level)
Sample Size	It can work well with samples between approximately 200 and 500 subjects.	More than 500 subjects are recommended.

This is how both test theories are related. Although they are almost contemporaneous, it seems clear that IRT was born as a response to the limitations or problems that TCT can develop. However, it seems clear that research still has a long way to go in this field of psychometrics.