Bloom’s 2 Sigma Problem

It’s fascinating to me, to what degree my perception of the world is based on vague ideas that float around at the borders of my consciousness. Every once in a while I end up noticing such an idea in my head, and decide to take a closer look. Every time, I discover that the idea is a distorted, oversimplified version of reality - like a message at the end of a very long game of broken telephone.

A recent example was the 10,000 hours myth that I discovered in Book Review: The Sports Gene. Today, it is Bloom’s 2 Sigma Problem.

You probably have heard some version of this factoid - students who receive 1:1 tutoring achieve 2 standard deviations (or 2 sigmas) above students that receive typical classroom instruction. This places the average tutored student above 98% of students in a traditional classroom.

Consider a class of 30 students, graded on a curve so the average grade for the class is a C. The 2-sigma claim is that you can take a C student from that class, give them some 1:1 tutoring, and they will be getting A+’s on par with the top students in the class.

As far as educational interventions go, this is a sensational effect. It means that nearly every student has the potential to perform at the highest levels of achievement, if given the appropriate resources and attention. It puts the entire institution of education in a new light.

What are we actually talking about here?

The 2 Sigma Problem is the title of an article published by educational psychologist Benjamin Bloom in 1984 [1].

In the article Bloom presents some research that was done by his graduate students, and documented in their dissertations [2, 3]. In the research, students were randomly assigned to one of three conditions: Conventional group instruction, Mastery Learning group instruction, and tutoring.

What is mastery learning? Bloom describes the following process:

- The students receive the same instruction as in the conventional group.
- The class periodically takes a teacher-constructed assessment on the material in the class.
- Topics that were missed by a significant portion of the class are re-taught from a new perspective.
- Topics that were missed by smaller groups of student are addressed through other means - small group instruction, peer instruction, or by getting the students to re-visit the relevant portions of the textbook.
- Then another “parallel” (covering the same topics) assessment is administered. All students are expected to get to an 80% “mastery” threshold on the topic. If there are some students who still aren’t at mastery level, additional attention is given to those students.

The tutoring group followed the Mastery Learning model, but the instruction was delivered either 1 on 1 with a tutor, or in small groups of 2 to 3 students. Bloom notes that formative assessments were also given to this group, but very little time was needed for “corrective work”.

At the end of the study, an overall assessment was given. The results? The tutored group outperformed the control by 2 standard deviations, while ML group outperformed the conventional group by approximately 1 standard deviation.

Bloom concludes

The tutoring process demonstrates that most of the students do have the potential to reach this high level of learning. [1]

The “problem” that Bloom is talking about is: how do we replicate the amazing results of 1:1 instruction while facing the economic realities of public education?

Bloom proposes group Mastery Learning as one solution, along with several other group instruction techniques that he covers in his article. Bloom suggests that such instructional modifications can provide approximately 1.5 standard deviations of improvement within the group instructional model.

OK, but what are we actually talking about here?

The dissertation studies covered 4th, 5th and 8th grade classes (N~75 for each). The researchers came up with their own Probability and Cartography curriculum. Each curriculum covered 11 periods of study over 3 weeks of time. The researchers designed the curriculum to be taught in three weekly units. The students were taught for 4 40-minute sections, and then a formative assessment would be given on the 5th day.

In the third week, the students received 3 days of instruction, then a formative assessment on the 4th day. A summative test was given on the last day.

After each formative assessment, the students in the ML and tutor groups would get corrective instruction addressing the mistakes in their formative assessments, and take a second formative test covering the same material. If students failed to get a sufficient score (80% for the ML group, 90% for the Tutor group), the process would repeat.

The tutors were undergrad education majors enrolled in a private college. The school was a middle-income parochial school on the Southwest side of Chicago.

Issues with drawing conclusions from this study.

Instructional time

I think you can probably see the first and most glaring issue with this experimental design. When and how are all these students getting corrective instruction? It’s not during the 40 minute classes given to the control group. The students in the ML and tutoring conditions are getting additional support from the tutors after their formative assessment every week.

Neither Bloom, nor the dissertation I found reported how much additional instructional time the students in the intervention groups received. We can make some inferences from the formative test scores for each week. For the fourth grade, this supplemental tutoring brought up the 20 students in the tutoring group from 78 to 95, from 69 to 94, and from 57 to 90 at the end of each of the three weeks. [2]

Is this measuring learning?

The formative assessments are administered at the end of each week, and the summative assessment on the last day of the three week period.

Is this measuring learning? I think these measures are missing some fundamental pieces that we usually associate with learning:

Retention. The students take the last formative assessment on a Thursday, get corrective instruction, take a formative post-test, and then take the summative assessment Friday and do pretty well. Will they still get the same score in a week? At the end of the quarter? When the students have to rely on this material next year? Cramming before a test is one thing, but getting things to stick around long-term is a much more challenging problem that is not addressed in this experimental design.

Transfer. Bloom is the creator of “Bloom’s taxonomy”, which has to do with the level of complexity of tasks, and he does address this in his article,

In the tutoring studies reported at the beginning of this paper, it was found that the tutored students’ Higher Mental Process achievement was 2.0 sigma above the control students… It should be noted that in these studies higher mental processes as well as lower mental process questions were included in the formative tests used… [1]

So they used a mix of assessment items - some lower-level tasks that required simply recalling and applying knowledge, and higher-level tasks that required analyzing, synthesizing and creating new knowledge.

By the end of week 3, the students in the intervention groups would have practiced on 6 formative assessments. How similar were these items to the summative assessment used on the final day? Unfortunately, I could not find any examples of assessment items.

Due to the short duration of the intervention, and the small number of topics covered in such a time, I think it’s likely that the summative assessment items were similar to the formative assessment, and so it’s likely that students would be able to pattern-match items from the formative assessments they’ve seen before.

Another aspect of transfer is the ability to choose an appropriate strategy from many options. In this case, there were only 3 weeks worth of material, so students only had a few possible topics they could be tested on. This is much easier than several months worth of material, as in a final exam; or a cumulative test that covers several years of material, as in a standardized test or graduation exam.

It should be noted, also, that the control group only got 3 formative assessments, and received no feedback other than their overall grade, or possibly whether they got each item correct or not.

I am certain that the intervention would look significantly less effective if the summative tests were administered a month after the intervention, evaluated knowledge transfer by using test items that were significantly different than the formative assessment, and covered a wider range of topics. I think that would be a much more appropriate measure of learning.

Is the curriculum representative?

Cartography and probability were chosen in order to avoid students having pre-requisite knowledge or expectations about the material. Here’s the list of the objectives for the probability unit:

Distinguishes between certain, possible, and impossible events. Identifies the set of possible outcomes of an experiment. Identifies equally likely outcomes of an experiment. Identifies unequally likely outcomes of an experiment. Writes and interprets statements of probability in symbolic form. Collects data about the frequency of events and interprets the results. Applies basic rules of probability. Determines experimental probabilities. Determines probabilities of simple and compound events. Compares experimental probabilities with theoretical probabilities. Applies the multiplication principle to determine the number of possible outcomes of a situation. [2]

The math skills required seem to be basic arithmetic - counting, addition, multiplication. “determining experimental probabilities” may involve fractions / decimals / percentages, but also it’s possible to design curriculum and test items to work around that by choosing friendly numbers.

This was set up this way deliberately by the experimenters in order to remove some confounding factors from the experiment. However, this makes these results less applicable to a typical math class.

Let’s say you’re teaching a standard probability class and you find that your students are missing a conceptual understanding of percentages, or fractions, or decimals. You wouldn’t be able to re-teach that material between a formative test Friday and class on the following Monday.

Is the Control Representative?

The “control” is a parochial school in 1984. Quizzes are given weekly to assign grades, but these tests have no impact whatsoever on future instruction.

Bloom’s observations of schools of the time:

Observations of teacher interaction with students in the classroom reveal that teachers frequently direct their teaching and explanations to some students and ignore others. They give much positive reinforcement and encouragement to some students but not to others, and they encourage active participation in the classroom from some students and discourage it from others. The studies find that typically teachers give students in the top third of the class the greatest attention and students in the bottom third of the class receive the least attention and support.

The legacy of this paper, and Bloom’s work in general, was an emphasis on differentiation and formative assessment in classrooms. These concepts are prevalent in public schools and teacher education programs today. When one considers the 2-sigma figure, one has to remember that this research is from a different era.

Broken telephone, revisited

Modern research on 1 on 1 instruction, tutoring, the use of formative assessments, and approaches like mastery learning do find statistically significant positive effects, but as one might expect, the effect sizes are more modest - akin to bumping a C grade to a C+ or a B-.

And yet, the 2-sigma figure lives on. I think largely this is because this figure is now used for marketing increased testing, tutoring (human or AI based), personalization and remediation ed-tech, and other education products.

This marketing tends to disparage public education. The 2-sigma figure creates the illusion that education is a solved problem; that if we just gave each child one-on-one instruction, all children would achieve at the highest level; that it is due to the lack of investment of our society, the lack of quality of our schools, and the lack of abilities of our teachers that students are not getting the education they deserve.

And of course the latest iteration of an AI tutor is here to fix all that for a reasonable monthly payment.

I hope that I convinced you that the 2-sigma figure is not real, at least in the way that it is often presented today. Learning with retention and transfer is far from a solved problem, and schools and teaching programs are approaching this problem in ever-more sophisticated ways. And this probably doesn’t need saying, but one should generally be suspicious of any education intervention that claims such an extraordinary effect size.

References