Part 2, further comments on OfS grade-inflation report

2019-01-07 16 minute read

Update, 7 January: I am pleased to say that the online media article that I complained about in Sec 1 below has now been amended by its author(s), to correct the false attributions. I am grateful to Chris Parr for helping to sort this out.

In my post a few days ago (which I’ll now call “Part 1”) I looked at aspects of the statistical methods used in a report by the UK government’s Office for Students, about “grade inflation” in English universities. This second post continues on the same topic.

In this Part 2 I will do two things:

Set the record straight, in relation to some incorrect reporting of Part 1 in the specialist media.
Suggest a new statistical method that (in my opinion) is better than the one used in the OfS report.

The more substantial stuff will be the second bullet there (and of course I wish I didn’t need to do the first bullet at all). In this post (at section 2 below) I will just outline a better method, by using the same artificial example that I gave in Part 1: hopefully that will be enough to give the general idea, to both specialist and non-specialist readers. Later I will follow up (in my intended Part 3) with a more detailed description of the suggested better method; that Part 3 post will be suitable mainly for readers with more specialist background in Statistics.

1. For the record

I am aware of two places where the analysis I gave in Part 1 has been reported:

At https://www.researchprofessional.com/0/rr/he/agencies/ofs/2019/OfS-grade-inflation-analysis-not-fit-for-purpose–says-expert.html, an article entitled “OfS grade inflation analysis not fit for purpose, says expert”
At https://www.researchresearch.com/news/article/?articleId=1379083, which seems to be a straight copy of the same article (I have not checked in detail).

The first link there is to a paywalled site, I think. The second one appears to be in the public domain. I do not recommend following either of those links, though! If anyone reading this wants to know about what I wrote in Part 1, then my advice is just to read Part 1 directly.

Here I want to mention three specific ways in which that article misrepresents what I wrote in Part 1. Points 2 and 3 here are the more important ones, I think (but #1 is also slightly troubling, to me):

The article refers to my blog post as “a review commissioned by HE”. The reality is that a journalist called Chris Parr had emailed me just before Christmas. In the email Chris introduced himself as “I’m a journalist at Research Fortnight”, and the request he made in the email (in relation to the newly published OfS report) was “Would you or someone you know be interested in taking a look?”. I had heard of Research Fortnight. And I was indeed interested in taking a look at the methods used in the OfS report. But until the above-mentioned article came to my attention, I had never even heard of a publication named HE. Possibly I am mistaken in this, but to my mind the phrase “a review commissioned by HE” indicates some kind of formal arrangement between HE and me, with specified deliverables and perhaps even payment for the work. There was in fact no such “commission” for the work that I did. I merely spent some time during the Christmas break thinking about the methods used in the OfS report, and then I wrote a blog post (and told Chris Parr that I had done that). And let me repeat: I had never even heard of HE (nor of the article’s apparent author, which was not Chris Parr). No payment was offered or demanded. I mention all this here only in case anyone who has read that article got a wrong impression from it.
The article contains this false statement: “The data is too complex for a reliable statistical method to be used, he said”. The “he” there refers to me, David Firth. I said no such thing, neither in my blog post nor in any email correspondence with Chris Parr. Indeed, it is not something I ever would say: the phrase “data…too complex for a reliable statistical method” is a nonsense.
The article contains this false statement: “He calls the OfS analysis an example of Simpson’s paradox”. Again, the “he” in that statement refers to me. But I did not call the OfS analysis an example of Simpson’s paradox, either in my blog post or anywhere else. (And nor could I have, since I do not have access to the OfS dataset.) What I actually wrote in my blog post was that my own artificial, specially-constructed example was an instance of Simpson’s paradox — which is not even close to the same thing!

The article mentioned above seems to have had an agenda that was very different from giving a faithful and informative account of my comments on the OfS report. I suppose that’s journalistic license (although I would naively have expected better from a specialist publication to which my own university appears to subscribe). The false attribution of misleading statements is not something I can accept, though, and that is why I have written specifically about that here.

To be completely clear:

The article mentioned above is misleading. I do not recommend it to anyone.
All of my posts in this blog are my own work, not commissioned by anyone. In particular, none of what I’ll continue to write below (and also in Part 3 of this extended blog post, when I get to that), about the OfS report, was requested by any journalist.

2. Towards a better (statistical) measurement model

I have to admit that in Part 1 I ran out of steam at one point, specifically where — in response to my own question about what would be a better way than the method used in the OfS report — I wrote “I do not have an answer”. I could have and should have done better than that.

Below I will outline a fairly simple approach that overcomes the specific pitfall I identified in Part 1, i.e., the fact that measurement at too high a level of aggregation can give misleading answers. I will demonstrate my suggested new approach through the same, contrived example that I used in Part 1. This should be enough to convey the basic idea, I hope. [Full generality for the analysis of real data will demand a more detailed and more technical treatment of a hierarchical statistical model; I’ll do that later, when I come to write Part 3.]

On reflection, I think a lot of the criticism seen by the OfS report since its publication relates to the use of the word “explain” in that report. And indeed, that was a factor also in my own (mentioned above) “I do not have an answer” comment. It seems obvious — to me, anyway — that any serious attempt to explain apparent increases in the awarding of First Class degrees would need to take account of a lot more than just the attributes of students when they enter university. With the data used in the OfS report I think the best that one can hope to do is to measure those apparent increases (or decreases), in such a way that the measurement is a “fair” one that appropriately takes account of incoming student attributes and their fluctuation over time. If we take that attitude — i.e, that the aim is only to measure things well, not to explain them — then I do think it is possible to devise a better statistical analysis, for that purpose, than the one that was used in the OfS report.

(I fully recognise that this actually was the attitude taken in the OfS work! It is just unfortunate that the OfS report’s use of the word “explain”, which I think was intended there mainly as a technical word with its meaning defined by a statistical regression model, inevitably leads readers of the report to think more broadly about substantive explanations for any apparent changes in degree-class distributions.)

2.1 Those “toy” data again, and a better statistical model

Recall the setup of the simple example from Part 1: Two academic years, two types of university, two types of student. The data are as follows:

2010-11
  University A           University B
    Firsts  Other          Firsts  Other
  h   1000      0        h    500    500 
  i      0   1000        i    500    500
2016-17
  University A          University B
    Firsts  Other          Firsts  Other 
  h   1800    200       h       0      0
  i      0      0       i     500   1500

Our measurement (of change) should reflect the fact that, for each type of student within each university, where information is available, the percentage awarded Firsts actually decreased (in this example).

Change in percent awarded firsts:
  University A, student type h:  100% --> 90%
  University A, student type i:   no data
  University B, student type h:   no data
  University B, student type i:   50% --> 25%

This provides the key to specification of a suitable (statistical) measurement model:

measure the changes at the lowest level of aggregation possible;
then, if aggregate conclusions are wanted, combine the separate measurements in some sensible way.

In our simple example, “lowest level of aggregation possible” means that we should measure the change separately for each type of student within each university. (In the real OfS data, there’s a lower level of aggregation that will be more appropriate, since different degree courses within a university ought to be distinguished too — they have different student intakes, different teaching, different exam boards, etc.)

In Statistics this kind of analysis is often called a stratified analysis. The quantity of interest (which here is the change in % awarded Firsts) is measured separately in several pre-specified strata, and those measurements are then combined if needed (either through a formal statistical model, or less formally by simple or weighted averaging).

In our simple example above, there are 4 strata (corresponding to 2 types of student within each of 2 universities). In our specific dataset there is information about the change in just 2 of those strata, and we can summarize that information as follows:

in University A, student type i saw their percentage of Firsts reduced by 10%;
in University B, student type h saw their percentage of Firsts reduced by 50%.

That’s all the information in the data, about changes in the rate at which Firsts are awarded. (It was a deliberately small dataset!)

If a combined, “sector-wide” measure of change is wanted, then the separate, stratum-specific measures need to be combined somehow. To some extent this is arbitrary, and the choice of a combination method ought to depend on the purpose of such a sector-wide measure and (especially) on the interpretation desired for it. I might find time to write more about this later in Part 3.

For now, let me just recall what was the “sector-wide” measurement that resulted from analysis (shown in Part 1) of the above dataset using the OfS report’s method. The result obtained by that method was a sector-wide increase of 7.5% in the rate at which Firsts are awarded — which is plainly misleading in the face of data that shows substantial decreases in both universities. Whilst I do not much like the OfS Report’s “compare with 2010” approach, it does have the benefit of transparency and in my “toy” example it is easy to apply to the stratified analysis:

2016-17          Expected Firsts       Actual
                 based on 2010-11
  University A         2000             1800
  University B         1000              500
  ------------------------------------------
  Total                3000             2300

— from which we could report a sector-wide decrease of 700/3000 = 23.3% in the awarding of Firsts, once student attributes are taken properly into account. (This could be viewed as just a suitably weighted average of the 10% and 50% decreases seen in University A and University B respectively.)

As before, I have made the full R code available (as an update to my earlier R Markdown document). For those who don’t use R, I attach here also a PDF copy of that: grade-inflation-example.pdf

2.2 Generalising the better model: More strata, more time-points

The essential idea of a better measurement model is presented above in the context of a small “toy” example, but the real data are of course much bigger and more complex.

The key to generalising the model will simply be to recognise that it can be expressed in the form of a logistic regression model (that’s the same kind of model that was used in the OfS report; but the “better” logistic regression model structure is different, in that it needs to include a term that defines the strata within which measurement takes place).

This will be developed further in Part 3, which will be more technical in flavour than Parts 1 and 2 of this blog-post thread have been. Just by way of a taster, let me show here the mathematical form of the logistic-regression representation of the “toy” data analysis shown above. With notation

u for providers (universities); u is either A or B in the toy example
t for type of student; t is either h or i in the toy example
y for years; y is either 2010-11 or 2016-17 in the toy example
\(\pi_{uty}\) for the probability of a First in year y, for students of type t in university u

the logistic regression model corresponding to the analysis above is

\(\log\left(\pi_{uty}\over 1-\pi_{uty}\right) = \alpha_{ut} + \beta_{uy}\).

This is readily generalized to situations involving more strata (more universities u and student types t, and also degree-courses within universities). There were just 4 stratum parameters \(\alpha_{Ah},\alpha_{Ai}, \alpha_{Bh}, \alpha_{Bi}\) in the above example, but more strata are easily accommodated.

The model is readily generalized also, in a similar way, to more than 2 years of data.

For comparison, the corresponding logistic regression model as used in the OfS report looks like this:

\(\log\left(\pi_{uty}\over 1-\pi_{uty}\right) = \alpha_{t} + \beta_{uy}\).

So it is superficially very similar. But the all-important term \(\alpha_{ut}\) that determines the necessary strata within universities is missing from the OfS model.

I will aim to flesh this out a bit in a new Part 3 post within the next few days, if time permits. For now I suppose the model I’m suggesting here needs a name (i.e., a name that identifies it more clearly than just “my better model”!) Naming things is not my strong point, unfortunately! But, for now at least, I will term the analysis introduced above “stratified by available student attributes” — or “SASA model” for short.

(The key word there is “stratified”.)

Update, September 2021: Just to note that the “Part 3” never got written! As well as having too much else to do in 2019, I lost all confidence that any further work by me on this topic would actually influence anything.

To cite this entry: Firth, D (2019). Part 2, further comments on OfS grade-inflation report. Weblog entry at URL https://DavidFirth.github.io/blog/2019/01/07/part-2-further-comments-on-ofs-grade-inflation-report/

David Firth

Part 2, further comments on OfS grade-inflation report

1. For the record

2. Towards a better (statistical) measurement model

2.1 Those “toy” data again, and a better statistical model

2.2 Generalising the better model: More strata, more time-points

You may also enjoy

My best known work, fading citations and a bit of history

My academic ancestry: Cox,Daniels,,,,,,,,,,,,,Newton,,,Galileo!

This blog has now moved to GitHub

Why we should trust the exit poll — but not too much!