Let’s Look at The Figures

My best known work, fading citations and a bit of history

2023-01-05T00:00:00+00:00

This will be the second egocentric post in succession — for which, apologies in advance! (It’s a sign of age, I suspect.)

This year will see the 30th anniversary of the publication of my most-cited research paper, which developed a general method for bias reduction of maximum likelihood estimates. I’ll write a few notes below (section 2) about the paper’s history. But first, to the main reason for this blog post: something that seems odd in the recent citation data.

1. Why does citation growth seem to be slowing?

Here is today’s view of the paper in Google Scholar:

The graph there indicates that my 1993 paper is still cited — Google Scholar shows a steadily increasing annual count, reaching 494 citations in calendar year 2021. The count shown for 2022 is currently 491 citations, and I suppose that that figure might grow a bit in the early days of 2023 as Google Scholar catches up. But it does look as though the growth in citations is slowing down a bit.

Why does this seem odd? Well, I know from interacting with other researchers (at conferences for example) that usage of the method from my 1993 paper is still on the increase, as is the range of research disciplines in which it gets used. So the puzzle is: why does the Google Scholar citation count no longer seem to show sustained growth in the method’s use?

At one recent conference I attended, I met someone who told me that “Firth logistic regression” (as they called it) is now so standard in their field (a particular branch of medical research) that it often gets used without any explicit reference to my published work. For example, a published research article might say something like “we used Firth’s logistic regression” but then not cite my 1993 paper as the source of that method.

If that’s the reality, then I suppose I ought to be happy about it. I am guessing that the use of (for example) “Fisher information” and “Cox proportional hazards model” — each of which is often seen without reference to the source (Fisher 1925 or Cox 1972 respectively) — did not make Fisher or Cox very unhappy. If it turns out that I have published something in remotely the same league as those things — even if it’s in one of the lower divisions of the league! — then I should not be at all bothered about citation counts.

To see if there is any concrete evidence for what I had been told at the recent conference, I took a quick look today at the data for the current year, 2023. It’s actually only the 5th day of 2023 today, so there is not too much data — a big advantage! (While more data would likely yield a more reliable analysis, the fact is that much more data would have taken much more time to process, and that would have been prohibitive for me.)

The 2023 data, as at 5 January

Google Scholar records 7 citations of my 1993 paper so far, in the first 5 days of 2023.

But a Google Scholar search for any papers whose text includes the words “Firth” and “logistic” finds 10 papers published in 2023 to date. All ten of those papers do indeed use the method that was developed in my 1993 paper. But only two of the ten papers actually contain a citation to my published work (they both cite the 1993 paper, so they are included among the 7 mentioned above). The remaining eight papers are “non-citing”: they all describe their use of “Firth’s logistic regression” or “logistic regression with the Firth procedure” or suchlike, but with no reference provided to the published source of the method.

(The ten papers that were found in the just-mentioned search are listed here in search-results.txt, in case anyone is interested!)

To sum up this little data-collection exercise, then:

in total fifteen distinct papers were found that used my method;
seven of them cited my 1993 paper, and eight did not.

Conclusion?

It’s a tiny amount of data that I have looked at here, but seemingly already enough to confirm what I had heard anecdotally at the recent conference: there appears to be plenty of published research that uses the method from my 1993 paper but without citing its source. Indeed the evidence just presented, small as it is, suggests that non-citations possibly outnumber formal citations at present.

There could perhaps be an interesting project for a data-science student in all this? For example, to look at citations of the famous Cox 1972 paper and the extent to which formal citation gets replaced by bare phrases such as “Cox proportional hazards model” or “Cox regression” or just “Cox model”. Such a project would demand/develop substantial skill in text processing, I think, and it might present some difficulties in terms of automated access to published works. But it could perhaps reveal some interesting patterns? (and similar patterns might even be found in other disciplines too?)

2. A bit of history

While writing the above I was reminded of various things connected with that 1993 paper. I’ll make a few notes here on some of them, in case anyone is interested.

Actually written in 1991

This I remember vividly. In May 1991 our first child was born; and her arrival was energising as well as exhausting. I wrote “Bias reduction of maximum likelihood estimates” in the late spring and summer of 1991. I think the sleep-deprived nights in that period might have caused me to think differently about things; I doubt that I would normally have had the confidence to write such a paper, at such speed. (I am normally very, very slow!) Our newly arrived daughter cartainly inspired me to finish the work and submit it to Biometrika for publication, by the end of the summer.

And then the paper was rejected, quite quickly as I recall (a “desk rejection”). The Associate Editor did not like it. The reason was that my paper had some superficial similarities with other work from the 1980s and early 1990s, work on “modified profile likelihood” methods for eliminating nuisance parameters in statistical models. I knew about those things, and had deliberately steered away from writing about them in my paper. My paper was an altogether more modest piece of work than the hugely impressive modified profile likelihood literature, and it could in no way be viewed as an alternative to all that. To me this was absolutely clear: the notion of bias is not even invariant to re-parameterization of the model, after all. But the Associate Editor seemed not to appreciate this, and so the paper was rejected.

Rejection led to dejection, and for some time afterwards I left the paper in a drawer untouched. But eventually I did something that I would not normally do: I wrote humbly to the Editor of Biometrika to ask them to reconsider. I knew that my paper was as good as anything I could ever produce; and I also knew that Biometrika was where I wanted my best work to appear. Fortunately, the Editor understood the point about bias and reparameterization, and I was permitted to submit a revised version of the paper that clarified the relationship with modified profile likelihood. That second version was enough, thankfully, to persuade the Associate Editor to recommend acceptance of the paper — but only after a further revision to remove the parts that I had added about the (non-)relationship with modified profile likelihood. The paper eventually appeared in the journal about 18 months after it was first submitted.

Roots: potatoes!

The idea for the paper came out of a coffee-time question that I had been asked by Sue Lewis, who was my colleague at Southampton and an expert on design of experiments. Sue was doing some consulting work with a company (Dalgety PLC), who had conducted a small experiment relating to the storage of potatoes. The problem was that their logistic regression analysis (performed in GLIM) had produced some enormous parameter estimates and standard errors, and their question to me was whether there was something wrong with the design of the experiment. I was interested, and found that the problem was not with the design but with the analysis: the well-known phenomenon of separation was causing maximum likelihood estimates to be “at infinity”. This led me to think about penalizing the likelihood to avoid the “estimates at infinity” problem; and then I made a connection with bias reduction, which led to the more general method that was developed in the paper.

I never actually wrote anything about the potato experiment, unfortunately. Several years later, though, the same dataset — with analysis by my method — appeared in a nice paper that Sue Lewis published with other members of her experimental-design group. It’s the “potato packing example” in their 2006 paper:

Woods, DC, Lewis SM, Eccleston, JA & Russell, KG (2006). Designs for generalized linear models with several variables and model uncertainty. Technometrics, 48(2), 284–292.

[A note for younger readers: GLIM was ground-breaking software, first developed in the 1970s under the leadership of J A Nelder and sponsored by the Royal Statistical Society. It was the first “object-oriented” system for interactive statistical modelling, and it led to a revolution in statistical practice. Today’s glm() and other such modelling functions in R, for example, are direct descendants of GLIM.]

Surprisingly wide reach?

The paper went largely unnoticed for about 10 years after its publication. But then it started to be used across a fairly broad array of research disciplines. This was largely due, I think, to the appearance in 2002 of an influential Statistics in Medicine paper by G Heinze and M Schemper from the University of Vienna — and due especially to their having published widely useable software to accompany their paper.

Heinze, G & Schemper, M (2002). A solution to the problem of separation in logistic regression. Statistics in Medicine, 21(16), 2409–2419.

Papers in similar vein appeared also in some other disciplines. A notable example is this 2005 Political Analysis article by C Zorn:

Zorn, C (2005). A solution to separation in binary response models. Political Analysis, 13(2), 157–170.

By 2012 my paper was being cited more than 100 times per year, which was both pleasing and unexpected. In 2013 I made a poster, for a departmental event at Warwick to showcase the applied impact of Statistics research. The poster (PDF available via the thumbnail link below) gave examples of several research fields in which the method from my paper had been used.

(The subliminal bar chart in the poster shows the Google Scholar citation counts for years 1994 to 2012.)

Most of the applications seen in other disciplines have been in the context of binary and multi-category regression models, especially logistic regression and multinomial-logit models. This is unsurprising, given the existence of accessible works of advocacy such as the papers by Heinze & Schemper (2002) and by Zorn (2005), mentioned above. In addition, logistic regressions (and similar) are a win-win context for estimation via the maximum penalized likelihood method of my paper. Finite estimates are guaranteed; and importantly, bias reduction is always accompanied in such contexts by variance reduction — so there is no trade-off between bias and variance. These features of the method had been known about for several years through empirical studies and informal arguments, and they are established more systematically in a relatively recent paper written jointly with Ioannis Kosmidis.

Kosmidis, I & Firth, D (2021). Jeffreys-prior penalty, finiteness and shrinkage in binomial-response generalized linear models. Biometrika, 108(1), 71–82.

And that brings us nicely back to citations. Of the “non-citing” publications, i.e., papers that use my 1993 method but don’t cite the 1993 paper, most (perhaps all?) are applications of logistic regression for binary-response data. And nowadays such applications really should be citing the newer work Kosmidis & Firth (2021). These things only happen very slowly of course, if at all. I’m not going to hold my breath…

To cite this entry: Firth, D (2023). My best known work, fading citations and a bit of history. Weblog entry at URL https://DavidFirth.github.io/blog/2023/01/05/f93-citations-and-history/

My academic ancestry: Cox,Daniels,,,,,,,,,,,,,Newton,,,Galileo!

2022-04-11T00:00:00+00:00

Recently I wrote a short article in memory of David Cox, who died in January aged 97. David had been my PhD adviser (at Imperial College London in the 1980s) and we became good friends.

While writing that article I discovered the magic of the online Mathematics Genealogy Project (MGP). And, in particular, I discovered that through David Cox I am a direct descendant of Isaac Newton and also of Galileo Galilei!

It seems that Galileo, whose doctorate was completed in Pisa in 1585, remarkably has more than 30,000 known academic descendants — so I must have a lot of academic cousins out there.

Ancestry in the MGP is traced through the relationship of PhD supervisor and student. David Cox was my academic father, Henry Daniels my grandfather, etc. Thirteen generations before Daniels we find Isaac Newton (Cambridge, 1668) and three generations before that is Galileo Galilei (Pisa, 1585).

This blog has now moved to GitHub

2021-09-10T00:00:00+00:00

After about 10 years of being hosted at wordpress.com, my personal blog “Let’s Look at the Figures” is now moving to GitHub. The reasons are various, including:

I am already using Jekyll with GitHub Pages for my other blog at alt-3.uk (about maths and football leagues) — and I like it! The simplicity of static pages maintained through a collection of plain-text files is very appealing to me.
It is part of an effort to simplify my digital life, which has become overwhelmingly complicated in recent years.

The stable URL has now changed to DavidFirth.github.io/blog.

Some things that broke or are changed:

The previous domain name statgeek.net will soon cease to work. I can only apologise for any inconvenience that this causes to anyone. My recent retirement from paid work means that, while it’s nice to have a custom domain for my blog, I can no longer thoil it.
All comments made on old blog posts via wordpress.com are being left behind — they are not being copied across to GitHub with the old blog posts. Likely that would be possible, but it does not seem easy. Those old posts and comments will still be viewable at statgeek.wordpress.com, for a while at least.
From now on, commenting will not be available directly on blog entries themselves. I was getting a lot of spam comments, which was a pain. In future I will encourage people to comment/discuss posts with me on Twitter or by email, instead.

For now, the new blog site at DavidFirth.github.io/blog is rather “bare bones”. It is likely to stay that way; but I am of course open to suggestions for improvement!

Why we should trust the exit poll — but not too much!

2019-12-09T00:00:00+00:00

I look ahead a few days here, to 10pm on UK General Election day. Polling stations will have just closed, and major broadcasters (BBC, ITV and Sky) will simultaneously announce the findings of their jointly-commissioned exit poll — the headline being always the predicted number of seats for the (predicted) largest party in the newly elected House of Commons.

The exit poll is by now a big part of election day/night. The expense of it is justified by the fact that the broadcasters get their alltime largest current-affairs viewing, listening and website-visiting figures during the first couple of hours after polls close on a General Election night — but in that specific couple of hours, almost all of the votes remain uncounted still. So the broadcasters need something for TV/radio/web commentators and on-air politicians to talk about in that first couple of hours; and the exit poll is a major part of that.

The world’s financial markets take notice of the exit poll too — in a big way, as evidenced by the substantial movements usually seen moments after 10pm on election day in currency rates and other markets.

But how accurate is the exit poll?

The answer since 2005 has mostly been: very accurate indeed! The 2005 General Election saw the full-scale introduction of a completely new set of methods for designing and analysing a UK exit poll — methods that had been tested first by the BBC at the 2001 election, and found to work so well that they were adopted jointly by BBC and ITV for 2005 (with Sky News joining to make a 3-way consortium by 2010). In 2005 Labour’s reduced majority of 66 seats — which was surprisingly low to commentators who had all seen pre-election polls predicting a majority of over 100 seats — was predicted exactly by the exit poll. And then the same happened in 2010: the exit-poll prediction of 307 seats for the Conservatives, still some way short of an overall majority, turned out by the next day to be exactly correct. Especially when viewed against the historical backdrop of 1992, which will forever be remembered as the election where the BBC exit poll was quite spectacularly wrong, the “spot on” successes of 2005 and 2010 started to make the new exit-poll methods seem somehow magical!

But there is no magic — and I really can say this with some confidence, as co-inventor of the new methods (while working with John Curtice, for BBC election-night programmes in 1997, 2001 and 2005). The innovative use of statistical modelling is what transformed exit polling at UK general elections, from a rather hit-and-hope exercise (in the 1990s and earlier) to an activity whose on-the-night predictions are now much more likely to be fairly accurate. Still, any exact prediction of seats won by the largest party, such as seen in both 2005 and 2010, owes as much to luck as it does to sound statistical thinking. There is nothing in the new methods that guarantees such freakish accuracy! Indeed, even getting a prediction error as small as 4 seats — as seen at the most recent General Election in 2017 — has to be regarded as extraordinarily accurate.

More typically the exit poll ought to be expected to predict with an error in roughly the 5–15 seats range (for the main parties). Sometimes the error will be smaller than that (as seen in 2005, 2010 and 2017); and occasionally it might be larger.

For the full story of how well the exit poll has performed at successive UK general elections, and lots of background material: see the online exit poll explainer.

(And for a bit more insight into the history of my own involvement in the exit poll, see this recent Twitter thread.)

But the main point here is this: While better methodology has radically improved the chances of an accurate prediction from the exit poll at a UK General Election, the super-accurate predictions seen in 2005, 2010 and even 2017 were unwarrantably accurate. Such an astounding level of accuracy is not guaranteed by the statistical methods used, and it definitely should not be expected every time!

Update, 14 December 2019

In the text above, written before the election on 12 December, I wrote:

...the exit poll ought to be expected to predict with an error in roughly the 5–15 seats range (for the main parties). Sometimes the error will be smaller than that (as seen in 2005, 2010 and 2017)...

The 2019 election saw, for the Conservative party total, an exit-poll error of just 3 seats; and so in the statement quoted above we could now say instead “as seen in 2005, 2010, 2017 and 2019”.

I also wrote:

...the super-accurate predictions seen in 2005, 2010 and even 2017 were unwarrantably accurate...

and that can now be amended to “the super-accurate predictions seen in 2005, 2010, 2017 and 2019 were unwarrantably accurate”. I still do believe this to be the case.

To cite this entry: Firth, D (2019). Why we should trust the exit poll — but not too much! Weblog entry at URL https://DavidFirth.github.io/blog/2019/12/09/why-we-should-trust-the-exit-poll-but-not-too-much/

Robust measurement from a 2-way table

2019-04-26T00:00:00+00:00

I work in a university. My department runs degree courses that allow students a lot of flexibility in their choice of course “modules”. (A typical student takes 8-10 modules per year, and is assessed separately on each module).

After the exams are finished each year, we promise our students to look carefully at the exam marks for each module — to ensure that students taking a “hard” module are not penalized for doing that, and that students taking an “easy” module are not unduly advantaged.

The challenge in this is to separate module difficulty from student ability: we need to be able to tell the difference between (for example) a hard module and a module that was chosen by weaker-than-average students. This necessitates analysis of the exam marks for all modules together, rather than separately.

The data to be analysed are each student’s score (expressed as a percentage) in each module they took. It is convenient to arrange those scores in a 2-way table, whose rows are indexed by student IDs, and whose columns correspond to all the different possible modules that were taken. The task is then to analyse the (typically incomplete) 2-way table, to determine a numerical “module effect” for each module (a relatively high number for each module that was found relatively “easy”, and lower numbers for modules that were relatively “hard”).

A standard method for doing this robustly (i.e., in such a way that the analysis is not influenced too strongly by the performance of a small number of students) is the clever median polish method due to J W Tukey. My university department has been using median polish now for several years, to identify any strong “module effects” that ought to be taken into account when assessing each student’s overall performance in their degree course.

Median polish works mostly OK, it seems: it gives answers that broadly make sense. But there are some well known problems, including that it matters which way round the table is presented (i.e., “rows are students”, versus “rows are modules”) — the answer will depend on that. So median polish is actually not just one method, but two.

When my university department asked me recently to implement its annual median-polish exercise in R, I could not resist thinking a bit about whether there might be something even better than median polish, for this specific purpose of identifying the column effects (module effects) robustly. This led me to look at some simple “toy” examples, to help understand the principles. I’ll just show one such example here, to illustrate how it’s possible to do better than median polish in this particular context.

Example: 5 modules, 3 students

My made-up “toy” data:

 > x
        module
 student  A  B  C  D  E
       i NA NA NA 45 60
       j NA NA NA 55 60
       k 10 20 30 NA 50

There were five modules (labelled A,B,C,D,E). Students i, j and k each took a selection of those modules. It’s a small dataset, but that is deliberate: we can see easily what’s going on in a table this small. Module E was easier than the others, for example; and student k looks to be the weakest student (since k was outperformed by the other two students in module E, the only one that they all took).

I will call the above table perfect, as far as the measurement of module effects is concerned. If we assign module effects (−20, −10, 0, 10, 20) to the five modules A,B,C,D,E respectively, then for every pair of modules the observed within-student differences are centered upon the relevant difference in those module effects. For example, look at modules D and E: student i scores 15 points more in E, while j scores 5 points more in E, and the median of those two differences is 10 — the same as the difference between the proposed “perfect” module effects for D and E.

When we perform median polish on this table, we get different answers depending on whether we apply the method to the table directly, or to its transpose:

 > medpolish(x, na.rm = TRUE, maxiter = 20)
 ...
 Median Polish Results (Dataset: "x")
 
 Overall: 38.75
 
 Row Effects:
     i     j     k 
  0.00  5.00 -8.75 
 
 Column Effects:
      A      B      C      D      E 
 -20.00 -10.00   0.00   8.75  20.00 
 
 Residuals:
        module
 student  A  B  C    D     E
       i NA NA NA -2.5  1.25
       j NA NA NA  2.5 -3.75
       k  0  0  0   NA  0.00
 
 > medpolish(t(x), na.rm = TRUE, maxiter = 20)
 ...
 Median Polish Results (Dataset: "t(x)") 
 
 Overall: 36.25
 
 Row Effects:
      A      B      C      D      E 
 -20.00 -10.00   0.00  11.25  20.00 
 
 Column Effects:
      i      j      k 
  0.625  5.625 -6.250 
 
 Residuals:
       student
 module      i      j  k
      A     NA     NA  0
      B     NA     NA  0
      C     NA     NA  0
      D -3.125  1.875 NA
      E  3.125 -1.875  0

Neither of those answers is the same as the “perfect” module-effect measurement that was mentioned above. The module effect for D as computed by median polish is either 8.75 or 11.25, depending on the orientation of the input table — but not the “perfect 10”.

A better method: Median difference analysis

I decided to implement, in place of median polish, a simple non-iterative method that targets directly the notion of “perfect” measurement that is mentioned above.

The method is in two stages.

Stage 1 computes within-student differences and takes the median of those, for each possible module pair. For our toy example:

 > md <- meddiff(x)
    A   B   C  D   E
 A NA -10 -20 NA -40
 B  1  NA -10 NA -30
 C  1   1  NA NA -20
 D  0   0   0 NA -10
 E  1   1   1  2  NA

The result here has all of the available median-difference values above the diagonal. Below the diagonal is the count of how many differences were used in computing each one of those medians. So, for example, the median difference between modules D and E is −10; and that was computed from 2 students’ exam scores.

Stage 2 then fits a linear model to the median-difference values, using weighted least squares. The linear model finds the vector of module effects that most closely approximates the available median differences (i.e., best approximates the numbers above the diagonal). The weights are simply the counts from the lower triangle of the above matrix.

In this “perfect” example, we achieve the desired perfect answer (which here is presented with E as the “reference” module):

 > fit(md)$coefficients
   A   B   C   D   E 
 -40 -30 -20 -10   0

My plan now is to make these simple R functions robust enough to use for our students’ actual exam marks, and to add also inference on the module-effect values (via a suitably designed bootstrap calculation).

For now, here are my prototype functions in case anyone else wants to play with them:

meddiff <- function(xmat) {
    ## rows are students, columns are modules
    S <- nrow(xmat)
    M <- ncol(xmat)
    result <- matrix(NA, M, M)
    rownames(result) <- colnames(result) <- colnames(xmat)
    for (m in 1:(M-1)) {
        for (mm in (m+1):M) {
            diffs <- xmat[, m] - xmat[, mm]
            ## upper triangle
            result[m, mm] <- median(diffs, na.rm = TRUE)
            ## lower triangle
            result[mm, m] <- sum(!is.na(diffs))
        }
    }
    return(result)
}

fit <- function(m) {
    ## matrix m needs to be fully connected above the diagonal
    upper <- upper.tri(m)
    diffs <- m[upper]
    weights <- t(m)[upper]
    rows <- factor(row(m)[upper])
    cols <- factor(col(m)[upper])
    X <- cbind(model.matrix(~ rows - 1), 0) - 
           cbind(0, model.matrix(~ cols - 1))
    colnames(X) <- colnames(m)
    rownames(X) <- paste0(colnames(m)[rows], "-", colnames(m)[cols])
    result <- lm.wfit(X, diffs, weights)
    result$coefficients[is.na(result$coefficients)] <- 0
    class(result) <- c("meddiff_fit", "list")
    return(result)
}

To cite this entry: Firth, D (2019). Robust measurement from a 2-way table. Weblog entry at URL https://DavidFirth.github.io/blog/2019/04/26/robust-measurement-from-a-2-way-table/

Part 2, further comments on OfS grade-inflation report

2019-01-07T00:00:00+00:00

Update, 7 January: I am pleased to say that the online media article that I complained about in Sec 1 below has now been amended by its author(s), to correct the false attributions. I am grateful to Chris Parr for helping to sort this out.

In my post a few days ago (which I’ll now call “Part 1”) I looked at aspects of the statistical methods used in a report by the UK government’s Office for Students, about “grade inflation” in English universities. This second post continues on the same topic.

In this Part 2 I will do two things:

Set the record straight, in relation to some incorrect reporting of Part 1 in the specialist media.
Suggest a new statistical method that (in my opinion) is better than the one used in the OfS report.

The more substantial stuff will be the second bullet there (and of course I wish I didn’t need to do the first bullet at all). In this post (at section 2 below) I will just outline a better method, by using the same artificial example that I gave in Part 1: hopefully that will be enough to give the general idea, to both specialist and non-specialist readers. Later I will follow up (in my intended Part 3) with a more detailed description of the suggested better method; that Part 3 post will be suitable mainly for readers with more specialist background in Statistics.

1. For the record

I am aware of two places where the analysis I gave in Part 1 has been reported:

At https://www.researchprofessional.com/0/rr/he/agencies/ofs/2019/OfS-grade-inflation-analysis-not-fit-for-purpose–says-expert.html, an article entitled “OfS grade inflation analysis not fit for purpose, says expert”
At https://www.researchresearch.com/news/article/?articleId=1379083, which seems to be a straight copy of the same article (I have not checked in detail).

The first link there is to a paywalled site, I think. The second one appears to be in the public domain. I do not recommend following either of those links, though! If anyone reading this wants to know about what I wrote in Part 1, then my advice is just to read Part 1 directly.

Here I want to mention three specific ways in which that article misrepresents what I wrote in Part 1. Points 2 and 3 here are the more important ones, I think (but #1 is also slightly troubling, to me):

The article refers to my blog post as “a review commissioned by HE”. The reality is that a journalist called Chris Parr had emailed me just before Christmas. In the email Chris introduced himself as “I’m a journalist at Research Fortnight”, and the request he made in the email (in relation to the newly published OfS report) was “Would you or someone you know be interested in taking a look?”. I had heard of Research Fortnight. And I was indeed interested in taking a look at the methods used in the OfS report. But until the above-mentioned article came to my attention, I had never even heard of a publication named HE. Possibly I am mistaken in this, but to my mind the phrase “a review commissioned by HE” indicates some kind of formal arrangement between HE and me, with specified deliverables and perhaps even payment for the work. There was in fact no such “commission” for the work that I did. I merely spent some time during the Christmas break thinking about the methods used in the OfS report, and then I wrote a blog post (and told Chris Parr that I had done that). And let me repeat: I had never even heard of HE (nor of the article’s apparent author, which was not Chris Parr). No payment was offered or demanded. I mention all this here only in case anyone who has read that article got a wrong impression from it.
The article contains this false statement: “The data is too complex for a reliable statistical method to be used, he said”. The “he” there refers to me, David Firth. I said no such thing, neither in my blog post nor in any email correspondence with Chris Parr. Indeed, it is not something I ever would say: the phrase “data…too complex for a reliable statistical method” is a nonsense.
The article contains this false statement: “He calls the OfS analysis an example of Simpson’s paradox”. Again, the “he” in that statement refers to me. But I did not call the OfS analysis an example of Simpson’s paradox, either in my blog post or anywhere else. (And nor could I have, since I do not have access to the OfS dataset.) What I actually wrote in my blog post was that my own artificial, specially-constructed example was an instance of Simpson’s paradox — which is not even close to the same thing!

The article mentioned above seems to have had an agenda that was very different from giving a faithful and informative account of my comments on the OfS report. I suppose that’s journalistic license (although I would naively have expected better from a specialist publication to which my own university appears to subscribe). The false attribution of misleading statements is not something I can accept, though, and that is why I have written specifically about that here.

To be completely clear:

The article mentioned above is misleading. I do not recommend it to anyone.
All of my posts in this blog are my own work, not commissioned by anyone. In particular, none of what I’ll continue to write below (and also in Part 3 of this extended blog post, when I get to that), about the OfS report, was requested by any journalist.

2. Towards a better (statistical) measurement model

I have to admit that in Part 1 I ran out of steam at one point, specifically where — in response to my own question about what would be a better way than the method used in the OfS report — I wrote “I do not have an answer”. I could have and should have done better than that.

Below I will outline a fairly simple approach that overcomes the specific pitfall I identified in Part 1, i.e., the fact that measurement at too high a level of aggregation can give misleading answers. I will demonstrate my suggested new approach through the same, contrived example that I used in Part 1. This should be enough to convey the basic idea, I hope. [Full generality for the analysis of real data will demand a more detailed and more technical treatment of a hierarchical statistical model; I’ll do that later, when I come to write Part 3.]

On reflection, I think a lot of the criticism seen by the OfS report since its publication relates to the use of the word “explain” in that report. And indeed, that was a factor also in my own (mentioned above) “I do not have an answer” comment. It seems obvious — to me, anyway — that any serious attempt to explain apparent increases in the awarding of First Class degrees would need to take account of a lot more than just the attributes of students when they enter university. With the data used in the OfS report I think the best that one can hope to do is to measure those apparent increases (or decreases), in such a way that the measurement is a “fair” one that appropriately takes account of incoming student attributes and their fluctuation over time. If we take that attitude — i.e, that the aim is only to measure things well, not to explain them — then I do think it is possible to devise a better statistical analysis, for that purpose, than the one that was used in the OfS report.

(I fully recognise that this actually was the attitude taken in the OfS work! It is just unfortunate that the OfS report’s use of the word “explain”, which I think was intended there mainly as a technical word with its meaning defined by a statistical regression model, inevitably leads readers of the report to think more broadly about substantive explanations for any apparent changes in degree-class distributions.)

2.1 Those “toy” data again, and a better statistical model

Recall the setup of the simple example from Part 1: Two academic years, two types of university, two types of student. The data are as follows:

2010-11
  University A           University B
    Firsts  Other          Firsts  Other
  h   1000      0        h    500    500 
  i      0   1000        i    500    500
2016-17
  University A          University B
    Firsts  Other          Firsts  Other 
  h   1800    200       h       0      0
  i      0      0       i     500   1500

Our measurement (of change) should reflect the fact that, for each type of student within each university, where information is available, the percentage awarded Firsts actually decreased (in this example).

Change in percent awarded firsts:
  University A, student type h:  100% --> 90%
  University A, student type i:   no data
  University B, student type h:   no data
  University B, student type i:   50% --> 25%

This provides the key to specification of a suitable (statistical) measurement model:

measure the changes at the lowest level of aggregation possible;
then, if aggregate conclusions are wanted, combine the separate measurements in some sensible way.

In our simple example, “lowest level of aggregation possible” means that we should measure the change separately for each type of student within each university. (In the real OfS data, there’s a lower level of aggregation that will be more appropriate, since different degree courses within a university ought to be distinguished too — they have different student intakes, different teaching, different exam boards, etc.)

In Statistics this kind of analysis is often called a stratified analysis. The quantity of interest (which here is the change in % awarded Firsts) is measured separately in several pre-specified strata, and those measurements are then combined if needed (either through a formal statistical model, or less formally by simple or weighted averaging).

In our simple example above, there are 4 strata (corresponding to 2 types of student within each of 2 universities). In our specific dataset there is information about the change in just 2 of those strata, and we can summarize that information as follows:

in University A, student type i saw their percentage of Firsts reduced by 10%;
in University B, student type h saw their percentage of Firsts reduced by 50%.

That’s all the information in the data, about changes in the rate at which Firsts are awarded. (It was a deliberately small dataset!)

If a combined, “sector-wide” measure of change is wanted, then the separate, stratum-specific measures need to be combined somehow. To some extent this is arbitrary, and the choice of a combination method ought to depend on the purpose of such a sector-wide measure and (especially) on the interpretation desired for it. I might find time to write more about this later in Part 3.

For now, let me just recall what was the “sector-wide” measurement that resulted from analysis (shown in Part 1) of the above dataset using the OfS report’s method. The result obtained by that method was a sector-wide increase of 7.5% in the rate at which Firsts are awarded — which is plainly misleading in the face of data that shows substantial decreases in both universities. Whilst I do not much like the OfS Report’s “compare with 2010” approach, it does have the benefit of transparency and in my “toy” example it is easy to apply to the stratified analysis:

2016-17          Expected Firsts       Actual
                 based on 2010-11
  University A         2000             1800
  University B         1000              500
  ------------------------------------------
  Total                3000             2300

— from which we could report a sector-wide decrease of 700/3000 = 23.3% in the awarding of Firsts, once student attributes are taken properly into account. (This could be viewed as just a suitably weighted average of the 10% and 50% decreases seen in University A and University B respectively.)

As before, I have made the full R code available (as an update to my earlier R Markdown document). For those who don’t use R, I attach here also a PDF copy of that: grade-inflation-example.pdf

2.2 Generalising the better model: More strata, more time-points

The essential idea of a better measurement model is presented above in the context of a small “toy” example, but the real data are of course much bigger and more complex.

The key to generalising the model will simply be to recognise that it can be expressed in the form of a logistic regression model (that’s the same kind of model that was used in the OfS report; but the “better” logistic regression model structure is different, in that it needs to include a term that defines the strata within which measurement takes place).

This will be developed further in Part 3, which will be more technical in flavour than Parts 1 and 2 of this blog-post thread have been. Just by way of a taster, let me show here the mathematical form of the logistic-regression representation of the “toy” data analysis shown above. With notation

u for providers (universities); u is either A or B in the toy example
t for type of student; t is either h or i in the toy example
y for years; y is either 2010-11 or 2016-17 in the toy example
\(\pi_{uty}\) for the probability of a First in year y, for students of type t in university u

the logistic regression model corresponding to the analysis above is

\(\log\left(\pi_{uty}\over 1-\pi_{uty}\right) = \alpha_{ut} + \beta_{uy}\).

This is readily generalized to situations involving more strata (more universities u and student types t, and also degree-courses within universities). There were just 4 stratum parameters \(\alpha_{Ah},\alpha_{Ai}, \alpha_{Bh}, \alpha_{Bi}\) in the above example, but more strata are easily accommodated.

The model is readily generalized also, in a similar way, to more than 2 years of data.

For comparison, the corresponding logistic regression model as used in the OfS report looks like this:

\(\log\left(\pi_{uty}\over 1-\pi_{uty}\right) = \alpha_{t} + \beta_{uy}\).

So it is superficially very similar. But the all-important term \(\alpha_{ut}\) that determines the necessary strata within universities is missing from the OfS model.

I will aim to flesh this out a bit in a new Part 3 post within the next few days, if time permits. For now I suppose the model I’m suggesting here needs a name (i.e., a name that identifies it more clearly than just “my better model”!) Naming things is not my strong point, unfortunately! But, for now at least, I will term the analysis introduced above “stratified by available student attributes” — or “SASA model” for short.

(The key word there is “stratified”.)

Update, September 2021: Just to note that the “Part 3” never got written! As well as having too much else to do in 2019, I lost all confidence that any further work by me on this topic would actually influence anything.

To cite this entry: Firth, D (2019). Part 2, further comments on OfS grade-inflation report. Weblog entry at URL https://DavidFirth.github.io/blog/2019/01/07/part-2-further-comments-on-ofs-grade-inflation-report/

Office for Students report on “grade inflation”

2019-01-02T00:00:00+00:00

Chris Parr, a journalist for Research Professional, asked me to look at a recent report, Analysis of degree classifications over time: Changes in graduate attainment. The report was published by the UK government’s Office for Students (OfS) on 19 December 2018, along with a headline-grabbing press release:

The report uses a statistical method — the widely used method of logistic regression — to devise a yardstick by which each English university (and indeed the English university sector as a whole) is to be measured, in terms of their tendency to award the top degree classes (First Class and Upper Second Class honours degrees). The OfS report looks specifically at the extent to which apparent “grade inflation” in recent years can be explained by changes in student-attribute data available to OfS (which include grades in pre-university qualifications, and also some other characteristics such as gender and ethnicity).

I write here as an experienced academic, who has worked at the University of Warwick (in England) for the last 15 years. At the end, below, I will briefly express some opinions based upon that general experience (and it should be noted that everything I write here is my own — definitely not an official view from the University of Warwick!)

My specific expertise, though, is in statistical methods, and this post will focus mainly on that aspect of the OfS report. (For a more wide-ranging critique, see for example https://wonkhe.com/blogs/policy-watch-ofs-report-on-grade-inflation/)

Parts of what I say below will get a bit technical, but I will aim to write first in a non-technical way about the big issue here, which is just how difficult it is to devise a meaningful measurement of “grade inflation” from available data. My impression is that, unfortunately, the OfS report has either not recognised the difficulty or has chosen to neglect it. In my view the methods used in the report are not actually fit for their intended purpose.

1. Analysis of an idealized dataset

In much the same way as when I give a lecture, I will aim here to expose the key issue through a relatively simple, concocted example. The real data from all universities over several years are of course quite complex; but the essence can be captured in a much smaller set of idealized data, the advantage of which is that it allows a crucial difficulty to be seen quite directly.

An imagined setup: Two academic years, two types of university, two types of student

Suppose (purely for simplicity) that there are just two identifiable types of university (or, if you prefer, just two distinct universities) — let’s call them A and B.

Suppose also (purely for simplicity) that all of the measurable characteristics of students can be encapsulated in a single binary indicator: every student is known to be either of type h or of type i, say. (Maybe h for hardworking and i for idle?)

Now let’s imagine the data from two academic years — say the years 2010-11 and 2016-17 as in the OfS report — on the numbers of First Class and Other graduates.

The 2010-11 data looks like this, say:

  University A           University B
    Firsts  Other          Firsts  Other
  h   1000      0        h    500    500 
  i      0   1000        i    500    500

The two universities have identical intakes in 2010-11 (equal numbers of type h and type i students). Students of type h do a lot better at University A than do students of type i; whereas University B awards a First equally often to the two types of student.

Now let’s suppose that, in the years that follow 2010-11,

students (who all know which type they are) learn to target the “right” university for themselves
both universities A and B tighten their final degree criteria, so as to make it harder (for both student types h and i) to achieve a First.

As a result of those behavioural changes, the 2016-17 data might look like this:

  University A          University B
    Firsts  Other          Firsts Other 
  h   1800    200       h       0     0
  i      0      0       i     500  1500

Now we can combine the data from the two universities, so as to look at how degree classes across the whole university sector have changed over time:

  Combined data from both universities:
    2010-11                  2016-17
          Firsts  Other            Firsts  Other
        h   1500    500          h   1800    200
        i    500   1500          i    500   1500
          -------------            -------------
    Total   2000   2000      Total   2300   1700
        %     50     50              57.5   42.5

The conclusion (not!)

The last table shown above would be interpreted, according to the methodology of the OfS report, as showing an unexplained increase of 7.5 percentage points in the awarding of first-class degrees.

(It is 7.5 percentage points because that’s the difference between 50% Firsts in 2010-11 and 57.5% Firsts in 2016-17. And it is unexplained — in the OfS report’s terminology — because the composition of the student body was unchanged, with 50% of each type h and i in both years.)

But such a conclusion would be completely misleading. In this constructed example, both universities actually made it harder for every type of student to get a First in 2016-17 than in 2010-11.

The real conclusion

The constructed example used above should be enough to demonstrate that the method developed in the OfS report does not necessarily measure what it intends to.

The constructed example was deliberately made both simple and quite extreme, in order to make the point as clearly as possible. The real data are of course more complex, and patterns such as shifts in the behaviour of students and/or institutions will usually be less severe (and will always be less obvious) than they were in my constructed example. The point of the constructed example is merely to demonstrate that any conclusions drawn from this kind of combined analysis of all universities will be unreliable, and such conclusions will often be incorrect (sometimes severely so).

That false conclusion is just an instance of Simpson’s Paradox, right?

Yes.

The phenomenon of analysing aggregate data to obtain (usually incorrect) conclusions about disaggregated behaviour is often (in Statistics) called ecological inference or the ecological fallacy. In extreme cases, even the direction of effects can be apparently reversed (as in the example above) — and in such cases the word “paradox” does seem merited.

Logistic regression

The simple example above was (deliberately) easy enough to understand without any fancy statistical methods. For more complex settings, especially when there are several “explanatory” variables to take into account, the method of logistic regression is a natural tool to choose (as indeed the authors of the OfS report did).

It might be thought that a relatively sophisticated tool such as logistic regression can solve the problem that was highlighted above. But that is not the case. The method of logistic regression, with its results aggregated as described in the OfS report, merely yields the same (incorrect) conclusions in the artificial example above.

For anyone reading this who wants to see the details: here is the full code in R, with some bits of commentary.

2. So, what is a better way?

The above has shown how the application of a statistical method can result in potentially very misleading results.

Unfortunately, it is hard for me (and perhaps just as hard for anyone else?) to come up with a purely statistical remedy — i.e., a better statistical method.

The problem of measuring “grade inflation” is an intrinsically difficult one to solve. Subject-specific Boards of Examiners — which is where the degree classification decisions are actually made within universities — work very hard (in my experience) to be fair to all students, including those students who have graduated with the same degree title in previous years or decades. This last point demands attention to the maintenance of standards through time. Undoubtedly, though, there are other pressures in play — pressures that might still result in “grade inflation” through a gradual lowering of standards, despite the efforts of exam boards to maintain those standards. (Such pressures could include the publication of %Firsts _and similar summaries, in league tables of university courses for example.) And even if standards are successfully held constant, there could still be _apparent grade-inflation wherever actual achievement of graduates is improving over time, due to such things as increased emphasis on high-quality teaching in universities, or improvements in the range of options and the information made available to students (who can then make better choices for their degree courses).

I should admit that I do not have an answer!

3. A few (more technical) notes

a. For the artificial example above, I focused on the difficulty caused by aggregating university-level data to draw a conclusion about the whole sector. But the problem does not go away if instead we want to draw conclusions about individual universities, because each university comprises several subject-specific exam boards (which is where the degree classification decisions are actually made). Any statistical model that aims to measure successfully an aspect of behaviour (such as grade inflation) would need to consider data at the right level of disaggregation — which in this instance would be the separate Boards of Examiners within each university.

b. Many (perhaps all?) of the reported standard errors attached to estimates in the OfS report seem, to my eye, unrealistically small. It is unclear how they were calculated, though, so I cannot judge this reliably. (A more general point related to this: It would be good if the OfS report’s authors could publish their complete code for the analysis, so that others can check it and understand fully what was done.)

c. In tables D2 and D3 of the OfS report, the model’s parameterization is not made clear enough to understand it fully. Specifically, how should the Year estimates be interpreted — do they, for example, relate to one specific university? (Again, giving access to the analysis code would help with understanding this in full detail.)

d. In equations E2 and E3 of the OfS report, it seems that some independence assumptions (or, at least, uncorrelatedness) have been made. I missed the justification for those; and it is unclear to me whether all of them are indeed justifiable.

e. The calculation of thresholds for “significance flags” as used in the OfS report is opaque. It is unclear to me how to interpret such statistical significance, in the present context.

4. Opinion

This topic seems to me to be a really important one for universities to be constantly aware of, both qualitatively and quantitatively.

Unfortunately I am unconvinced that the analysis presented in this OfS report contributes any reliable insights. This is worrying (to me, and probably to many others in academia) because the _Office for Students _is an important government body for the university sector.

It is especially troubling that the OfS appears to base aspects of its regulation of universities upon such a flawed approach to measurement. As someone who has served in many boards of examiners, at various different universities in the UK and abroad (including as an external examiner when called upon), I cannot help feeling that a lot of careful work by such exam boards is in danger of simply being dismissed as “unexplained”, on the basis of some well-intentioned but inadequate statistical analysis. The written reports of exam boards, and especially of the external examiners who moderate standards across the sector, would surely be a much better guide than that?

Update, 7 January: There’s now also Part 2 of this blog post, for those who are keen to know more!

To cite this entry: Firth, D (2019). Office for Students report on “grade inflation”. Weblog entry at URL https://DavidFirth.github.io/blog/2019/01/02/office-for-students-report-on-grade-inflation

Simple maths of a fairer USS deal

2018-03-16T00:00:00+00:00

In yesterday’s post I showed a graph, followed by some comments to suggest that future USS proposals with a flatter (or even increasing) “percent lost” curve would be fairer (and, as I argued earlier in my Robin Hood post, more affordable at the same time).

It’s now clear to me that my suggestion seemed a bit cryptic to many (maybe most!) who read it yesterday. So here I will try to show more specifically how to achieve a flat curve. (This is not because I think flat is optimal. It’s mainly because it’s easy to explain. As already mentioned, it might not be a bad idea if the curve was actually to increase a bit as salary levels increase; that would allow those with higher salaries to feel happy that they are doing their bit towards the sustainable future of USS.)

Flattening the curve

The graph below is the same as yesterday’s but with a flat (blue, dashed) line drawn at the level of 4% lost across all salary levels.

I drew the line at 4% here just as an example, to illustrate the calculation. The actual level needed — i.e, the “affordable” level for universities — would need to be determined by negotiation; but the maths is essentially the same, whatever the level (within reason).

Let’s suppose we want to adjust the USS contribution and benefits parameters to achieve just such a flat “percent lost” curve, at the 4% level. How is that done?

I will assume here the same adjustable parameters that UUK and UCU appear to have in mind, namely:

employee contribution rate E (as percentage of salary — currently 8; was 8.7 in the 12 March proposal; was 8 in the January proposal)
threshold salary T, over which defined benefit (DB) pension entitlement ceases (which is currently £55.55k; was £42k in the 12 March proposal; and was £0 in the January proposal)
accrual rate A, in the DB pension. Expressed here in percentage points (currently 100/75; was 100/85 in the 12 March proposal; and not relevant to the January proposal).
employer contribution rate (%) to the defined contribution (DC) part of USS pension. Let’s allow different rates \(C_1\) and \(C_2\) for, respectively, salaries between T and £55.55k, and salaries over £55.55k. (Currently \(C_1\) is irrelevant, and \(C_2\) is 13 (max); these were both set at 12 in the 12th March proposal; and were both 13.25 in the January proposal.)

I will assume also, as all the recent proposals do, that the 1% USS match possibility is lost to all members.

Then, to get to 4% lost across the board, we need simply to solve the following linear equations. (To see where these came from, please see this earlier post.)

For salary up to T:

\[(E - 8) + 19(100/75 - A) + 1] = 4.\]

For salary between T and £55.55k:

\[ -8 + 19(100/75) - C_1 + 1 = 4.\]

For salary over £55.55k:

\[13 - C_2 = 4.\]

Solving those last two equations is simple, and results in

\[C_1 = 14.33, \qquad C_2 = 9.\]

The first equation above clearly allows more freedom: it’s just one equation, with two unknowns, so there are many solutions available. Three example solutions, still based the illustrative 4% loss level across all salary levels, are:

\[E=8, \qquad A = 1.175 = 100/85.1\] \[E = 8.7, \qquad A = 1.21 = 100/82.6\] \[E = 11, \qquad A = 100/75.\]

At the end here I’ll give code in R to do the above calculation quite generally, i.e., for any desired percentage loss level. First let me just make a few remarks relating to all this.

Remarks

Choice of threshold

Note that the value of T does not enter into the above calculation. Clearly there will be (negotiable) interplay between T and the required percentage loss, though, for a given level of affordability.

Choice of \(C_2\)

Much depends on the value of \(C_2\).

The calculation above gives the value of \(C_2\) needed for a flat “percent lost” curve, at any given level for the percent lost (which was 4% in the example above).

To achieve an increasing “percent lost” curve, we could simply reduce the value of \(C_2\) further than the answer given by the above calculation. Alternatively, as suggested in my earlier Robin Hood post, USS could apply a lower value of \(C_2\) only for salaries above some higher threshold — i.e., in much the same spirit as progressive taxation of income.

Just as with income tax, it would be important not to set \(C_2\) too small, otherwise the highest-paid members would quite likely want to leave USS. There is clearly a delicate balance to be struck, at the top end of the salary spectrum.

But it is clear that if the higher-paid were to sacrifice at least as much as everyone else, in proportion to their salary, then that would allow the overall level of “percent lost” to be appreciably reduced, which would benefit the vast majority of USS members.

Determination of the overall “percent lost”

Everything written here constitutes a methodology _to help with finding a good solution. As mentioned at the top here, the _actual solution — and in particular, the actual level of USS member pain (if any) deemed to be necessary to keep USS afloat — will be a matter for negotiation. The maths here can help inform that negotiation, though.

Code for solving the above equations

## Function to compute the USS parameters needed for a
## flat "percent lost" curve
##
## Function arguments are:
## loss: in percentage points, the constant loss desired
## E: employee contribution, in percentage points
## A: the DB accrual rate
##
## Exactly one of E and A must be specified (ie, not NULL).
##
## Example calls:
## flatcurve(4.0, A = 100/75)
## flatcurve(2.0, E = 10.5)
## flatcurve(1.0, A = 100/75)  # status quo, just 1% "match" lost

flatcurve <- function(loss, E = NULL, A = NULL){

    if (is.null(E) && is.null(A)) {
        stop("E and A can't both be NULL")}
    if (!is.null(E) && !is.null(A)) {
        stop("one of {E, A} must be NULL")}

    c1 <- 19 * (100/75) - (7 + loss)
    c2 <- 13 - loss

    if (is.null(E)) {
        E <- 7 + loss - (19 * (100/75 - A))
    }

    if (is.null(A)) {
        A <- (E - 7 - loss + (19 * 100/75)) / 19
    }

return(list(loss_percent = loss,
            employee_contribution_percent = E,
            accrual_reciprocal = 100/A,
            DC_employer_rate_below_55.55k = c1,
            DC_employer_rate_above_55.55k = c2))
}

The above function will run in base R.

Here are three examples of its use (copied from an interactive session in R):

###  Specify 4% loss level, 
###  still using the current USS DB accrual rate

> flatcurve(4.0, A = 100/75)
$loss_percent
[1] 4

$employee_contribution_percent
[1] 11

$accrual_reciprocal
[1] 75

$DC_employer_rate_below_55.55k
[1] 14.33333

$DC_employer_rate_above_55.55k
[1] 9

#------------------------------------------------------------
###  This time for a smaller (2%) loss, 
###  with specified employee contribution

> flatcurve(2.0, E = 10.5)
$loss_percent
[1] 2

$employee_contribution_percent
[1] 10.5

$accrual_reciprocal
[1] 70.80745

$DC_employer_rate_below_55.55k
[1] 16.33333

$DC_employer_rate_above_55.55k
[1] 11

#------------------------------------------------------------
### Finally, my personal favourite:
### --- status quo with just the "match" lost

> flatcurve(1, A = 100/75)
$loss_percent
[1] 1

$employee_contribution_percent
[1] 8

$accrual_reciprocal
[1] 75

$DC_employer_rate_below_55.55k
[1] 17.33333

$DC_employer_rate_above_55.55k
[1] 12

To cite this entry: Firth, D (2018). Simple maths of a fairer USS deal. Weblog entry at URL https://DavidFirth.github.io/blog/2018/03/16/simple-maths-of-a-fairer-uss-deal/

USS proposals: Tail wagging the dog?

2018-03-15T00:00:00+00:00

In response to my previous post, “Latest USS proposal: Who would lose most?”, someone asked me about doing the same calculation for the USS JNC-supported proposals from January. For a summary of those January proposals and my comments about their fairness, please see my earlier post “USS pension scheme and fairness”.

Anyway, the calculation is quite simple, and it led to the following graph. The black curve is as in my previous post, and the red one is from the same calculation done for the January USS proposal.

The red curve shows just over 5% effective loss of salary for those below the current £55.55k USS threshold, and then a fairly sharp decline to less than 2% lost at the salaries of the very highest-paid professors, managers and administrators. Under the January proposals, higher-paid staff would contribute proportionately less to the “rescue package” for USS — less, even, than under the March proposals. (And if the salary axis were to be extended indefinitely, the red curve would actually cross the zero-line: that’s because in the January proposals the defined-contribution rate from employers would actually have increased from (max) 13% to 13.25%.)

In terms of unequal sharing of the “pain”, then, the January proposal was even worse than the March one.

At the bottom here I’ll give the R code and a few words of explanation for the calculation of the red curve above.

But the main topic of this post arises from a remarkable feature of the above graph! At the current USS threshold salary of £55.55k, the amount lost is the same — it’s 5.08% under both proposals. Which led me to wonder: is that a coincidence, or was it actually a (pretty weird!) constraint used in the recent UUK-UCU negotiations? And then to wonder: might the best solution (i.e., for the same cost) be to do something that gives a better graph than either of the two proposals seen so far?

Tail wagging the dog?

The fact that the loss under the March proposal tops out at 5.08%, exactly (to 2 decimals, anyway) the same as in the January proposal, seems unlikely to be a coincidence?

If it’s not a coincidence, then a plausible route to the March proposal, at the UUK-UCU negotiating table, could have been along the lines of:

How can we re-work the January proposal to

retain defined benefit, up to some (presumably reduced) threshold and with some (presumably reduced) accrual rate,

while at the same time

nobody loses more than the maximum 5.08% that’s in the January proposal

the employer contribution rate to the DC pots of high earners is not reduced below the current standard (i.e., without the “match”) level of 12%

?

Those constraints, coupled with total cost to employers, would lead naturally to a family of solutions indexed by just two adjustable constants, namely

the threshold salary up to which DB pension applies (previously £55.55k)
the DB accrual rate (previously 1/75)

— and it seems plausible that the suggested (12 March 2018) new threshold of £42k and accrual rate of 1/85 were simply selected as the preferred candidate (among many such potential solutions) to offer to UUK and UCU members.

But the curve ought to be flat, or even increasing!

The two constraints listed as second and third bullets in the above essentially fix the position of the part of the black curve that applies to salaries over £55.55k. That’s what I mean by “tail wagging the dog”. Those constraints inevitably result in a solution that implies substantial losses for those with low or moderate incomes.

Once this is recognised, it becomes natural to ask: what should the shape of that “percentage loss” curve be?

The answer is surely a matter of opinion.

Those wishing to preserve substantial pension contributions at high salary levels, at the expense of those at lower salary levels, would want a curve that decreases to the right — as seen in the above curves for the January and March proposals.

For myself, I would argue the opposite: The “percent lost” curve should either be roughly constant, or might reasonably even increase as salary increases. (The obvious parallel being progressive rates of income tax: those who can afford to pay more, pay more.)

I had made a specific suggestion along these lines, in this earlier post:

Future USS: Robin Hood can help?

The details of any solution that satisfies the “percent loss roughly constant, or even increasing” requirement clearly would need to depend on data that’s not so widely available (mainly, the distribution of all salaries for USS members).

But first the principle of fairness needs to be recognised. And once that is accepted, the constraints underlying future UUK-UCU negotiations would need to change radically — i.e., definitely away from those last two bullets in the above display.

Calculation of the red curve

In the previous post I gave R code for the black curve. Here is the corresponding calculation behind the red curve:

sacrifice.Jan <- function(salary) { # salary in thousands
    old_threshold <- 55.55
    s <- salary

## sacrifice arising from income up to old_threshold
    s2 <- min(s, old_threshold)
    r2 <- s2 * (19/75 + 1/100 - (13.25 + 8)/100)

## sacrifice (max) arising from income over the old threshold
## -- note that this is negative
    r3 <- (s > old_threshold) * (s - old_threshold) * 
                (13 - 13.25)/100

    return(r2 + r3)
}

## A vector of salary values up to £150k
salaries <- (1:1500) / 10

## Compute percent of salary that would be lost, 
## at each salary level
sacrifices <- 100 * sapply(salaries, sacrifice.Jan) / salaries

In essence:

salary under £55.55k would lose the defined benefit (that’s the 19/75 part) and the 1% “match”, and in its place would get 21.25% as defined contribution. The sum of these parts is the computed loss r2.
salary over £55.55k would gain the difference between potential 13% employer contribution and the proposed new rate of 13.25% (that’s the negative value r3 in the code).

Update, 16 March: There’s now a follow-up post to this one, which gives more detail on how (mathematically) to achieve a fairer sharing-out of whatever level of USS member pain might ultimately be deemed necessary. See Simple maths of a fairer USS deal (but ideally only after reading the necessary background, above!).

To cite this entry: Firth, D (2018). USS proposals: Tail wagging the dog? Weblog entry at URL https://DavidFirth.github.io/blog/2018/03/15/uss-proposals-tail-wagging-the-dog/

Latest USS proposal: Who would lose most?

2018-03-13T00:00:00+00:00

Yesterday (March 12th) the UUK/UCU negotiations at ACAS concluded with an agreement document.

In this post I’ll look at the numbers in those proposed interim changes to the Universities Superannuation Scheme, to work out how much money would effectively be lost by USS members at each salary level.

This is inevitably a fairly rough calculation, but its results don’t really demand more precision. The picture is very clear: the cost of “saving” USS would be felt most by USS members with low or moderate incomes.

The effective marginal rates at which money is lost by members are (as calculated below):

4.7% on salary up to £42k
6.3% on salary between £42k and the current USS threshold salary of £55.55k
1.0% (at most) on salary over £55.55k

This translates into the following relationship between salary and the percentage of total salary lost:

The two “kinks” in that graph reflect the discontinuities in marginal rates, at £42k and at £55.55k.

The vertical lines drawn in green are current full-time pay grades at a typical university (with no London allowance or other extras): grade 6 is the pay of many Research Associates and Teaching Fellows, for example; grade 7 is the pay of most Lecturers; grade 8 is the pay of Senior Lecturers and Readers; and grade 9 is the pay of Professors and other senior staff. (I have mentioned only academic and research staff here, but the same grades apply also to administrative and technical staff in UK universities.)

The long decay to the right continues indefinitely, ultimately approaching an asymptote at 1% lost, i.e., for those with absolutely stratospheric salaries (if such people are actually members of USS, still, that is — though I would guess that many are not).

In the rest of this post I’ll give the details of the calculation that leads to the above numbers and graph. (For people who prefer a list of numbers to a graphical display, I have also added the numbers as an Appendix at the bottom of this post.)

Just here, though, let me again comment on how unfair this “remedy” would be. The unfairness should be obvious from the above graph: those who are paid most, and would stand to benefit most from being in USS, would contribute least, in percentage terms, in this proposed move towards the future sustainability of USS. For a more general view on this unfairness, see also my previous two posts in this “USS” category:

The calculation

It suffices to consider salaries in three distinct bands. In each salary band, we can calculate how much is lost, per unit of salary.

The following code in R reproduces the graph drawn above. A brief explanation is then given, beneath the displayed code.

## This code runs in base R.

## Function to compute the amount that would be lost annually (£k)
## at any given salary level
sacrifice <- function(salary) { # salary in thousands
    old_threshold <- 55.55
    new_threshold <- 42
    s <- salary

## sacrifice arising from income up to the new threshold
    r1 <- min(s, new_threshold) * ((8.7 - 8)/100 +
                                    19 * (1/75 - 1/85) +
                                    1/100)

## sacrifice arising from income between the thresholds
    s2 <- (s > new_threshold) * (min(s, old_threshold) - 
                                            new_threshold)
    r2 <- s2 * ((8.7 - 8)/100 + (19/75 - (12 + 8.7)/100) + 1/100)

## sacrifice (max) arising from income over the old threshold
    r3 <- (s > old_threshold) * (s - old_threshold) * (1/100)

    return(r1 + r2 + r3)
}

## A vector of salary values up to £150k
salaries <- (1:1500) / 10

## Compute percent of salary that would be lost, 
## at each salary level
sacrifices <- 100 * sapply(salaries, sacrifice) / salaries

## Plot the result
svg(file = "lost.svg", width = 8, height = 4)
plot(salaries, sacrifices, type = "l",
 xlab = "salary (thousands)", ylab = "percent lost",
 main = "Percent of salary lost under UUK-UCU agreement 2018-03-12")
abline(v = c(29, 39, 48, 61), col = "green")
text(x = c(34, 44, 54, 75), y = 2.8,
 labels = c("6", "7", "8", "9"), col = "green")
dev.off()

Band 1: Salary up to £42k

Most contributions from this part of salary go to the “defined benefit” part of USS. The new proposal would see 8.7% of member’s salary up to £42k going in to this, as opposed to 8.0% at present. The return (i.e., the value of the defined-benefit pension) can readily be calculated using the standard HMRC formula, the one that is used for Annual Allowance purposes. Under current USS, the value of this part is 19 times (s/75), where s is either £42k or the member’s salary if the salary is less than £42k. Under yesterday’s proposals, the value of this part would fall to 19 times (s/85). Under yesterday’s proposals, USS members would also lose the possibility to add 1% “matching” employer contribution to an additional, defined-contribution pension pot. The amount lost to each member, relating to salary in this first band, is then the sum of the additional contribution made and the amount of pension value lost: that is r1 in the above code.

Band 2: Salary between £42k and £55.55k

Now, for salaries greater than £42k, let s2 be the smaller of (salary minus £42k) and (£55.55k minus £42k). Then current USS has members contributing 8% of s2 in the defined-benefit part, for a return of 19 times s2/75. Yesterday’s proposal would change the contribution to 8.7% of s2, for a return of s2 times (12% + 8.7%). And again, the possibility of 1% matching employer contribution to the defined-contribution pot would be lost. The amount lost to each member, relating to salary in this second band, is again just the sum of the additional contribution made and the amount of pension value lost: that is r2 in the above code.

Band 3: Salary over £55.55k

Relating to salary above the current £55.55k threshold, the loss would be limited to loss of the 1% matching employer contribution. This is computed as r3 in the above code. (In practice this will be an upper bound on what is lost. Those USS members with the very highest salaries are likely also to face issues relating to the HMRC Annual Allowance and Lifetime Allowance limits, in which case the loss of the matching employer contribution could be worth substantially less than 1% to them.)

Conclusion

I have reproduced the full calculation here, with code, because I found the result of the calculation so shocking! If anyone reading this thinks I have made a mistake in the calculation, please do let me know. If it is correct — and right now I have no reason to suspect otherwise — then I confess I’m alarmed that this is actually being proposed as a potential solution, even as an interim solution for the next 3 years, to the perceived problems with USS. It shakes my faith in those who have been involved in negotiating it. With seemingly intelligent people on both sides of the table, how could they possibly come up with something as bad as this?

Update, 14 March: Some details in the original post yesterday were not quite right, and so the graph/numbers that appear in the now-corrected version above are different in detail from yesterday’s. But the overall picture is unchanged. (If you really want to know about those changes in detail, please see my note in Appendix 2 at the bottom of the post about that.)

Update, 16 March: After reading this post, you might perhaps be interested in these follow-ups:

To cite this entry: Firth, D (2018). Latest USS proposal: Who would lose most? Weblog entry at URL https://DavidFirth.github.io/blog/2018/03/13/latest-uss-proposal-who-would-lose-most/.

Appendix 1: A tabular view of what’s in the graph

## Make a table for anyone who wants more detail than the graph
salary <- c(10:55, 55.55, 56:100, 150)
percent_lost <- round(100 * sapply(salary, sacrifice) / salary, 2)
salary <- 1000 * salary
my_table <- data.frame(salary, percent_lost)

That’s the code for making a little table, showing the same numbers as those in the above graph.

Here is the resulting table:

salary    %
4.68 -- I started the table at £10k for no good reason
4.68
 ...
4.68
4.68 -- the proposed new threshold
4.72
4.76
4.79
4.82
4.86
4.89
4.92
4.94
4.97
5.00
5.02
5.05
5.07
5.08 -- current USS threshold, highest % of salary lost
5.05
4.98
4.91
4.84
4.78
4.72
4.66
4.60
4.54
4.49
4.44
4.39
4.34
4.29
4.24
4.19
4.15
4.11
4.07
4.02
3.98
3.95
3.91
3.87
3.84
3.80
3.77
3.73
3.70
3.67
3.64
3.61
3.58
3.55
3.52
3.49
3.47
3.44
3.41
3.39
3.36
3.34
3.31
3.29
100000 3.27
150000 2.51 -- possibly there are even some salaries this high?!

Appendix 2: Details of the update made on 14 March

Many thanks to all who gave feedback on the original posting, yesterday (13 March).

In response to that feedback, I made two substantive changes to the calculation. This Appendix gives details of those changes, for those who are interested (and for the record).

Neither change affects the story qualitatively: only the detailed numbers have changed a bit.

Change 1: Use of HMRC multiplier 19 rather than 23

The HMRC calculations for Annual Allowance and Lifetime Allowance purposes are different in detail: the former uses a multiplier of 19 times pension to value USS defined benefits, while the latter uses 23 (i.e., in place of 19). In yesterday’s post I had used 23. The updated figures calculated above use multiplier 19 **instead.**

Mainly I decided to use the smaller figure as it’s a bit more conservative, in relation to the value lost through the proposed reduction of defined benefits. (I certainly don’t want to be accused of bias in the other direction, through having picked the larger multiplier.)

The effect on the calculated numbers is mainly to reduce the height of the “spike” that appears in the graph, around the £55k salary level. The spike is still there; it’s just a bit smaller.

My friend Jon commented that the actual value of a defined-benefit pension is harder to quantify than the HMRC formula would suggest — and that it’s likely to be dependent on age and perhaps other factors. This is undoubtedly true, and certainly I would not suggest that anyone should use the above numbers for their own financial planning! Rather, the aim here was (only) to show through a simple, transparent calculation how the losses arising from current proposals would differ — in rough, average terms — between pay levels.

Since writing my post yesterday I found that I am not alone in having done a calculation like this: see also http://brianosmith.blogspot.co.uk/ (and maybe there are others too?).

Change 2: Inclusion of the USS “Match” at all salary levels

Several people pointed out to me that the USS “Match” possibility is available at all salary levels. So it’s a benefit that would be lost at all salary levels, under the 12 March agreement. In yesterday’s post I had taken it into account only at salaries over £55.55k: that (relatively minor) error is now corrected, in the revised figures shown above.