I spent a fair amount of effort exploring various model architectures and configurations, all of which tend to modestly affect performance relative to the configuration for the reported results. However, last week I found that code that was accidentally breaking important parts of the training data. So I decided to rebuild the data pipeline, line by line, from raw data to training data. I also enhanced some of the training features. I then re-trained models using this refreshed data.
The results with this refreshed data nicely boosted performance by a few percentage points. When we include oral argument data, the model beats algorithmic SOTA on held out data (OT2024), at both the vote-level and the case-level; it beats human crowds at the case level. The vote-level accuracy is 74 percent, and the case-level accuracy is 81 percent. This is about 2-4 percentage points better than the earlier model with the broken data. The algorithmic benchmarks again tend to perform in the low 70s for votes and cases; human crowds perform around 80 percent for votes and mid-70s for cases.
Without oral argument, the results are weaker, but still respectable and roughly approximate the algorithmic SOTA results at the vote level. The accuracy is 71 percent at the vote level. At the case-level, however, even this low-information model out-performs humans and algorithmic SOTA. The case-level accuracy approaches 80 percent.
I updated the methods page to reflect this newly trained model. I will soon update the predictions page, with older results moved down to an archive section of the predictions page.
The exercise suggests that for this task data is difficult and likely the binding constraint in performance (not models). As I noted earlier, a main reason that this is a notoriously difficult problem is that law is high dimensional, and a big part of the game is to capture enough relevant information in the training data without also including distracting noise. That depends on domain expertise—but also your code not breaking.