don't worry, it's probably fine

Legislation as Code

24 Apr 2019

civic tech python

Given the polling day of an election in the UK, when should the Statement of Persons Nominated (SoPN) be published?

I built a small Python library, sopn-publish-date, to answer this question. This post is me sharing my experiences with the wider internet.

🗳️ Why?

Elections have become interesting to me over the last couple of years.

I spend some of my free time volunteering my software-engineering skills at Democracy Club, and the different dates for candidate lists tickled my curiousity.

Why do some places publish their candidate lists earlier or later than others?

With my software developer hat on, I also had several things I wanted to do with this project:

  • Build a small, very specific, library and avoid the temptation of feature creep.
  • Use proper documentation tooling
  • Improve my Python skillset
  • Use modern Python features such as typing

⚖️ Identifying test-cases and edge-cases

Each type of election has a corresponding piece of legislation which sets out the election timetable.

An electoral body responsible should publish SoPNs a fixed number of working days before polling day.

This problem has at least three axes:

  • Type of election
  • Country in which the election takes place
  • Public holiday calendar of the country/region in which the election takes place

Plenty of room for edge-cases for tests to capture.

🔨 Assembling tools

I picked pandas1 as it comes with very good support for date calculations. The core of the project relies on date arithmetic - a notoriously annoying problem.

Calculating working days requires knowledge of public/bank holidays. I could have built this myself, but I opted to use a canonical source from GOV.UK’s API2 instead.

Finally, I decided to use black3 for the first time to improve my python formatting.

⚙️ Evolving an API

The library started off with a single entry-point that took an election-id4.

This approach quickly became unworkable, as discussed in the project’s first issue

This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. - Doug McIlroy

I opted to redesign the API to have an entry point for every type of election. The project now contains a separate package to operate with the election id format.

Each entry point has a similar type-signature, taking a date (date of poll) and returning a date (date of candidate list publication).

Election types which need more information, like country or region, have additional parameters.

📖 Documenting the project

I wanted to document this library clearly and decided on Sphinx5 to auto-generate documentation from code comments, hosted on ReadTheDocs.org6.

I am really pleased with how this turned out and love that it generates docs that are in sync with the code and comments - you can check out the docs here

🔍 Collecting test data

There is a lot of test data out there in the wild - conceivably every election since the most recently amended piece of corresponding legislation is a valid test case.

Small problem: There is no single official place to find these test-cases.

That in mind, I decided to split my test cases into two categories: single-election “unit” tests which I could source from a single specific election, and a large count of historic “regression” tests.

✅ Unit Tests

I used Democracy Club’s election database to find reference elections for the unit-tests

Sourcing a bulk of historic elections for regression testing was far trickier.

🐍 Regression Tests and Heuristics

It was a perfect storm. SoPNs are published 99% of the time as PDFs, and there is no need to archive or keep them.

Luckily, Democracy Club came to the rescue. They have an archived ZIP file of thousands of SoPNs dating back to 2015.

They were still PDFs, so I wrote some Horrible Python to:

  1. Parse the PDF to text (using pdftotext)
  2. Regex out dates in multiple formats (ended up with ((\d{1,2})(\S{2})? ([A-Za-z]+)[,]? (\d{4})))
  3. Dedupe the dates and pick a pair such that date_1 < date_2

These turned out to be reasonable heuristics and generated a CSV with over a thousand tests.

The test file reads in the CSV and uses pytest’s parametrize7 to generate test cases, but they still require a little bit of fudge: sometimes the candidate list is delayed if a candidate withdraws their nominations, or the legislation is phrased as “no later than” (so previous day is fair game).

I wrote the assertions in the tests to bear this in mind, allowing same_or_next_day or within_one_day depending on the type of legislation.

On a first run through of 1000+ tests the project had a success rate of >99% against the historic data, which I’m happy with.

I did discover 14 cases where the rules don’t apply, one of which was a typo claiming the candidate list was published over a year ago!

🎉 Conclusion

I really enjoyed building a small composable library to address a very niche problem, and I will take these learnings with me into the future.

It gave me a much deeper appreciation into just how big this problem space is, having spent a fair bit of time digging out the correct bits of legislation.

🇪🇺😱 Epilogue: Extensibility and the European Parliament

Just as I’d released v1.0.0 to PyPI it was announced that the UK might be contesting elections to the European Parliament.

I decided to improve the library by adding better support for these elections, only to go into the Electoral Commission timetable to find this:

As a result of a bank holiday in Gibraltar on 29 April and on 1 May 2019, some electoral deadlines in the South West electoral region are different to the deadlines elsewhere in Great Britain.

Another edge-case, how exciting!

This functionality was released in v1.1.0

The library doesn’t support parish or City of London elections yet, but there’s plenty of room for future extensions.