Data Linkage and Anonymisation Workshop

Just got back from a workshop held at the Turing Gateway to Mathematics in Cambridge. The event had a range of fantastic speakers discussing issues around data linkage and data privacy.

Data Linkage

Chris Dibben (University of Edinburgh) introduced the idea of linking different sets of data about subjects from multiple sources. He gave the example of how data is collected from pre-birth (pregnancy records) all the way through a persons life until death (or just after). All of this data is stored is different locations with no unique identifier.

Data Linkage Example based on Living in Scotland
Data Linkage Example based on Living in Scotland (Source: Chris Dibben, University of Edinburgh)

Chris’s talk also incorporated Paul Burton (University of Bristol) presentation and between the two pieces it was discussed how we can link these records based on probabilistic matching. By further linking parents to children we can actually build a full data picture of a full family or group of people.

Some of the limits the data linkage face are high frequency data, surnames such as Smith are and example. Further problems can be miss-typed data entry or as we use older records, incorrectly OCR scanned text. Later in the talk Natalie Shlomo (University of Manchester) demonstrated her approach to data linkage and give a good example of matching data records by weighting the probability through the use of string comparators. Natalie went on to give examples using Jaro [1] and Jaro-Winkle [2].

Commercial Applications

The workshop also saw talks from Clive Humby (Startcount) and Mary Gregory (Department of Energy and Climate Change).

Clive talked around the commercial aspects of Big Data and gave examples of how Tesco uses their data. He explained how commercial analysis want to model behaviour of groups in relation to sales. Suggesting that organisations can test if various indicators change in relation to business activities i.e. a new product launch, a price adjustment, etc. Clive also discussed how consumers give away there personal data however they are protective of other data (e.g Medical).

It appears no-one actually wants personal information, they don’t want to look after it, they don’t want the responsibility. Genuine business use is for behaviour matching and has nothing to do with was Joe Blogs does on his weekend.

Red Blue Data Seperation (Source: Clive Humby, Starcount)
Red Blue Data Seperation (Source: Clive Humby, Starcount)

Mary highlighted how government initiatives such as Open Data policies are leading to more data been made publically. She gave a fantastic example of the processes and steps her team went through to anonymise some of their data. Even to the point that they offered prizes for analysts to try to ‘crack’ the data.


[1] Jaro, M. A. 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 84:414–420.

[2] Winkler, W. E. (1990). “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage”. Proceedings of the Section on Survey Research Methods (American Statistical Association): 354–359.

Jaro, M. A. 1995. Probabilistic linkage of large public health data files (disc: P687-689). Statistics in Medicine 14:491–498.