Entity Resolution with Dedupe Package and SQLite

In [1]:
import dedupe
import os
import itertools
import time
import logging
import optparse
import locale
import pickle
import multiprocessing
import sqlite3 as lite
import pandas as pd
import unicodedata
import csv
import sys

pd.options.display.max_columns = 999

Original data and cleaning

In [2]:
data = pd.read_csv('../data/companies-with-controlling-entities v6.csv', encoding='utf-8')
/Users/nedyoxall/anaconda/envs/datakind3/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [3]:
data.head(2)
Out[3]:
company_number company_name jurisdiction_code Controlling Entity SKey Controlling Entity Name opencorporates_url address_care_of address_country address_locality address_postal_code address_region address_street controlling_company_type country_of_residence dob_month dob_year family_name given_name middle_name nationality po_box title uid ultimate_entity_company_number controlling_entity_company_number
0 452 VERNON INVESTMENTS (1856) LIMITED gb 0 Mrs Mary Jane Ursula White https://opencorporates.com/companies/gb/00000452 NaN United Kingdom High Wycombe HP10 9QN Bucks. The Mill House Boundary Road\nLoudwater NaN United Kingdom 8.0 1961.0 White Mary Jane Ursula British NaN Mrs /company/00000452/persons-with-significant-con... NaN NaN
1 452 VERNON INVESTMENTS (1856) LIMITED gb 1 Mr Andrew Gwynne Haydon White https://opencorporates.com/companies/gb/00000452 NaN United Kingdom High Wycombe HP10 9QN Bucks. The Mill House Boundary Road\nLoudwater NaN United Kingdom 2.0 1960.0 White Andrew Gwynne Haydon British NaN Mr /company/00000452/persons-with-significant-con... NaN NaN
In [4]:
# TODO - trim left and right of name for spaces? 
# TODO - check that the index is unique, or will cause problems at a later date!

Select just the columns that are useful for entity resolution. Then:

  1. Remove rows with nulls in for family name and given name columns.
  2. Fill NaNs in other columns with None (required for dedupe package to work properly on String types).
  3. Convert numbers to strings - maybe the best way of dealing with human input text?? Open question!
In [5]:
to_dedupe = data[['family_name', 'given_name', 'middle_name', 'dob_year', 'dob_month']].copy()
In [6]:
# Some utility functions for getting dataframes ready for deduping
def remove_nulls_from_these_columns(df, columns, verbose=False):
    """ Utility function to remove nulls from the specified columns of a dataframe.
        
        Inputs: - df -> dataframe to remove nulls from
                - columns -> list of columns which should have no nulls in
                
        Outputs: - new dataframe without nulls
    """
    if verbose:
        print("Number of rows before removing nulls: {}".format(len(df)))
    to_remove = df
    for col in columns:
        no_null = to_remove[to_remove[col].notnull()]
        if verbose:
            print("Number of rows after removing " + col + ": {}".format(len(no_null)))
        to_remove = no_null
        
    return no_null

def fill_nulls_with_none(df):
    """ Fills nulls in a dataframe with None.
        This is required for the Dedupe package to work properly.
        
        Input: - dataframe with nulls as NaN
        
        Output: - new dataframe with nulls as None
    """
    new_df = df.copy()
    for col in df.columns:
        new_df[col] = new_df[col].where(new_df[col].notnull(), None)
    return new_df

def convert_numbers_to_strings(df, cols_to_convert, remove_point_zero=True):
    """ Convert number types to strings in a dataframe.
        This is convoluted as need to keep NoneTypes as NoneTypes for what comes next!
        
        Inputs: - df -> dataframe to convert number types
                - cols_to_convert -> list of columns to convert
                - remove_point_zero -> bool to say whether you want '.0' removed from number
        
        Ouputs: - dataframe with converted number types
    """
    new_df = df.copy()
    for col in cols_to_convert:
        if remove_point_zero:
            new_df[col] = new_df[col].apply(lambda x: str(x).replace('.0','')\
                                            if not isinstance(x, type(None)) else x)
        else:
            new_df[col] = new_df[col].apply(lambda x: str(x)\
                                            if not isinstance(x, type(None)) else x)
    return new_df
In [7]:
to_dedupe = remove_nulls_from_these_columns(to_dedupe, ['family_name','given_name'], verbose=True)
to_dedupe = fill_nulls_with_none(to_dedupe)
to_dedupe = convert_numbers_to_strings(to_dedupe, ['dob_year','dob_month'])
Number of rows before removing nulls: 759447
Number of rows after removing family_name: 706156
Number of rows after removing given_name: 706151
In [8]:
to_dedupe.head()
Out[8]:
family_name given_name middle_name dob_year dob_month
0 White Mary Jane Ursula 1961 8
1 White Andrew Gwynne Haydon 1960 2
7 Cooke Bernard Patrick 1939 3
8 Schofield Peter Malcolm 1944 2
9 Howells Peter John 1951 3

Dedupe requires that the input data be in the form of a list of dicts:

In [9]:
to_dedupe_dict = to_dedupe.to_dict(orient = 'index')

This is what the entries of the list look like:

In [10]:
to_dedupe_dict[0]
Out[10]:
{'dob_month': '8',
 'dob_year': '1961',
 'family_name': 'White',
 'given_name': 'Mary',
 'middle_name': 'Jane Ursula'}

For convenience while testing, store the to_dedupe_dict object as a pickle file:

In [11]:
with open('to_dedupe_dict.pkl', 'wb') as f:
    pickle.dump(to_dedupe_dict, f)
In [12]:
with open('to_dedupe_dict.pkl', 'rb') as f:
    to_dedupe_dict = pickle.load(f)

Setting up the Dedupe object

Each field (or, in other words, column) has to be declared and given a type. You can read about the different options here:

In [13]:
# docs for this are here: 
fields = [{'field' : 'family_name', 'type' : 'String'},
          {'field' : 'given_name', 'type': 'String'}, 
          {'field' : 'middle_name', 'type': 'String', 'has_missing' : True},
          {'field' : 'dob_year', 'type': 'String', 'has_missing' : True},
          {'field' : 'dob_month', 'type': 'String', 'has_missing' : True}]
In [14]:
# There is a bug later on that requires num_cores to be 1, but we can make use of
# multi-threaded processes in the meantime
deduper = dedupe.Dedupe(fields, num_cores=4)

We then sample the data to get something to train on. More information about all of the deduper object methods can be found here. Note that training is done by a human - pairs of records are presented and you're asked whether they are referring to the same thing or not (coming to this bit shortly!)

In [15]:
deduper.sample(to_dedupe_dict, sample_size=1000000)

Training dedupe with human-matched records

If you've already done the training, load the file that contains the record pairs marked as being duplicates or not. (It gets saved as a json file).

In [16]:
training_file = 'controlling_entities_dedupe_training.json'
if os.path.exists(training_file):
    print('reading labeled examples from ', training_file)
    with open(training_file, 'rb') as tf:
        deduper.readTraining(tf)
else:
    print('no training file found - will create new one: ' + training_file)
INFO:dedupe.api:reading training from file
reading labeled examples from  controlling_entities_dedupe_training.json
In [17]:
dedupe.convenience.consoleLabel(deduper)
family_name : Hilton
given_name : Paul
middle_name : Christopher
dob_year : 1967
dob_month : 6

family_name : Hilton
given_name : Paul
middle_name : None
dob_year : 1967
dob_month : 10

21/10 positive, 27/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
family_name : Watts
given_name : Stephen
middle_name : Patrick James
dob_year : 1951
dob_month : 3

family_name : Watt
given_name : Stephen
middle_name : None
dob_year : 1959
dob_month : 3

21/10 positive, 28/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
family_name : Perkins
given_name : Mary
middle_name : Lesley
dob_year : 1943
dob_month : 2

family_name : Perkins
given_name : Mary
middle_name : Lesley
dob_year : 1944
dob_month : 2

21/10 positive, 29/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Brown
given_name : Dominic
middle_name : None
dob_year : 1973
dob_month : 8

family_name : Brownlee
given_name : Dominic
middle_name : None
dob_year : 1978
dob_month : 8

22/10 positive, 29/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
family_name : Kirby
given_name : Hugo
middle_name : Giles Stephen Astley
dob_year : 1953
dob_month : 5

family_name : Kirby
given_name : Hugo
middle_name : Giles Stephen Astley
dob_year : 1958
dob_month : 5

22/10 positive, 30/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Chen
given_name : Hongbing
middle_name : None
dob_year : 1971
dob_month : 12

family_name : Chen
given_name : Hongbin
middle_name : None
dob_year : 1970
dob_month : 12

23/10 positive, 30/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : White
given_name : June
middle_name : Elizabeth
dob_year : 1949
dob_month : 6

family_name : Whitehouse
given_name : June
middle_name : None
dob_year : 1941
dob_month : 6

24/10 positive, 30/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
family_name : Greaves
given_name : Justin
middle_name : Piers Leonard
dob_year : 1947
dob_month : 12

family_name : Greaves
given_name : Justin
middle_name : Piers Leonard
dob_year : 1957
dob_month : 12

24/10 positive, 31/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Grant
given_name : Jo
middle_name : None
dob_year : 1970
dob_month : 8

family_name : Grant
given_name : John
middle_name : Kenneth
dob_year : 1960
dob_month : 8

25/10 positive, 31/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
family_name : Quinn
given_name : Paul
middle_name : Lawrence Patrick
dob_year : 1971
dob_month : 6

family_name : Quinn
given_name : Paul
middle_name : Patrick
dob_year : 1979
dob_month : 6

25/10 positive, 32/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
family_name : Kirby
given_name : Hugo
middle_name : Giles Stephen Astley
dob_year : 1958
dob_month : 5

family_name : Kirby
given_name : Hugo
middle_name : Giles Stephen Astley
dob_year : 1953
dob_month : 5

25/10 positive, 33/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Meek
given_name : Jeffrey
middle_name : Andrew Callaghan
dob_year : 1963
dob_month : 4

family_name : Meek
given_name : Jeffrey
middle_name : Andrew Callaghan
dob_year : 1962
dob_month : 4

26/10 positive, 33/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Dalton
given_name : Richard
middle_name : Alistair
dob_year : 1969
dob_month : 1

family_name : Dalton
given_name : Richard
middle_name : None
dob_year : 1960
dob_month : 12

27/10 positive, 33/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
family_name : Harris
given_name : Jarod
middle_name : Thomas Chamberlain
dob_year : 1971
dob_month : 2

family_name : Harris
given_name : Jarod
middle_name : Thomas Chamberlain
dob_year : 1972
dob_month : 2

27/10 positive, 34/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Parsons
given_name : Andrew
middle_name : None
dob_year : 1964
dob_month : 1

family_name : Parsons
given_name : Andrew
middle_name : Keith
dob_year : 1961
dob_month : 11

28/10 positive, 34/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
family_name : Prabhaker
given_name : Sangeet
middle_name : Alejanro Luis
dob_year : 1979
dob_month : 3

family_name : Prabhaker
given_name : Sangeet
middle_name : Alejanro Luis
dob_year : 1976
dob_month : 3

28/10 positive, 35/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Griffiths
given_name : Peter
middle_name : David
dob_year : 1959
dob_month : 12

family_name : Griffiths
given_name : Peter
middle_name : None
dob_year : 1979
dob_month : 1

29/10 positive, 35/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
family_name : Davies
given_name : Peter
middle_name : Joseph Mansel
dob_year : 1969
dob_month : 9

family_name : Davies
given_name : Peter
middle_name : Joseph Mansel
dob_year : 1962
dob_month : 9

29/10 positive, 36/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Noble
given_name : David
middle_name : None
dob_year : 1956
dob_month : 1

family_name : Noble
given_name : David
middle_name : Clive
dob_year : 1955
dob_month : 11

30/10 positive, 36/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
family_name : Beever Estate
given_name : Executors
middle_name : Of Edward
dob_year : 2015
dob_month : 9

family_name : Beever
given_name : The
middle_name : Executors Of Edward
dob_year : 2015
dob_month : 9

30/10 positive, 37/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Jones
given_name : Darren
middle_name : Raymond
dob_year : 1971
dob_month : 5

family_name : Jones
given_name : Darren
middle_name : Paul
dob_year : 1976
dob_month : 5

31/10 positive, 37/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
family_name : Solanki
given_name : Nimesh
middle_name : None
dob_year : 1980
dob_month : 4

family_name : Solanki
given_name : Kailesh
middle_name : None
dob_year : 1980
dob_month : 4

31/10 positive, 38/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
family_name : Anstruther Gough Calthorpe Bt
given_name : Euan
middle_name : Hamilton
dob_year : 1966
dob_month : 6

family_name : Anstruther Gough Calthorpe Bt
given_name : Euan
middle_name : Hamilton
dob_year : 1996
dob_month : 6

31/10 positive, 39/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Thomson
given_name : William
middle_name : George Ritchie
dob_year : 1951
dob_month : 5

family_name : Thomson
given_name : William
middle_name : George Ritchie
dob_year : 1950
dob_month : 5

32/10 positive, 39/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Hall
given_name : Richard
middle_name : Bruce
dob_year : 1952
dob_month : 10

family_name : Billingsley
given_name : Richard
middle_name : Albert
dob_year : 1952
dob_month : 10

33/10 positive, 39/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
family_name : Shiekh
given_name : Amar
middle_name : None
dob_year : 1961
dob_month : 1

family_name : Sheikh
given_name : Amar
middle_name : None
dob_year : 1961
dob_month : 1

33/10 positive, 40/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : De Pellette
given_name : Alan
middle_name : James
dob_year : 1968
dob_month : 12

family_name : De Pellette
given_name : Alan
middle_name : None
dob_year : 1968
dob_month : 2

34/10 positive, 40/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Harris
given_name : Anthony
middle_name : None
dob_year : 1968
dob_month : 10

family_name : Harris
given_name : Anthony
middle_name : John
dob_year : 1968
dob_month : 5

35/10 positive, 40/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
family_name : Collison
given_name : Christian
middle_name : David Ross
dob_year : 1984
dob_month : 11

family_name : Collison
given_name : Christian
middle_name : David Ross
dob_year : 1984
dob_month : 8

35/10 positive, 41/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Newett
given_name : David
middle_name : Ian
dob_year : 1959
dob_month : 2

family_name : Newett
given_name : David
middle_name : Ian
dob_year : 1959
dob_month : 12

36/10 positive, 41/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Farrugia
given_name : Paul
middle_name : John
dob_year : 1957
dob_month : 11

family_name : Farrugia
given_name : Paula
middle_name : None
dob_year : 1957
dob_month : 2

37/10 positive, 41/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
family_name : Paterson
given_name : Thomas
middle_name : James
dob_year : 1979
dob_month : 9

family_name : Paterson
given_name : Thomas
middle_name : James
dob_year : 1979
dob_month : 10

37/10 positive, 42/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Chesterfield
given_name : Philip
middle_name : Markham
dob_year : 1965
dob_month : 12

family_name : Chesterfield
given_name : Philip
middle_name : Markham
dob_year : 1965
dob_month : 2

38/10 positive, 42/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Sanders
given_name : Justin
middle_name : David Cavania
dob_year : 1972
dob_month : 6

family_name : Sanders
given_name : Justin
middle_name : David Cavania
dob_year : 1976
dob_month : 6

39/10 positive, 42/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Anstruther Gough Calthorpe Bt
given_name : Euan
middle_name : Hamilton
dob_year : 1996
dob_month : 6

family_name : Anstruther Gough Calthorpe Bt
given_name : Euan
middle_name : Hamilton
dob_year : 1966
dob_month : 6

40/10 positive, 42/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Beattie
given_name : Will
middle_name : Eric Macdonald
dob_year : 1973
dob_month : 3

family_name : Beattie
given_name : William
middle_name : Brian
dob_year : 1943
dob_month : 3

41/10 positive, 42/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
n
family_name : Meath Baker
given_name : Prescilla
middle_name : Ann
dob_year : 1937
dob_month : 5

family_name : Meath Baker
given_name : Prescilla
middle_name : Ann
dob_year : 1938
dob_month : 5

41/10 positive, 43/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
family_name : Millar
given_name : Lynne
middle_name : Patricia
dob_year : 1959
dob_month : 3

family_name : Wall
given_name : Lynne
middle_name : None
dob_year : 1959
dob_month : 3

42/10 positive, 43/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
f
Finished labeling

Once you've provided the training data, you can train the model.

In [19]:
# learn both the classifier and blocking rules
# max_comparisons seems to be SUPER important in finding good predicates
deduper.train(recall=0.9, maximum_comparisons=1000000000)
INFO:rlr.crossvalidation:using cross validation to find optimum alpha...
INFO:rlr.crossvalidation:optimum alpha: 0.001000
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, given_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, given_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, given_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, given_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, given_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, given_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, given_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, given_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, given_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, given_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, given_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, given_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, dob_month)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, dob_month)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, dob_month)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, dob_month)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, dob_month)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, dob_month)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, dob_month)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, dob_month)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, dob_month)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, dob_month)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, dob_month)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, dob_month)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, middle_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, middle_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, middle_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, middle_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, middle_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, middle_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, middle_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, middle_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, family_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, family_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, family_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, family_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, family_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, family_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, family_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, family_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, family_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, family_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, family_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, family_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, dob_year)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, dob_year)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, dob_year)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, dob_year)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, dob_year)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, dob_year)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, dob_year)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, dob_year)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, dob_year)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, dob_year)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, dob_year)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, dob_year)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, given_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, given_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, given_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, given_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, given_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, given_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, given_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, given_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, given_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, given_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, given_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, given_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, dob_month)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, dob_month)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, dob_month)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, dob_month)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, dob_month)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, dob_month)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, dob_month)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, dob_month)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, dob_month)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, dob_month)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, dob_month)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, dob_month)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, middle_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, middle_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, middle_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, middle_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, middle_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, middle_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, middle_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, middle_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, family_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, family_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, family_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, family_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, family_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, family_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, family_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, family_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, family_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, family_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, family_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, family_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, dob_year)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, dob_year)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, dob_year)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, dob_year)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, dob_year)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, dob_year)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, dob_year)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, dob_year)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, dob_year)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, dob_year)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, dob_year)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, dob_year)
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, family_name), TfidfTextCanopyPredicate: (0.2, middle_name))
INFO:dedupe.training:(SimplePredicate: (commonSixGram, family_name), SimplePredicate: (commonThreeTokens, middle_name))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, family_name), SimplePredicate: (twoGramFingerprint, family_name))
INFO:dedupe.training:(TfidfTextCanopyPredicate: (0.4, family_name), TfidfTextCanopyPredicate: (0.8, middle_name))
INFO:dedupe.training:(SimplePredicate: (fingerprint, family_name), TfidfNGramCanopyPredicate: (0.4, dob_month))

The blocking rules have now been found, and the classifier has also been trained. The classifier (which determines whether 2 records are the same or not) is an L2 regularized logistic regression. Any model in SK learn with a fit and predict_proba method could be used in its place.

Just to demonstrate the inputs to the model, we select 2 records from the inputs:

In [20]:
record_pairs = ((to_dedupe_dict[0], to_dedupe_dict[1]),)

The model features are given by:

In [21]:
print(deduper.data_model._field_comparators)
[('family_name', <built-in function normalizedAffineGapDistance>, 0, 1), ('given_name', <built-in function normalizedAffineGapDistance>, 1, 2), ('middle_name', <built-in function normalizedAffineGapDistance>, 2, 3), ('dob_year', <built-in function normalizedAffineGapDistance>, 3, 4), ('dob_month', <built-in function normalizedAffineGapDistance>, 4, 5)]

And the input values for this particular pair of records by:

In [22]:
print(deduper.data_model.distances(record_pairs=record_pairs))
[[ 0.5         4.69999981  4.79166651  1.75        5.5       ]]

The coefficients in the model itself are given by:

In [23]:
print(deduper.classifier.weights)
[-11.37927761  -7.32188256  -0.63712457 -14.30634516  -4.50698292]

Anyway, after training we can save the data we labelled, as well as the settings for this particular run:

In [24]:
settings_file = 'dedupe_settings'
with open(settings_file, 'wb') as f:
    deduper.writeSettings(f)
with open(training_file, 'w') as f:
    deduper.writeTraining(f)

The training finds both the Predicates for creating 'blocks' (or 'canopies' - for a good explanation, see this excellent paper) and the classifier for actually determining whether two records are referring to the same entity.

The documentation provides a good overview on why one might want to use blocks/canopies.

My understanding is that blocks/canopies are created for every single predicate - in other words, blocks/canopies are created on the entire dataset as many times as there are predicates.

In [25]:
deduper.blocker.predicates
Out[25]:
((SimplePredicate: (commonThreeTokens, family_name),
  TfidfTextCanopyPredicate: (0.2, middle_name)),
 (SimplePredicate: (commonSixGram, family_name),
  SimplePredicate: (commonThreeTokens, middle_name)),
 (SimplePredicate: (commonTwoTokens, family_name),
  SimplePredicate: (twoGramFingerprint, family_name)),
 (TfidfTextCanopyPredicate: (0.4, family_name),
  TfidfTextCanopyPredicate: (0.8, middle_name)),
 (SimplePredicate: (fingerprint, family_name),
  TfidfNGramCanopyPredicate: (0.4, dob_month)))

Some predicates rely on an 'inverted index', these are the CanopyPredicates and the SearchPredicates. You can avoid predicates that rely on inverted indexes by setting index_predicates to False in the train method.

Inverted indexes are used to "find the documents where the word X occurs". Unfortunately it seems quite challenging to print out the inverted index itself to understand exactly what it's doing. Nevermind!

Note that canopies are used by predicates that require an inverted index, and blocks are used by simple predicates. The same row can appear in multiple canopies (i.e. canopies can overlap), while the same row only ever appears in one block.

In [26]:
for field in deduper.blocker.index_fields:
    print(field)
dob_month
middle_name
family_name

With large data sets, the data_sample, training_pairs, training_data, and activeLearner objects are potentially very large memory-wise. This command deletes them to free up said memory.

In [27]:
deduper.cleanupTraining()

Storing the raw data in a database

For portability, it's convenient to use SQLite, rather than anything bigger and grander requiring a database server. You probably wouldn't want to deal with hundreds of millions or rows like this though!

In [29]:
database_name = 'controlling_entities_dedupe_database.db'
if os.path.exists(database_name):
    print("database '{}' already exists. deleting it and starting from scratch!".format(database_name))
    os.remove(database_name)

conn = lite.connect(database_name)
database 'controlling_entities_dedupe_database.db' already exists. deleting it and starting from scratch!

As a reminder, the raw, un-deduplicated data looks like this:

In [30]:
to_dedupe.head()
Out[30]:
family_name given_name middle_name dob_year dob_month
0 White Mary Jane Ursula 1961 8
1 White Andrew Gwynne Haydon 1960 2
7 Cooke Bernard Patrick 1939 3
8 Schofield Peter Malcolm 1944 2
9 Howells Peter John 1951 3

It's convenient to use pandas to create a table in the database. We store the raw data in a table called actual_data:

In [31]:
c1 = conn.cursor()
c1.execute("DROP TABLE IF EXISTS actual_data")
to_dedupe.to_sql('actual_data', conn, if_exists='fail', index=True, index_label = 'id')
c1.close()

Creating inverted indexes for the fields that require it

i.e. those which have a CanopyPredicate or SearchPredicate associated with them

In [32]:
# A dictionary of the Index Predicates that will used for blocking. 
# The keys are the fields the predicates will operate on.
deduper.blocker.index_fields
Out[32]:
defaultdict(<function dedupe.blocking.Blocker.__init__.<locals>.<lambda>>,
            {'dob_month': defaultdict(list,
                         {'TfidfNGramCanopyPredicate': [TfidfNGramCanopyPredicate: (0.4, dob_month)]}),
             'family_name': defaultdict(list,
                         {'TfidfTextCanopyPredicate': [TfidfTextCanopyPredicate: (0.4, family_name)]}),
             'middle_name': defaultdict(list,
                         {'TfidfTextCanopyPredicate': [TfidfTextCanopyPredicate: (0.2, middle_name),
                           TfidfTextCanopyPredicate: (0.8, middle_name)]})})
In [33]:
# this does not add an index if there are no canopy predicates in the index_fields
c3 = conn.cursor()
for field in deduper.blocker.index_fields:
    print('creating inverted index on: ' + field)
    c3.execute("SELECT DISTINCT {field} FROM actual_data "
               "WHERE {field} IS NOT NULL".format(field = field))
    field_data = (row[0] for row in c3)
    # Indexes the data from a field for use in an index predicate/canopy
    # This goes through row-by-row
    deduper.blocker.index(field_data, field)

c3.close()
creating inverted index on: dob_month
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, dob_month)
creating inverted index on: middle_name
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, middle_name)
creating inverted index on: family_name
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, family_name)

Creating the blocks/canopies

Start by creating the (empty) table in the database that will contain the mapping between the blocks and the row_ids.

In [34]:
c2 = conn.cursor()
c2.execute("DROP TABLE IF EXISTS blocking_map")
c2.execute("CREATE TABLE blocking_map "
          "(block_key VARCHAR(200), id INTEGER) ")
c2.close()
In [35]:
SELECT_STATEMENT = "SELECT id, family_name, given_name, middle_name, " \
                   "dob_year, dob_month FROM actual_data"

c4 = conn.cursor()
print('writing blocking map')
c4.execute(SELECT_STATEMENT)
# the data should be fed to the blocker as an iterable of the form:
# data = [(1, {'name' : 'bob'}), (2, {'name' : 'suzanne'}), ... ]
full_data = ((row[0], {'id': row[0],
                       'family_name' : row[1],
                       'given_name' : row[2],
                       'middle_name' : row[3],
                       'dob_year' : row[4],
                       'dob_month' : row[5]}) for row in c4)
# Generate the predicates for records. Yields tuples of (predicate, record_id)
b_data = deduper.blocker(full_data)
# b_data has the form:
# [('foo:1', 1), ..., ('bar:1', 100)]
writing blocking map
In [36]:
# b_data is a generator so will yield output as looped over
c5 = conn.cursor()
for row in b_data:
    c5.execute("INSERT INTO blocking_map (block_key, id) VALUES ('{block_key}',{id})".\
               format(block_key=row[0],id=row[1]))
c5.close()
c4.close()
conn.commit()
INFO:dedupe.blocking:10000, 0.8866982 seconds
INFO:dedupe.blocking:20000, 1.6555902 seconds
INFO:dedupe.blocking:30000, 2.4064102 seconds
INFO:dedupe.blocking:40000, 3.1403292 seconds
INFO:dedupe.blocking:50000, 3.8846132 seconds
INFO:dedupe.blocking:60000, 4.6088152 seconds
INFO:dedupe.blocking:70000, 5.3391912 seconds
INFO:dedupe.blocking:80000, 6.0645912 seconds
INFO:dedupe.blocking:90000, 6.7555952 seconds
INFO:dedupe.blocking:100000, 7.4369202 seconds
INFO:dedupe.blocking:110000, 8.1320572 seconds
INFO:dedupe.blocking:120000, 8.8197112 seconds
INFO:dedupe.blocking:130000, 9.5520212 seconds
INFO:dedupe.blocking:140000, 10.3426402 seconds
INFO:dedupe.blocking:150000, 11.1764612 seconds
INFO:dedupe.blocking:160000, 12.0623922 seconds
INFO:dedupe.blocking:170000, 12.9916722 seconds
INFO:dedupe.blocking:180000, 13.9297212 seconds
INFO:dedupe.blocking:190000, 14.8699092 seconds
INFO:dedupe.blocking:200000, 15.7996292 seconds
INFO:dedupe.blocking:210000, 16.5510012 seconds
INFO:dedupe.blocking:220000, 17.2484762 seconds
INFO:dedupe.blocking:230000, 17.9182622 seconds
INFO:dedupe.blocking:240000, 18.6009632 seconds
INFO:dedupe.blocking:250000, 19.2736792 seconds
INFO:dedupe.blocking:260000, 19.9467552 seconds
INFO:dedupe.blocking:270000, 20.6309332 seconds
INFO:dedupe.blocking:280000, 21.2957982 seconds
INFO:dedupe.blocking:290000, 21.9544192 seconds
INFO:dedupe.blocking:300000, 22.6275562 seconds
INFO:dedupe.blocking:310000, 23.4300932 seconds
INFO:dedupe.blocking:320000, 24.1403062 seconds
INFO:dedupe.blocking:330000, 24.8197872 seconds
INFO:dedupe.blocking:340000, 25.4823882 seconds
INFO:dedupe.blocking:350000, 26.1421512 seconds
INFO:dedupe.blocking:360000, 26.8031332 seconds
INFO:dedupe.blocking:370000, 27.4551432 seconds
INFO:dedupe.blocking:380000, 28.1198822 seconds
INFO:dedupe.blocking:390000, 28.7666722 seconds
INFO:dedupe.blocking:400000, 29.4340742 seconds
INFO:dedupe.blocking:410000, 30.0859422 seconds
INFO:dedupe.blocking:420000, 30.7262992 seconds
INFO:dedupe.blocking:430000, 31.3874802 seconds
INFO:dedupe.blocking:440000, 32.0366312 seconds
INFO:dedupe.blocking:450000, 32.7143132 seconds
INFO:dedupe.blocking:460000, 33.3956372 seconds
INFO:dedupe.blocking:470000, 34.1253152 seconds
INFO:dedupe.blocking:480000, 34.8132422 seconds
INFO:dedupe.blocking:490000, 35.4847342 seconds
INFO:dedupe.blocking:500000, 36.1423512 seconds
INFO:dedupe.blocking:510000, 36.7974212 seconds
INFO:dedupe.blocking:520000, 37.4679592 seconds
INFO:dedupe.blocking:530000, 38.1166822 seconds
INFO:dedupe.blocking:540000, 38.8171572 seconds
INFO:dedupe.blocking:550000, 39.5315302 seconds
INFO:dedupe.blocking:560000, 40.2375672 seconds
INFO:dedupe.blocking:570000, 40.9275792 seconds
INFO:dedupe.blocking:580000, 41.7327812 seconds
INFO:dedupe.blocking:590000, 42.4457382 seconds
INFO:dedupe.blocking:600000, 43.1246422 seconds
INFO:dedupe.blocking:610000, 43.7684642 seconds
INFO:dedupe.blocking:620000, 44.5120412 seconds
INFO:dedupe.blocking:630000, 45.2681682 seconds
INFO:dedupe.blocking:640000, 45.9224352 seconds
INFO:dedupe.blocking:650000, 46.5498092 seconds
INFO:dedupe.blocking:660000, 47.2548972 seconds
INFO:dedupe.blocking:670000, 48.0567672 seconds
INFO:dedupe.blocking:680000, 48.7506512 seconds
INFO:dedupe.blocking:690000, 49.6441622 seconds
INFO:dedupe.blocking:700000, 50.6131992 seconds

The blocking_map actually looks like the following:

In [37]:
# How well has the blocking worked? check that there are not loads of blocks with single counts
blocking_map_df = pd.read_sql("SELECT * FROM blocking_map", conn)
blocking_map_df.head()
Out[37]:
block_key id
0 471:299:3 0
1 471:300:3 1
2 814:206:3 7
3 217:301:3 8
4 815:131:3 9

Conceptually, you can think of the blocks/canopies as looking something like this (note that you end up with this situation for each Predicate dedupe finds):

In [2]:
from IPython.display import Image
Image(filename="./diagrams/blocking_map.png",width=400,height=400)
Out[2]:

And here is a list of the ten largest blocks:

In [38]:
blocking_map_df.groupby('block_key')['block_key'].agg('count').sort_values(ascending = False).head(10)
Out[38]:
block_key
Valaitis:3:4    668
Smith:4:4       470
Smith:1:4       459
3591:233:3      448
Smith:3:4       419
Jones:1:4       347
Jones:4:4       346
Jones:3:4       306
75:663:3        297
75:19:3         291
Name: block_key, dtype: int64

Note - when rows do not fall into any block they are excluded (because they can only be singletons), so the length of the blocking_map might be shorter than the overall data.

It may, of course, also be longer as the same row can appear in multiple blocks/canopies.

In [39]:
print(len(blocking_map_df))
print(len(to_dedupe))
570464
706151

If we merge the blocking map back onto the original data, we can see what has been lumped together:

In [40]:
merged = pd.merge(blocking_map_df, to_dedupe, left_on = 'id', right_index = True, how = 'right')
In [41]:
merged.sort_values(by='block_key').head()
Out[41]:
block_key id family_name given_name middle_name dob_year dob_month
526870 02893324178:3:4 699883 02893324178 Derek Higgins 1949 11
255126 08022036:4:4 320653 08022036 Nicola None 1968 10
137923 0Brien:1:4 171840 0'Brien Ivana Maria 1962 12
443993 100003:36449:3 572372 Semerean Ovidiu Severian 1976 11
274820 10000:264:3 346236 Meachin Robert James 1982 5
In [42]:
merged[merged.family_name == 'Loch']
Out[42]:
block_key id family_name given_name middle_name dob_year dob_month
187114 42572:35:3 233786 Loch Pamela Ann 1967 1
373228 42572:35:3 477017 Loch Pamela Ann 1967 1
512293 42572:35:3 677794 Loch Pam Ann 1967 1
536487 42572:18:3 713316 Loch Elizabeth Mary 1963 2
570463 NaN 237119 Loch Richard None 1968 8
570463 NaN 759367 Loch Pam None 1967 1

Finally, we create an index on the block_key for faster queries, and free up memory used by the inverted indexes:

In [43]:
c = conn.cursor()
logging.info("indexing block_key")
c.execute("CREATE INDEX blocking_map_key_idx ON blocking_map (block_key)")
INFO:root:indexing block_key
Out[43]:
<sqlite3.Cursor at 0x1358b8500>
In [44]:
deduper.blocker.resetIndices()

SQL manipulations to prepare blocks for matching

The next steps require some data manipulations to prepare the blocks into the format required by the matchBlocks method. See the matchBlocks section in the documentation for exactly what that format is.

A MySQL version of these steps can be found here.

These are the tables that we will be created shortly:

In [45]:
c.execute("DROP TABLE IF EXISTS plural_key")
c.execute("DROP TABLE IF EXISTS plural_block")
c.execute("DROP TABLE IF EXISTS covered_blocks")
c.execute("DROP TABLE IF EXISTS smaller_coverage")
Out[45]:
<sqlite3.Cursor at 0x1358b8500>

Plural Key table

Here, the block_keys (which are fairly long varchar types) are reassigned an integer id, imaginatively called block_id. Further, if one block_key forms identical blocks, we only keep one of them. Further further, we only keep blocks that have more than one record in them (as that record is obviously not going to be matched to another record) - this is why the word 'plural' is used!

In [46]:
logging.info("calculating plural_key")
c.execute(" DROP TABLE IF EXISTS plural_key")
c.execute("CREATE TABLE plural_key "
          "(block_key VARCHAR(200), "
          " block_id INTEGER PRIMARY KEY AUTOINCREMENT)  ")
INFO:root:calculating plural_key
Out[46]:
<sqlite3.Cursor at 0x1358b8500>
In [47]:
c.execute(" INSERT INTO plural_key (block_key) "
          " SELECT MIN(block_key) FROM "
          "    (SELECT block_key, GROUP_CONCAT(id) AS block FROM "
          "       (SELECT block_key, id FROM blocking_map ORDER BY block_key, id) AS a "
          "    GROUP BY block_key "
          "    HAVING COUNT(*) > 1) AS b "
          " GROUP BY block  ")
Out[47]:
<sqlite3.Cursor at 0x1358b8500>

... and this is what plural_key actually looks like:

In [48]:
plural_key_df = pd.read_sql("SELECT * FROM plural_key", conn)
plural_key_df.head()
Out[48]:
block_key block_id
0 471:299:3 1
1 471:300:3 2
2 Osborne:4:4 3
3 Folan:1:4 4
4 Folan:3:4 5

The length of plural_key is the number of blocks within which records will be matched later:

In [49]:
print(len(plural_key_df))
66318

As a sanity check, there should be no duplicated rows in plural_key:

In [50]:
try:
    assert len(plural_key_df[plural_key_df.duplicated(keep = False)]) == 0
    print("no duplicated rows in plural_key")
except AssertionError:
    print("somehow duplicated rows have crept into plural_key - fix before continuing!!")
no duplicated rows in plural_key

And finally we index the table appropriately:

In [51]:
logging.info("creating block_key index")
c.execute("CREATE UNIQUE INDEX block_key_idx ON plural_key (block_key)")
INFO:root:creating block_key index
Out[51]:
<sqlite3.Cursor at 0x1358b8500>

Plural Block table

This simply links block_id calculated in plural_key above back to the records that each block contains (as represented by id).

In [52]:
logging.info("calculating plural_block")
c.execute(" DROP TABLE IF EXISTS plural_block ")
c.execute(" CREATE TABLE plural_block AS "
          " SELECT block_id, id "
          " FROM blocking_map AS a "
          " INNER JOIN plural_key AS b "
          " ON a.block_key = b.block_key "
          " ORDER BY block_id, id")
INFO:root:calculating plural_block
Out[52]:
<sqlite3.Cursor at 0x1358b8500>
In [53]:
plural_block_df = pd.read_sql("SELECT * FROM plural_block", conn)
plural_block_df.head()
Out[53]:
block_id id
0 1 0
1 1 605691
2 2 1
3 2 605692
4 3 1000
In [54]:
print(len(plural_block_df))
294466
In [55]:
try:
    assert len(plural_block_df[plural_block_df.duplicated(keep = False)]) == 0
    print("no duplicated rows in plural_block")
except AssertionError:
    print("somehow duplicated rows have crept into plural_block - fix before continuing!!")
no duplicated rows in plural_block
In [56]:
logging.info("adding id index and sorting index")
c.execute("CREATE INDEX plural_block_id_idx ON plural_block (id)")
c.execute("CREATE UNIQUE INDEX plural_block_block_id_id_uniq "
          " ON plural_block (block_id, id)")
INFO:root:adding id index and sorting index
Out[56]:
<sqlite3.Cursor at 0x1358b8500>

Covered Blocks table

This table maps records to the blocks they appear in (if more than one, all blocks are concatenated with a comma separator). Remember that, at this stage, blocks with just one record in ("singleton" blocks) have been excluded.

In [57]:
logging.info("creating covered_blocks")
c.execute(" CREATE TABLE covered_blocks AS "
          " SELECT id, GROUP_CONCAT(block_id) AS sorted_ids"
          " FROM (SELECT id, block_id FROM plural_block ORDER BY id, block_id) AS a "
          " GROUP BY id")
INFO:root:creating covered_blocks
Out[57]:
<sqlite3.Cursor at 0x1358b8500>
In [58]:
covered_blocks_df = pd.read_sql("SELECT * FROM covered_blocks", conn)
covered_blocks_df.head()
Out[58]:
id sorted_ids
0 0 1
1 1 2
2 7 57783
3 9 64129
4 13 6112

Number of rows in covered_blocks:

In [59]:
print(len(covered_blocks_df))
258376

Examples of rows which appear in more than one block:

In [60]:
covered_blocks_df[covered_blocks_df.sorted_ids.str.contains(',')].head()
Out[60]:
id sorted_ids
9 27 27254,27255
19 75 60564,60565
20 80 61869,61870
24 92 64591,64592
41 197 17381,17382
In [61]:
c.execute("CREATE UNIQUE INDEX covered_blocks_id_idx "
          "ON covered_blocks (id)")
Out[61]:
<sqlite3.Cursor at 0x1358b8500>
In [62]:
conn.commit()

Smaller Coverage table

This table is a PITA to think about. Essentially, it says: for each record, which block_id is it in; is it in any other block_ids; and if so, which other block_ids are smaller than this one? The purpose is that we don't want to make multiple comparisons of the same pairs of records. If both records fall into the two blocks, for example, we only want to compare them once.

In [63]:
logging.info("creating smaller_coverage")
c.execute(" DROP TABLE IF EXISTS smaller_coverage ")
c.execute(" CREATE TABLE smaller_coverage "
          " AS SELECT a.id, block_id, sorted_ids, "
          "           RTRIM(SUBSTR(sorted_ids, 1, INSTR(sorted_ids, block_id) - 1), ',') as smaller_ids "
          " FROM plural_block AS a "
          " INNER JOIN covered_blocks AS b "
          " ON a.id = b.id")
INFO:root:creating smaller_coverage
Out[63]:
<sqlite3.Cursor at 0x1358b8500>
In [64]:
smaller_coverage_df = pd.read_sql("SELECT * FROM smaller_coverage", conn)
smaller_coverage_df.head()
Out[64]:
id block_id sorted_ids smaller_ids
0 0 1 1
1 605691 1 1
2 1 2 2
3 605692 2 2
4 1000 3 3
In [65]:
print(len(smaller_coverage_df))
294466

Examples of where smaller_ids is not empty:

In [66]:
smaller_coverage_df[smaller_coverage_df.smaller_ids.str.contains(',')].head()
Out[66]:
id block_id sorted_ids smaller_ids
69828 178568 14600 14598,14599,14600 14598,14599
69829 223486 14600 14598,14599,14600 14598,14599
69830 245362 14600 14598,14599,14600 14598,14599
69936 178720 14628 7265,14627,14628 7265,14627
69937 241185 14628 7265,14627,14628 7265,14627

Clustering within the blocks/canopies

First create the final table that is fed to matchBlocks:

In [67]:
logging.info("creating final table")
c.execute("CREATE TABLE final AS "
          "SELECT a.id, family_name,given_name, middle_name, dob_year, dob_month, "
          "block_id, smaller_ids "
          "FROM smaller_coverage AS a "
          "INNER JOIN actual_data AS b "
          "ON a.id = b.id "
          "ORDER BY (block_id)")
INFO:root:creating final table
Out[67]:
<sqlite3.Cursor at 0x1358b8500>
In [68]:
conn.commit()
In [69]:
to_match = pd.read_sql("SELECT * FROM final", conn)
to_match.head()
Out[69]:
id family_name given_name middle_name dob_year dob_month block_id smaller_ids
0 0 White Mary Jane Ursula 1961 8 1
1 605691 White Mary Jane Ursula 1961 8 1
2 1 White Andrew Gwynne Haydon 1960 2 2
3 605692 White Andrew Gwynne Haydon 1960 2 2
4 1000 Osborne Jill Penelope 1951 10 3

Write a generator function that feeds the blocks to matchBlocks in the correct format:

In [70]:
start_time = time.time()

# Clustering function
def candidates_gen(result_set) :
    lset = set

    block_id = None
    records = []
    i = 0
    for row in result_set :
        # need to change from list to dict for sqlite query
        row = {'id': row[0],
               'family_name' : row[1],
               'given_name' : row[2],
               'middle_name' : row[3],
               'dob_year' : row[4],
               'dob_month' : row[5],
               'block_id' : row[6],
               'smaller_ids' : row[7]}

        # if there's a new block id then yield the old block
        if row['block_id'] != block_id :
            #print("done with block id: {}".format(row['block_id']))
            if records :
                #print(len(records))
                yield records

            
            block_id = row['block_id']
            records = []
            i += 1

            if i % 10000 == 0 :
                print(i, "blocks")
                print(time.time() - start_time, "seconds")

        smaller_ids = row['smaller_ids']

        if smaller_ids :
            smaller_ids = lset(smaller_ids.split(','))
        else :
            smaller_ids = lset([])

        records.append((row['id'], row, smaller_ids))

    # once gone through the loop need to yield the last block
    if records :
        yield records

Deal with the bug where matchBlocks hangs if it is using more than one core:

In [71]:
# this has to be set to 1 or matchBlocks fails
deduper.num_cores = 1
In [72]:
c.execute("SELECT * FROM final")
# cands is the iterable
cands = candidates_gen(c)
clustering...

Finally - at the stage where we can perform the comparison between all the records in a block:

In [73]:
print('clustering...')
clustered_dupes = deduper.matchBlocks(cands)
10000 blocks
11.290481805801392 seconds
20000 blocks
15.696646928787231 seconds
30000 blocks
20.057891845703125 seconds
40000 blocks
23.997675895690918 seconds
50000 blocks
28.32711100578308 seconds
60000 blocks
32.93247079849243 seconds

Saving the results

We store the results of matchBlocks in a table called entity_map:

In [74]:
c.execute("DROP TABLE IF EXISTS entity_map")

print('creating entity_map database')
c.execute("CREATE TABLE entity_map "
          "(id INTEGER, canon_id INTEGER, "
          " cluster_score FLOAT, PRIMARY KEY(id))")
creating entity_map database
Out[74]:
<sqlite3.Cursor at 0x1358b8500>
In [75]:
for cluster, scores in clustered_dupes :
    cluster_id = cluster[0]
    for id, score in zip(cluster, scores) :
        c.execute('INSERT INTO entity_map VALUES ({}, {}, {})'.format(id, cluster_id, score))
In [76]:
c.execute("CREATE INDEX head_index ON entity_map (canon_id)")
conn.commit()
In [77]:
entity_map_df = pd.read_sql("SELECT * FROM entity_map", conn)
In [78]:
entity_map_df.head()
Out[78]:
id canon_id cluster_score
0 0 0 1.0
1 1 1 1.0
2 9 9 1.0
3 13 13 1.0
4 24 24 1.0

Print the number of duplicates found:

In [79]:
print('# duplicate sets')
print(len(clustered_dupes))
# duplicate sets
32805

Combining the results with the original data

In [80]:
output = pd.merge(to_dedupe, entity_map_df, left_index = True, right_on = 'id', how = 'left')
In [81]:
output.sort_values(by = 'canon_id').head(10)
Out[81]:
family_name given_name middle_name dob_year dob_month id canon_id cluster_score
0 White Mary Jane Ursula 1961 8 0 0.0 1.0
65881 White Mary Jane Ursula 1961 8 605691 0.0 1.0
1 White Andrew Gwynne Haydon 1960 2 1 1.0 1.0
65882 White Andrew Gwynne Haydon 1960 2 605692 1.0 1.0
2 Howells Peter John 1951 3 9 9.0 1.0
4703 Howells Peter John 1951 3 33099 9.0 1.0
120 Sanderson John Francis 1941 9 895 13.0 1.0
21394 Sanderson John Francis 1941 9 173796 13.0 1.0
3 Sanderson John Francis 1941 9 13 13.0 1.0
4 Richardson Eric Keith 1939 1 24 24.0 1.0

An example of success - for this person, the dob_year is different, but each row fairly surely refers to the same thing, so they have the same canon_id:

In [86]:
output[output.family_name == 'Meath Baker']
Out[86]:
family_name given_name middle_name dob_year dob_month id canon_id cluster_score
46 Meath Baker Prescilla Ann 1937 5 361 361.0 0.764748
81 Meath Baker Prescilla Ann 1938 5 610 361.0 0.764748

Rows with the same canon_id are the rows which dedupe believes are the referring to the same thing. The cluster_score is a measure of how sure dedupe is about the comparison. If there is no canon_id at all, dedupe doesn't believe that that row has a pair.

By playing with the precision/recall in the train (and other?) methods, one can move the metaphorical slider between capturing all of the duplicated records (at the expense of including some pairs which aren't referring to the same person - high recall), and being certain that the records you say are pairs really are pairs (at the expense of missing some of the pairs that should be lumped together - high precision).

Much more information about entity resolution can be found in this PhD thesis. Indeed the dedupe package itself is based on this work!

In [87]:
# save output to a csv
output.to_csv('entity_resolved_controlling_entities.csv')

blogroll

social