In [1]:

import dedupe
import os
import itertools
import time
import logging
import optparse
import locale
import pickle
import multiprocessing
import sqlite3 as lite
import pandas as pd
import unicodedata
import csv
import sys

pd.options.display.max_columns = 999

Original data and cleaning¶

In [2]:

data = pd.read_csv('../data/companies-with-controlling-entities v6.csv', encoding='utf-8')

/Users/nedyoxall/anaconda/envs/datakind3/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

In [3]:

data.head(2)

Out[3]:

	company_number	company_name	jurisdiction_code	Controlling Entity SKey	Controlling Entity Name	opencorporates_url	address_care_of	address_country	address_locality	address_postal_code	address_region	address_street	controlling_company_type	country_of_residence	dob_month	dob_year	family_name	given_name	middle_name	nationality	po_box	title	uid	ultimate_entity_company_number	controlling_entity_company_number
0	452	VERNON INVESTMENTS (1856) LIMITED	gb	0	Mrs Mary Jane Ursula White	https://opencorporates.com/companies/gb/00000452	NaN	United Kingdom	High Wycombe	HP10 9QN	Bucks.	The Mill House Boundary Road\nLoudwater	NaN	United Kingdom	8.0	1961.0	White	Mary	Jane Ursula	British	NaN	Mrs	/company/00000452/persons-with-significant-con...	NaN	NaN
1	452	VERNON INVESTMENTS (1856) LIMITED	gb	1	Mr Andrew Gwynne Haydon White	https://opencorporates.com/companies/gb/00000452	NaN	United Kingdom	High Wycombe	HP10 9QN	Bucks.	The Mill House Boundary Road\nLoudwater	NaN	United Kingdom	2.0	1960.0	White	Andrew	Gwynne Haydon	British	NaN	Mr	/company/00000452/persons-with-significant-con...	NaN	NaN

In [4]:

# TODO - trim left and right of name for spaces? 
# TODO - check that the index is unique, or will cause problems at a later date!

Select just the columns that are useful for entity resolution. Then:

Remove rows with nulls in for family name and given name columns.
Fill NaNs in other columns with None (required for dedupe package to work properly on String types).
Convert numbers to strings - maybe the best way of dealing with human input text?? Open question!

In [5]:

to_dedupe = data[['family_name', 'given_name', 'middle_name', 'dob_year', 'dob_month']].copy()

In [6]:

# Some utility functions for getting dataframes ready for deduping
def remove_nulls_from_these_columns(df, columns, verbose=False):
    """ Utility function to remove nulls from the specified columns of a dataframe.
        
        Inputs: - df -> dataframe to remove nulls from
                - columns -> list of columns which should have no nulls in
                
        Outputs: - new dataframe without nulls
    """
    if verbose:
        print("Number of rows before removing nulls: {}".format(len(df)))
    to_remove = df
    for col in columns:
        no_null = to_remove[to_remove[col].notnull()]
        if verbose:
            print("Number of rows after removing " + col + ": {}".format(len(no_null)))
        to_remove = no_null
        
    return no_null

def fill_nulls_with_none(df):
    """ Fills nulls in a dataframe with None.
        This is required for the Dedupe package to work properly.
        
        Input: - dataframe with nulls as NaN
        
        Output: - new dataframe with nulls as None
    """
    new_df = df.copy()
    for col in df.columns:
        new_df[col] = new_df[col].where(new_df[col].notnull(), None)
    return new_df

def convert_numbers_to_strings(df, cols_to_convert, remove_point_zero=True):
    """ Convert number types to strings in a dataframe.
        This is convoluted as need to keep NoneTypes as NoneTypes for what comes next!
        
        Inputs: - df -> dataframe to convert number types
                - cols_to_convert -> list of columns to convert
                - remove_point_zero -> bool to say whether you want '.0' removed from number
        
        Ouputs: - dataframe with converted number types
    """
    new_df = df.copy()
    for col in cols_to_convert:
        if remove_point_zero:
            new_df[col] = new_df[col].apply(lambda x: str(x).replace('.0','')\
                                            if not isinstance(x, type(None)) else x)
        else:
            new_df[col] = new_df[col].apply(lambda x: str(x)\
                                            if not isinstance(x, type(None)) else x)
    return new_df

In [7]:

to_dedupe = remove_nulls_from_these_columns(to_dedupe, ['family_name','given_name'], verbose=True)
to_dedupe = fill_nulls_with_none(to_dedupe)
to_dedupe = convert_numbers_to_strings(to_dedupe, ['dob_year','dob_month'])

Number of rows before removing nulls: 759447
Number of rows after removing family_name: 706156
Number of rows after removing given_name: 706151

In [8]:

to_dedupe.head()

Out[8]:

	family_name	given_name	middle_name	dob_year	dob_month
0	White	Mary	Jane Ursula	1961	8
1	White	Andrew	Gwynne Haydon	1960	2
7	Cooke	Bernard	Patrick	1939	3
8	Schofield	Peter	Malcolm	1944	2
9	Howells	Peter	John	1951	3

Dedupe requires that the input data be in the form of a list of dicts:

In [9]:

to_dedupe_dict = to_dedupe.to_dict(orient = 'index')

This is what the entries of the list look like:

In [10]:

to_dedupe_dict[0]

Out[10]:

{'dob_month': '8',
 'dob_year': '1961',
 'family_name': 'White',
 'given_name': 'Mary',
 'middle_name': 'Jane Ursula'}

For convenience while testing, store the to_dedupe_dict object as a pickle file:

In [11]:

with open('to_dedupe_dict.pkl', 'wb') as f:
    pickle.dump(to_dedupe_dict, f)

In [12]:

with open('to_dedupe_dict.pkl', 'rb') as f:
    to_dedupe_dict = pickle.load(f)

Setting up the Dedupe object¶

Each field (or, in other words, column) has to be declared and given a type. You can read about the different options here:

In [13]:

# docs for this are here: 
fields = [{'field' : 'family_name', 'type' : 'String'},
          {'field' : 'given_name', 'type': 'String'}, 
          {'field' : 'middle_name', 'type': 'String', 'has_missing' : True},
          {'field' : 'dob_year', 'type': 'String', 'has_missing' : True},
          {'field' : 'dob_month', 'type': 'String', 'has_missing' : True}]

In [14]:

# There is a bug later on that requires num_cores to be 1, but we can make use of
# multi-threaded processes in the meantime
deduper = dedupe.Dedupe(fields, num_cores=4)

We then sample the data to get something to train on. More information about all of the deduper object methods can be found here. Note that training is done by a human - pairs of records are presented and you're asked whether they are referring to the same thing or not (coming to this bit shortly!)

In [15]:

deduper.sample(to_dedupe_dict, sample_size=1000000)

Training dedupe with human-matched records¶

If you've already done the training, load the file that contains the record pairs marked as being duplicates or not. (It gets saved as a json file).

In [16]:

training_file = 'controlling_entities_dedupe_training.json'
if os.path.exists(training_file):
    print('reading labeled examples from ', training_file)
    with open(training_file, 'rb') as tf:
        deduper.readTraining(tf)
else:
    print('no training file found - will create new one: ' + training_file)

INFO:dedupe.api:reading training from file

reading labeled examples from  controlling_entities_dedupe_training.json

In [17]:

dedupe.convenience.consoleLabel(deduper)

family_name : Hilton
given_name : Paul
middle_name : Christopher
dob_year : 1967
dob_month : 6

family_name : Hilton
given_name : Paul
middle_name : None
dob_year : 1967
dob_month : 10

21/10 positive, 27/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Watts
given_name : Stephen
middle_name : Patrick James
dob_year : 1951
dob_month : 3

family_name : Watt
given_name : Stephen
middle_name : None
dob_year : 1959
dob_month : 3

21/10 positive, 28/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Perkins
given_name : Mary
middle_name : Lesley
dob_year : 1943
dob_month : 2

family_name : Perkins
given_name : Mary
middle_name : Lesley
dob_year : 1944
dob_month : 2

21/10 positive, 29/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Brown
given_name : Dominic
middle_name : None
dob_year : 1973
dob_month : 8

family_name : Brownlee
given_name : Dominic
middle_name : None
dob_year : 1978
dob_month : 8

22/10 positive, 29/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Kirby
given_name : Hugo
middle_name : Giles Stephen Astley
dob_year : 1953
dob_month : 5

family_name : Kirby
given_name : Hugo
middle_name : Giles Stephen Astley
dob_year : 1958
dob_month : 5

22/10 positive, 30/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Chen
given_name : Hongbing
middle_name : None
dob_year : 1971
dob_month : 12

family_name : Chen
given_name : Hongbin
middle_name : None
dob_year : 1970
dob_month : 12

23/10 positive, 30/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : White
given_name : June
middle_name : Elizabeth
dob_year : 1949
dob_month : 6

family_name : Whitehouse
given_name : June
middle_name : None
dob_year : 1941
dob_month : 6

24/10 positive, 30/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Greaves
given_name : Justin
middle_name : Piers Leonard
dob_year : 1947
dob_month : 12

family_name : Greaves
given_name : Justin
middle_name : Piers Leonard
dob_year : 1957
dob_month : 12

24/10 positive, 31/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Grant
given_name : Jo
middle_name : None
dob_year : 1970
dob_month : 8

family_name : Grant
given_name : John
middle_name : Kenneth
dob_year : 1960
dob_month : 8

25/10 positive, 31/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Quinn
given_name : Paul
middle_name : Lawrence Patrick
dob_year : 1971
dob_month : 6

family_name : Quinn
given_name : Paul
middle_name : Patrick
dob_year : 1979
dob_month : 6

25/10 positive, 32/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Kirby
given_name : Hugo
middle_name : Giles Stephen Astley
dob_year : 1958
dob_month : 5

family_name : Kirby
given_name : Hugo
middle_name : Giles Stephen Astley
dob_year : 1953
dob_month : 5

25/10 positive, 33/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Meek
given_name : Jeffrey
middle_name : Andrew Callaghan
dob_year : 1963
dob_month : 4

family_name : Meek
given_name : Jeffrey
middle_name : Andrew Callaghan
dob_year : 1962
dob_month : 4

26/10 positive, 33/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Dalton
given_name : Richard
middle_name : Alistair
dob_year : 1969
dob_month : 1

family_name : Dalton
given_name : Richard
middle_name : None
dob_year : 1960
dob_month : 12

27/10 positive, 33/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Harris
given_name : Jarod
middle_name : Thomas Chamberlain
dob_year : 1971
dob_month : 2

family_name : Harris
given_name : Jarod
middle_name : Thomas Chamberlain
dob_year : 1972
dob_month : 2

27/10 positive, 34/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Parsons
given_name : Andrew
middle_name : None
dob_year : 1964
dob_month : 1

family_name : Parsons
given_name : Andrew
middle_name : Keith
dob_year : 1961
dob_month : 11

28/10 positive, 34/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Prabhaker
given_name : Sangeet
middle_name : Alejanro Luis
dob_year : 1979
dob_month : 3

family_name : Prabhaker
given_name : Sangeet
middle_name : Alejanro Luis
dob_year : 1976
dob_month : 3

28/10 positive, 35/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Griffiths
given_name : Peter
middle_name : David
dob_year : 1959
dob_month : 12

family_name : Griffiths
given_name : Peter
middle_name : None
dob_year : 1979
dob_month : 1

29/10 positive, 35/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Davies
given_name : Peter
middle_name : Joseph Mansel
dob_year : 1969
dob_month : 9

family_name : Davies
given_name : Peter
middle_name : Joseph Mansel
dob_year : 1962
dob_month : 9

29/10 positive, 36/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Noble
given_name : David
middle_name : None
dob_year : 1956
dob_month : 1

family_name : Noble
given_name : David
middle_name : Clive
dob_year : 1955
dob_month : 11

30/10 positive, 36/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Beever Estate
given_name : Executors
middle_name : Of Edward
dob_year : 2015
dob_month : 9

family_name : Beever
given_name : The
middle_name : Executors Of Edward
dob_year : 2015
dob_month : 9

30/10 positive, 37/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Jones
given_name : Darren
middle_name : Raymond
dob_year : 1971
dob_month : 5

family_name : Jones
given_name : Darren
middle_name : Paul
dob_year : 1976
dob_month : 5

31/10 positive, 37/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Solanki
given_name : Nimesh
middle_name : None
dob_year : 1980
dob_month : 4

family_name : Solanki
given_name : Kailesh
middle_name : None
dob_year : 1980
dob_month : 4

31/10 positive, 38/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Anstruther Gough Calthorpe Bt
given_name : Euan
middle_name : Hamilton
dob_year : 1966
dob_month : 6

family_name : Anstruther Gough Calthorpe Bt
given_name : Euan
middle_name : Hamilton
dob_year : 1996
dob_month : 6

31/10 positive, 39/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Thomson
given_name : William
middle_name : George Ritchie
dob_year : 1951
dob_month : 5

family_name : Thomson
given_name : William
middle_name : George Ritchie
dob_year : 1950
dob_month : 5

32/10 positive, 39/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Hall
given_name : Richard
middle_name : Bruce
dob_year : 1952
dob_month : 10

family_name : Billingsley
given_name : Richard
middle_name : Albert
dob_year : 1952
dob_month : 10

33/10 positive, 39/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Shiekh
given_name : Amar
middle_name : None
dob_year : 1961
dob_month : 1

family_name : Sheikh
given_name : Amar
middle_name : None
dob_year : 1961
dob_month : 1

33/10 positive, 40/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : De Pellette
given_name : Alan
middle_name : James
dob_year : 1968
dob_month : 12

family_name : De Pellette
given_name : Alan
middle_name : None
dob_year : 1968
dob_month : 2

34/10 positive, 40/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Harris
given_name : Anthony
middle_name : None
dob_year : 1968
dob_month : 10

family_name : Harris
given_name : Anthony
middle_name : John
dob_year : 1968
dob_month : 5

35/10 positive, 40/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Collison
given_name : Christian
middle_name : David Ross
dob_year : 1984
dob_month : 11

family_name : Collison
given_name : Christian
middle_name : David Ross
dob_year : 1984
dob_month : 8

35/10 positive, 41/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Newett
given_name : David
middle_name : Ian
dob_year : 1959
dob_month : 2

family_name : Newett
given_name : David
middle_name : Ian
dob_year : 1959
dob_month : 12

36/10 positive, 41/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Farrugia
given_name : Paul
middle_name : John
dob_year : 1957
dob_month : 11

family_name : Farrugia
given_name : Paula
middle_name : None
dob_year : 1957
dob_month : 2

37/10 positive, 41/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Paterson
given_name : Thomas
middle_name : James
dob_year : 1979
dob_month : 9

family_name : Paterson
given_name : Thomas
middle_name : James
dob_year : 1979
dob_month : 10

37/10 positive, 42/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Chesterfield
given_name : Philip
middle_name : Markham
dob_year : 1965
dob_month : 12

family_name : Chesterfield
given_name : Philip
middle_name : Markham
dob_year : 1965
dob_month : 2

38/10 positive, 42/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Sanders
given_name : Justin
middle_name : David Cavania
dob_year : 1972
dob_month : 6

family_name : Sanders
given_name : Justin
middle_name : David Cavania
dob_year : 1976
dob_month : 6

39/10 positive, 42/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Anstruther Gough Calthorpe Bt
given_name : Euan
middle_name : Hamilton
dob_year : 1996
dob_month : 6

family_name : Anstruther Gough Calthorpe Bt
given_name : Euan
middle_name : Hamilton
dob_year : 1966
dob_month : 6

40/10 positive, 42/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Beattie
given_name : Will
middle_name : Eric Macdonald
dob_year : 1973
dob_month : 3

family_name : Beattie
given_name : William
middle_name : Brian
dob_year : 1943
dob_month : 3

41/10 positive, 42/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Meath Baker
given_name : Prescilla
middle_name : Ann
dob_year : 1937
dob_month : 5

family_name : Meath Baker
given_name : Prescilla
middle_name : Ann
dob_year : 1938
dob_month : 5

41/10 positive, 43/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

family_name : Millar
given_name : Lynne
middle_name : Patricia
dob_year : 1959
dob_month : 3

family_name : Wall
given_name : Lynne
middle_name : None
dob_year : 1959
dob_month : 3

42/10 positive, 43/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

Finished labeling

Once you've provided the training data, you can train the model.

In [19]:

# learn both the classifier and blocking rules
# max_comparisons seems to be SUPER important in finding good predicates
deduper.train(recall=0.9, maximum_comparisons=1000000000)

INFO:rlr.crossvalidation:using cross validation to find optimum alpha...
INFO:rlr.crossvalidation:optimum alpha: 0.001000
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, given_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, given_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, given_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, given_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, given_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, given_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, given_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, given_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, given_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, given_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, given_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, given_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, dob_month)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, dob_month)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, dob_month)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, dob_month)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, dob_month)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, dob_month)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, dob_month)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, dob_month)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, dob_month)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, dob_month)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, dob_month)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, dob_month)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, middle_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, middle_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, middle_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, middle_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, middle_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, middle_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, middle_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, middle_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, family_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, family_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, family_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, family_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, family_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, family_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, family_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, family_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, family_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, family_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, family_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, family_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, dob_year)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, dob_year)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, dob_year)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, dob_year)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, dob_year)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, dob_year)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, dob_year)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, dob_year)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, dob_year)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, dob_year)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, dob_year)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, dob_year)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, given_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, given_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, given_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, given_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, given_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, given_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, given_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, given_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, given_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, given_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, given_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, given_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, dob_month)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, dob_month)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, dob_month)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, dob_month)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, dob_month)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, dob_month)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, dob_month)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, dob_month)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, dob_month)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, dob_month)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, dob_month)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, dob_month)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, middle_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, middle_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, middle_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, middle_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, middle_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, middle_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, middle_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, middle_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, family_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, family_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, family_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, family_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, family_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, family_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, family_name)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, family_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, family_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, family_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, family_name)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, family_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, dob_year)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.6, dob_year)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, dob_year)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, dob_year)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (4, dob_year)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (1, dob_year)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (3, dob_year)
INFO:dedupe.blocking:Canopy: LevenshteinCanopyPredicate: (2, dob_year)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.2, dob_year)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.6, dob_year)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.8, dob_year)
INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, dob_year)
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, family_name), TfidfTextCanopyPredicate: (0.2, middle_name))
INFO:dedupe.training:(SimplePredicate: (commonSixGram, family_name), SimplePredicate: (commonThreeTokens, middle_name))
INFO:dedupe.training:(SimplePredicate: (commonTwoTokens, family_name), SimplePredicate: (twoGramFingerprint, family_name))
INFO:dedupe.training:(TfidfTextCanopyPredicate: (0.4, family_name), TfidfTextCanopyPredicate: (0.8, middle_name))
INFO:dedupe.training:(SimplePredicate: (fingerprint, family_name), TfidfNGramCanopyPredicate: (0.4, dob_month))

The blocking rules have now been found, and the classifier has also been trained. The classifier (which determines whether 2 records are the same or not) is an L2 regularized logistic regression. Any model in SK learn with a fit and predict_proba method could be used in its place.

Just to demonstrate the inputs to the model, we select 2 records from the inputs:

In [20]:

record_pairs = ((to_dedupe_dict[0], to_dedupe_dict[1]),)

The model features are given by:

In [21]:

print(deduper.data_model._field_comparators)

[('family_name', <built-in function normalizedAffineGapDistance>, 0, 1), ('given_name', <built-in function normalizedAffineGapDistance>, 1, 2), ('middle_name', <built-in function normalizedAffineGapDistance>, 2, 3), ('dob_year', <built-in function normalizedAffineGapDistance>, 3, 4), ('dob_month', <built-in function normalizedAffineGapDistance>, 4, 5)]

And the input values for this particular pair of records by:

In [22]:

print(deduper.data_model.distances(record_pairs=record_pairs))

[[ 0.5         4.69999981  4.79166651  1.75        5.5       ]]

The coefficients in the model itself are given by:

In [23]:

print(deduper.classifier.weights)

[-11.37927761  -7.32188256  -0.63712457 -14.30634516  -4.50698292]

Anyway, after training we can save the data we labelled, as well as the settings for this particular run:

In [24]:

settings_file = 'dedupe_settings'
with open(settings_file, 'wb') as f:
    deduper.writeSettings(f)
with open(training_file, 'w') as f:
    deduper.writeTraining(f)

The training finds both the Predicates for creating 'blocks' (or 'canopies' - for a good explanation, see this excellent paper) and the classifier for actually determining whether two records are referring to the same entity.

The documentation provides a good overview on why one might want to use blocks/canopies.

My understanding is that blocks/canopies are created for every single predicate - in other words, blocks/canopies are created on the entire dataset as many times as there are predicates.

In [25]:

deduper.blocker.predicates

Out[25]:

((SimplePredicate: (commonThreeTokens, family_name),
  TfidfTextCanopyPredicate: (0.2, middle_name)),
 (SimplePredicate: (commonSixGram, family_name),
  SimplePredicate: (commonThreeTokens, middle_name)),
 (SimplePredicate: (commonTwoTokens, family_name),
  SimplePredicate: (twoGramFingerprint, family_name)),
 (TfidfTextCanopyPredicate: (0.4, family_name),
  TfidfTextCanopyPredicate: (0.8, middle_name)),
 (SimplePredicate: (fingerprint, family_name),
  TfidfNGramCanopyPredicate: (0.4, dob_month)))

Some predicates rely on an 'inverted index', these are the CanopyPredicates and the SearchPredicates. You can avoid predicates that rely on inverted indexes by setting index_predicates to False in the train method.

Inverted indexes are used to "find the documents where the word X occurs". Unfortunately it seems quite challenging to print out the inverted index itself to understand exactly what it's doing. Nevermind!

Note that canopies are used by predicates that require an inverted index, and blocks are used by simple predicates. The same row can appear in multiple canopies (i.e. canopies can overlap), while the same row only ever appears in one block.

In [26]:

for field in deduper.blocker.index_fields:
    print(field)

dob_month
middle_name
family_name

With large data sets, the data_sample, training_pairs, training_data, and activeLearner objects are potentially very large memory-wise. This command deletes them to free up said memory.

In [27]:

deduper.cleanupTraining()

Storing the raw data in a database¶

For portability, it's convenient to use SQLite, rather than anything bigger and grander requiring a database server. You probably wouldn't want to deal with hundreds of millions or rows like this though!

In [29]:

database_name = 'controlling_entities_dedupe_database.db'
if os.path.exists(database_name):
    print("database '{}' already exists. deleting it and starting from scratch!".format(database_name))
    os.remove(database_name)

conn = lite.connect(database_name)

database 'controlling_entities_dedupe_database.db' already exists. deleting it and starting from scratch!

As a reminder, the raw, un-deduplicated data looks like this:

In [30]:

to_dedupe.head()

Out[30]:

	family_name	given_name	middle_name	dob_year	dob_month
0	White	Mary	Jane Ursula	1961	8
1	White	Andrew	Gwynne Haydon	1960	2
7	Cooke	Bernard	Patrick	1939	3
8	Schofield	Peter	Malcolm	1944	2
9	Howells	Peter	John	1951	3

It's convenient to use pandas to create a table in the database. We store the raw data in a table called actual_data:

In [31]:

c1 = conn.cursor()
c1.execute("DROP TABLE IF EXISTS actual_data")
to_dedupe.to_sql('actual_data', conn, if_exists='fail', index=True, index_label = 'id')
c1.close()

Creating inverted indexes for the fields that require it¶

i.e. those which have a CanopyPredicate or SearchPredicate associated with them

In [32]:

# A dictionary of the Index Predicates that will used for blocking. 
# The keys are the fields the predicates will operate on.
deduper.blocker.index_fields

Out[32]:

defaultdict(<function dedupe.blocking.Blocker.__init__.<locals>.<lambda>>,
            {'dob_month': defaultdict(list,
                         {'TfidfNGramCanopyPredicate': [TfidfNGramCanopyPredicate: (0.4, dob_month)]}),
             'family_name': defaultdict(list,
                         {'TfidfTextCanopyPredicate': [TfidfTextCanopyPredicate: (0.4, family_name)]}),
             'middle_name': defaultdict(list,
                         {'TfidfTextCanopyPredicate': [TfidfTextCanopyPredicate: (0.2, middle_name),
                           TfidfTextCanopyPredicate: (0.8, middle_name)]})})

In [33]:

# this does not add an index if there are no canopy predicates in the index_fields
c3 = conn.cursor()
for field in deduper.blocker.index_fields:
    print('creating inverted index on: ' + field)
    c3.execute("SELECT DISTINCT {field} FROM actual_data "
               "WHERE {field} IS NOT NULL".format(field = field))
    field_data = (row[0] for row in c3)
    # Indexes the data from a field for use in an index predicate/canopy
    # This goes through row-by-row
    deduper.blocker.index(field_data, field)

c3.close()

creating inverted index on: dob_month

INFO:dedupe.blocking:Canopy: TfidfNGramCanopyPredicate: (0.4, dob_month)

creating inverted index on: middle_name

INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.2, middle_name)
INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.8, middle_name)

creating inverted index on: family_name

INFO:dedupe.blocking:Canopy: TfidfTextCanopyPredicate: (0.4, family_name)

Creating the blocks/canopies¶

Start by creating the (empty) table in the database that will contain the mapping between the blocks and the row_ids.

In [34]:

c2 = conn.cursor()
c2.execute("DROP TABLE IF EXISTS blocking_map")
c2.execute("CREATE TABLE blocking_map "
          "(block_key VARCHAR(200), id INTEGER) ")
c2.close()

In [35]:

SELECT_STATEMENT = "SELECT id, family_name, given_name, middle_name, " \
                   "dob_year, dob_month FROM actual_data"

c4 = conn.cursor()
print('writing blocking map')
c4.execute(SELECT_STATEMENT)
# the data should be fed to the blocker as an iterable of the form:
# data = [(1, {'name' : 'bob'}), (2, {'name' : 'suzanne'}), ... ]
full_data = ((row[0], {'id': row[0],
                       'family_name' : row[1],
                       'given_name' : row[2],
                       'middle_name' : row[3],
                       'dob_year' : row[4],
                       'dob_month' : row[5]}) for row in c4)
# Generate the predicates for records. Yields tuples of (predicate, record_id)
b_data = deduper.blocker(full_data)
# b_data has the form:
# [('foo:1', 1), ..., ('bar:1', 100)]

writing blocking map

In [36]:

# b_data is a generator so will yield output as looped over
c5 = conn.cursor()
for row in b_data:
    c5.execute("INSERT INTO blocking_map (block_key, id) VALUES ('{block_key}',{id})".\
               format(block_key=row[0],id=row[1]))
c5.close()
c4.close()
conn.commit()

INFO:dedupe.blocking:10000, 0.8866982 seconds
INFO:dedupe.blocking:20000, 1.6555902 seconds
INFO:dedupe.blocking:30000, 2.4064102 seconds
INFO:dedupe.blocking:40000, 3.1403292 seconds
INFO:dedupe.blocking:50000, 3.8846132 seconds
INFO:dedupe.blocking:60000, 4.6088152 seconds
INFO:dedupe.blocking:70000, 5.3391912 seconds
INFO:dedupe.blocking:80000, 6.0645912 seconds
INFO:dedupe.blocking:90000, 6.7555952 seconds
INFO:dedupe.blocking:100000, 7.4369202 seconds
INFO:dedupe.blocking:110000, 8.1320572 seconds
INFO:dedupe.blocking:120000, 8.8197112 seconds
INFO:dedupe.blocking:130000, 9.5520212 seconds
INFO:dedupe.blocking:140000, 10.3426402 seconds
INFO:dedupe.blocking:150000, 11.1764612 seconds
INFO:dedupe.blocking:160000, 12.0623922 seconds
INFO:dedupe.blocking:170000, 12.9916722 seconds
INFO:dedupe.blocking:180000, 13.9297212 seconds
INFO:dedupe.blocking:190000, 14.8699092 seconds
INFO:dedupe.blocking:200000, 15.7996292 seconds
INFO:dedupe.blocking:210000, 16.5510012 seconds
INFO:dedupe.blocking:220000, 17.2484762 seconds
INFO:dedupe.blocking:230000, 17.9182622 seconds
INFO:dedupe.blocking:240000, 18.6009632 seconds
INFO:dedupe.blocking:250000, 19.2736792 seconds
INFO:dedupe.blocking:260000, 19.9467552 seconds
INFO:dedupe.blocking:270000, 20.6309332 seconds
INFO:dedupe.blocking:280000, 21.2957982 seconds
INFO:dedupe.blocking:290000, 21.9544192 seconds
INFO:dedupe.blocking:300000, 22.6275562 seconds
INFO:dedupe.blocking:310000, 23.4300932 seconds
INFO:dedupe.blocking:320000, 24.1403062 seconds
INFO:dedupe.blocking:330000, 24.8197872 seconds
INFO:dedupe.blocking:340000, 25.4823882 seconds
INFO:dedupe.blocking:350000, 26.1421512 seconds
INFO:dedupe.blocking:360000, 26.8031332 seconds
INFO:dedupe.blocking:370000, 27.4551432 seconds
INFO:dedupe.blocking:380000, 28.1198822 seconds
INFO:dedupe.blocking:390000, 28.7666722 seconds
INFO:dedupe.blocking:400000, 29.4340742 seconds
INFO:dedupe.blocking:410000, 30.0859422 seconds
INFO:dedupe.blocking:420000, 30.7262992 seconds
INFO:dedupe.blocking:430000, 31.3874802 seconds
INFO:dedupe.blocking:440000, 32.0366312 seconds
INFO:dedupe.blocking:450000, 32.7143132 seconds
INFO:dedupe.blocking:460000, 33.3956372 seconds
INFO:dedupe.blocking:470000, 34.1253152 seconds
INFO:dedupe.blocking:480000, 34.8132422 seconds
INFO:dedupe.blocking:490000, 35.4847342 seconds
INFO:dedupe.blocking:500000, 36.1423512 seconds
INFO:dedupe.blocking:510000, 36.7974212 seconds
INFO:dedupe.blocking:520000, 37.4679592 seconds
INFO:dedupe.blocking:530000, 38.1166822 seconds
INFO:dedupe.blocking:540000, 38.8171572 seconds
INFO:dedupe.blocking:550000, 39.5315302 seconds
INFO:dedupe.blocking:560000, 40.2375672 seconds
INFO:dedupe.blocking:570000, 40.9275792 seconds
INFO:dedupe.blocking:580000, 41.7327812 seconds
INFO:dedupe.blocking:590000, 42.4457382 seconds
INFO:dedupe.blocking:600000, 43.1246422 seconds
INFO:dedupe.blocking:610000, 43.7684642 seconds
INFO:dedupe.blocking:620000, 44.5120412 seconds
INFO:dedupe.blocking:630000, 45.2681682 seconds
INFO:dedupe.blocking:640000, 45.9224352 seconds
INFO:dedupe.blocking:650000, 46.5498092 seconds
INFO:dedupe.blocking:660000, 47.2548972 seconds
INFO:dedupe.blocking:670000, 48.0567672 seconds
INFO:dedupe.blocking:680000, 48.7506512 seconds
INFO:dedupe.blocking:690000, 49.6441622 seconds
INFO:dedupe.blocking:700000, 50.6131992 seconds

The blocking_map actually looks like the following:

In [37]:

# How well has the blocking worked? check that there are not loads of blocks with single counts
blocking_map_df = pd.read_sql("SELECT * FROM blocking_map", conn)
blocking_map_df.head()

Out[37]:

	block_key	id
0	471:299:3	0
1	471:300:3	1
2	814:206:3	7
3	217:301:3	8
4	815:131:3	9

Conceptually, you can think of the blocks/canopies as looking something like this (note that you end up with this situation for each Predicate dedupe finds):

In [2]:

from IPython.display import Image
Image(filename="./diagrams/blocking_map.png",width=400,height=400)

Out[2]:

And here is a list of the ten largest blocks:

In [38]:

blocking_map_df.groupby('block_key')['block_key'].agg('count').sort_values(ascending = False).head(10)

Out[38]:

block_key
Valaitis:3:4    668
Smith:4:4       470
Smith:1:4       459
3591:233:3      448
Smith:3:4       419
Jones:1:4       347
Jones:4:4       346
Jones:3:4       306
75:663:3        297
75:19:3         291
Name: block_key, dtype: int64

Note - when rows do not fall into any block they are excluded (because they can only be singletons), so the length of the blocking_map might be shorter than the overall data.

It may, of course, also be longer as the same row can appear in multiple blocks/canopies.

In [39]:

print(len(blocking_map_df))
print(len(to_dedupe))

570464
706151

If we merge the blocking map back onto the original data, we can see what has been lumped together:

In [40]:

merged = pd.merge(blocking_map_df, to_dedupe, left_on = 'id', right_index = True, how = 'right')

In [41]:

merged.sort_values(by='block_key').head()

Out[41]:

	block_key	id	family_name	given_name	middle_name	dob_year	dob_month
526870	02893324178:3:4	699883	02893324178	Derek	Higgins	1949	11
255126	08022036:4:4	320653	08022036	Nicola	None	1968	10
137923	0Brien:1:4	171840	0'Brien	Ivana	Maria	1962	12
443993	100003:36449:3	572372	Semerean	Ovidiu	Severian	1976	11
274820	10000:264:3	346236	Meachin	Robert	James	1982	5

In [42]:

merged[merged.family_name == 'Loch']

Out[42]:

	block_key	id	family_name	given_name	middle_name	dob_year	dob_month
187114	42572:35:3	233786	Loch	Pamela	Ann	1967	1
373228	42572:35:3	477017	Loch	Pamela	Ann	1967	1
512293	42572:35:3	677794	Loch	Pam	Ann	1967	1
536487	42572:18:3	713316	Loch	Elizabeth	Mary	1963	2
570463	NaN	237119	Loch	Richard	None	1968	8
570463	NaN	759367	Loch	Pam	None	1967	1

Finally, we create an index on the block_key for faster queries, and free up memory used by the inverted indexes:

In [43]:

c = conn.cursor()
logging.info("indexing block_key")
c.execute("CREATE INDEX blocking_map_key_idx ON blocking_map (block_key)")

INFO:root:indexing block_key

Out[43]:

<sqlite3.Cursor at 0x1358b8500>

In [44]:

deduper.blocker.resetIndices()

SQL manipulations to prepare blocks for matching¶

The next steps require some data manipulations to prepare the blocks into the format required by the matchBlocks method. See the matchBlocks section in the documentation for exactly what that format is.

A MySQL version of these steps can be found here.

These are the tables that we will be created shortly:

In [45]:

c.execute("DROP TABLE IF EXISTS plural_key")
c.execute("DROP TABLE IF EXISTS plural_block")
c.execute("DROP TABLE IF EXISTS covered_blocks")
c.execute("DROP TABLE IF EXISTS smaller_coverage")

Out[45]:

<sqlite3.Cursor at 0x1358b8500>

Plural Key table¶

Here, the block_keys (which are fairly long varchar types) are reassigned an integer id, imaginatively called block_id. Further, if one block_key forms identical blocks, we only keep one of them. Further further, we only keep blocks that have more than one record in them (as that record is obviously not going to be matched to another record) - this is why the word 'plural' is used!

In [46]:

logging.info("calculating plural_key")
c.execute(" DROP TABLE IF EXISTS plural_key")
c.execute("CREATE TABLE plural_key "
          "(block_key VARCHAR(200), "
          " block_id INTEGER PRIMARY KEY AUTOINCREMENT)  ")

INFO:root:calculating plural_key

Out[46]:

<sqlite3.Cursor at 0x1358b8500>

In [47]:

c.execute(" INSERT INTO plural_key (block_key) "
          " SELECT MIN(block_key) FROM "
          "    (SELECT block_key, GROUP_CONCAT(id) AS block FROM "
          "       (SELECT block_key, id FROM blocking_map ORDER BY block_key, id) AS a "
          "    GROUP BY block_key "
          "    HAVING COUNT(*) > 1) AS b "
          " GROUP BY block  ")

Out[47]:

<sqlite3.Cursor at 0x1358b8500>

... and this is what plural_key actually looks like:

In [48]:

plural_key_df = pd.read_sql("SELECT * FROM plural_key", conn)
plural_key_df.head()

Out[48]:

	block_key	block_id
0	471:299:3	1
1	471:300:3	2
2	Osborne:4:4	3
3	Folan:1:4	4
4	Folan:3:4	5

The length of plural_key is the number of blocks within which records will be matched later:

In [49]:

print(len(plural_key_df))

As a sanity check, there should be no duplicated rows in plural_key:

In [50]:

try:
    assert len(plural_key_df[plural_key_df.duplicated(keep = False)]) == 0
    print("no duplicated rows in plural_key")
except AssertionError:
    print("somehow duplicated rows have crept into plural_key - fix before continuing!!")

no duplicated rows in plural_key

And finally we index the table appropriately:

In [51]:

logging.info("creating block_key index")
c.execute("CREATE UNIQUE INDEX block_key_idx ON plural_key (block_key)")

INFO:root:creating block_key index

Out[51]:

<sqlite3.Cursor at 0x1358b8500>

Plural Block table¶

This simply links block_id calculated in plural_key above back to the records that each block contains (as represented by id).

In [52]:

logging.info("calculating plural_block")
c.execute(" DROP TABLE IF EXISTS plural_block ")
c.execute(" CREATE TABLE plural_block AS "
          " SELECT block_id, id "
          " FROM blocking_map AS a "
          " INNER JOIN plural_key AS b "
          " ON a.block_key = b.block_key "
          " ORDER BY block_id, id")

INFO:root:calculating plural_block

Out[52]:

<sqlite3.Cursor at 0x1358b8500>

In [53]:

plural_block_df = pd.read_sql("SELECT * FROM plural_block", conn)
plural_block_df.head()

Out[53]:

	block_id	id
0	1	0
1	1	605691
2	2	1
3	2	605692
4	3	1000

In [54]:

print(len(plural_block_df))

In [55]:

try:
    assert len(plural_block_df[plural_block_df.duplicated(keep = False)]) == 0
    print("no duplicated rows in plural_block")
except AssertionError:
    print("somehow duplicated rows have crept into plural_block - fix before continuing!!")

no duplicated rows in plural_block

In [56]:

logging.info("adding id index and sorting index")
c.execute("CREATE INDEX plural_block_id_idx ON plural_block (id)")
c.execute("CREATE UNIQUE INDEX plural_block_block_id_id_uniq "
          " ON plural_block (block_id, id)")

INFO:root:adding id index and sorting index

Out[56]:

<sqlite3.Cursor at 0x1358b8500>

Covered Blocks table¶

This table maps records to the blocks they appear in (if more than one, all blocks are concatenated with a comma separator). Remember that, at this stage, blocks with just one record in ("singleton" blocks) have been excluded.

In [57]:

logging.info("creating covered_blocks")
c.execute(" CREATE TABLE covered_blocks AS "
          " SELECT id, GROUP_CONCAT(block_id) AS sorted_ids"
          " FROM (SELECT id, block_id FROM plural_block ORDER BY id, block_id) AS a "
          " GROUP BY id")

INFO:root:creating covered_blocks

Out[57]:

<sqlite3.Cursor at 0x1358b8500>

In [58]:

covered_blocks_df = pd.read_sql("SELECT * FROM covered_blocks", conn)
covered_blocks_df.head()

Out[58]:

	id	sorted_ids
0	0	1
1	1	2
2	7	57783
3	9	64129
4	13	6112

Number of rows in covered_blocks:

In [59]:

print(len(covered_blocks_df))

Examples of rows which appear in more than one block:

In [60]:

covered_blocks_df[covered_blocks_df.sorted_ids.str.contains(',')].head()

Out[60]:

	id	sorted_ids
9	27	27254,27255
19	75	60564,60565
20	80	61869,61870
24	92	64591,64592
41	197	17381,17382

In [61]:

c.execute("CREATE UNIQUE INDEX covered_blocks_id_idx "
          "ON covered_blocks (id)")

Out[61]:

<sqlite3.Cursor at 0x1358b8500>

In [62]:

conn.commit()

Smaller Coverage table¶

This table is a PITA to think about. Essentially, it says: for each record, which block_id is it in; is it in any other block_ids; and if so, which other block_ids are smaller than this one? The purpose is that we don't want to make multiple comparisons of the same pairs of records. If both records fall into the two blocks, for example, we only want to compare them once.

In [63]:

logging.info("creating smaller_coverage")
c.execute(" DROP TABLE IF EXISTS smaller_coverage ")
c.execute(" CREATE TABLE smaller_coverage "
          " AS SELECT a.id, block_id, sorted_ids, "
          "           RTRIM(SUBSTR(sorted_ids, 1, INSTR(sorted_ids, block_id) - 1), ',') as smaller_ids "
          " FROM plural_block AS a "
          " INNER JOIN covered_blocks AS b "
          " ON a.id = b.id")

INFO:root:creating smaller_coverage

Out[63]:

<sqlite3.Cursor at 0x1358b8500>

In [64]:

smaller_coverage_df = pd.read_sql("SELECT * FROM smaller_coverage", conn)
smaller_coverage_df.head()

Out[64]:

	id	block_id	sorted_ids
0	0	1	1
1	605691	1	1
2	1	2	2
3	605692	2	2
4	1000	3	3

In [65]:

print(len(smaller_coverage_df))

Examples of where smaller_ids is not empty:

In [66]:

smaller_coverage_df[smaller_coverage_df.smaller_ids.str.contains(',')].head()

Out[66]:

	id	block_id	sorted_ids	smaller_ids
69828	178568	14600	14598,14599,14600	14598,14599
69829	223486	14600	14598,14599,14600	14598,14599
69830	245362	14600	14598,14599,14600	14598,14599
69936	178720	14628	7265,14627,14628	7265,14627
69937	241185	14628	7265,14627,14628	7265,14627

Clustering within the blocks/canopies¶

First create the final table that is fed to matchBlocks:

In [67]:

logging.info("creating final table")
c.execute("CREATE TABLE final AS "
          "SELECT a.id, family_name,given_name, middle_name, dob_year, dob_month, "
          "block_id, smaller_ids "
          "FROM smaller_coverage AS a "
          "INNER JOIN actual_data AS b "
          "ON a.id = b.id "
          "ORDER BY (block_id)")

INFO:root:creating final table

Out[67]:

<sqlite3.Cursor at 0x1358b8500>

In [68]:

conn.commit()

In [69]:

to_match = pd.read_sql("SELECT * FROM final", conn)
to_match.head()

Out[69]:

	id	family_name	given_name	middle_name	dob_year	dob_month	block_id
0	0	White	Mary	Jane Ursula	1961	8	1
1	605691	White	Mary	Jane Ursula	1961	8	1
2	1	White	Andrew	Gwynne Haydon	1960	2	2
3	605692	White	Andrew	Gwynne Haydon	1960	2	2
4	1000	Osborne	Jill	Penelope	1951	10	3

Write a generator function that feeds the blocks to matchBlocks in the correct format:

In [70]:

start_time = time.time()

# Clustering function
def candidates_gen(result_set) :
    lset = set

    block_id = None
    records = []
    i = 0
    for row in result_set :
        # need to change from list to dict for sqlite query
        row = {'id': row[0],
               'family_name' : row[1],
               'given_name' : row[2],
               'middle_name' : row[3],
               'dob_year' : row[4],
               'dob_month' : row[5],
               'block_id' : row[6],
               'smaller_ids' : row[7]}

        # if there's a new block id then yield the old block
        if row['block_id'] != block_id :
            #print("done with block id: {}".format(row['block_id']))
            if records :
                #print(len(records))
                yield records

            
            block_id = row['block_id']
            records = []
            i += 1

            if i % 10000 == 0 :
                print(i, "blocks")
                print(time.time() - start_time, "seconds")

        smaller_ids = row['smaller_ids']

        if smaller_ids :
            smaller_ids = lset(smaller_ids.split(','))
        else :
            smaller_ids = lset([])

        records.append((row['id'], row, smaller_ids))

    # once gone through the loop need to yield the last block
    if records :
        yield records

Deal with the bug where matchBlocks hangs if it is using more than one core:

In [71]:

# this has to be set to 1 or matchBlocks fails
deduper.num_cores = 1

In [72]:

c.execute("SELECT * FROM final")
# cands is the iterable
cands = candidates_gen(c)

clustering...

Finally - at the stage where we can perform the comparison between all the records in a block:

In [73]:

print('clustering...')
clustered_dupes = deduper.matchBlocks(cands)

10000 blocks
11.290481805801392 seconds
20000 blocks
15.696646928787231 seconds
30000 blocks
20.057891845703125 seconds
40000 blocks
23.997675895690918 seconds
50000 blocks
28.32711100578308 seconds
60000 blocks
32.93247079849243 seconds

Saving the results¶

We store the results of matchBlocks in a table called entity_map:

In [74]:

c.execute("DROP TABLE IF EXISTS entity_map")

print('creating entity_map database')
c.execute("CREATE TABLE entity_map "
          "(id INTEGER, canon_id INTEGER, "
          " cluster_score FLOAT, PRIMARY KEY(id))")

creating entity_map database

Out[74]:

<sqlite3.Cursor at 0x1358b8500>

In [75]:

for cluster, scores in clustered_dupes :
    cluster_id = cluster[0]
    for id, score in zip(cluster, scores) :
        c.execute('INSERT INTO entity_map VALUES ({}, {}, {})'.format(id, cluster_id, score))

In [76]:

c.execute("CREATE INDEX head_index ON entity_map (canon_id)")
conn.commit()

In [77]:

entity_map_df = pd.read_sql("SELECT * FROM entity_map", conn)

In [78]:

entity_map_df.head()

Out[78]:

	id	canon_id	cluster_score
0	0	0	1.0
1	1	1	1.0
2	9	9	1.0
3	13	13	1.0
4	24	24	1.0

Print the number of duplicates found:

In [79]:

print('# duplicate sets')
print(len(clustered_dupes))

# duplicate sets
32805

Combining the results with the original data¶

In [80]:

output = pd.merge(to_dedupe, entity_map_df, left_index = True, right_on = 'id', how = 'left')

In [81]:

output.sort_values(by = 'canon_id').head(10)

Out[81]:

	family_name	given_name	middle_name	dob_year	dob_month	id	canon_id	cluster_score
0	White	Mary	Jane Ursula	1961	8	0	0.0	1.0
65881	White	Mary	Jane Ursula	1961	8	605691	0.0	1.0
1	White	Andrew	Gwynne Haydon	1960	2	1	1.0	1.0
65882	White	Andrew	Gwynne Haydon	1960	2	605692	1.0	1.0
2	Howells	Peter	John	1951	3	9	9.0	1.0
4703	Howells	Peter	John	1951	3	33099	9.0	1.0
120	Sanderson	John	Francis	1941	9	895	13.0	1.0
21394	Sanderson	John	Francis	1941	9	173796	13.0	1.0
3	Sanderson	John	Francis	1941	9	13	13.0	1.0
4	Richardson	Eric	Keith	1939	1	24	24.0	1.0

An example of success - for this person, the dob_year is different, but each row fairly surely refers to the same thing, so they have the same canon_id:

In [86]:

output[output.family_name == 'Meath Baker']

Out[86]:

	family_name	given_name	middle_name	dob_year	dob_month	id	canon_id	cluster_score
46	Meath Baker	Prescilla	Ann	1937	5	361	361.0	0.764748
81	Meath Baker	Prescilla	Ann	1938	5	610	361.0	0.764748

Rows with the same canon_id are the rows which dedupe believes are the referring to the same thing. The cluster_score is a measure of how sure dedupe is about the comparison. If there is no canon_id at all, dedupe doesn't believe that that row has a pair.

By playing with the precision/recall in the train (and other?) methods, one can move the metaphorical slider between capturing all of the duplicated records (at the expense of including some pairs which aren't referring to the same person - high recall), and being certain that the records you say are pairs really are pairs (at the expense of missing some of the pairs that should be lumped together - high precision).

Much more information about entity resolution can be found in this PhD thesis. Indeed the dedupe package itself is based on this work!

In [87]:

# save output to a csv
output.to_csv('entity_resolved_controlling_entities.csv')

A Life in Data

Entity Resolution with Dedupe Package and SQLite

Original data and cleaning¶

Setting up the Dedupe object¶

Training dedupe with human-matched records¶

Storing the raw data in a database¶

Creating inverted indexes for the fields that require it¶

Creating the blocks/canopies¶

SQL manipulations to prepare blocks for matching¶

Plural Key table¶

Plural Block table¶

Covered Blocks table¶

Smaller Coverage table¶

Clustering within the blocks/canopies¶

Saving the results¶

Combining the results with the original data¶

Original data and cleaning¶

Setting up the Dedupe object¶

Training dedupe with human-matched records¶

Storing the raw data in a database¶

Creating inverted indexes for the fields that require it¶

Creating the blocks/canopies¶

SQL manipulations to prepare blocks for matching¶

Plural Key table¶

Plural Block table¶

Covered Blocks table¶

Smaller Coverage table¶

Clustering within the blocks/canopies¶

Saving the results¶

Combining the results with the original data¶

blogroll

social