Fuzzy match two hash tables?
I am looking for ideas on how to best match two hash tables containing string / value pairs.
Here is the real problem I am facing: I have structured data flowing into which is imported into the database. I need to UPDATE records that are already in the DB, however, it is possible that the ANY value in the source might change, so I don't have a reliable ID.
I think about fuzzy match of two strings, source and DB, and make an "educated" guess if it should be updated or inserted.
Any ideas will be very grateful.
Decision
The solution is based on Ben Robinson's entry. Works really well, allows for slight inconsistencies here and there and customizable key-based weights.
require 'rubygems'
require 'amatch'
class Hash
def fuzzy_match(hash, key_weights = {})
sum_total = 0
sum_weights = 0
self.keys.each do |key|
weight = key_weights[key] || 1
next if weight == 0
weight *= 10000
match = self[key].to_s.levenshtein_similar(hash[key].to_s) * weight
sum_total += match
sum_weights += weight
end
sum_total.to_f / sum_weights.to_f
end
end
a source to share
I've been using Levenshtein Distance to do fuzzy mapping lately. I am calculating the distance between two possible matching lines and give the match a count, which is the inverse of the distance. Then I do a weighted average across all fields to determine the score for the record and allow more important fields to count more than fewer important fields. It is used in a CRM application where there are presenters coming from different sources. The client had to match these against existing limits / capabilities / client / resellers, etc. Thresholds need to be slightly adjusted for which score was a match and what was not. We ended up with about 1% false positive, which is pretty good in my opinion.
a source to share