Fuzzy match two hash tables?

I am looking for ideas on how to best match two hash tables containing string / value pairs.

Here is the real problem I am facing: I have structured data flowing into which is imported into the database. I need to UPDATE records that are already in the DB, however, it is possible that the ANY value in the source might change, so I don't have a reliable ID.

I think about fuzzy match of two strings, source and DB, and make an "educated" guess if it should be updated or inserted.

Any ideas will be very grateful.

Decision

The solution is based on Ben Robinson's entry. Works really well, allows for slight inconsistencies here and there and customizable key-based weights.

require 'rubygems'
require 'amatch'

class Hash
  def fuzzy_match(hash, key_weights = {})
    sum_total = 0
    sum_weights = 0

    self.keys.each do |key|
      weight = key_weights[key] || 1
      next if weight == 0

      weight *= 10000
      match = self[key].to_s.levenshtein_similar(hash[key].to_s) * weight
      sum_total += match
      sum_weights += weight
    end

    sum_total.to_f / sum_weights.to_f
  end
end

      

+2


a source to share


2 answers


I've been using Levenshtein Distance to do fuzzy mapping lately. I am calculating the distance between two possible matching lines and give the match a count, which is the inverse of the distance. Then I do a weighted average across all fields to determine the score for the record and allow more important fields to count more than fewer important fields. It is used in a CRM application where there are presenters coming from different sources. The client had to match these against existing limits / capabilities / client / resellers, etc. Thresholds need to be slightly adjusted for which score was a match and what was not. We ended up with about 1% false positive, which is pretty good in my opinion.



+2


a source


If you import data into SQL Server, SSIS performs the fuzzy match task. Try to see if you like the results. We found it useful in such situations.



+1


a source







All Articles