Comparing Records in Files and Reports - Scenario 1

Requirements:

Fact 1: We have some legacy system data files

Fact 2: We have several data files created by the new system that should eventually replace the obsolete one

Fact 3:

  • Both files are text / ASCII files, with records consisting of multiple lines.
  • Each line within a record consists of a field name and a field value.
  • The format in which the strings are represented is different between 1 and 2, but the field name and field value can be extracted from each string using a regular expression
  • The field names can change between 1 and 2, but we have a mapping that links them
  • Each record has a unique identifier that helps us associate a legacy record with a new record as the order of records in the output file will not be the same in both systems.
  • Each file for comparison is at least 10 MB to an average case of 30-35 MB

Fact 4: How and when we iterate to build a new system, we will need to compare the files created by both systems under the same conditions and reconcile the differences.

Fact 5: This comparison is done manually using an expensive visual delineation tool. To help with this, I wrote a tool that brings two different field names to a common name and then sorts the field names in each entry in each file so that they sync in order (new files may have additional fields that are ignored in the visual diff )

Fact 6: Due to the fact that comparison is done manually by people and people are wrong, we get false positions and negatives that significantly affect our timing.

The obvious question is, what should be "ALG" and "DS"?

The script I have to point out:

When people keep checking the diff visually - that's where the performance of the exsiting script is grim - most of the processing seems to be sorting the array of strings in lexicographic order (read / fetch an array element: Tie :: File :: FETCH, Tie :: File :: Cache :: lookup and put it in the correct place so that it is sorted: Tie :: File :: Cache :: insert, Tie :: File :: Heap :: insert)

use strict;
use warnings;

use Tie::File;

use Data::Dumper;

use Digest::MD5 qw(md5_hex);

# open an existing file in read-only mode
use Fcntl 'O_RDONLY';

die "Usage: $0 <unsorted input filename> <sorted output filename>" if ($#ARGV < 1);

our $recordsWrittenCount = 0;
our $fieldsSorted = 0;

our @array;

tie @array, 'Tie::File', $ARGV[0], memory => 50_000_000, mode => O_RDONLY or die "Cannot open $ARGV[0]: $!";

open(OUTFILE, ">" .  $ARGV[1]) or die "Cannot open $ARGV[1]: $!";

our @tempRecordStorage = ();

our $dx = 0;

# Now read in the EL6 file

our $numberOfLines = @array; # accessing @array in a loop might be expensive as it is tied?? 

for($dx = 0; $dx < $numberOfLines; ++$dx)
{
    if($array[$dx] eq 'RECORD')
    {
        ++$recordsWrittenCount;

        my $endOfRecord = $dx;

        until($array[++$endOfRecord] eq '.')
        {
            push @tempRecordStorage, $array[$endOfRecord];
            ++$fieldsSorted;
        }

        print OUTFILE "RECORD\n";

        local $, = "\n";
        print OUTFILE sort @tempRecordStorage;
        @tempRecordStorage = ();

        print OUTFILE "\n.\n"; # PERL does not postfix trailing separator after the last array element, so we need to do this ourselves)

        $dx = $endOfRecord;     
    }
}

close(OUTFILE);

# Display results to user

print "\n[*] Done: " . $fieldsSorted . " fields sorted from " . $recordsWrittenCount . " records written.\n";

      

So I thought about this and I believe something like a trie, maybe a trie / PATRICIA trie suffix, so that on the very insertion, the fields in each record are sorted. Hence, I would not have to sort the final array in one go and the cost would be amortized (speculation on my part)

In this case, another problem arises - Tie :: File uses an array for abstract lines in the file - reading the lines into a tree and then converting them to an array will require additional memory AND processing /

The question is - will it cost more than the current cost of sorting the linked array?

+1


a source to share


1 answer


Tie :: The file is very slow. There are two reasons for this: First, bind variables are significantly slower than standard variables. Another reason is that, in the case of Tie :: File, the data in your array is stored on disk and not in memory. This slows down access significantly. Tie :: File Cache can help performance under certain circumstances, but not when you are just looping through the array one element at a time, as you do here. (The cache only helps if you are revisiting the same index.) The time to use Tie :: File is when you have an algorithm that requires all data to be in memory at once, but you don't have enough memory to do so. Since you are only processing the file one line at a time using Tie :: File, this is not only pointless, but harmful.

I don't think this is the right choice. I would use a simple HoH (hash hash) instead. Your files are small enough that you can get everything in memory at once. I recommend parsing each file and creating a hash that looks like this:

%data = (
  id1 => {
    field1 => value1,
    field2 => value2,
  },
  id2 => {
    field1 => value1,
    field2 => value2,
  },
);

      



If you use your mappings to normalize field names when constructing your data structure, this will make the comparison easier.

To compare data, do the following:

  • Match the keys of the two hashes. This should generate three lists: identifiers present only in legacy data, identifiers present only in new data, and identifiers present in both.
  • Report lists of identifiers that appear in only one dataset. These are records that do not have a matching record in another dataset.
  • For identifiers in both datasets, compare the data for each identifier field by field and report any differences.
+2


a source







All Articles