Comparing Records in Files and Reports - Scenario 2
Requirements:
Fact 1: We have some legacy system data files
Fact 2: We have several data files created by the new system that should eventually replace the obsolete one
Fact 3:
- Both files are text / ASCII files, with records consisting of multiple lines.
- Each line within a record consists of a field name and a field value.
- The format in which the strings are represented is different between 1 and 2, but the field name and field value can be extracted from each string using a regular expression
- The field names can change between 1 and 2, but we have a mapping that links them
- Each record has a unique identifier that helps us associate a legacy record with a new record as the order of records in the output file will not be the same in both systems.
- Each file for comparison is at least 10 MB to an average case of 30-35 MB
Fact 4: How and when we iterate to build a new system, we will need to compare the files created by both systems under the same conditions and reconcile the differences.
Fact 5: This comparison is done manually using an expensive visual delineation tool. To help with this, I wrote a tool that brings two different field names to a common name and then sorts the field names in each entry in each file so that they sync in order (new files may have additional fields that are ignored in the visual diff )
Fact 6: Due to the fact that comparison is done manually by people and people are wrong, we get false positions and negatives that significantly affect our timing.
The obvious question is, what should be "ALG" and "DS"?
The script I have to point out:
I want to create a PERL program that will
- read the relevant information from both files into the "DS" data structure
- and find the differences using the "ALG" algorithm between records from DS
- Display / report statistics to the end user, for example, how many lines (values) were different between records, where they are different or the values are completely different, are the lines missing (files from the new system may have additional fields, but they MUST contain all the lines that are in files created by legacy system)
My suggestions for:
DS: Multiple nested hash mapped to disk.
Looks like:
$namedHash { unique field value across both records } = {
legacy_system => {
'goodField' => 'I am good!',
'firstField' => 1,
'secondField' => 3
},
new_system => {
'firstField' => 11,
'secondField' => 33,
'goodField' => 'I am good!'
}
};
ALG: Custom key - by comparing keys between anonymous hashes referenced by the legacy_system and new_system keys. Any differences will be noted by adding a new "difference" key, which will be an array of field names that differ between the legacy and the new system.
Hence, for this example, the output of my ALG would be:
$namedHash { unique field value across both records } = {
legacy_system => {
'goodField' => 'I am good!',
'firstField' => 1,
'secondField' => 3
},
new_system => {
'firstField' => 11,
'secondField' => 33,
'goodField' => 'I am good!'
},
differences => [firstField, secondField];
};
What would you do / suggest in this scenario?
a source to share
Why not import all data into a SQLite database. You only need one table with one primary key corresponding to a unique identifier common to both systems. Columns should be a concatenation of old and new fields.
First, import one dataset, say the new system generated one. Then, for each member of the legacy set, try UPDATE on the appropriate record in the table: If the UPDATE fails, you know that the new dataset is missing those records that previously existed on the old system.
If any of the columns corresponding to the stale data are NULL, then you know the records in the new system that were not in the old system.
Then you can select rows where any column from the new system does not match the corresponding column from the old system.
IMHO, this is more flexible than a hash table system.
a source to share