Excel CSV to nested dictionary; List of recommendations
I have CSV Excel files with employee records in them. Something like that:
mail,first_name,surname,employee_id,manager_id,telephone_number
blah@blah.com,john,smith,503422,503423,+65(2)3423-2433
foo@blah.com,george,brown,503097,503098,+65(2)3423-9782
....
I am using DictReader to put this in a nested dictionary:
import csv
gd_extract = csv.DictReader(open('filename 20100331 original.csv'), dialect='excel')
employees = dict([(row['employee_id'], row) for row in gp_extract])
Is it done right - it actually works, but is it right? Something more efficient? Also, the funny thing is, in IDLE, if I try to print "employees" in the shell, it seems to cause IDLE to crash (there are about 1051 lines).
2. Remove employee_id from internal dict
Second problem, I put it in a dictionary indexed by employee_id with the value as a nested dictionary of all values, however employee_id is also a key: the value inside the nested dictionary, what's a little overkill? Is there a way to exclude it from the internal dictionary?
3. Manipulation of data by understanding
Third, we need to do some manipulation of the imported data - for example, all phone numbers are in the wrong format, so we need to do some regex. Also, we need to convert manager_id to the actual manager's name and email address. Most of the managers are in the same file, while others are in the external_contractors CSV file, which is similar, but not exactly the same format. I can import this into a separate dict though.
Are these two things things that can be done within a single list comprehension, or should I use a for loop? Or does multiple comprehension work? (sample code would be really awesome here). Or is there a smarter way in Python?
Cheers, Victor
a source to share
Your first part has one simple problem (which may not even be a problem). You don't handle key collisions at all (unless you intend to just overwrite).
>>> dict([('a', 'b'), ('a', 'c')])
{'a': 'c'}
If you are guaranteed to be employee_id
unique, the problem does not arise.
2) You can of course rule it out, but no real harm has been done. In fact, especially in python, if employee_id is a string or int (or some other primitive), the internal dict reference and the key actually refer to the same thing. They both point to the same place in memory. The only duplication is in the link (which is not that important). If you're concerned about memory consumption, you probably don't have to.
3) Don't try to do too much on one list comprehension. Just use a for loop after first comprehending the list.
To summarize, it sounds like you are really worried about the performance of looping over twice. Don't worry about performance in the beginning. Performance problems arise from algorithm issues, not specific language constructs like loops and lists.
If you are familiar with the Big O note, list comprehension and loop after (if you do), they have Big O of O (n). Add them together and you get O (2n), but as we know from Big O notation, we can simplify this to O (n). I've simplified a lot here a lot, but the point is you really don't need to worry.
If there are performance issues, bring them up after writing the code and confirm with a code profiler.
reply to comments
As for your answer # 2, python doesn't really have a lot of mechanisms for making one nice and extra snazzy liner. This meant forcing you to just write the code, rather than stick it in one line. That being said, you can still do a little work on one line. My suggestion is not to worry about how much code you can put in one line. Python looks much prettier (IMO) when written out rather than stuck on one line.
Regarding your answer # 1, you can try something like this:
employees = {}
for row in gd_extract:
if row['employee_id'] in employees:
... handle duplicates in employees dictionary ...
else:
employees[row['employee_id']] = row
Regarding your # 3 answer, not sure what you are looking for and what about the phone numbers you want to fix, but ... this might get you started:
import re
retelephone = re.compile(r'[-\(\)\s]') # remove dashes, open/close parens, and spaces
for empid, row in employees.iteritems():
retelephone.sub('',row['telephone'])
a source to share