An efficient way in Python to remove an item from a comma separated string
I'm looking for the most efficient way to add an element to a comma separated string while preserving the alphabetical order for words:
For instance:
string = 'Apples, Bananas, Grapes, Oranges'
subtraction = 'Bananas'
result = 'Apples, Grapes, Oranges'
Also, a way to do it, but while keeping the IDs:
string = '1:Apples, 4:Bananas, 6:Grapes, 23:Oranges'
subtraction = '4:Bananas'
result = '1:Apples, 6:Grapes, 23:Oranges'
The sample code is much appreciated. Thank you very much.
a source to share
Ideally, something like:
input_str = '1:Apples, 4:Bananas, 6:Grapes, 23:Oranges'
removal_str = '4:Bananas'
sep = ", "
print sep.join(input_str.split(sep).remove(removal_str))
will work. But python doesn't return a new list from remove (), so you can't do everything on one line and need temporary variables, etc. A similar solution that works is:
input_str = '1:Apples, 4:Bananas, 6:Grapes, 23:Oranges'
removal_str = '4:Bananas'
sep = ", "
print sep.join([ i for i in input_str.split(sep) if i != removal_str ])
However, to be as correct as possible, unless you have a WARRANTY that all items are valid, you need to ensure that each item meets all the specifications you specify, namely that they are in the format number: identifier. The easiest way to do this is to use the re module to search for a specific regex format, return all the results, and skip results that don't match what you want. By using deliberately compact code, you end up with a fairly short solution that makes a good check:
def str_to_dictlist(inp_str):
import re
regexp = r"(?P<id>[0-9]+):(?P<name>[a-zA-Z0-9_]+)"
return [ x.groups() for x in re.finditer(regexp, inp_str) ]
input_str = '1:Apples, 4:Bananas, 6:Grapes, 23:Oranges'
subtraction_str = "4:Bananas"
sep = ", "
input_items = str_to_dictlist(input_str)
removal_items = str_to_dictlist(subtraction_str)
final_items = [ "%s:%s" % (x,y) for x,y in input_items if (x,y) not in removal_items ]
print sep.join(final_items)
It also has the advantage of handling multiple deletes at the same time. Since the input format and delete formats are similar, and the input format has multiple elements, it makes sense that the format for deleting them might need support - or at least it is useful to have support.
Note that doing this (using re to search) makes it difficult to find items that are NOT being checked; it just scans whatever it does. As a hack, you can count the commas in the input file and give a warning that something could not be parsed:
if items_found < (num_commas + 1):
print warning_str
It would also warn about commas without spaces.
To parse more complex input lines correctly, you need to break it down into separate markers, keep track of input lines and columns when parsing, print errors for anything unexpected, and maybe even handle things like backtracking and graphing for more complex ones like source. For this sort of thing, take a look into the pyparsing module (which is a third party download, not from python).
a source to share
Matthew's comment above is the correct approach, but if you are sure that ,
(comma followed by a space) only occur as delimiters, then something like this will work
def remove(str, element):
items = str.split(", ")
items.remove(element)
return ", ".join(items)
I would not recommend using strings as lists. They are for a different purpose and following Matthew's advice is the right thing to do.
a source to share
>>> import re
>>> re.sub("Bananas, |, Bananas$", "", "Apples, Bananas, Grapes, Oranges")
'Apples, Grapes, Oranges'
or
import re
strng = '1:Apples, 4:Bananas, 6:Grapes, 23:Oranges'
subtraction = '4:Bananas'
result = re.sub(subtraction + ", |, " + subtraction, "", strng)
print result
This works for your examples, but will need to be changed if the subtraction strings can contain regex metacharacters such as [].*?{}\
.
This is, as one commenter noted, a low level line operation. It might work, but an approach that takes your data structure into account should be more robust. Whether the comma or space separation is sufficient, or whether you need module reliability csv
, depends on the possible input strings you expect.
a source to share