Identifying similarities and patterns within a string - Python

Question

Identifying similarities and patterns within a string - Python

This is the use case I am trying to understand.

I have a spam subscription list for a service and they kill conversion rates and other usability studies.

Attached letters look like this:

rogerep_dyeepvu@hotmail.com

rogeram_ingramameb@hotmail.com

rogerew_jonesewct@hotmail.com

roger [...] _ last name [...] @ hotmail.com

What would be your suggestions for defining these records with an automated script? It feels a little more complicated than it actually seems.

Help would be much appreciated!

+1

python string pattern-matching spam-prevention

RadiantHex May 13 '10 at 9:06

a source to share

3 answers

I don't think you can do more than flag it as a potential problem by checking:

^ roger ([az] {2}) _ ([az] +) @ hotmail.com

using regular expressions if it is a pattern that the spammer uses repeatedly.

It looks like after roger

they use 2 lowercase lowercase alphabetic characters, so I built them up. Not sure how you are going to match the dictionary of the lexicons they use, so matching the last part (which appears to be the last name followed by 4 lowercase alphabetic characters) can be difficult, although you could possibly do:

^ roger ([az] {2}) _ ([az] {5,}) @ hotmail.com

which assumes that all of their surnames have at least one character.

+1

Dominic Rodger May 13 '10 at 9:14

a source to share

Sounds like a job for regular expressions:

if re.match("^roger[a-z]+_[a-z]+@hotmail.com$", email_address):
    # might be your spammer

(If you've never used regular expressions, here's a rundown of what that means: ^

matches the beginning of a string, but $

matches the end, so we require everything in between those characters to be a pattern that describes the entire string. [a-z]

Matches any lowercase letter, +

meaning " one or more times ", so it [a-z]+

matches one or more lowercase letters. Putting it all together, our regular expressions match if the string can be described as" the beginning of a string followed by letters roger

, followed by one or more lowercase letters, followed by an underscore, followed by one or more lowercase letters, followed by@hotmail.com

and then the end of the line. "If the regex matches, the email address matches the pattern you described in your question.)

Of course, if it catches and changes its pattern (for example, switching the first names), this method will fail and you will have to abandon more traditional spam prevention methods such as using CAPTCHAs.

+1

Etaoin May 13 '10 at 9:17

a source to share

Kylotan · Accepted Answer · 2010-05-13T09:50:49+0000

I don't think you can check this easily. This is not likely to be a simple string matching problem that you can use for a regex, because I am assuming that your use of the name "Roger" was just an example, and that any number of names could appear at that position. You can also run one of the regular expressions provided by the other posters, parameterizing it with every permutation of the obvious first and last name. It will probably take somewhere between "too long" and "forever" and will mark a lot of false positives.

Another approach that works with the pattern you posted above would be to take the last 4 letters of the username and compare them to something. Spotted characters, which are random rather than reasonably organized (given a specific language), can be made by training a Markov chain from legitimate text, which will then allow you to calculate the probability that the given 4 letters will appear in that order in that language. For random letters, this probability will usually be much lower than for a legitimate name (although if there are special characters or numbers, all bets are disabled).

Another way would be to use a Bayesian filter (like Reverend in Python, although there are others) trained on the last 4 letters of legitimate email addresses. This would probably mean 95% of those that were just random, which would ensure data usage. eg. By sending not only 4 letters, but each of the 2-letter and 3-letter substrings inside to capture the context of each letter. I don't think this will work in the same way as the Markov style method.

Whatever you do, you can only trim false positives by sending specific email addresses for it (for example, only those that have email addresses that contain an underscore, at least 3 characters before the underscore and 5 characters after him).

But ultimately, you will never know if this is a spam address or a real one until it is used for some purpose. Therefore, if possible, I suggest that you give up trying to analyze the content and fix the problem elsewhere. How are they killing conversion rates? If you consider these bogus accounts to be some kind of metric, your best bet is to add a verification stage first and only consider metrics for the accounts that are being verified. Some people do have addresses like rogerep_dyeepvu@hotmail.com after all.

Identifying similarities and patterns within a string - Python

More articles: