Calculating the probability that a row was randomized? - Python

this is consistent with the question I asked earlier (question)

I have a list of handcrafted strings like:

lucy87

gordan_king

fancy_unicorn77

joplucky_kanga90

base_belong_to_narwhals

and a list of randomized strings :

johnkdf

pancake90kgjd

fancy_jagookfk

manhattanljg


Which gives the last set of rows is randomized, these are sequences like 'kjg', 'jgf', 'lkd', ....

Any clever way to split lines containing these seemingly randomized lines from the crowd?

I suppose this greatly influences the fact that some characters are more often located next to others (for example, "co", "ka", "ja", ...).


Any ideas on this? Kylotan mentioned by Rev , but I'm not sure if it can be used for this purpose.

Help would be greatly appreciated!

+2


a source to share


5 answers


It's just a thought. I've never tried this myself ...

Create a bloom filter from hashing each (overlapping) 4-letter sequence found in the dictionary. Check the string by counting how many 4-letter sequences in the string are missing from the filter. The more misses, the more likely the word contains random junk.



Try adjusting the bloom filter size and the number of letters in each sequence.

Also note (thanks @MihaiD) that you should include a dictionary of names, preferably from multiple languages, in the flower filter to minimize false positives.

+4


a source


What results do you get if you run strings through something like textcat ? (I've seen several different implementations of TextCat, maybe Python is already there, if not, this is not a hard algorithm - this is important data.)

I think that if you strip the numbers, the first set of lines will be closer to the "English" result in TextCat than to them with random stuff.



How much closer and whether you can use TextCat data that is based on which letters tend to be next to each other in certain languages ​​- "pass" or "fail", the line goes for some experimentation though ...

+2


a source


Try using a vanilla fill classifier. Should be sufficient for the general case.

+1


a source


It looks to me like you are trying to write code to find out some set of thumbnails that some spammer does for a string to get past your filters. What I don't see is what they are being stopped from after all your hard work doing a 10 second tweak to their algorithm and beating your new filter.

+1


a source


Some time ago I read a short article on generating random names where they did the following: they created a table that contains the information you already pointed to: "I think this says a lot about that some characters are more likely to be placed next to others. "

So what they did was they read the whole vocabulary and determined which letters were more likely to others. I don't know how many letters per line they looked at. Maybe you should try more than just two consecutive letters, let them say something between 3 and 6.

Now I suggest that you create a table like this (perhaps in a better structural representation of the data) that contains all the "correct" sequential letter combinations (and possibly their probabilities), and see if only your name contains (almost) such " valid "consecutive letters.

+1


a source







All Articles