Finding 25GB Corpus for One Word

Question

Finding 25GB Corpus for One Word

I need to search for a 25GB wikipedia body in one word. I used grep but it takes a long time. Is there an effective and simple view that can be done quickly. Also, I want to find an exact match.

Thanks.

+2

algorithm data-structures wikipedia

Boolean May 15 '10 at 7:30

a source to share

4 answers

Have you tried using an indexing engine ... say Lucene with Nutch? Lucene is an indexing engine. Nutch is a web crawler arm. Combine the power!

I forgot to mention ... CouchDB ( http://couchdb.apache.org/ )

+3

Gaurav Vaish May 15 '10 at 7:37

a source to share

I have had success with Boyer-Moore algorithm and its simplified version . There are versions for different languages floating on the net.

+2

Michael barker May 15 '10 at 7:36

a source to share

@aloobe got an answer for using an index file that matched words to locations. I just want to state this, although I think the answer the OP is looking for might only be Boyer-Moore.

The index file will look like this (simplified 2-digit text to use):

53 17 89 03
77 79 29 39
88 01 05 15
...

Each entry above represents a word or letter byte offset that you think is important enough to index. In practice, you will not use letter indices, since your index file is larger than your corpus!

The trick is that if you must replace words in those places with locations, your index file will be an alphabetical version of the corpus:

and and are as
ate bad bat bay
bear best bin binge

This allows you to do a Binary Search on a corpus via an index file. If you are looking for the word "better" above, you should grab the middle record in index file 79. Then you go to position / byte 79 in the corpus and see which word is there. This is bad

. We know it's in alphabetical order best > bad

, so the position should be in the second half of the index file.

So we grab the middle index between 79 (6) and 15 (12), which in my example is 01. Then we look at position / byte 88 (9) in the corpus to find bear

. best > bear

so we'll try again - the average index is now 01 (10) or 05 (11) depending on how you round. But obviously we will find best

in 1 or 2 searches. If we have 12 words like the example, in the worst case, we need no more than 4 queries. For a 25 GB file with an average word length of 5 letters and spaces in between, that's ~ 4 billion words. However, in the worst case, you will only search 32 times. At this point, more of your program's time is spent on disk and input buffering than actually searching!

This method also works with repeated words. If you want to find all the locations of a word the

, you must do a binary search on the

until you find the index. Then you will subtract 1 from the position in the index file multiple times, using the value each time to look at the corpus. If the word is still in this place the

, continue. When you finally stop, you have the earliest index in the index file that maps to the

.

The creation of the index file is the only difficult part. You need to go through each word in the corpus by creating a data structure for the words and their indices. Along the way, skip words that are too common or too short to list, such as "a", "I", "the", "and", "is", etc. When you're done, you can take this data structure and turn it into an index file. Unfortunately, for a 25GB file, your indices must be> 32 bits, so use long

(in Java) or long long

(in C) to store it. There is no reason it should be human readable, so write the indices as 64-bit values, not strings.

The structure I would recommend is a self balancing binary search tree . Each node represents a string value (word) and an index. However, the tree compares nodes based only on a string. If you do this, traversing in order (left, node, right) will give you exactly the index file.

Hope this helps! An example I used many years ago to create a dictionary for mobile phones is Jim Breen EDICT . This can be difficult to match due to the EUC and Japanese character encoding, but the intent is the same.

0

Phil 16 May '10 at 7:00

a source to share

aioobe · Accepted Answer · 2010-05-15T07:33:20+0000

You probably want to make the mapping index from word to list of locations (bytecode offsets). The word list will be sorted alphabetically. Then you might have a secondary index where some letters start in that big wordlist.

Lazy hash           |   Word index               |  Corpus
aaa   starts at X   |   aaa                      |  lorem ipsum dolor
aab   starts at Y   |   ...                      |  sit amet .....
aac   ...           |   and  486, 549, 684, ...  |  ...
...   ...           |                            |
zzz   ...           |                            |

This is the way the natural language professor in my department defends (we did this exercise as a lab in the algorithm course).

Finding 25GB Corpus for One Word

More articles: