Performing SVD on tweets. Memory problem

Question

EDIT: I am the size of the vocabulary list 10-20 times larger than what I wrote down. I just forgot zero.

EDIT2: I'll take a look at SVDLIBC and also see how to shrink the matrix to its dense version so that helps too.

I created a huge csv file as output from my markup and output. It looks like this:

        word1, word2, word3, ..., word 150.000
person1   1      2      0            1
person2   0      0      1            0
...
person650

It contains the word count for each person. In a similar way, I get characteristic vectors for each person.

I want to run SVD on this beast, but it seems like the matrix is too big to be stored in memory to perform the operation. My question is:

you should reduce the size of the column by deleting words containing the sum of the column, for example 1, which means they were used only once. Am I trying to bias the data too much with this attempt?
I tried trying quickminer by uploading csv to db. and then sequentially read it in batches for processing as quickminer suggests. But Mysql cannot store that many columns in a table. If I migrate the data and then relay it on import, it also takes a long time.

-> In general, I am asking for advice on how to execute svd on such a package.

+2

plotti May 12 '10 at 12:23

2 answers

Yin Zhu · Answer 1 · 2010-05-15T01:07:40+0000

It is a large dense matrix. However, this is just a small small sparse matrix.

Using the sparse matrix SVD algorithm is sufficient. for example here .

Steve severance · Answer 2 · 2010-05-15T01:15:04+0000

SVD is limited by the size of your memory. Cm:

Folding In : Document on Partial Matrix Updates.

Apache Mahout is a distributed data mining library that runs on hadoop with parallel SVD