Performing SVD on tweets. Memory problem

EDIT: I am the size of the vocabulary list 10-20 times larger than what I wrote down. I just forgot zero.

EDIT2: I'll take a look at SVDLIBC and also see how to shrink the matrix to its dense version so that helps too.

I created a huge csv file as output from my markup and output. It looks like this:

        word1, word2, word3, ..., word 150.000
person1   1      2      0            1
person2   0      0      1            0
...
person650

      

It contains the word count for each person. In a similar way, I get characteristic vectors for each person.

I want to run SVD on this beast, but it seems like the matrix is ​​too big to be stored in memory to perform the operation. My question is:

  • you should reduce the size of the column by deleting words containing the sum of the column, for example 1, which means they were used only once. Am I trying to bias the data too much with this attempt?

  • I tried trying quickminer by uploading csv to db. and then sequentially read it in batches for processing as quickminer suggests. But Mysql cannot store that many columns in a table. If I migrate the data and then relay it on import, it also takes a long time.

-> In general, I am asking for advice on how to execute svd on such a package.

+2


a source to share


2 answers


It is a large dense matrix. However, this is just a small small sparse matrix.



Using the sparse matrix SVD algorithm is sufficient. for example here .

+1


a source


SVD is limited by the size of your memory. Cm:

Folding In : Document on Partial Matrix Updates.



Apache Mahout is a distributed data mining library that runs on hadoop with parallel SVD

-1


a source







All Articles