Performing SVD on tweets. Memory problem
EDIT: I am the size of the vocabulary list 10-20 times larger than what I wrote down. I just forgot zero.
EDIT2: I'll take a look at SVDLIBC and also see how to shrink the matrix to its dense version so that helps too.
I created a huge csv file as output from my markup and output. It looks like this:
word1, word2, word3, ..., word 150.000
person1 1 2 0 1
person2 0 0 1 0
...
person650
It contains the word count for each person. In a similar way, I get characteristic vectors for each person.
I want to run SVD on this beast, but it seems like the matrix is ββtoo big to be stored in memory to perform the operation. My question is:
-
you should reduce the size of the column by deleting words containing the sum of the column, for example 1, which means they were used only once. Am I trying to bias the data too much with this attempt?
-
I tried trying quickminer by uploading csv to db. and then sequentially read it in batches for processing as quickminer suggests. But Mysql cannot store that many columns in a table. If I migrate the data and then relay it on import, it also takes a long time.
-> In general, I am asking for advice on how to execute svd on such a package.
a source to share
SVD is limited by the size of your memory. Cm:
Folding In : Document on Partial Matrix Updates.
Apache Mahout is a distributed data mining library that runs on hadoop with parallel SVD
a source to share