Radim Řehůřek | 6 Jun 2012 10:09
Gravatar

[gensim:1123] Re: LSA: from cosine to probability

Hi Joris,

it looks like you're passing your LSA objects (document history_lsa,
corpus dict_lsa) as if they have dimensionality `len(dictionary)`. But
that's the dimensionality of the bag-of-words representation (sparse,
before the transform). After the LSA transformation, the
dimensionality is only a few hundred, typically.

If this is really the cause of your problems, just fix it by
`MatrixSimilarity(num_features=self.model.num_topics)`.

Best,
Radim

On Jun 6, 12:05 am, niefpaarschoenen <joris.pelem...@...> wrote:
> Hi all,
>
> I'm currently using gensim to calculate LSA term-document similarities
> expressed as a probability, based on a paper by Bellegarda*. To go
> from a cosine similarity between a term and a document ([-1:1]) to a
> probability ([0:1]), he calculates the cossim between all terms in the
> dictionary and the document, subtracts the minimum cossim and
> renormalizes by dividing with the sum of all the cossims. What I
> implemented so far should do just that, but calculating all the
> similarities takes either a long time or a lot of memory or simply
> doesn't work, so I guess I'm doing something wrong.
>
> Here's my code so far:
>
>         def calcProb(self, word, history):
>                 history_bow = self.dictionary.doc2bow(history.lower().split())
>                 history_lsa = self.model[history_bow]
>                 # here we want to compare EACH word with the history. this can be
> done in two ways:
>                 # 1) use the rows of the lsa to convert words to latent space (more
> correct and probably faster, but deeper into gensim code) TODO
>                 # 2) transform each word as if it was a document to the latent space
> (probably easiest solution)
>                 def dict_bow():
>                         # add the word first, so we know its index
>                         yield self.dictionary.doc2bow([word])
>                         for k in self.dictionary.token2id.keys():
>                                 if k != word:
>                                         yield self.dictionary.doc2bow([k])
>                 dict_lsa = self.model[dict_bow()]
>
>                 # comparing all the terms with the history:
>                 # MEMORY EATING VERSION
>                 #index =
> similarities.MatrixSimilarity([history_lsa],num_features=len(self.dictionar y))
>                 #sims = index[dict_lsa]
>                 # SLOW VERSION (3 mins)
>                 #index = similarities.Similarity('tst',
> [history_lsa],num_features=len(self.dictionary))
>                 #sims = index[dict_lsa]
>                 # REVERSE COMPARISON = FASTER VERSION (1.5 mins)
>                 index = similarities.Similarity('tst',
> [dict_lsa],num_features=len(self.dictionary))
>                 sims = index[history_lsa]
>
>                 return self.cosToProb(sims, 0)
>
>         def cosToProb(self, cos, index, gamma=1):
>                 cos_min = min(cos)
>                 cos_shifted = (cos - cos_min)**gamma
>                 sum_cos_shifted = sum(cos_shifted)
>                 return cos_shifted[index] / sum_cos_shifted
>
> Since my data set is rather small (80k words), I would have thought
> using the simple MatrixSimilarity was no problem, but this uses a lot
> of memory. Switching to Similarity (in which case gensim uses
> SparseMatrixSimilarity) helps, but is still quite slow. Just for
> kicks, I thought about reversing the calculation and to my surprise
> this sped up things considerably (although I still wouldn't mind it
> being a bit faster, since I will have to do this a lot).
>
> So I guess my questions are:
> 1) Is my fastest implementation the best way to do this?
> 2) Why does reversing matter?
> 3) Why does MatrixSimilarity take up so much memory here?
>
> Thanks in advance,
>
> Joris


Gmane