6 Jun 2012 10:09
[gensim:1123] Re: LSA: from cosine to probability
Hi Joris, it looks like you're passing your LSA objects (document history_lsa, corpus dict_lsa) as if they have dimensionality `len(dictionary)`. But that's the dimensionality of the bag-of-words representation (sparse, before the transform). After the LSA transformation, the dimensionality is only a few hundred, typically. If this is really the cause of your problems, just fix it by `MatrixSimilarity(num_features=self.model.num_topics)`. Best, Radim On Jun 6, 12:05 am, niefpaarschoenen <joris.pelem...@...> wrote: > Hi all, > > I'm currently using gensim to calculate LSA term-document similarities > expressed as a probability, based on a paper by Bellegarda*. To go > from a cosine similarity between a term and a document ([-1:1]) to a > probability ([0:1]), he calculates the cossim between all terms in the > dictionary and the document, subtracts the minimum cossim and > renormalizes by dividing with the sum of all the cossims. What I > implemented so far should do just that, but calculating all the > similarities takes either a long time or a lot of memory or simply > doesn't work, so I guess I'm doing something wrong. > > Here's my code so far: > > def calcProb(self, word, history): > history_bow = self.dictionary.doc2bow(history.lower().split()) > history_lsa = self.model[history_bow] > # here we want to compare EACH word with the history. this can be > done in two ways: > # 1) use the rows of the lsa to convert words to latent space (more > correct and probably faster, but deeper into gensim code) TODO > # 2) transform each word as if it was a document to the latent space > (probably easiest solution) > def dict_bow(): > # add the word first, so we know its index > yield self.dictionary.doc2bow([word]) > for k in self.dictionary.token2id.keys(): > if k != word: > yield self.dictionary.doc2bow([k]) > dict_lsa = self.model[dict_bow()] > > # comparing all the terms with the history: > # MEMORY EATING VERSION > #index = > similarities.MatrixSimilarity([history_lsa],num_features=len(self.dictionar y)) > #sims = index[dict_lsa] > # SLOW VERSION (3 mins) > #index = similarities.Similarity('tst', > [history_lsa],num_features=len(self.dictionary)) > #sims = index[dict_lsa] > # REVERSE COMPARISON = FASTER VERSION (1.5 mins) > index = similarities.Similarity('tst', > [dict_lsa],num_features=len(self.dictionary)) > sims = index[history_lsa] > > return self.cosToProb(sims, 0) > > def cosToProb(self, cos, index, gamma=1): > cos_min = min(cos) > cos_shifted = (cos - cos_min)**gamma > sum_cos_shifted = sum(cos_shifted) > return cos_shifted[index] / sum_cos_shifted > > Since my data set is rather small (80k words), I would have thought > using the simple MatrixSimilarity was no problem, but this uses a lot > of memory. Switching to Similarity (in which case gensim uses > SparseMatrixSimilarity) helps, but is still quite slow. Just for > kicks, I thought about reversing the calculation and to my surprise > this sped up things considerably (although I still wouldn't mind it > being a bit faster, since I will have to do this a lot). > > So I guess my questions are: > 1) Is my fastest implementation the best way to do this? > 2) Why does reversing matter? > 3) Why does MatrixSimilarity take up so much memory here? > > Thanks in advance, > > Joris
RSS Feed