6 Jun 2012 16:37
[gensim:1128] Re: LSA: from cosine to probability
Oh, and about the memory -- I don't think the 1GB comes from Similarity. You can have a look the `tst.*.npy` files that are created on disk -- the sum of their sizes will be the size of MatrixSimilarity in memory. My guess is that some documents (text) don't get garbage collected, because of dangling references. But hard to say where, from the log. -rr On Jun 6, 4:08 pm, Radim Řehůřek <m...@...> wrote: > Looks good to me Joris! > > Transforming 10k docs with LSA and indexing them takes about 7s. All > 80k docs take ~1 minute. That sounds reasonable to me. > > Then the query at the end takes ~280ms. To make querying faster, you > can: > > 1. use MatrixSimilarity > 2. keep Similarity, but increase `shardsize` to ~80k (so all documents > land in a single shard) > 3. use the "latest" gensim from github (there have been some > optimizations to Similarity lately) > > I think option 1) is the easiest :) > > Best, > Radim > > On Jun 6, 3:34 pm, niefpaarschoenen <joris.pelem...@...> wrote: > > > > > > > > > Hi Radim, > > > On Jun 6, 12:37 pm, Radim Řehůřek <m... <at> radimrehurek.com> wrote: > > > > Do you keep other (large) objects in memory? > > > Well, of course I have the LSA model and the dictionary, but reading > > them never increases memory usage over 200MB. I'm measuring this with > > top by the way (copy paste below), so I don't know how accurate this > > is. > > > 28890 jpeleman 20 0 1268m 1.1g 69m R 100.0 6.8 3:36.85 python > > > > > - Similarity with history index: the same > > > > - Similarity with terms index: 1m04, 340MB > > > > 1 minute for a single query?? That's not normal :) > > > Well, it's not the query that takes a long time. It's 1min in its > > entirety, with the bulk used for building the index. > > > > Going by what you describe as the goal, I'd index the `dict_lsa` > > > corpus with MatrixSimilarity once, and then run individual queries > > > (=`history_lsa`) against this index. > > > That's exactly what I'm doing (I think :-D) in the "fast version". > > First I build the index of the 80k 1-word documents dict_lsa with > > index = similarities.Similarity('tst', > > [dict_lsa],num_features=len(self.dictionary)), then I query the 1- > > document corpus history_lsa against this index. I hope this is > > somewhat clear? > > > Ah, but maybe you mean when I want to run calcProb more than once? > > That's actually a very logical and good idea, yes. So far I was > > only concerned with running it once and optimizing this already, but I > > guess I should have looked at the big picture immediately. > > > > If you paste a log of your run, at DEBUG level and including > > > timestamps, I will check which parts seem fishy or what to change: > > > OK, thanks, here it is (for the "fast version"): > > > <jpeleman <at> spchcl11:~/docs/lm/jppl> time > > test.py > > reading term-topic matrix from file /users/spraak/jpeleman/docs/lm/ > > semantics/results/ > > bizz_docsize30_topics100.lsi > > 2012-06-06 15:29:47,209 : INFO : loading Dictionary object from /users/ > > spraak/jpeleman/docs/lm/semantics/results/ > > bizz.dict > > 2012-06-06 15:29:47,362 : DEBUG : loading LsiModel object from /users/ > > spraak/jpeleman/docs/lm/semantics/results/ > > bizz_docsize30_topics100.lsi > > calculating LSA probability for 'wagen > > auto' > > 2012-06-06 15:29:54,852 : INFO : PROGRESS: fresh_shard > > size=10000 > > 2012-06-06 15:30:02,035 : INFO : PROGRESS: fresh_shard > > size=20000 > > /usr/lib64/python2.6/site-packages/scipy/sparse/compressed.py:129: > > UserWarning: indices array has non-integer dtype > > (float64) > > % > > self.indices.dtype.name ) > > 2012-06-06 15:30:09,163 : INFO : PROGRESS: fresh_shard > > size=30000 > > 2012-06-06 15:30:11,133 : INFO : creating matrix for 32768 documents > > and 100 > > features > > 2012-06-06 15:30:11,134 : DEBUG : PROGRESS: at document > > #0/32768 > > 2012-06-06 15:30:11,137 : DEBUG : PROGRESS: at document > > #1000/32768 > > 2012-06-06 15:30:11,140 : DEBUG : PROGRESS: at document > > #2000/32768 > > 2012-06-06 15:30:11,144 : DEBUG : PROGRESS: at document > > #3000/32768 > > 2012-06-06 15:30:11,147 : DEBUG : PROGRESS: at document > > #4000/32768 > > 2012-06-06 15:30:11,151 : DEBUG : PROGRESS: at document > > #5000/32768 > > 2012-06-06 15:30:11,154 : DEBUG : PROGRESS: at document > > #6000/32768 > > 2012-06-06 15:30:11,157 : DEBUG : PROGRESS: at document > > #7000/32768 > > 2012-06-06 15:30:11,160 : DEBUG : PROGRESS: at document > > #8000/32768 > > 2012-06-06 15:30:11,163 : DEBUG : PROGRESS: at document > > #9000/32768 > > 2012-06-06 15:30:11,166 : DEBUG : PROGRESS: at document > > #10000/32768 > > 2012-06-06 15:30:11,169 : DEBUG : PROGRESS: at document > > #11000/32768 > > 2012-06-06 15:30:11,172 : DEBUG : PROGRESS: at document > > #12000/32768 > > 2012-06-06 15:30:11,175 : DEBUG : PROGRESS: at document > > #13000/32768 > > 2012-06-06 15:30:11,178 : DEBUG : PROGRESS: at document > > #14000/32768 > > 2012-06-06 15:30:11,182 : DEBUG : PROGRESS: at document > > #15000/32768 > > 2012-06-06 15:30:11,185 : DEBUG : PROGRESS: at document > > #16000/32768 > > 2012-06-06 15:30:11,188 : DEBUG : PROGRESS: at document > > #17000/32768 > > 2012-06-06 15:30:11,191 : DEBUG : PROGRESS: at document > > #18000/32768 > > 2012-06-06 15:30:11,194 : DEBUG : PROGRESS: at document > > #19000/32768 > > 2012-06-06 15:30:11,196 : DEBUG : PROGRESS: at document > > #20000/32768 > > 2012-06-06 15:30:11,201 : DEBUG : PROGRESS: at document > > #21000/32768 > > 2012-06-06 15:30:11,203 : DEBUG : PROGRESS: at document > > #22000/32768 > > 2012-06-06 15:30:11,206 : DEBUG : PROGRESS: at document > > #23000/32768 > > 2012-06-06 15:30:11,209 : DEBUG : PROGRESS: at document > > #24000/32768 > > 2012-06-06 15:30:11,212 : DEBUG : PROGRESS: at document > > #25000/32768 > > 2012-06-06 15:30:11,215 : DEBUG : PROGRESS: at document > > #26000/32768 > > 2012-06-06 15:30:11,218 : DEBUG : PROGRESS: at document > > #27000/32768 > > 2012-06-06 15:30:11,221 : DEBUG : PROGRESS: at document > > #28000/32768 > > 2012-06-06 15:30:11,224 : DEBUG : PROGRESS: at document > > #29000/32768 > > 2012-06-06 15:30:11,226 : DEBUG : PROGRESS: at document > > #30000/32768 > > 2012-06-06 15:30:11,229 : DEBUG : PROGRESS: at document > > #31000/32768 > > 2012-06-06 15:30:11,232 : DEBUG : PROGRESS: at document > > #32000/32768 > > 2012-06-06 15:30:11,235 : INFO : creating dense shard > > #0 > > 2012-06-06 15:30:11,235 : INFO : saving index shard to tst. > > 0 > > 2012-06-06 15:30:11,235 : INFO : storing MatrixSimilarity object to > > tst.0 and tst. > > 0.npy > > 2012-06-06 15:30:11,458 : INFO : PROGRESS: fresh_shard > > size=0 > > 2012-06-06 15:30:18,651 : INFO : PROGRESS: fresh_shard > > size=10000 > > 2012-06-06 15:30:25,837 : INFO : PROGRESS: fresh_shard > > size=20000 > > 2012-06-06 15:30:33,027 : INFO : PROGRESS: fresh_shard > > size=30000 > > 2012-06-06 15:30:35,013 : INFO : creating matrix for 32768 documents > > and 100 > > features > > 2012-06-06 15:30:35,014 : DEBUG : PROGRESS: at document > > #0/32768 > > 2012-06-06 15:30:35,016 : DEBUG : PROGRESS: at document > > #1000/32768 > > 2012-06-06 15:30:35,019 : DEBUG : PROGRESS: at document > > #2000/32768 > > 2012-06-06 15:30:35,022 : DEBUG : PROGRESS: at document > > #3000/32768 > > 2012-06-06 15:30:35,026 : DEBUG : PROGRESS: at document > > #4000/32768 > > 2012-06-06 15:30:35,029 : DEBUG : PROGRESS: at document > > #5000/32768 > > 2012-06-06 15:30:35,032 : DEBUG : PROGRESS: at document > > #6000/32768 > > 2012-06-06 15:30:35,035 : DEBUG : PROGRESS: at document > > #7000/32768 > > 2012-06-06 15:30:35,037 : DEBUG : PROGRESS: at document > > #8000/32768 > > 2012-06-06 15:30:35,040 : DEBUG : PROGRESS: at document > > #9000/32768 > > 2012-06-06 15:30:35,044 : DEBUG : PROGRESS: at document > > #10000/32768 > > 2012-06-06 15:30:35,046 : DEBUG : PROGRESS: at document > > #11000/32768 > > 2012-06-06 15:30:35,049 : DEBUG : PROGRESS: at document > > #12000/32768 > > 2012-06-06 15:30:35,051 : DEBUG : PROGRESS: at document > > #13000/32768 > > 2012-06-06 15:30:35,054 : DEBUG : PROGRESS: at document > > #14000/32768 > > 2012-06-06 15:30:35,058 : DEBUG : PROGRESS: at document > > #15000/32768 > > 2012-06-06 15:30:35,060 : DEBUG : PROGRESS: at document > > #16000/32768 > > 2012-06-06 15:30:35,063 : DEBUG : PROGRESS: at document > > #17000/32768 > > 2012-06-06 15:30:35,066 : DEBUG : PROGRESS: at document > > #18000/32768 > > 2012-06-06 15:30:35,068 : DEBUG : PROGRESS: at document > > #19000/32768 > > 2012-06-06 15:30:35,072 : DEBUG : PROGRESS: at document > > #20000/32768 > > 2012-06-06 15:30:35,074 : DEBUG : PROGRESS: at document > > #21000/32768 > > 2012-06-06 15:30:35,077 : DEBUG : PROGRESS: at document > > #22000/32768 > > 2012-06-06 15:30:35,080 : DEBUG : PROGRESS: at document > > #23000/32768 > > 2012-06-06 15:30:35,083 : DEBUG : PROGRESS: at document > > #24000/32768 > > 2012-06-06 15:30:35,086 : DEBUG : PROGRESS: at document > > #25000/32768 > > 2012-06-06 15:30:35,089 : DEBUG : PROGRESS: at document > > #26000/32768 > > 2012-06-06 15:30:35,092 : DEBUG : PROGRESS: at document > > #27000/32768 > > 2012-06-06 15:30:35,095 : DEBUG : PROGRESS: at document > > #28000/32768 > > 2012-06-06 15:30:35,097 : DEBUG : PROGRESS: at document > > #29000/32768 > > 2012-06-06 15:30:35,100 : DEBUG : PROGRESS: at document > > #30000/32768 > > 2012-06-06 15:30:35,103 : DEBUG : PROGRESS: at document > > #31000/32768 > > 2012-06-06 15:30:35,106 : DEBUG : PROGRESS: at document > > #32000/32768 > > 2012-06-06 15:30:35,109 : INFO : creating dense shard > > #1 > > 2012-06-06 15:30:35,109 : INFO : saving index shard to tst. > > 1 > > 2012-06-06 15:30:35,109 : INFO : storing MatrixSimilarity object to > > tst.1 and tst. > > 1.npy > > 2012-06-06 15:30:35,305 : INFO : PROGRESS: fresh_shard > > size=0 > > 2012-06-06 15:30:42,454 : INFO : PROGRESS: fresh_shard > > size=10000 > > 2012-06-06 15:30:47,893 : INFO : creating matrix for 17606 documents > > and 100 > > features > > 2012-06-06 15:30:47,893 : DEBUG : PROGRESS: at document > > #0/17606 > > 2012-06-06 15:30:47,896 : DEBUG : PROGRESS: at document > > #1000/17606 > > 2012-06-06 15:30:47,899 : DEBUG : PROGRESS: at document > > #2000/17606 > > 2012-06-06 15:30:47,902 : DEBUG : PROGRESS: at document > > #3000/17606 > > 2012-06-06 15:30:47,905 : DEBUG : PROGRESS: at document > > #4000/17606 > > 2012-06-06 15:30:47,908 : DEBUG : PROGRESS: at document > > #5000/17606 > > 2012-06-06 15:30:47,911 : DEBUG : PROGRESS: at document > > #6000/17606 > > 2012-06-06 15:30:47,914 : DEBUG : PROGRESS: at document > > #7000/17606 > > 2012-06-06 15:30:47,917 : DEBUG : PROGRESS: at document > > #8000/17606 > > ... > > read more »
. So far I was
> > only concerned with running it once and optimizing this already, but I
> > guess I should have looked at the big picture immediately.
>
> > > If you paste a log of your run, at DEBUG level and including
> > > timestamps, I will check which parts seem fishy or what to change:
>
> > OK, thanks, here it is (for the "fast version"):
>
> > <jpeleman <at> spchcl11:~/docs/lm/jppl> time
> > test.py
> > reading term-topic matrix from file /users/spraak/jpeleman/docs/lm/
> > semantics/results/
> > bizz_docsize30_topics100.lsi
> > 2012-06-06 15:29:47,209 : INFO : loading Dictionary object from /users/
> > spraak/jpeleman/docs/lm/semantics/results/
> > bizz.dict
> > 2012-06-06 15:29:47,362 : DEBUG : loading LsiModel object from /users/
> > spraak/jpeleman/docs/lm/semantics/results/
> > bizz_docsize30_topics100.lsi
> > calculating LSA probability for 'wagen
> > auto'
> > 2012-06-06 15:29:54,852 : INFO : PROGRESS: fresh_shard
> > size=10000
> > 2012-06-06 15:30:02,035 : INFO : PROGRESS: fresh_shard
> > size=20000
> > /usr/lib64/python2.6/site-packages/scipy/sparse/compressed.py:129:
> > UserWarning: indices array has non-integer dtype
> > (float64)
> > %
> > self.indices.dtype.name )
> > 2012-06-06 15:30:09,163 : INFO : PROGRESS: fresh_shard
> > size=30000
> > 2012-06-06 15:30:11,133 : INFO : creating matrix for 32768 documents
> > and 100
> > features
> > 2012-06-06 15:30:11,134 : DEBUG : PROGRESS: at document
> > #0/32768
> > 2012-06-06 15:30:11,137 : DEBUG : PROGRESS: at document
> > #1000/32768
> > 2012-06-06 15:30:11,140 : DEBUG : PROGRESS: at document
> > #2000/32768
> > 2012-06-06 15:30:11,144 : DEBUG : PROGRESS: at document
> > #3000/32768
> > 2012-06-06 15:30:11,147 : DEBUG : PROGRESS: at document
> > #4000/32768
> > 2012-06-06 15:30:11,151 : DEBUG : PROGRESS: at document
> > #5000/32768
> > 2012-06-06 15:30:11,154 : DEBUG : PROGRESS: at document
> > #6000/32768
> > 2012-06-06 15:30:11,157 : DEBUG : PROGRESS: at document
> > #7000/32768
> > 2012-06-06 15:30:11,160 : DEBUG : PROGRESS: at document
> > #8000/32768
> > 2012-06-06 15:30:11,163 : DEBUG : PROGRESS: at document
> > #9000/32768
> > 2012-06-06 15:30:11,166 : DEBUG : PROGRESS: at document
> > #10000/32768
> > 2012-06-06 15:30:11,169 : DEBUG : PROGRESS: at document
> > #11000/32768
> > 2012-06-06 15:30:11,172 : DEBUG : PROGRESS: at document
> > #12000/32768
> > 2012-06-06 15:30:11,175 : DEBUG : PROGRESS: at document
> > #13000/32768
> > 2012-06-06 15:30:11,178 : DEBUG : PROGRESS: at document
> > #14000/32768
> > 2012-06-06 15:30:11,182 : DEBUG : PROGRESS: at document
> > #15000/32768
> > 2012-06-06 15:30:11,185 : DEBUG : PROGRESS: at document
> > #16000/32768
> > 2012-06-06 15:30:11,188 : DEBUG : PROGRESS: at document
> > #17000/32768
> > 2012-06-06 15:30:11,191 : DEBUG : PROGRESS: at document
> > #18000/32768
> > 2012-06-06 15:30:11,194 : DEBUG : PROGRESS: at document
> > #19000/32768
> > 2012-06-06 15:30:11,196 : DEBUG : PROGRESS: at document
> > #20000/32768
> > 2012-06-06 15:30:11,201 : DEBUG : PROGRESS: at document
> > #21000/32768
> > 2012-06-06 15:30:11,203 : DEBUG : PROGRESS: at document
> > #22000/32768
> > 2012-06-06 15:30:11,206 : DEBUG : PROGRESS: at document
> > #23000/32768
> > 2012-06-06 15:30:11,209 : DEBUG : PROGRESS: at document
> > #24000/32768
> > 2012-06-06 15:30:11,212 : DEBUG : PROGRESS: at document
> > #25000/32768
> > 2012-06-06 15:30:11,215 : DEBUG : PROGRESS: at document
> > #26000/32768
> > 2012-06-06 15:30:11,218 : DEBUG : PROGRESS: at document
> > #27000/32768
> > 2012-06-06 15:30:11,221 : DEBUG : PROGRESS: at document
> > #28000/32768
> > 2012-06-06 15:30:11,224 : DEBUG : PROGRESS: at document
> > #29000/32768
> > 2012-06-06 15:30:11,226 : DEBUG : PROGRESS: at document
> > #30000/32768
> > 2012-06-06 15:30:11,229 : DEBUG : PROGRESS: at document
> > #31000/32768
> > 2012-06-06 15:30:11,232 : DEBUG : PROGRESS: at document
> > #32000/32768
> > 2012-06-06 15:30:11,235 : INFO : creating dense shard
> > #0
> > 2012-06-06 15:30:11,235 : INFO : saving index shard to tst.
> > 0
> > 2012-06-06 15:30:11,235 : INFO : storing MatrixSimilarity object to
> > tst.0 and tst.
> > 0.npy
> > 2012-06-06 15:30:11,458 : INFO : PROGRESS: fresh_shard
> > size=0
> > 2012-06-06 15:30:18,651 : INFO : PROGRESS: fresh_shard
> > size=10000
> > 2012-06-06 15:30:25,837 : INFO : PROGRESS: fresh_shard
> > size=20000
> > 2012-06-06 15:30:33,027 : INFO : PROGRESS: fresh_shard
> > size=30000
> > 2012-06-06 15:30:35,013 : INFO : creating matrix for 32768 documents
> > and 100
> > features
> > 2012-06-06 15:30:35,014 : DEBUG : PROGRESS: at document
> > #0/32768
> > 2012-06-06 15:30:35,016 : DEBUG : PROGRESS: at document
> > #1000/32768
> > 2012-06-06 15:30:35,019 : DEBUG : PROGRESS: at document
> > #2000/32768
> > 2012-06-06 15:30:35,022 : DEBUG : PROGRESS: at document
> > #3000/32768
> > 2012-06-06 15:30:35,026 : DEBUG : PROGRESS: at document
> > #4000/32768
> > 2012-06-06 15:30:35,029 : DEBUG : PROGRESS: at document
> > #5000/32768
> > 2012-06-06 15:30:35,032 : DEBUG : PROGRESS: at document
> > #6000/32768
> > 2012-06-06 15:30:35,035 : DEBUG : PROGRESS: at document
> > #7000/32768
> > 2012-06-06 15:30:35,037 : DEBUG : PROGRESS: at document
> > #8000/32768
> > 2012-06-06 15:30:35,040 : DEBUG : PROGRESS: at document
> > #9000/32768
> > 2012-06-06 15:30:35,044 : DEBUG : PROGRESS: at document
> > #10000/32768
> > 2012-06-06 15:30:35,046 : DEBUG : PROGRESS: at document
> > #11000/32768
> > 2012-06-06 15:30:35,049 : DEBUG : PROGRESS: at document
> > #12000/32768
> > 2012-06-06 15:30:35,051 : DEBUG : PROGRESS: at document
> > #13000/32768
> > 2012-06-06 15:30:35,054 : DEBUG : PROGRESS: at document
> > #14000/32768
> > 2012-06-06 15:30:35,058 : DEBUG : PROGRESS: at document
> > #15000/32768
> > 2012-06-06 15:30:35,060 : DEBUG : PROGRESS: at document
> > #16000/32768
> > 2012-06-06 15:30:35,063 : DEBUG : PROGRESS: at document
> > #17000/32768
> > 2012-06-06 15:30:35,066 : DEBUG : PROGRESS: at document
> > #18000/32768
> > 2012-06-06 15:30:35,068 : DEBUG : PROGRESS: at document
> > #19000/32768
> > 2012-06-06 15:30:35,072 : DEBUG : PROGRESS: at document
> > #20000/32768
> > 2012-06-06 15:30:35,074 : DEBUG : PROGRESS: at document
> > #21000/32768
> > 2012-06-06 15:30:35,077 : DEBUG : PROGRESS: at document
> > #22000/32768
> > 2012-06-06 15:30:35,080 : DEBUG : PROGRESS: at document
> > #23000/32768
> > 2012-06-06 15:30:35,083 : DEBUG : PROGRESS: at document
> > #24000/32768
> > 2012-06-06 15:30:35,086 : DEBUG : PROGRESS: at document
> > #25000/32768
> > 2012-06-06 15:30:35,089 : DEBUG : PROGRESS: at document
> > #26000/32768
> > 2012-06-06 15:30:35,092 : DEBUG : PROGRESS: at document
> > #27000/32768
> > 2012-06-06 15:30:35,095 : DEBUG : PROGRESS: at document
> > #28000/32768
> > 2012-06-06 15:30:35,097 : DEBUG : PROGRESS: at document
> > #29000/32768
> > 2012-06-06 15:30:35,100 : DEBUG : PROGRESS: at document
> > #30000/32768
> > 2012-06-06 15:30:35,103 : DEBUG : PROGRESS: at document
> > #31000/32768
> > 2012-06-06 15:30:35,106 : DEBUG : PROGRESS: at document
> > #32000/32768
> > 2012-06-06 15:30:35,109 : INFO : creating dense shard
> > #1
> > 2012-06-06 15:30:35,109 : INFO : saving index shard to tst.
> > 1
> > 2012-06-06 15:30:35,109 : INFO : storing MatrixSimilarity object to
> > tst.1 and tst.
> > 1.npy
> > 2012-06-06 15:30:35,305 : INFO : PROGRESS: fresh_shard
> > size=0
> > 2012-06-06 15:30:42,454 : INFO : PROGRESS: fresh_shard
> > size=10000
> > 2012-06-06 15:30:47,893 : INFO : creating matrix for 17606 documents
> > and 100
> > features
> > 2012-06-06 15:30:47,893 : DEBUG : PROGRESS: at document
> > #0/17606
> > 2012-06-06 15:30:47,896 : DEBUG : PROGRESS: at document
> > #1000/17606
> > 2012-06-06 15:30:47,899 : DEBUG : PROGRESS: at document
> > #2000/17606
> > 2012-06-06 15:30:47,902 : DEBUG : PROGRESS: at document
> > #3000/17606
> > 2012-06-06 15:30:47,905 : DEBUG : PROGRESS: at document
> > #4000/17606
> > 2012-06-06 15:30:47,908 : DEBUG : PROGRESS: at document
> > #5000/17606
> > 2012-06-06 15:30:47,911 : DEBUG : PROGRESS: at document
> > #6000/17606
> > 2012-06-06 15:30:47,914 : DEBUG : PROGRESS: at document
> > #7000/17606
> > 2012-06-06 15:30:47,917 : DEBUG : PROGRESS: at document
> > #8000/17606
>
> ...
>
> read more »
RSS Feed