Radim Řehůřek | 1 Apr 2012 13:04
Gravatar

[gensim:1040] Re: online algorithm for generating lsi model

Hello dhruvg,

correct, the algo is online.

The method that updates the model is this one:
http://radimrehurek.com/gensim/models/lsimodel.html#gensim.models.lsimodel.LsiModel.add_documents

It's also in the documentation of the class:
>>> import gensim
>>> help(gensim.models.LsiModel)

Note that adding more documents at once will be (much) more efficient
than adding one document at a time. It's possibly also more accurate,
because when you add 1 document 1000 times, the online algo machinery
is run 1000x, giving it more chance to accumulate floating point
rounding errors at machine precision, as opposed to adding 1000
documents at once.

Best,
Radim

On Apr 1, 7:33 am, dhruvg <dhruv.g...@...> wrote:
> Since the input to LSI contains a reference to mm (the corpus), does the
> LSI model automatically update itself when mm is updated upon getting a new
> document from the stream?
>
>
>
>
>
>
>
> On Sunday, April 1, 2012 1:30:49 AM UTC-4, dhruvg wrote:
>
> > Upon further reading, the documentation says the LSA algorithm is online,
> > which answers my question. But I am unable to find the code which
> > demonstrates the online nature of LSA (i.e. the code which updates an
> > existing LSI model). Could someone point me to it?
>
> > Thanks.
>
> > On Sunday, April 1, 2012 1:23:43 AM UTC-4, dhruvg wrote:
>
> >> Hey -- I am new to gensim and am planning to use it for processing a
> >> busy stream of text documents. If I understand the documentation
> >> correctly, gensim allows the creation of a corpus from a non-repeating
> >> stream of documents. So this is good. My plan is to use the corpus to
> >> build an LSI model. From what I understand, there are two approaches
> >> to do this: one-pass and distributed. I am wondering whether it is
> >> possible to update this LSI model efficiently when I get a new
> >> document. Or do I have to re-create the entire LSI model every time I
> >> get a new document?
>
> >> For example, lets say I have 1000 documents. I create an LSI model
> >> called A. Then my stream reads another document. The corpus gets
> >> updated. Now do I have to see all 1001 documents again to create the
> >> LSI model? Or is there a smarter way where I can just update the
> >> existing LSI model to incorporate the new document?
>
> >> Thanks,
> >> Dhruv


Gmane