Radim Řehůřek | 1 Apr 2012 13:36
Gravatar

[gensim:1041] Re: online algorithm for generating lsi model

More concretely: a practical amount of documents to add to the model
at a time is at least "requested dimensionality (=num_topics)" +
"internal oversampling (=100)". I'll add a notice to that effect to
the docs.

But if this is not possible for you, adding one document at a time
will work, too.

Best,
Radim

On Apr 1, 1:04 pm, Radim Řehůřek <m...@...> wrote:
> Hello dhruvg,
>
> correct, the algo is online.
>
> The method that updates the model is this one:http://radimrehurek.com/gensim/models/lsimodel.html#gensim.models.lsi...
>
> It's also in the documentation of the class:
>
> >>> import gensim
> >>> help(gensim.models.LsiModel)
>
> Note that adding more documents at once will be (much) more efficient
> than adding one document at a time. It's possibly also more accurate,
> because when you add 1 document 1000 times, the online algo machinery
> is run 1000x, giving it more chance to accumulate floating point
> rounding errors at machine precision, as opposed to adding 1000
> documents at once.
>
> Best,
> Radim
>
> On Apr 1, 7:33 am, dhruvg <dhruv.g...@...> wrote:
>
>
>
>
>
>
>
> > Since the input to LSI contains a reference to mm (the corpus), does the
> > LSI model automatically update itself when mm is updated upon getting a new
> > document from the stream?
>
> > On Sunday, April 1, 2012 1:30:49 AM UTC-4, dhruvg wrote:
>
> > > Upon further reading, the documentation says the LSA algorithm is online,
> > > which answers my question. But I am unable to find the code which
> > > demonstrates the online nature of LSA (i.e. the code which updates an
> > > existing LSI model). Could someone point me to it?
>
> > > Thanks.
>
> > > On Sunday, April 1, 2012 1:23:43 AM UTC-4, dhruvg wrote:
>
> > >> Hey -- I am new to gensim and am planning to use it for processing a
> > >> busy stream of text documents. If I understand the documentation
> > >> correctly, gensim allows the creation of a corpus from a non-repeating
> > >> stream of documents. So this is good. My plan is to use the corpus to
> > >> build an LSI model. From what I understand, there are two approaches
> > >> to do this: one-pass and distributed. I am wondering whether it is
> > >> possible to update this LSI model efficiently when I get a new
> > >> document. Or do I have to re-create the entire LSI model every time I
> > >> get a new document?
>
> > >> For example, lets say I have 1000 documents. I create an LSI model
> > >> called A. Then my stream reads another document. The corpus gets
> > >> updated. Now do I have to see all 1001 documents again to create the
> > >> LSI model? Or is there a smarter way where I can just update the
> > >> existing LSI model to incorporate the new document?
>
> > >> Thanks,
> > >> Dhruv


Gmane