1 Apr 2012 13:36
[gensim:1041] Re: online algorithm for generating lsi model
More concretely: a practical amount of documents to add to the model at a time is at least "requested dimensionality (=num_topics)" + "internal oversampling (=100)". I'll add a notice to that effect to the docs. But if this is not possible for you, adding one document at a time will work, too. Best, Radim On Apr 1, 1:04 pm, Radim Řehůřek <m...@...> wrote: > Hello dhruvg, > > correct, the algo is online. > > The method that updates the model is this one:http://radimrehurek.com/gensim/models/lsimodel.html#gensim.models.lsi... > > It's also in the documentation of the class: > > >>> import gensim > >>> help(gensim.models.LsiModel) > > Note that adding more documents at once will be (much) more efficient > than adding one document at a time. It's possibly also more accurate, > because when you add 1 document 1000 times, the online algo machinery > is run 1000x, giving it more chance to accumulate floating point > rounding errors at machine precision, as opposed to adding 1000 > documents at once. > > Best, > Radim > > On Apr 1, 7:33 am, dhruvg <dhruv.g...@...> wrote: > > > > > > > > > Since the input to LSI contains a reference to mm (the corpus), does the > > LSI model automatically update itself when mm is updated upon getting a new > > document from the stream? > > > On Sunday, April 1, 2012 1:30:49 AM UTC-4, dhruvg wrote: > > > > Upon further reading, the documentation says the LSA algorithm is online, > > > which answers my question. But I am unable to find the code which > > > demonstrates the online nature of LSA (i.e. the code which updates an > > > existing LSI model). Could someone point me to it? > > > > Thanks. > > > > On Sunday, April 1, 2012 1:23:43 AM UTC-4, dhruvg wrote: > > > >> Hey -- I am new to gensim and am planning to use it for processing a > > >> busy stream of text documents. If I understand the documentation > > >> correctly, gensim allows the creation of a corpus from a non-repeating > > >> stream of documents. So this is good. My plan is to use the corpus to > > >> build an LSI model. From what I understand, there are two approaches > > >> to do this: one-pass and distributed. I am wondering whether it is > > >> possible to update this LSI model efficiently when I get a new > > >> document. Or do I have to re-create the entire LSI model every time I > > >> get a new document? > > > >> For example, lets say I have 1000 documents. I create an LSI model > > >> called A. Then my stream reads another document. The corpus gets > > >> updated. Now do I have to see all 1001 documents again to create the > > >> LSI model? Or is there a smarter way where I can just update the > > >> existing LSI model to incorporate the new document? > > > >> Thanks, > > >> Dhruv
RSS Feed