paritosh ranjan | 17 Oct 11:10 2012
Picon

Re: Using model of mahout 0.7

Your first confusion matrix looks too good to be true, which tells that
there can be a target leak or some other problem in the model.

I wanted to suggest some ModelDissector which you can use for analyzing the
NaiveBayes model, however I just came to know that the ModelDissector in
Mahout does not work for NaiveBayes, its only for SGD, you might want to
read this
http://lucene.472066.n3.nabble.com/Training-vectors-for-classification-td2080345.html
.

As Ted suggests in the discussion on the link I have shared, creating a
Model Dissector for Naive Bayes would be a good place to contribute to
Mahout, and you can also use it for solving your problem.

On Wed, Oct 17, 2012 at 1:18 PM, Priyadarshan Raj <darshan786.iitk <at> gmail.com
> wrote:

> Hi paritosh,
>
> As suggested by you I ran seq2sparse with arguments:-
>
> bin/mahout seq2sparse -i ${user-dir}/fact-seq -o ${user-dir}/fact-vectors
>  -lnorm -nv  -wt tfidf --maxDFSigma 3.0 --maxDFPercent 100 --minSupport 5
>
> but still I am getting the same result..
>
> As suggested by you to use -analyzerName.but i think mahout itself uses
> "DefaultAnalyzer" which by default
>  assign StandardAnalyzer(Version.LUCENE_36) as its analyzer..I think there
>  is a problem in seq2sparse command..When I am creating a training and
> testing set from the same set of vectors using "split" command then on
> using "testnb" on test set I am getting correct confusion matrix.But when i
> am separately creating vectors from the subset of training data then i am
> getting that "vertically aligned " entirely wrong confusion matrix.
> Thanks
>
> On Tue, Oct 16, 2012 at 6:39 PM, paritosh ranjan
> <paritoshranjan5 <at> gmail.com>wrote:
>
> > I am not an expert of Mahout's Naive Bayes, but since everyone seems to
> be
> > busy, I would like to point you towards certain things that you have not
> > tried yet and might want to try.
> >
> > Try out
> >
> > --maxDFPercent, --minSupport Both of these options drop terms that are
> > either too frequent (max) or not frequent enough across the collection of
> > documents Useful in automatically dropping common or very infrequent
> terms
> > that add little value to the calculation
> > in seq2sparse command.
> >
> > Also try
> >
> > --analyzerName An Apache Lucene analyzer class that can be used to
> > tokenize, stem, remove, or otherwise change the words in the document
> >
> > to get rid of common words and also to stem words so that similar words
> are
> > converted into same form.
> >
> > Creating a model is lot of try and test in my opinion. I will suggest to
> > explore different parameters provided in each mahout command. I am sure
> you
> > will be able to move ahead.
> >
> > Good luck.
> >
> > On Tue, Oct 16, 2012 at 12:47 PM, rdarshan <darshan786.iitk <at> gmail.com
> > >wrote:
> >
> > > Hi,
> > > I am working on sentiment analysis of tweets.
> > > I am using mahout naive bayes classifier for it.I am making a directory
> > > "data".Inside "data" I am making  three more directories named
> > > "positive","negative","uncertain"..Then I kept 151 files(total 151Mb)
> on
> > > each of these positive,negatie and uncertain directory..Then I kept the
> > > data
> > > directory in hdfs..below are the set of command i ran to generate the
> > model
> > > and labelindex out of it.
> > >
> > > bin/mahout seqdirectory -i ${WORK_DIR}/data  -o ${WORK_DIR}/data-seq
> > > bin/mahout seq2sparse   -i ${WORK_DIR}/data-seq  -o
> > > ${WORK_DIR}/data-vectors
> > > -lnorm -nv  -wt tfidf
> > > bin/mahout split -i ${WORK_DIR}/data-vectors/tfidf-vectors
> > > --trainingOutput ${WORK_DIR}/data-train-vectors --testOutput
> > > ${WORK_DIR}/data-test-vectors  --randomSelectionPct 40 --overwrite
> > > --sequenceFiles -xm sequential
> > > bin/mahout trainnb -i ${WORK_DIR}/data-train-vectors -el -o
> > > ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow $c
> > >
> > >  I am getting the confusion matrix after testing on the same set of
> data
> > > using "testnb" command as given below:
> > >
> > > bin/mahout testnb  -i ${WORK_DIR}/data-train-vectors  -m
> > ${WORK_DIR}/model
> > > -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data-testing $c
> > >
> > > Confusion Matrix
> > > -------------------------------------------------------
> > > a       b       c       <--Classified as
> > > 151    0        0        |  151         a     = negative
> > > 0    151        0        |  151         b     = positive
> > > 0       0       151    |  151           c     = uncertain
> > >
> > >
> > > Then I created a some another directory "data2" in the same way and put
> > > some
> > > random data(which is a sub set of the training data(30 files(total size
> > > 30MB) each)) in positive,negative,uncertain directory inside it .Then i
> > > created a vector out of it using the "seq2sparse" command given below
> :-
> > >
> > > bin/mahout seqdirectory -i ${WORK_DIR}/data2  -o ${WORK_DIR}/data2-seq
> > > bin/mahout seq2sparse   -i ${WORK_DIR}/data2-seq  -o
> > > ${WORK_DIR}/data2-vectors  -lnorm -nv  -wt tfidf
> > >
> > > On  running the "testnb" using the model/lablelindex created from the
> > > previous set of data using the command given below:-
> > >
> > > bin/mahout testnb  -i
> > ${WORK_DIR}/data2-vectors/tfidf-vectors/part-r-00000
> > > -m ${WORK_DIR}/model  -l ${WORK_DIR}/labelindex -ow -o
> > > ${WORK_DIR}/data2-testing $c
> > >
> > > .I am getting confusion matrix like this.
> > >
> > > Confusion Matrix
> > > -------------------------------------------------------
> > > a       b       c       <--Classified as
> > > 0     30        0       |  30           a     = negative
> > > 0     30        0       | 30            b     = positive
> > > 0     30      0      |  30      c     = uncertain
> > >
> > > Can anyone tell me why this is coming.Am i using the correct way to
> test
> > > the
> > > model or it is a bug in mahout 0.7.If it is not the correct way please
> > > suggest a way out of it.
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/Using-model-of-mahout-0-7-tp4013891.html
> > > Sent from the Mahout User List mailing list archive at Nabble.com.
> > >
> >
>

Gmane