Brian Campbell | 7 Oct 19:18
Picon
Favicon

Re: Clustering large data


I've recently been engaged in some exploratory data analysis also involving cluster analysis, albeit on a
much smaller dataset.  There are quite a few packages (e.g. ecodist(), vegan(), pvclust()) that include
functions for undertaking cluster analysis, but have you, or anyone else on here looked at alternative
clustering methods with bootstrap permutation tests of the nodes?  I've done this with pvclust but I don't
seem to recall this function including an argument for method="average").

Brian

> To: tyler.smith@...
> From: Farrar.David@...
> Date: Tue, 7 Oct 2008 09:56:15 -0400
> CC: r-sig-ecology-bounces@...; r-sig-ecology@...
> Subject: Re: [R-sig-eco] Clustering large data
> 
> Thierry, 
> 
>  Search of CRAN with "sparse clustering" yielded cluster.dist {cba}, 
> defined as "Clustering a Sparse Symmetric Distance Matrix".  There were 
> also sparse PCA packages and sparse matrix classes.  I have no experience 
> with these procedures. 
> 
> As additional background, you might like to say what kind of clustering 
> you want to do and whether some particular similarity/distance will be 
> involved. 
> Does your cluster analysis program take a data frame as input? 
> 
> However, it sounds like you are having problems with preliminary data 
> processing, and may not yet know whether some cluster analysis procedure 
> or other would choke on your matrix, once it is computed. 
> 
> It does seem surprising that you are having problems with a problem of 
> this size.  I assume you have checked that you have a couple G or so free, 
> at least. 
> 
> Farrar 
> 
> 
> 
> 
> r-sig-ecology-bounces@... wrote on 10/07/2008 08:35:39 AM:
> 
> > "ONKELINX, Thierry" <Thierry.ONKELINX@...>
> > writes:
> > 
> > > Dear all,
> > >
> > > We have a problem with a large dataset that we want to cluster. The
> > > dataset is in a long format: 1154024 rows with presence data. Each row
> > > has the name of the species and the location. We have 1381 species and
> > > 6354 locations.
> > > The main problem is that we need the data in wide format (one row for
> > > each location, one column for each species) for the clustering
> > > algorithms. But the 6354 x 1381 dataframe is too big to fit into the
> > > memory. At least when we use cast from the reshape package to convert
> > > the dataframe from a long to a wide format.
> > >
> > > Are there any clustering tools available that can work with the data 
> in
> > > a long format or with sparse matrices (only 13% of the matrix is
> > > non-zero)? If the work with sparse matrices: how to convert our 
> dataset
> > > to a sparse matrix? Other suggestions are welcome.
> > >
> > 
> > 6354 x 1381 should be well within your memory limit, so I assume it's
> > the intermediate steps that are fouling you up. Maybe you can do it in
> > pieces: 
> > 
> > 1. subset the original two-column matrix to include only the first 100 
> sites
> > 2. convert this subset to wide form
> > 3. repeat 63 times for different subsets
> > 4. rbind the resulting matrices
> > 
> > Good luck,
> > 
> > Tyler
> > 
> > -- 
> > Watching a recorded television broadcast more than once will be illegal
> > under Bill C-61. 
> > 
> > http://www.michaelgeist.ca/content/view/3046/125/
> > 
> > _______________________________________________
> > R-sig-ecology mailing list
> > R-sig-ecology@...
> > https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology@...
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

_________________________________________________________________

	[[alternative HTML version deleted]]

Gmane