7 Dec 21:06 2012

## Re: Jaccard distance?

Jake Vanderplas <vanderplas <at> astro.washington.edu>

2012-12-07 20:06:07 GMT

2012-12-07 20:06:07 GMT

Hi, Jaccard distance is a dissimilarity metric between two sets. I think the confusion here is how the sets are specified. The definition of Jaccard distance is: J(A, B) = 1 - |A intersect B| / |A union B| so if A = {1, 2, 3, 4} and B = {1, 2, 4, 3}, then J(A, B) = 0. Recall that order does not matter for sets. Sets can also be encoded as an ordered list of binary variables: the above sets (zero-indexed) could be represented by A = [0, 1, 1, 1, 1]; B = [0, 1, 1, 1, 1]. In this case, order matters, and the distance can be specified by J_bin(A, B) = N_unequal(A, B) / N_nonzero(A, B) and we recover J_bin(A, B) = 0 as above. As a more complicated example, if A = [1, 0, 0, 1] and B = [1, 0, 1, 0], then N_unequal(A, B) = 2 N_nonzero(A, B) = 3 (only a single index has zero for both A and B) and J_bin(A, B) = 2/3 Where things get a bit messy is that scipy & octave extend this binary notion of the Jaccard distance to arbitrary numbers. So if A = [1, 2, 3, 4] and B = [1, 2, 4, 3], then N_unequal(A, B) = 2 N_nonzero(A, B) = 4 and J_ext(A, B) = 1/2 I'm not sure whether this results in a true metric - that would be interesting to figure out. This seems to be the issue in ticket 1774: the user expected the metric to operate as if A and B are unordered sets, while the function actually operates as if they're ordered lists of (extended) binary variables. Hope that helps, Jake On 12/07/2012 11:36 AM, Pauli Virtanen wrote: > Hi, > > Does someone know about what is the "correct" definition for the Jaccard > distance? > > There's a bug report that claims that the current behavior is wrong: > > http://projects.scipy.org/scipy/ticket/1774 > > However, as far as I see, the result is exactly what the docstring says > and our result agrees with Octave. >