Jake Vanderplas | 7 Dec 21:06 2012

Re: Jaccard distance?

Hi,
Jaccard distance is a dissimilarity metric between two sets.  I think 
the confusion here is how the sets are specified.  The definition of 
Jaccard distance is:

   J(A, B) = 1 -  |A intersect B| / |A union B|

so if A = {1, 2, 3, 4} and B = {1, 2, 4, 3},
then J(A, B) = 0.  Recall that order does not matter for sets.

Sets can also be encoded as an ordered list of binary variables: the 
above sets (zero-indexed) could
be represented by A = [0, 1, 1, 1, 1]; B = [0, 1, 1, 1, 1].  In this 
case, order matters, and the distance can be specified by

   J_bin(A, B) = N_unequal(A, B) / N_nonzero(A, B)

and we recover J_bin(A, B) = 0 as above.

As a more complicated example, if A = [1, 0, 0, 1] and B = [1, 0, 1, 0], 
then
   N_unequal(A, B) = 2
   N_nonzero(A, B) = 3 (only a single index has zero for both A and B)
and
   J_bin(A, B) = 2/3

Where things get a bit messy is that scipy & octave extend this binary 
notion of the Jaccard distance to arbitrary numbers. So if A = [1, 2, 3, 
4] and B = [1, 2, 4, 3], then
   N_unequal(A, B) = 2
   N_nonzero(A, B) = 4
and
   J_ext(A, B) = 1/2

I'm not sure whether this results in a true metric - that would be 
interesting to figure out.

This seems to be the issue in ticket 1774: the user expected the metric 
to operate as if A and B are unordered sets, while the function actually 
operates as if they're ordered lists of (extended) binary variables.

Hope that helps,
    Jake

On 12/07/2012 11:36 AM, Pauli Virtanen wrote:
> Hi,
>
> Does someone know about what is the "correct" definition for the Jaccard
> distance?
>
> There's a bug report that claims that the current behavior is wrong:
>
>      http://projects.scipy.org/scipy/ticket/1774
>
> However, as far as I see, the result is exactly what the docstring says
> and our result agrees with Octave.
>

Gmane