Paul Cowan (JIRA | 1 Sep 04:31 2008

[jira] Updated: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

     [ ]

Paul Cowan updated LUCENE-1372:

    Attachment: lucene-multisort.patch

Patch which deals with this in the case of Strings, with a test case. This is a POC example; if people are happy
with the approach I'll implement for the other types (float, int, etc) as I think it makes sense there also.

> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>                 Key: LUCENE-1372
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.2
>            Reporter: Paul Cowan
>            Priority: Minor
>         Attachments: lucene-multisort.patch
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting on a field for which
multiple values exist for one document. For example, imagine a field "fruit" which is added to a document
multiple times, with the values as follows:
> doc 1: {"apple"}
> doc 2: {"banana"}
> doc 3: {"apple", "banana"}
> doc 4: {"apple", "zebra"}
> if one sorts on the field "fruit", the loop in FieldCacheImpl.stringsIndexCache.createValue() (and
similarly for the other methods in the various FieldCacheImpl caches) does the following:
>           while ( {
>             retArray[termDocs.doc()] = t;
>           }
> which means that we look over the terms in their natural order and, on each one, overwrite retArray[doc]
with the value for each document with that term. Effectively, this overwriting means that a string sort in
this circumstance will sort by the LAST term lexicographically, so the docs above will effecitvely be
sorted as if they had the single values ("apple", "banana", "banana", "zebra") which is nonintuitive. To
change this to sort on the first time in the TermEnum seems relatively trivial and low-overhead; while
it's not perfect (it's not local-aware, for example) the behaviour seems much more sensible to me.
Interested to see what people think.
> Patch to follow.


This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.