Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Paul Cowan (JIRA) <jira <at> apache.org>
Subject: [jira] Updated: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term
Newsgroups: gmane.comp.jakarta.lucene.devel
Date: Monday 1st September 2008 02:31:44 UTC (over 8 years ago)
[ https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Paul Cowan updated LUCENE-1372:
-------------------------------

    Attachment: lucene-multisort.patch

Patch which deals with this in the case of Strings, with a test case. This
is a POC example; if people are happy with the approach I'll implement for
the other types (float, int, etc) as I think it makes sense there also.

> Proposal: introduce more sensible sorting when a doc has multiple values
for a term
>
-----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1372
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1372
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.2
>            Reporter: Paul Cowan
>            Priority: Minor
>         Attachments: lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when
sorting on a field for which multiple values exist for one document. For
example, imagine a field "fruit" which is added to a document multiple
times, with the values as follows:
> doc 1: {"apple"}
> doc 2: {"banana"}
> doc 3: {"apple", "banana"}
> doc 4: {"apple", "zebra"}
> if one sorts on the field "fruit", the loop in
FieldCacheImpl.stringsIndexCache.createValue() (and similarly for the other
methods in the various FieldCacheImpl caches) does the following:
>           while (termDocs.next()) {
>             retArray[termDocs.doc()] = t;
>           }
> which means that we look over the terms in their natural order and, on
each one, overwrite retArray[doc] with the value for each document with
that term. Effectively, this overwriting means that a string sort in this
circumstance will sort by the LAST term lexicographically, so the docs
above will effecitvely be sorted as if they had the single values ("apple",
"banana", "banana", "zebra") which is nonintuitive. To change this to sort
on the first time in the TermEnum seems relatively trivial and
low-overhead; while it's not perfect (it's not local-aware, for example)
the behaviour seems much more sensible to me. Interested to see what people
think.
> Patch to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
 
CD: 9ms