2 Mar 2009 06:27
Re: Tag-based filesystem with xapian, advice?
Olly Betts <olly <at> survex.com>
2009-03-02 05:27:07 GMT
2009-03-02 05:27:07 GMT
On Sat, Feb 28, 2009 at 10:21:15PM +0100, Karel Marissens wrote: > Now my 1th question is, what is the path of a file? The content of a > document? A term? A value? I need to be able to use the path when > searching as I need to be able to limit the file-results to files in a > certain directory. Thus for example, only files that have a path of / > photo's/2008/*. Or do I have to work with a relevance-set or something? I would put the path in the document data for reading when you get results, and also index all the directories which the file is in as terms (e.g. P/photo's and P/photo's/2008 for a file in /photo's/2008). > I tried using the path as a tag itself, but when I do a query for "/ > photo's/2008/*", it is automatically translated to 2 separated terms I > think? (a file tagged as 2008 also showed up for example) I don't think you want to use QueryParser here - just build your Query objects up by hand. If you want to allow "free text queries" after +FIND, then you can parse that part with QueryParser and then filter the result using the appropriate "P"-prefixed term, e.g. in C++: Xapian::Query q = qp.parse_query(query_string); q = Xapian::Query(q.OP_FILTER, q, Xapian::Query("P/photo's")); > My 2th question is, what is the easiest way to get a list of all the > tags associated with files in the resultset? I want to have a list of > all tags associated with files in /photo's/2008. One method would be > to do a search for all files in /photo's/2008, or any subdirectory, > loop all the results, and per document, loop the terms associated with > it and add these to a list. You can add all documents in the MSet to an RSet and use Enquire::get_eset() to get a set of all the terms in all the documents. That's not so different to what you describe, though Xapian does most of the work for you, including eliminating duplicates. If you just want the "tag" terms (and not P/photo's, etc), you can use an "ExpandDecider" to only pick out those. I'd suggest for efficiency that you might want to consider adding a special case for "/" and use Database::allterms_begin() to iterate over all the terms in the database. > My 3th question is how I can get ALL results? Get_mset() requires a > maximum amount of results. Do I just set it to an extremely big number > and see it as a safety-limitation that shouldn't be reached? If you can handle result sets of any size, just pass db.get_doccount() - there can't be more matching documents than there are documents in the database. I'll add a note to the documentation comment for Enquire::get_mset() as this has been asked a few times before. Cheers, Olly