2 Mar 2009 22:50
Re: Tag-based filesystem with xapian, advice?
Karel Marissens <karel.marissens <at> gmail.com>
2009-03-02 21:50:34 GMT
2009-03-02 21:50:34 GMT
Olly, Thank you for your answers. I haven't got time yet to test it all out but I looked at the API for your answer to my 2th question and it's not entirely clear to me yet. First of all, how do I go from an MSet to an RSet? Is there a built-in method I'm overlooking? As you guessed, I want to eliminate path terms in my taglists. I see the ExpandDecider can accept a term to ignore, so I need to loop over all terms, check for a '/' and if found add the term to the decider? Or is the last paragraph of your answer on this question not directly related? I'm a little confused. I'm also curious what you think might be the performance of this search for available tags compared to an RDBMS solution? In an RDBMS solution with 3 tables (files, files_tags and tags), there could be a LIKE '/path/%' in the files table to find relevant files (I believe an index can be used for the like), a join with the files_tags table, a join with the tags table and finally a group by on the found tags. But I have no idea if that is more/less performant than the xapian way. Another requirement that I (probably) have is to be able to add a term to the database without actually adding it to a file yet. Is this possible? I off course can always use an empty document which has all terms (except paths)... Lastly, do you have any idea if there's python documentation similar to the API documentation for C++? (see link below) Or can it be generated somehow? I did find the python bindings page and everything seems to be about the same as the C++ API, but still it would be handy... http://xapian.org/docs/apidoc/html/classes.html Thanks in advance, Karel On 02 Mar 2009, at 06:27, Olly Betts wrote: > On Sat, Feb 28, 2009 at 10:21:15PM +0100, Karel Marissens wrote: >> Now my 1th question is, what is the path of a file? The content of a >> document? A term? A value? I need to be able to use the path when >> searching as I need to be able to limit the file-results to files >> in a >> certain directory. Thus for example, only files that have a path of / >> photo's/2008/*. Or do I have to work with a relevance-set or >> something? > > I would put the path in the document data for reading when you get > results, and also index all the directories which the file is in as > terms (e.g. P/photo's and P/photo's/2008 for a file in /photo's/2008). > >> I tried using the path as a tag itself, but when I do a query for "/ >> photo's/2008/*", it is automatically translated to 2 separated >> terms I >> think? (a file tagged as 2008 also showed up for example) > > I don't think you want to use QueryParser here - just build your Query > objects up by hand. > > If you want to allow "free text queries" after +FIND, then you can > parse > that part with QueryParser and then filter the result using the > appropriate "P"-prefixed term, e.g. in C++: > > Xapian::Query q = qp.parse_query(query_string); > q = Xapian::Query(q.OP_FILTER, q, Xapian::Query("P/photo's")); > >> My 2th question is, what is the easiest way to get a list of all the >> tags associated with files in the resultset? I want to have a list of >> all tags associated with files in /photo's/2008. One method would be >> to do a search for all files in /photo's/2008, or any subdirectory, >> loop all the results, and per document, loop the terms associated >> with >> it and add these to a list. > > You can add all documents in the MSet to an RSet and use > Enquire::get_eset() to get a set of all the terms in all the > documents. > That's not so different to what you describe, though Xapian does most > of the work for you, including eliminating duplicates. > > If you just want the "tag" terms (and not P/photo's, etc), you can use > an "ExpandDecider" to only pick out those. > > I'd suggest for efficiency that you might want to consider adding a > special case for "/" and use Database::allterms_begin() to iterate > over > all the terms in the database. > >> My 3th question is how I can get ALL results? Get_mset() requires a >> maximum amount of results. Do I just set it to an extremely big >> number >> and see it as a safety-limitation that shouldn't be reached? > > If you can handle result sets of any size, just pass > db.get_doccount() - > there can't be more matching documents than there are documents in the > database. > > I'll add a note to the documentation comment for Enquire::get_mset() > as > this has been asked a few times before. > > Cheers, > Olly