Monday, March 23, 2009

Running a text search engine in N810: CLucene

A search engine is, from a high level view, a computer program that comprises at least two tools: an indexer and a "searcher", both to work over some given content. Going deeper a bit on it, the indexer part would be the responsible to identify all of the terms in a document or corpus and builds a table (or whatever) that indicates where the terms are used. Basically it maps a term to documents it appears. On the other hand, the searcher is what makes it possible to perform quick queries over this indexed content.

Well, what about all this, on a maemo context ?

Scenario

Lately, André Pedralho (a co-worker at INdT) and I got a demand for a product we are working on to benchmark a text-based search engine on an ARM platform having as final goal to find out a feasible candidate to be embedded into a web browser.

After some research for good, stable and open source search engines available, CLucene (a C++ port of popular Java Lucene) came to scene as promising option, and so we ran a couple of tests against it to check its feasibility as described below.

Environment and requirements:
  • scratchbox set up with rootstrap and compilers for building to maemo platforms, as described here.
  • a patch to be applied against CLucene (only some files in clucence/src/demo change), that makes it possible to perform some stress tests over it.
Benchmark (stress test)

To perform the tests, it was indexed a 16Mb textual base set (by using CLucene itself) resulting in a 18Mb index ready to use.

In the patched CLucene demo, there are 850 words to be used to query the indexed base against each of them programatically, and measure the time taken by complete the batch, as well as reliability, leakage and CPU and RAM memory usage of CLucene during this stress test.

Results: (from a N810 device w/ chinook)
  1. 850 queries were executed in around 45 seconds (so it took in avarage 0.05 seconds per query). It is also interesting to mention that it was on ARM 5 seconds slower than my Desktop machine, with dual core processor, 4Gb of RAM and so and so ...
  2. Out of 850 queries, 80 failed (for some reason) and 770 succeeded.
  3. No crash observed.
  4. Minor leaks (a few k's) observed in valgrind during the tests.
  5. Acceptable memory and CPU usage (given the amount of queries performed in so short period) as shown in the figures below
Although CLucene has been shown itseft as a promising option and others (i.e. the mozilla-based browser Flock) have successfully embedded it as a content search engine, it is pretty much early to say something more concrete as a conclusion about CLucene. A next round of tests will be performed soon these days, including a measure of how fast it updates its index on-the-go, as well as how good its concurrency handling is, and all it running over a bigger indexed content.

Please comment if any idea comes.

--Antonio Gomes
tonikitoo at gmail dot com