Splitting the lucene index into two halves

Question

Splitting the lucene index into two halves

what's the best way to split the existing Lucene index into two halves i.e. each split must contain half of the total number of documents in the original index

+2

lucene

Akhil May 19 '10 at 13:37

a source to share

4 answers

A fairly reliable mechanism is to use the checksum of the document modulo the number of indexes to determine which index it will go to.

+1

Marcelo cantos May 19 '10 at 13:42

a source to share

Recent versions of Lucene have a dedicated tool for this ( IndexSplitter

and MultiPassIndexSplitter

contrib / misc).

+1

Michael mcandless May 13 '11 at 19:18

a source to share

This question was one of the first when I was looking into the answers to this problem, so I leave my solution here for future generations. In my case, I needed to split my index across specific rows, not arbitrarily in the middle or third or whatever you have. This is a C # solution using Lucene 3.0.3.

My app index is over 300GB, which has gotten a bit unmanageable. Each document in the index is associated with one of the manufacturing plants using the application. There is no business reason that one plant will ever look for other plant data, so I would need to split the index along those lines. Here's the code I wrote to do this:

var distinctPlantIDs = databaseRepo.GetDistinctPlantIDs();
var sourceDir = GetOldIndexDir();
foreach (var plantID in distinctPlantIDs)
{
    var query = new TermQuery(new Term("PlantID", plantID.ToString()));
    var targetDir = GetNewIndexDirForPlant(plantID); //returns a unique directory where this plant index will go

    //read each plant documents and write them to the new index
    using (var analyzer = new StandardAnalyzer(Version.LUCENE_30, CharArraySet.EMPTY_SET))
    using (var sourceSearcher = new IndexSearcher(sourceDir, true))
    using (var destWriter = new IndexWriter(targetDir, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
    {
        var numHits = sourceSearcher.DocFreq(query.Term);
        if (numHits <= 0) continue;
        var hits = sourceSearcher.Search(query, numHits).ScoreDocs;
        foreach (var hit in hits)
        {
            var doc = sourceSearcher.Doc(hit.Doc);
            destWriter.AddDocument(doc);
        }
        destWriter.Optimize();
        destWriter.Commit();
    }

    //delete the documents out of the old index
    using (var analyzer = new StandardAnalyzer(Version.LUCENE_30, CharArraySet.EMPTY_SET))
    using (var sourceWriter = new IndexWriter(sourceIndexDir, analyzer, false, IndexWriter.MaxFieldLength.UNLIMITED))
    {
        sourceWriter.DeleteDocuments(query);
        sourceWriter.Commit();
    }
}

This part that removes records from the old index exists because in my case one plant record took up the majority of the index (more than 2 / 3rds). So my real version has some extra code to make this plant last, and instead of splitting it like the others, it will optimize the remaining index (which is exactly that plant) and then move it to a new directory.

Anyway, hope this helps someone out there.

0

lettucemode May 04 '18 at 21:50

a source to share

bajafresh4life · Accepted Answer · 2010-05-20T13:51:54+0000

The easiest way to split an existing index (without reindexing all documents) is:

Make another copy of the existing index (i.e. cp -r myindex mycopy)
Open the first index and delete half of the documents (range 0 to maxDoc / 2)
Open the second index and remove the other half (range maxDoc / 2 to maxDoc)
Optimize both indices

This is probably not the most efficient way, but it requires very little coding.

Splitting the lucene index into two halves

More articles: