Wipepedia example

This section discusses how to index Wikipedia dump.

Downloading Wikipedia dump

$ curl -o ~/tmp/enwiki-20190101-pages-articles.xml.bz2 https://dumps.wikimedia.org/enwiki/20190101/enwiki-20190101-pages-articles.xml.bz2

Installing Wikiextractor

$ git clone git@github.com:attardi/wikiextractor.git
$ cd wikiextractor

Extracting Wikipedia data

$ ./WikiExtractor.py --output ~/tmp/enwiki --bytes 200K --json ~/tmp/enwiki-20190101-pages-articles.xml.bz2

Starting Cockatrice

$ cockatrice start indexer

Creating index

$ curl -s -X GET https://raw.githubusercontent.com/mosuka/cockatrice/master/example/enwiki_schema.yaml | xargs -0 cockatrice create index enwiki

Indexing Wikipedia

$ for FILE in $(find ./tmp/enwiki -type f -name '*' | sort)
  do
    echo ${FILE}
    cat ${FILE} | jq  . | jq -s '.' | xargs -0 cockatrice put documents enwiki
  done