Schema management¶
First of all, you need to create a schema definition. Cockatrice fully supports the field types, analyzers, tokenizers and filters provided by Whoosh. This section explains how to describe schema definition.
Schema Design¶
Cockatrice defines the schema in YAML format. YAML is a human friendly data serialization standard for all programming languages.
The following items are defined in YAML:
- schema
- default_search_field
- field_types
- analyzers
- tokenizers
- filters
Schema¶
The schema is the place where you tell Cockatrice how it should build indexes from input documents.
schema:
<FIELD_NAME>:
field_type: <FIELD_TYPE>
args:
<ARG_NAME>: <ARG_VALUE>
...
<FIELD_NAME>
: The field name in the document.<FIELD_TYPE>
: The field type used in this field.<ARG_NAME>
: The argument name to use constructing the field.<ARG_VALUE>
: The argument value to use constructing the field.
For example, id
field used as a unique key is defined as following:
schema:
id:
field_type: id
args:
unique: true
stored: true
Default Search Field¶
The query parser uses this as the field for any terms without an explicit field.
default_search_field: <FIELD_NAME>
<FIELD_NAME>
: Uses this as the field name for any terms without an explicit field name.
For example, uses text
field as default search field as following:
default_search_field: text
Field Types¶
The field type defines how Cockatrice should interpret data in a field and how the field can be queried. There are many field types included with Whoosh by default, and they can also be defined directly in YAML.
field_types:
<FIELD_TYPE>:
class: <FIELD_TYPE_CLASS>
args:
<ARG_NAME>: <ARG_VALUE>
<FIELD_TYPE>
: The field type name.<FIELD_TYPE_CLASS>
: The field type class.<ARG_NAME>
: The argument name to use constructing the field type.<ARG_VALUE>
: The argument value to use constructing the field type.
For example, defines text
field type as following:
field_types:
text:
class: whoosh.fields.TEXT
args:
analyzer:
phrase: true
chars: false
stored: false
field_boost: 1.0
multitoken_query: default
spelling: false
sortable: false
lang: null
vector: null
spelling_prefix: spell_
Analyzers¶
class
element whose class attribute is a fully qualified Python class name.tokenizer
and filters
to use, in the order you want them to run.analyzers:
<ANALYZER_NAME>:
class: <ANALYZER_CLASS>
args:
<ARG_NAME>: <ARG_VALUE>
<ANALYZER_NAME>:
tokenizer: <TOKENIZER_NAME>
filters:
- <FILTER_NAME>
<ANALYZER_NAME>
: The analyzer name.<ANALYZER_CLASS>
: The analyzer class.<ARG_NAME>
: The argument name to use constructing the analyzer.<ARG_VALUE>
: The argument value to use constructing the analyzer.<TOKENIZER_NAME>
: The tokenizer name to use in the analyzer chain.<FILTER_NAME>
: The filter name to use in the analyzer chain.
For example, defines analyzers using class
, tokenizer
and filters
as follows:
analyzers:
simple:
class: whoosh.analysis.SimpleAnalyzer
args:
expression: "\\w+(\\.?\\w+)*"
gaps: false
ngram:
tokenizer: ngram
filters:
- lowercase
Tokenizers¶
The job of a tokenizer is to break up a stream of text into tokens, where each token is (usually) a sub-sequence of the characters in the text.
tokenizers:
<TOKENIZER_NAME>:
class: <TOKENIZER_CLASS>
args:
<ARG_NAME>: <ARG_VALUE>
<TOKENIZER_NAME>
: The tokenizer name.<TOKENIZER_CLASS>
: The tokenizer class.<ARG_NAME>
: The argument name to use constructing the tokenizer.<ARG_VALUE>
: The argument value to use constructing the tokenizer.
For example, defines tokenizer as follows:
tokenizers:
ngram:
class: whoosh.analysis.NgramTokenizer
args:
minsize: 2
maxsize: null
Filters¶
The job of a filter is usually easier than that of a tokenizer since in most cases a filter looks at each token in the stream sequentially and decides whether to pass it along, replace it or discard it.
filters:
<FILTER_NAME>:
class: <FILTER_CLASS>
args:
<ARG_NAME>: <ARG_VALUE>
<FILTER_NAME>
: The filter name.<FILTER_CLASS>
: The filter class.<ARG_NAME>
: The argument name to use constructing the filter.<ARG_VALUE>
: The argument value to use constructing the filter.
For example, defines filter as follows:
filters:
stem:
class: whoosh.analysis.StemFilter
args:
lang: en
ignore: null
cachesize: 50000
Example¶
Refer to the example for how to define schema.
https://github.com/mosuka/cockatrice/blob/master/example/schema.yaml
More information¶
See documents for more information.