Token constraints

The simplest possible Odinson patterns consist of a single token constraint. A token constraint specifies what must be true of a token in order for it to be a valid extraction. These constraints are limited only by what you include in your index. For example, we commonly include part of speech tag, named entity information (NER), and chunk.

Example

If you write a query such as this:

dog

Odinson will look for any occurrence of the word dog. Unless specified otherwise, this will be case-insensitive and will normalize accents and Unicode characters. That is, this pattern will match: dog, DoG, and dög.

Using the token fields

If you want to write a token constraint that uses the indexed fields, you can use this format:

[tag=/N.*/]

This pattern will match any token in a document whose part of speech tag begins with "N". Here, tag is the specified field of the constraint. Any token constraint with an unspecified field (e.g., dog in the above example) will be matched against the norm field, which is the default. That is, dog is equivalent to [norm=dog]. The field names are specified here, and for most use cases should not need to be modified.

Operators for token constraints

You can combine and nest constraints for a token using operators: & (for AND), | (for OR) and parentheses. For example, the following would match a word whose tag starts with "N" and which is also tagged as an organization using NER or is a proper noun.

[tag=/N.*/ & (entity=ORGANIZATION | tag=NNP)]

Wildcards

Odinson supports wildcards for token constraints. Specifically, - [] : any token

Quantifiers

Like any other pattern component, token constraints (as well as these wildcards) can be combined with quantifiers, e.g.,

[chunk=B-NP] [chunk=I-NP]*