Metadata Query Language

Sometimes when we perform queries in Odinson we are only interested in a subset of the results. Maybe we are only interested in extractions from documents written by a particular author, or published in a specific venue, or maybe we only care about documents published recently. Odinson can index metadata associated to each document, which can be used to filter the results of a query through a metadata filter query.

Odinson supports two main types of metadata: numeric and textual.

Numeric Metadata

Numeric metadata can be compared using the common comparison operators used in most programming languages: equals (==), not equals (!=), less than (<), greater than (>), less than or equals (<=), and greater than or equals (>=). One of the elements being compared must be an indexed metadata field, and the other must be a numeric value, e.g., citations > 5. These comparisons can be chained together, allowing us to express ranges in a more concise way, e.g., 1 < citations < 10.

  • Field type to use: ai.lum.odinson.NumberField

Textual Metadata

From those comparison operators, textual metadata only supports the equals (==) and not equals (!=) operators. Equals checks for exact matching, e.g., publisher == 'mit press'. Note that the textual content is wrapped in quotation marks.

Sometimes exact textual matching can be too stringent, and a containment check can be more appropriate. This can be accomplished with the contains operator, e.g., venue contains 'language'. To specify that a metadata field should not contain a given text, you can use: venue not contains 'language'

To make textual comparison more robust, we perform some normalization of the textual metadata fields. - unicode-aware case folding - NFKC unicode normalization - some unicode characters are transformed into ASCII equivalents (e.g., arrows, ligatures, etc.) - removal of diacritics

  • Field type to use: ai.lum.odinson.TokensField (Note, assumes the text is tokenized.)

Combining Filters

The metadata query language supports the and (&&), or (||), and not (!) operators to combine individual field constraints into a more complex filter, e.g., 1 < citations < 10 && venue contains 'language'.

Dates

The metadata query language offers special support for dates using the date function, which returns a numeric representation of the date. As such, you can write queries against dates as you would with other numeric metadata. For example, date(2020, 'Jan', 1) < pub_date <= date(2020, 'Dec', 31) would return documents published in 2020. This provides a lot of flexibility when checking for specific dates, but since checking for years is a common use case, we provide a shortcut. The query presented above can also be expressed as pub_date.year == 2020. Months can be expressed as their full names, common abbreviations, or number (i.e., "August", "Aug", or 8).

  • Field type to use: ai.lum.odinson.DateField

Nested Fields

Another capability of the metadata query language is its support for nested metadata fields. For example, authors are usually indexed as nested fields so that their first and last name are associated with each other, and not with other authors. To query nested fields, we need to specify the name of the field and the query to be performed on it. For example, if we had a document with two authors named Jane Smith and John Doe, then we could match this document with the queries author{first=='jane' && last=='smith'} or author{first=='john' && last =='doe'}, but not author{first=='jane' && last=='doe'}. Not all attributes of a nested field must be specified. For example, given the authors above, this would also match: author{first=='jane'}

  • Field type to use: ai.lum.odinson.NestedField

Regular Expressions

The metadata query language also supports Lucene regular expressions, such as: - author{first=='/j.*/' && last=='/d.*/'} - keywords contains '/bio.*/'

Adding Metadata to Existing Odinson Documents

As of PR #319, there are some utilities and an app to help you add metadata to an Odinson document. - there is an addMetadata method in ai.lum.odinson.Document that take a sequence of Fields and adds them as metadata. - there is now a MetadataWrapper case class that can serialize itself in a compatible json format - in the extra/ subproject, there is an app (ai.lum.odinson.extra.AddMetadataToDocuments) that will load a set of metadata json files and a set of Document json files, and add any found metadata to the corresponding Document (as indicated by the docID in the metadata file).