Annotating text
Before an Odinson index can be created, the text needs to be annotated. You may use your own annotation tools, as long as you convert your annotated output to Odinson Documents.
However, we also provide an App for annotating free text and producing this format, which makes use of the clulab Processors library.
Configuration
The configurations are specified in extra/src/main/resources/application.conf
.
-
First, decide what Processor you'd like to use to annotate the text by specifying a value for
odinson.extra.processorType
. Available options areFastNLPProcessor
, andCluProcessor
. For more information about these, see clulab Processors. -
Ensure
odinson.textDir
andodinson.docDir
are set as intended. Text will be read fromodinson.textDir
, annotated, and serialized toodinson.docDir
.
NOTE: We recommend a directory structure where you will have a data folder with subdirstext
,docs
, andindex
. If you do this, you can simply specifyodinson.dataDir = path/to/your/dataDir
, and the subfolders will be handled.
Memory Usage
Depending on the number and size of the documents you are annotating, this step can be memory intensive. We recommend you set aside at least 8g, but if you have more it will run faster. You can specify this through this command:
export SBT_OPTS="-Xmx8g"
Command
sbt "extra/runMain ai.lum.odinson.extra.AnnotateText"
This step may take time, highly dependent on the length of your documents and the size of your corpus.