EMERSE   EMERSE Website   Documentation Documentation

This guide describes how natural language processing (NLP) is incorporated into EMERSE, how to configure the system properly, and how to make changes so that you can incorporate your own annotations into the search index.

Background

EMERSE uses NLP to enhance basic keyword search by allowing for more semantic search. The most basic form of search is when a term matches only exactly that word in documents. This can find too little: you might want to find that word regardless of case, misspellings, conjugations, or find other words that mean the same thing. It can also find too much: you may not want to find places where the word refers to something else, when it is not in reference to the patient, or when it appears in a certain phrase. EMERSE already provides search limiters to do much of this, but NLP is in a position to provide more nuanced solutions. It can selectively tag words or phrases as refering to abstract concepts disambiguating word or acronym usage, tag words to indicate when they are talking about the patient, and more.

We’ve integrated open-source libraries OpenNLP and CTAKES for our NLP, however, it is possible to add disable them, or the output of another NLP process to theirs.

NLP Implementation

Solr

Our NLP system is implemented as part of Solr. This means, when you send a document to Solr, Solr will tokenize the document, and index those tokens, thus making it searchable. Documents in Solr consist of many fields, only one of which is the text of the document, and the others are metadata fields, such as when the document was written, the patient it is for etc. Each such field needs a description of how Solr should process it, in particular, how it should be tokenized. This is what the field’s type is for.

Fields a document can have, plus their types, are all defined in the managed-schema XML file in the core’s conf directory in the index. A declation of a field type look like this:

<fieldType
  name="SOME NAME"
  class="org.example.SomeClass"
  other-options-possible>
  <analyzer type="index">...</analyzer>
  <analyzer type="query">...</analyzer>
</fieldType>

Once you have the type, you can make zero or more fields actually use that type, like so:

<field name="SOME FIELD NAME" type="SOME FIELD TYPE NAME"/>

The main work in running NLP in Solr is defining the <analyzer type="index"> element in the field type. The analyzer with type="index", called the index analyzer, is used when indexing documents, which is what we primarily talk about in this document, but the other analyzer is the query analyzer. We will talk about the query analyzer here.

However, the class of the field type is important too: you must use org.emerse.solr.TextWithHeaderFieldType instead of the more normal solr.TextField if your tokenizer has hasHeader="true" set. (See below for more info.)

When you send a header with the document text, you are giving the analyzer additional information it needs for tokenization and indexing. However, that header is not really part of the document, and so shouldn’t be stored and presented as if it were. In order to present the header to the analyzer but not store it, we needed a custom class for the field type.

However, we do actually need to store the header in order to support document updates, where you may update the metadata of the document, but not change the text. This is implemented as a delete-and-insert in Solr, meaning it synthesizes a new document based on the existing document and the update you wish to make to fields of the document. Thus, we need a place to store the header, so we can reconstruct the header-plus-document-text to present it to the analyzer. This place is a field named NLP_HEADER and must have exactly the definition:

<field name="NLP_HEADER" type="string" indexed="false" stored="true" docValues="true"/>

Finally, and importantly, if the field type has this custom class, you must always have a header in every document.

For quick reference, a document with a header attached looks like:

RU1FUlNFX0g=1|5|
Mrs. Anderson was seen in our clinic today. She is complaining of mild shortness of breath, as well as moderate fatigue and hair loss. She denies fevers, loss of appetite, or chest pain. Her mother had Hashimoto's thyroiditis. She has a history of type 2 diabetes and myopia. Mrs. Anderson does not have a history of thyroid cancer. She should be tested for a possible autoimmune disorder.

The header is the first line. See here for more details. It’s possible to extend the header to include custom NLP tokens, as covered below.

One of the simpler examples of a field type of the document text field is:

<fieldType name="text_with_header_post_analysis"
  class="org.emerse.solr.TextWithHeaderFieldType"
  termPositions="true" termVectors="true" termOffsets="true" termPayloads="true"
>
  <analyzer type="index">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="org.emerse.nlp.HybridTokenizerFactory"
      standardTokenizer="true"
      hasHeader="true"
      tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
      sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
      numOfNewLineSymbols="2"
    />
    <filter class="org.emerse.nlp.PostAnalysisFilterFactory" allowMisalignment="true"/>
  </analyzer>
</fieldType>

In the rest of the document, whenever you see a <analyzer type="index"> element or when talk about an analyzer, analysis chain, or a pipeline, it is this index analyzer element in the field type for the document text’s field that we are talking about.

Tokens

EMERSE uses a custom Lucene Analyzer to run NLP over the text, and produce tokens for indexing. Ordinarily, tokens are just words of the text, sometimes stemmed, lowercased, etc, depending on how you want search terms to match the text. However in EMERSE, we have multiple tokens per word of text, with different tokens conveying different information about the text. So as to prevent ambiguity, tokens that convey the words of the text itself are prefixed with either TX_ or tx_. Tokens that convey other things are not allowed to start with these prefixes.

NLP Tokens

Other than tokens that refer to the words of the text itself, there are two broad categories of tokens:

Entities

An entity is a token that refers to a specific concept, such as one given in the UMLS. It makes sense for users to search directly for an entity within the text. An example entity could be CUI_2395 which refers to alzheimers disease.

Annotations

An annotation is a token which adds auxillary information to entities and text tokens. It doesn’t make sense for a user to search for it directly, but to use it to modify a search of another term. An example could be POL_NEG which indicates the word is used in a negative sense of a sentence, such as the phrase alzheimers disease in the phrase does not have alzheimers disease.

The inclusion of these two types of tokens is the contribution of the NLP system to the search index, and enables all the NLP search features in EMERSE to work.

Token Attributes

Each token has an range of characters it refes to in the text. Similarly, each token has a range of "positions" it refers to. These "positions" are conceptually positions of the words of the text, with the first word in the text at position one, second word of the text at position two, etc. Tokens that refer to words of the text itself only span one position, but NLP tokens can span multiple positions.

Positions are determine how search phrases match. If a phrase such as alzheimers disease is to match, the end position of the token tx_alzheimers must be one less than the start position of the token tx_disease.

Token Attribute Reconciliation

Words determine what the positions are, and those same words have character ranges as well, so we want any token than spans a certain set of positions to have the corresponding range of characters. However, NLP systems generally only output character ranges, and they don’t necessarily correspond exactly with the way we break up words in our system. For this reason, we reconcile the character offsets of the output of NLP with the tokenization of words in Solr, assigning position ranges based on the character ranges, and clipping character ranges to the character ranges of those corresponding positions. (Not all characters need to be in the character range of a position; whitespace or punctuation is often not part of any "word" and hence not part of character ranges of tokens, unless those tokens span multiple words and those characters are between those words.)

However, in assigning positions to NLP token based on their character ranges, we can make a choice: we can either have a single token in the index associated with multiple positions, or we duplicate that token once for each such position, so that each token spans only a single position. Conventionally, we duplicate the token for annotations, but have the token span many positions for entities.

Token-Aligned Layered Index Structure

Here is an example to conceptually illustrate the Token-Aligned Layered Index using a sentence: This is not a teddy bear.

This is not a teddy bear

PPOL_NEG

POL_NEG

POL_NEG

POL_NEG

CUI_TEDDY_BEAR

SMG_STUFFED_ANIMALS

There are four layers in this index. Text tokens are in the first row (the prefix omitted). The following rows contain NLP tokens that are aligned with text tokens. The second row are the annotations, third row has an entity which is a concept from UMLS, and the four row has an entity which is a semantic group. Each concept is assigned one or more semantic groups, making semantic groups a very large category of concepts. More information of semantic group can be found here.

NLP Pipeline

A Lucene analysis chain, or the "NLP pipeline" as we call it, comprises:

  • zero or more CharFilters,

  • a single Tokenizer, and

  • zero or more TokenFilters.

CharFilters work mainly at the character level of the text, effectively pre-processing the text so what is seen down the line is in more of a normal form. For instance, it can cause the pipeline to effectively treat special characters as other more normal characters or to ignore parts of the text that correspond to HTML markup.

The tokenizer decides what the words and sentences are in the text.

The token filters do the NLP work, generally adding additional tokens to the stream, though it is possible for them to remove tokens as well. The order of token filters doesn’t matter, unless we otherwise note that it does.

Here is a possible implementation of the EMERSE NLP pipeline:

<analyzer type="index">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-delimiters.txt"/>
    <tokenizer class="org.emerse.opennlp.OpenNLPTokenizerFactory"
        tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
        sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
        hasHeader="true"
        usingLuceneTokenizer="true"
    />
    <filter class="org.emerse.ctakes.CTAKESFilterFactory"
        excludeTypes="nlp/pos_exclude.txt"
        prefix="CUI"
        dictionary="nlp/dictionary/custom.script"
        posModel="nlp/models/mayo-pos.zip"
    />
    <filter class="org.emerse.nlp.TaggerFilterFactory"
        addTokenPrefix="false"
        delimiters="nlp/delimiters.txt"
        negations="nlp/negations.txt"
        negationIgnore="nlp/negationIgnore.txt"
        certainty="nlp/probability_resource_patterns.txt"
        family="nlp/family_history_resource_patterns_A.txt"
        familyTemplates="nlp/family_history_resource_patterns_B.txt"
        familyMembers="nlp/family_history_resource_patterns_C.txt"
        history="nlp/history.txt"
        historyIgnore="nlp/historyIgnore.txt"
        tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
    />
    <filter class="org.emerse.nlp.PostAnalysisFilterFactory"/>
</analyzer>

We’ll go over the options for each of these.

HTML Strip CharFilter

<charFilter class="solr.HTMLStripCharFilterFactory"/>

The HTML strip CharFilter causes the rest of the system to ignore HTML tags. These are just the tags themselves, not the children of those tags. For instance, in the text <b>Alzheimer’s</b> the text <b> and </b> will be ignored, but not the word Alzheimer’s. The offsets of the tokens does account for the tags, so the start of the word Alzheimer’s is character three, not zero within this snippet.

Mapping CharFilter

<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-delimiters.txt"/>

The Mapping CharFilter replaces every occurrence of one character with another, at the direction of the given file, mapping-delimiters.txt. We use this to normalize certain punctuation characters, such as "smart" or "curly" quote marks which differ depending whether they are opening or closing quotes, to straight ones that don’t differ.

OpenNLP Tokenizer

<tokenizer class="org.emerse.opennlp.OpenNLPTokenizerFactory"
    tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
    sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
    hasHeader="true"
    usingLuceneTokenizer="true"
/>

The open nlp tokenizer decides what the "words" of the text are. Though it may seem obvious what this is, but there are edge cases around numbers, hyphenation, etc, where there are choices to be made that affect search. It also decides what the sentences of the text are.

Table 1. Options
Argument Name Required Default Value Description

sentenceModel

Yes

N/A

String, the file path of the pre-training sentence detection model for OpenNLP. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

tokenizerModel

Yes

N/A

String, the file path of the pre-training tokenization model for OpenNLP. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

usingLuceneTokenizer

No

False

Ture or False, reprocess tokens using Lucene’s StandardTokenizer, this approach ensures that the indexed tokens adhere to the behavior of the StandardTokenizer.

hasHeader

No

False

True or False, if True, the tokenizer will expect there is a single line header at the beginning of the document. If the document has a header, the field type for storing the text in Solr index has to be org.emerse.solr.TextWithHeaderFieldType

numOfNewLineSymbols

No

1

Numeric, number of the new line symbols (carriage return and line feed combined) before the end of a sentence. This is a hint to help tokenizer decide the boundary of the sentences.

numOfSpaces

No

5

Numeric, number of space before the end of a sentence. This is a hint to help tokenizer decide the boundary of the sentences.

The NLP Header

The OpenNLP tokenizer expects documents to have a header attached to them if the attribute hasHeader is set to true (as shown in the example). The header is indicated by a kind of file signature or "magic number": the text RU1FUlNFX0g= should appear at the very beginning of the document. This text is unlikely to occur in any real document, and hence is a good indication that it is intended to be the header. After this signature, there are three fields separated by pipes (|):

  1. the number of newlines that should end a sentence,

  2. the number of spaces that should end a sentence, and

  3. boundary text, which indicates the end of the custom token TSV and start of the actual document.

The boundary is all the text from the last pipe character to the next newline. If the boundary text is empty, the document starts immediately. Otherwise, the custom token TSV starts immediately, and continues until a newline followed by the boundary text, and then another newline is seen, at which point the document starts.

Thus, a header can look like:

RU1FUlNFX0g=1|5|

or:

RU1FUlNFX0g=2|5|XHRcdEVNRVJTRSoqKioqKioqRVNSRU1FXHRcdA==

The OpenNLP tokenizer does not work correctly with a custom token TSV. Instead, use the Hybrid Tokenizer.

Hybrid Tokenizer

<tokenizer class="org.emerse.nlp.HybridTokenizerFactory"
    tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
    sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
    hasHeader="true"
    usingLuceneTokenizer="true"
/>

The hybrid tokenizer is basically the same as the OpenNLP tokenizer, except that it works with the custom token TSV, introducing the tokens stated there into the token stream it produces. See here for more information on the format of that TSV, and how to include your custom NLP tokens into documents.

Table 2. Options
Argument Name Required Default Value Description

standardTokenizer

No

False

True or False, if True, HybridTokenizer will be wrapped with Lucene StandardTokenizer; If False, OpenNLPTokenizer will be wrapped. If user decides to take any NLP artifact from EMERSE NLP filters into the Solr index, OpenNLPTokenizer should be used.

maxTokenCount

No

255

Numeric, this is for StandardTokenizer only. It defines the maximum token length. If a token is seen that exceeds this length then it is split at maxTokenCount intervals.

lineFeedWidth

No

1

Numeric, the length of the line feed at the end of the header row. If nothing is specified, EMERSE expects a single line feed at the end of the header row.

CTAKES TokenFilter

<filter class="org.emerse.ctakes.CTAKESFilterFactory"
    excludeTypes="nlp/pos_exclude.txt"
    prefix="CUI"
    dictionary="nlp/dictionary/custom.script"
    posModel="nlp/models/mayo-pos.zip"
/>

The CTAKES TokenFilter produces entity tokens for UMLS concepts and semantic groups.

Table 3. Options
Argument Name Required Default Value Description

dictionary

Yes

N/A

string, the file path of the dictionary. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

prefix

Yes

N/A

String, the prefix that is prepended to the entity ID that will be inserted into the index

longestMatch

No

True

True or False, In a scenario where multiple entities are recognized, and one of these entities encompasses all the others, the decision to insert into the index depends on whether this statement is true or false. If it is true, only the longest entity will be included in the index; if it is false, all identified entities will be added to the index.

posModel

No

N/A

String, the file path of the POS(Parts-Of-Speech) model. The POS model combined with an exclusdeTypes list are used to exclude tokens with certain POS types from being identified as entities.If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

excludeTypes

No

N/A

String, the file path of the POS(Parts-Of-Speech) types. If a token has been identified with any POS type in this list, no entity extraction will be conducted against this token. This list can only be activated if a posModel is given.

Tagger TokenFilter

<filter class="org.emerse.nlp.TaggerFilterFactory"
    addTokenPrefix="false"
    delimiters="nlp/delimiters.txt"
    negations="nlp/negations.txt"
    negationIgnore="nlp/negationIgnore.txt"
    certainty="nlp/probability_resource_patterns.txt"
    family="nlp/family_history_resource_patterns_A.txt"
    familyTemplates="nlp/family_history_resource_patterns_B.txt"
    familyMembers="nlp/family_history_resource_patterns_C.txt"
    history="nlp/history.txt"
    historyIgnore="nlp/historyIgnore.txt"
    tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
/>

The tagger token factory uses a series of rules to produce annotation tokens for

  • negation

  • uncertainty

  • non-patient mentions, and

  • patient history.

Table 4. Options
Argument Name Required Default Value Description

delimiters

No

N/A

String, the file path for the delimiters. These delimiters play a crucial role in dividing a sentence into distinct sections. By doing so, they prevent the detection of mentions from extending beyond the confines of each section, avoiding potential inaccuracies at the sentence’s end. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

negations

No

N/A

String, the file path for the negation rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

negationIgnore

No

N/A

String, the file path for the negation ignore rules. This is useful for the case like can not be ignored. If Can not is a negation rule, without this ignorance, can not be ignored could be wrongly identified as negation. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

certainty

No

N/A

String, the file path for the uncertainty rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

family

No

N/A

String, the file path for the family rules that are in static mode. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

familyTemplates

No

N/A

String, the file path for the family rules that are in dynamic mode. This argument should work with argument familyMembers together. The rules in familyTemplates act as a format string and the placeholders in the format string will be replaced by all rules defined in familyMembers. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

familyMembers

No

N/A

String, the file path for the family member definitions. This argument should work with argument familyTemplates together. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

history

No

N/A

String, the file path for the patient history mention rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

historyIgnore

No

N/A

String, the file path for the history ignore rules. This is useful for the case like history on admit. If History is a history rule, without this ignorance, history on admit could be wrongly identified as negation. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

addTokenPrefix

No

True

True or False, if True, this filter will add a prefix of TX_ and tx_ to tokens. Tokens with the TX_ prefix are case-sensitive, while tokens with the tx_ prefix are case-insensitive; If False, no prefix will be added and the tokens are sent to downstream as they are.

Post Analysis TokenFilter

<filter class="org.emerse.nlp.PostAnalysisFilterFactory"/>

The Post analysis TokenFilter does token reconciliation. It must be the last token filter in the pipeline.

Table 5. Options
Argument Name Required Default Value Description

addTokenPrefix

No

True

True or False, if True, a token will be provided in both case-sensitive and case-insensitive forms and be prefixed with TX_ and tx_ respectively.

Query Analyzer

So far, we’ve discussed the index analyzer. What about the query analyzer?

The query analyzer is used when analyzing the text of user’s queries, so that the words users type into term bundles in EMERSE are broken into the same tokens as for indexing, that way they match up and documents can be found. Howveer, the query analyzer doesn’t do NLP, since the user isn’t entering general text, so the analysis chain needs to be different, which is why there are separate analyzers for index and query.

You should use the same char filters and tokenizer as the main index analyzer, but use none of the token filters. Instead, use only the org.emerse.nlp.TokenAdjusterFilterFactory token filter. The token adjuster filter factory appends the TX_ or tx_ prefixes to words, except when it appears the word already has one of the reserved prefixes, as given in the passed attributed.

<analyzer type="query">
  <charFilter class="solr.HTMLStripCharFilterFactory"/>
  <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-delimiters.txt"/>
  <tokenizer class="solr.OpenNLPTokenizerFactory"
    sentenceModel="nlp/models/sd-med-model.zip"
    tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"/>
  <filter class="org.emerse.nlp.TokenAdjusterFilterFactory"
      reservedPrefix="CUI_,POL_NEG,PPOL_NEG,APOL_NEG,TG_FAMILY,PTG_FAMILY,ATG_FAMILY,CT_UNCERTAIN,PCT_UNCERTAIN,ACT_UNCERTAIN,TG_HISTORY,PTG_HISTORY,ATG_HISTORY"/>
</analyzer>

Custom NLP

If you want to provide your own entities or annotations in EMERSE, you must use the Hybrid Tokenizer passing hasHeader="true", and provide the custom NLP tokens into the TSV part of that header.

The TSV has four columns. The first two give the character range in the text the token is for. This is a count of characters (in UTF-16 format with a surrogate pair counting as two characters) starting from the beginning of the document, after the boundary and its trailing newline appear below the TSV. The third column gives the token. The fourth column gives the type of reconciliation to be done, as shown in this table:

Flag Reconciliation Method

T

Duplicate the token once per position in the character range (like annotations)

R

Generate only token per row in TSV, assign it all positions the character range (like entities)

Again, "position" here means a position of a word, counting words from the beginning of the document.

Here is an example custom token TSV, complete with the header:

RU1FUlNFX0g=1|5|XHRcdEVNRVJTRSoqKioqKioqRVNSRU1FXHRcdA==
37	42	CUI_310367	R
37	42	SMG_Drug	R
66	90	CUI_4054523	R
66	90	SMG_Finding	R
71	90	CUI_13404	R
71	90	SMG_Finding	R
84	90	CUI_225386	R
84	90	SMG_Finding	R
95	99	CUI_5575035	R
95	99	SMG_Finding	R
103	111	CUI_5201148	R
103	111	SMG_Finding	R
112	119	CUI_15672	R
112	119	SMG_Findnig	R
124	128	CUI_18494	R
124	128	SMG_Anatomy	R
124	133	CUI_2170	R
124	133	SMG_Disorder	R
135	138	PPOL_NEG	T
139	145	PPOL_NEG	T
146	152	POL_NEG	T
146	152	CUI_15967	R
146	152	SMG_Finding	R
154	158	POL_NEG	T
154	170	CUI_1971624	R
154	170	SMG_Finding	R
159	161	POL_NEG	T
162	170	POL_NEG	T
162	170	CUI_3618	R
162	170	SMG_Finding	R
172	174	POL_NEG	T
175	180	POL_NEG	T
175	180	CUI_817096	R
175	180	SMG_Anatomy	R
175	185	CUI_8031	R
175	185	SMG_Finding	R
181	185	POL_NEG	T
181	185	CUI_30193	R
181	185	SMG_Finding	R
191	197	PTG_FAMILY	T
198	201	PTG_FAMILY	T
202	211	TG_FAMILY	T
202	211	CUI_677607	R
202	211	SMG_Disorder	R
212	213	TG_FAMILY	T
214	225	TG_FAMILY	T
214	225	CUI_40147	R
214	225	SMG_Disorder	R
237	244	PTG_HISTORY	T
237	244	CUI_262926	R
237	244	SMG_Finding	R
245	247	TG_HISTORY	T
248	252	TG_HISTORY	T
253	254	TG_HISTORY	T
255	263	TG_HISTORY	T
255	263	CUI_11849	R
255	263	SMG_Disorder	R
255	263	CUI_11847	R
255	263	SMG_Disorder	R
264	267	TG_HISTORY	T
268	274	TG_HISTORY	T
268	274	CUI_27092	R
268	274	SMG_Disorder	R
290	294	PPOL_NEG	T
295	298	PPOL_NEG	T
299	303	PPOL_NEG	T
304	305	PPOL_NEG	T
306	313	PPOL_NEG	T
306	313	PTG_HISTORY	T
306	313	CUI_262926	R
306	313	SMG_Finding	R
314	316	PPOL_NEG	T
314	316	TG_HISTORY	T
317	324	POL_NEG	T
317	324	TG_HISTORY	T
317	324	CUI_40132	R
317	324	SMG_Anatomy	R
317	331	CUI_7115	R
317	331	SMG_Disorder	R
325	331	POL_NEG	T
325	331	TG_HISTORY	T
325	331	CUI_6826	R
325	331	SMG_Disorder	R
347	353	CUI_39593	R
347	353	SMG_Finding	R
360	368	PCT_UNCERTAIN	T
360	368	CUI_332149	R
360	368	SMG_Finding	R
369	379	CT_UNCERTAIN	T
369	379	CUI_4551524	R
369	379	SMG_Finding	R
369	388	CUI_4364	R
369	388	SMG_Disorder	R
380	388	CT_UNCERTAIN	T
380	388	CUI_12634	R
380	388	SMG_Disorder	R
XHRcdEVNRVJTRSoqKioqKioqRVNSRU1FXHRcdA==
Mrs. Anderson was seen in our clinic today. She is complaining of mild shortness of breath, as well as moderate fatigue and hair loss. She denies fevers, loss of appetite, or chest pain. Her mother had Hashimoto's thyroiditis. She has a history of type 2 diabetes and myopia. Mrs. Anderson does not have a history of thyroid cancer. She should be tested for a possible autoimmune disorder.

In this file, the single line header serves as the designated location for providing arguments to the tokenizer. Following the header is a dedicated NLP artifact section, where each row contains the start offset, end offset, artifact ID, and artifact type, which are delimited by tabs. A separator demarcates the conclusion of the NLP artifact section. Subsequently, the original document is appended.

Here is a possible implementation of the EMERSE custom NLP pipeline:

<analyzer type="index">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="org.emerse.nlp.HybridTokenizerFactory"
        tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
        sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
        hasHeader="true"
        usingLuceneTokenizer="true"
    />
    <filter class="org.emerse.nlp.PostAnalysisFilterFactory" allowMisalignment="true"/>
</analyzer>

Query Analyzer

To facilitate Solr indexing, a mandatory component is the Analyzer chain (pipeline). Similarly, for Solr queries, an analyzer chain must be defined. Since EMERSE offers flexibility to customize the NLP pipeline, it’s crucial to emphasize that the tokens extracted from a query phrase should precisely match those generated by the analyzer during the Solr indexing process.

Here is the query analyzer that EMERSE recommends:

<analyzer type="query">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-delimiters.txt"/>
    <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="nlp/models/sd-med-model.zip" tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"/>
    <filter class="org.emerse.nlp.TokenAdjusterFilterFactory" reservedPrefix="CUI_,POL_NEG,PPOL_NEG,APOL_NEG,TG_FAMILY,PTG_FAMILY,ATG_FAMILY,CT_UNCERTAIN,PCT_UNCERTAIN,ACT_UNCERTAIN,TG_HISTORY,PTG_HISTORY,ATG_HISTORY"/>
</analyzer>

org.emerse.nlp.TokenAdjusterFilterFactory is designed to work with the OpenNLPTokenizer created by solr.OpenNLPTokenizerFactory only. Failing to use the specified tokenizer will lead to unexpected results, which may impact the functionality of EMERSE.

Alternatively:

<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="org.emerse.opennlp.OpenNLPTokenizerFactory"
        tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
        sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
        usingLuceneTokenizer="true"
    />
    <filter class="org.emerse.nlp.PostAnalysisFilterFactory" addTokenPrefix="false"/>
</analyzer>

NLP Configuration Files

NLP configuration files, including mapping file for the CharFilter, model files for standard NLP tokenizer, UMLS dictionary for NER and pattern files for detection of negation, uncertainty etc., are all located under DOCUMENT_INDEX_ROOT/config/nlp.

Delimiter Mapping

Prior to version 7, EMERSE includes MappingCharFilter in both indexing pipeline and query pipeline. The mapping file, mapping-delimiters.txt, may contain the removing of dots from various character sets, for example:

"。" => ""

This type of mapping should be removed in version 7, as detecting a dot is a common method for NLP to identify sentence boundaries.

OpenNLP Tokenizer Model

Model files are in a subdirectory named models. By default, it includes OpenNLP trained token, sentence and POS models for English. If the documents are in a different language, models may be available on OpenNLP website.

Pattern Files for Detection of Negation, Uncertainty and More

Sentiment detection in EMERSE relies on pattern matching. Newer technologies, such as convolutional neural networks (CNNs), were not adopted primarily due to their impact on indexing performance. To efficiently index large volumes of notes within a reasonable timeframe, pattern matching remains a viable solution.

EMERSE can detect negation, uncertainty, non-patient mentions, and patient history. Some detection types involve two pattern files: one for the detection patterns and another for overriding these patterns to address false positives. For example, negations.txt versus negationIgnore.txt, history.txt versus historyIgnore.txt. Here is a snippet of negation patterns:

denied symptoms of *****
***** couldn't be

The text in each line, i.e. couldn’t be, is a trigger phrase to activate negation detection. The 4 asterisks * represent a placeholder for anything that is being negated. * can be placed either before or after the trigger.

Negation-Ignore file, may include:

couldn't be excluded

For the given negation pattern * couldn’t be, if the sentence is: Pneumonia can not be excluded, negation will not be concluded due to the negation ignore trigger couldn’t be excluded. Be aware, ignore file contains trigger only.

Uncertainty only implements a pattern file probability_resource_patterns.txt and non-patient mention, on the other hand, has 3 pattern files, among which, family_history_resource_patterns_A.txt is a regular pattern file, family_history_resource_patterns_B.txt is a template file and @@@@@ in each line can be replaced by the familial terms defined in family_history_resource_patterns_C.txt

If a mention is identified through a given rule, EMERSE will tag tokens with the corresponding attribute ID before and/or after the token pattern in the rule until the sentence boundary is reached. The direction is defined in conjunction with the rule as either prepending *****, indicating movement towards the beginning of the sentence, or appending *****, indicating movement towards the end of the sentence. Additionally, tokens matching the rule will be tagged with a distinct attribute ID specifying the rule.:

This is not a teddy.

If a negation is provided as not *****, every token after not will be tagged as negation (POL_NEG as the attribute ID) and not itself will be tagged as the "trigger" for this negation (PPOL_NEG as the attribute ID). Tokens This is will not be touched since ***** is after not, which means only the tokens after the trigger will be considered as negated.

This is not a teddy

PPOL_NEG

POL_NEG

POL_NEG

From the above example, it is evident that a set of attributes is required to describe the occurrence of a mention.

UMLS Dictionary

dictionary subdirectory contains the UMLS dictionary tailed by EMERSE using CTAKES UI. The dictionary file is a SQL file since that’s the default output from CTAKES. EMERSE parses this SQL file and extracts all concepts in conjunction with their groups. Entity recognition through using the xref:[ctakes-token-filter] requires a script file built through CTAKES Dictionary Creator. (As of 2024 this program still required Java 8 to run properly).

The EMERSE team has a default dictionary file to distribute to sites installing EMERSE, or upgrading from prior versions that are not NLP-enabled. Sites simply need to provide to us with a copy of a valid UMLS license first.

For those wishing to make their own dictionary, users can create a custom dictionary and specify the path of the script file for OpenNLPTokenizerFactory using its dictionary attribute.

By default, this dictionary resides at:

SOLR_HOME/DOCUMENT_CORE/conf/nlp/dictionary

UMLS Source Data

At the time of this writing (May 2024), the source of the data to match concepts in the clinical notes comes from the UMLS version 2023AB. A subset of source vocabularies and semantic types are included in the default EMERSE distribution, specified in the tables below. These were selected because they are the mostly likely to have clinical relevance and appear in the EHR notes. Each semantic type is mapped to a Type Unique Identifier (TUI).

Table 6. List of the all UMLS source vocabularies, including those in the default EMERSE distribution, derived from UMLS version 2023AB
Abbreviation Vocabulary Name Included in EMERSE?

AIR

AI/RHEUM

ALT

Alternative Billing Concepts

AOD

Alcohol and Other Drug Thesaurus

YES

AOT

Authorized Osteopathic Thesaurus

YES

ATC

Anatomical Therapeutic Chemical Classification System

BI

Beth Israel Problem List

CCC

Clinical Care Classification

CCPSS

Clinical Problem Statements

CCS

Clinical Classifications Software

CCSR_ICD10CM

Clinical Classifications Software Refined for ICD-10-CM

CCSR_ICD10PCS

Clinical Classifications Software Refined for ICD-10-PCS

CDCREC

Race & Ethnicity - CDC

CDT

CDT

CHV

Consumer Health Vocabulary

YES

COSTAR

COSTAR

CPM

Medical Entities Dictionary

CPT

CPT - Current Procedural Terminology

YES

CPTSP

CPT Spanish

CSP

CRISP Thesaurus

CST

COSTART

CVX

Vaccines Administered

DDB

Diseases Database

YES

DMDICD10

ICD-10 German

DMDUMD

UMDNS German

DRUGBANK

DrugBank

YES

DSM-5

Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition

YES

DXP

DXplain

FMA

Foundational Model of Anatomy

GO

Gene Ontology

GS

Gold Standard Drug Database

YES

HCDT

CDT in HCPCS

HCPCS

HCPCS - Healthcare Common Procedure Coding System

YES

HCPT

CPT in HCPCS

HGNC

HUGO Gene Nomenclature Committee

HL7V2.5

HL7 Version 2.5

HL7V3.0

HL7 Version 3.0

HLREL

ICPC2E ICD10 Relationships

HPO

Human Phenotype Ontology

YES

ICD10

International Classification of Diseases and Related Health Problems, Tenth Revision

YES

ICD10AE

ICD-10, American English Equivalents

YES

ICD10AM

ICD-10, Australian Modification

ICD10AMAE

ICD-10, Australian Modification, Americanized English Equivalents

ICD10CM

International Classification of Diseases, Tenth Revision, Clinical Modification

YES

ICD10DUT

ICD10, Dutch Translation

ICD10PCS

ICD-10 Procedure Coding System

YES

ICD9CM

International Classification of Diseases, Ninth Revision, Clinical Modification

ICF

International Classification of Functioning, Disability and Health

YES

ICF-CY

International Classification of Functioning, Disability and Health for Children and Youth

YES

ICNP

International Classification for Nursing Practice

ICPC

International Classification of Primary Care

ICPC2EDUT

ICPC2E Dutch

ICPC2EENG

International Classification of Primary Care, 2nd Edition, Electronic

ICPC2ICD10DUT

ICPC2-ICD10 Thesaurus, Dutch Translation

ICPC2ICD10ENG

ICPC2-ICD10 Thesaurus

ICPC2P

ICPC-2 PLUS

ICPCBAQ

ICPC Basque

ICPCDAN

ICPC Danish

ICPCDUT

ICPC Dutch

ICPCFIN

ICPC Finnish

ICPCFRE

ICPC French

ICPCGER

ICPC German

ICPCHEB

ICPC Hebrew

ICPCHUN

ICPC Hungarian

ICPCITA

ICPC Italian

ICPCNOR

ICPC Norwegian

ICPCPOR

ICPC Portuguese

ICPCSPA

ICPC Spanish

ICPCSWE

ICPC Swedish

JABL

Congenital Mental Retardation Syndromes

YES

KCD5

Korean Standard Classification of Disease Version 5

LCH

Library of Congress Subject Headings

LCH_NW

Library of Congress Subject Headings, Northwestern University subset

LNC

LOINC

YES

LNC-DE-AT

LOINC Linguistic Variant - German, Austria

LNC-DE-DE

LOINC Linguistic Variant - German, Germany

LNC-EL-GR

LOINC Linguistic Variant - Greek, Greece

LNC-ES-AR

LOINC Linguistic Variant - Spanish, Argentina

LNC-ES-ES

LOINC Linguistic Variant - Spanish, Spain

LNC-ES-MX

LOINC Linguistic Variant - Spanish, Mexico

LNC-ET-EE

LOINC Linguistic Variant - Estonian, Estonia

LNC-FR-BE

LOINC Linguistic Variant - French, Belgium

LNC-FR-CA

LOINC Linguistic Variant - French, Canada

LNC-FR-FR

LOINC Linguistic Variant - French, France

LNC-IT-IT

LOINC Linguistic Variant - Italian, Italy

LNC-KO-KR

LOINC Linguistic Variant - Korea, Korean

LNC-NL-NL

LOINC Linguistic Variant - Dutch, Netherlands

LNC-PL-PL

LOINC Linguistic Variant - Polish, Poland

LNC-PT-BR

LOINC Linguistic Variant - Portuguese, Brazil

LNC-RU-RU

LOINC Linguistic Variant - Russian, Russia

LNC-TR-TR

LOINC Linguistic Variant - Turkish, Turkey

LNC-UK-UA

LOINC Linguistic Variant - Ukrainian, Ukraine

LNC-ZH-CN

LOINC Linguistic Variant - Chinese, China

MCM

Glossary of Clinical Epidemiologic Terms

MDR

MedDRA

YES

MDRARA

MedDRA Arabic

MDRBPO

MedDRA Brazilian Portuguese

MDRCZE

MedDRA Czech

MDRDUT

MedDRA Dutch

MDRFRE

MedDRA French

MDRGER

MedDRA German

MDRGRE

MedDRA Greek

MDRHUN

MedDRA Hungarian

MDRITA

MedDRA Italian

MDRJPN

MedDRA Japanese

MDRKOR

MedDRA Korean

MDRLAV

MedDRA Latvian

MDRPOL

MedDRA Polish

MDRPOR

MedDRA Portuguese

MDRRUS

MedDRA Russian

MDRSPA

MedDRA Spanish

MDRSWE

MedDRA Swedish

MED-RT

Medication Reference Terminology

MEDCIN

MEDCIN

MEDLINEPLUS

MedlinePlus Health Topics

YES

MEDLINEPLUS_SPA

MedlinePlus Spanish Health Topics

MMSL

Multum

YES

MMX

Micromedex

YES

MSH

MeSH

YES

MSHCZE

MeSH Czech

MSHDUT

MeSH Dutch

MSHFIN

MeSH Finnish

MSHFRE

MeSH French

MSHGER

MeSH German

MSHITA

MeSH Italian

MSHJPN

MeSH Japanese

MSHLAV

MeSH Latvian

MSHNOR

MeSH Norwegian

MSHPOL

MeSH Polish

MSHPOR

MeSH Portuguese

MSHRUS

MeSH Russian

MSHSCR

MeSH Croatian

MSHSPA

MeSH Spanish

MSHSWE

MeSH Swedish

MTH

Metathesaurus Names

MTHCMSFRF

Metathesaurus CMS Formulary Reference File

MTHICD9

ICD-9-CM Entry Terms

MTHICPC2EAE

ICPC2E American English Equivalents

MTHICPC2ICD10AE

ICPC2E-ICD10 Thesaurus, American English Equivalents

MTHMST

Minimal Standard Terminology (UMLS)

YES

MTHMSTFRE

Minimal Standard Terminology French (UMLS)

MTHMSTITA

Minimal Standard Terminology Italian (UMLS)

MTHSPL

FDA Structured Product Labels

MVX

Manufacturers of Vaccines

YES

NANDA-I

NANDA-I Taxonomy

NCBI

NCBI Taxonomy

NCI

NCI Thesaurus

YES

NCISEER

NCI SEER ICD Mappings

NDDF

FDB MedKnowledge

NEU

Neuronames Brain Hierarchy

NIC

Nursing Interventions Classification

NOC

Nursing Outcomes Classification

NUCCHCPT

National Uniform Claim Committee - Health Care Provider Taxonomy

OMIM

Online Mendelian Inheritance in Man

OMS

Omaha System

ORPHANET

ORPHANET

YES

PCDS

Patient Care Data Set

PDQ

Physician Data Query

PNDS

Perioperative Nursing Data Set

YES

PPAC

Pharmacy Practice Activity Classification

PSY

Psychological Index Terms

YES

QMR

Quick Medical Reference

RAM

Clinical Concepts by R A Miller

RCD

Read Codes

RCDAE

Read Codes Am Engl

RCDSA

Read Codes Am Synth

RCDSY

Read Codes Synth

RXNORM

RXNORM

YES

SCTSPA

SNOMED CT Spanish Edition

SNM

SNOMED 1982

SNMI

SNOMED Intl 1998

SNOMEDCT_US

SNOMED CT, US Edition

YES

SNOMEDCT_VET

SNOMED CT, Veterinary Extension

SOP

Source of Payment Typology

SPN

Standard Product Nomenclature

SRC

Source Terminology Names (UMLS)

TKMT

Traditional Korean Medical Terms

ULT

UltraSTAR

UMD

UMDNS

USP

USP Compendial Nomenclature

USPMG

USP Model Guidelines

UWDA

Digital Anatomist

VANDF

National Drug File

YES

WHO

WHOART

WHOFRE

WHOART French

WHOGER

WHOART German

WHOPOR

WHOART Portuguese

WHOSPA

WHOART Spanish

Table 7. List of the all UMLS Semantic Types, including those in the default EMERSE distribution, derived from UMLS version 2023AB
Type Unique Identifier (TUI) Semantic Type Name Included in EMERSE?

T001

Organism

T002

Plant

T004

Fungus

T005

Virus

YES

T007

Bacterium

T008

Animal

T010

Vertebrate

T011

Amphibian

T012

Bird

T013

Fish

T014

Reptile

T015

Mammal

T016

Human

YES

T017

Anatomical Structure

YES

T018

Embryonic Structure

T019

Congenital Abnormality

YES

T020

Acquired Abnormality

YES

T021

Fully Formed Anatomical Structure

YES

T022

Body System

YES

T023

Body Part, Organ, or Organ Component

YES

T024

Tissue

YES

T025

Cell

YES

T026

Cell Component

YES

T028

Gene or Genome

T029

Body Location or Region

YES

T030

Body Space or Junction

YES

T031

Body Substance

YES

T032

Organism Attribute

T033

Finding

YES

T034

Laboratory or Test Result

YES

T037

Injury or Poisoning

YES

T038

Biologic Function

T039

Physiologic Function

YES

T040

Organism Function

YES

T041

Mental Process

YES

T042

Organ or Tissue Function

YES

T043

Cell Function

YES

T044

Molecular Function

YES

T045

Genetic Function

YES

T046

Pathologic Function

YES

T047

Disease or Syndrome

YES

T048

Mental or Behavioral Dysfunction

YES

T049

Cell or Molecular Dysfunction

YES

T050

Experimental Model of Disease

YES

T051

Event

T052

Activity

YES

T053

Behavior

YES

T054

Social Behavior

YES

T055

Individual Behavior

YES

T056

Daily or Recreational Activity

YES

T057

Occupational Activity

YES

T058

Health Care Activity

YES

T059

Laboratory Procedure

YES

T060

Diagnostic Procedure

YES

T061

Therapeutic or Preventive Procedure

YES

T062

Research Activity

T063

Molecular Biology Research Technique

T064

Governmental or Regulatory Activity

T065

Educational Activity

T066

Machine Activity

T067

Phenomenon or Process

T068

Human-caused Phenomenon or Process

T069

Environmental Effect of Humans

T070

Natural Phenomenon or Process

T071

Entity

T072

Physical Object

T073

Manufactured Object

T074

Medical Device

YES

T075

Research Device

YES

T077

Conceptual Entity

T078

Idea or Concept

T079

Temporal Concept

T080

Qualitative Concept

T081

Quantitative Concept

T082

Spatial Concept

T083

Geographic Area

T085

Molecular Sequence

T086

Nucleotide Sequence

T087

Amino Acid Sequence

T088

Carbohydrate Sequence

T089

Regulation or Law

T090

Occupation or Discipline

YES

T091

Biomedical Occupation or Discipline

YES

T092

Organization

T093

Health Care Related Organization

T094

Professional Society

T095

Self-help or Relief Organization

T096

Group

T097

Professional or Occupational Group

YES

T098

Population Group

YES

T099

Family Group

T100

Age Group

T101

Patient or Disabled Group

YES

T102

Group Attribute

T103

Chemical

T104

Chemical Viewed Structurally

T109

Organic Chemical

YES

T114

Nucleic Acid, Nucleoside, or Nucleotide

YES

T116

Amino Acid, Peptide, or Protein

YES

T120

Chemical Viewed Functionally

T121

Pharmacologic Substance

YES

T122

Biomedical or Dental Material

YES

T123

Biologically Active Substance

YES

T125

Hormone

YES

T126

Enzyme

YES

T127

Vitamin

YES

T129

Immunologic Factor

YES

T130

Indicator, Reagent, or Diagnostic Aid

YES

T131

Hazardous or Poisonous Substance

YES

T167

Substance

YES

T168

Food

YES

T169

Functional Concept

YES

T170

Intellectual Product

T171

Language

T184

Sign or Symptom

YES

T185

Classification

YES

T190

Anatomical Abnormality

YES

T191

Neoplastic Process

YES

T192

Receptor

T194

Archaeon

T195

Antibiotic

YES

T196

Element, Ion, or Isotope

YES

T197

Inorganic Chemical

YES

T200

Clinical Drug

YES

T201

Clinical Attribute

T203

Drug Delivery Device

YES

T204

Eukaryote

Document Fields Example

<schema name="example" version="1.5">

	<!-- You can use different field names for all fields.
	Just be sure they match what the EMRESE Admin app expects the fields to be named. -->

	<uniqueKey>ID</uniqueKey>

	<!-- unique across sources -->
	<field name="ID" type="string" required="true"/>
	<!-- unique within a source, optional.
	For use by a user if they want to cross-reference
	a document in EMERSE with another system
	without having to demange the cross-source unique ID above -->
	<field name="RPT_ID" type="string"/>

  <!-- Required fields -->
	<field name="MRN" type="string" required="true"/>
	<field name="ENCOUNTER_DATE" type="date" required="true"/>
	<field name="DOC_TYPE" type="string" required="true"/>
	<field name="SOURCE" type="string" required="true"/>
	<field name="RPT_TEXT" type="text_with_header_post_analysis" required="true"/>

	<!-- Required fields which are calculated by
	 the ETL system based on the encounter date and the birthdate of the patient. -->
	<field name="AGE_DAYS" type="int" required="true"/>
	<field name="AGE_MONTHS" type="int" required="true"/>

	<!-- The encounter ID, if available -->
	<field name="CSN" type="string"/>

	<!-- Other fields describing metadata -->
	<field name="ADMIT_DATE" type="date"/>
	<field name="RPT_DATE" type="date"/>
	<field name="DEPT" type="string" default="unknown"/>
	<field name="SVC" type="string" default="unknown"/>
	<field name="CLINICIAN" type="string" default="unknown"/>
	<field name="CATEGORY" type="string" multiValued="true"/>

	<!-- The NLP_HEADER is optional but cannot be renamed -->
	<field name="NLP_HEADER" type="string" indexed="false"/>

	<!-- Metadata fields related to indexing itself, provided by the ETL system -->
	<field name="INDEX_DATE" type="date"/>
	<field name="LAST_UPDATED" type="date"/>

	<!-- _version_ must be a field since it's used by Solr internally.
	Solr fills in the values automatically. -->
	<field name="_version_" type="long"/>

	<!-- Field Types -->
	<fieldType name="date" class="solr.DatePointField"
		sortMissingLast="true" uninvertible="false" docValues="true"/>
	<fieldType name="long" class="solr.LongPointField"
		sortMissingLast="true" uninvertible="false" docValues="true"/>
	<fieldType name="int" class="solr.IntPointField"
		sortMissingLast="true" uninvertible="false" docValues="true"/>
	<fieldType name="string" class="solr.StrField"
		sortMissingLast="true" uninvertible="false" docValues="true"/>
	<fieldType name="boolean" class="solr.BoolField"
		sortMissingLast="true" uninvertible="false" docValues="true"/>
    ....
</schema>

EMERSE Database

EMERSE stores NLP artifacts in two database tables. By default, only EMERSE built-in artifacts are populated. To include custom NLP artifacts within the EMERSE application, it is necessary to populate them into these tables accordingly.

Table SEMANTIC_GROUP

This table is for storing NLP Semantic Groups, which has been discussed at: Token-Aligned Layered Index Structure

Column Name Type Description

GROUP_ID

VARCHAR(25)

Semantic Group ID, this is the name stored in the Solr index to represent an entity.

LABEL

VARCHAR(50)

Group Name, a user-friendly name for the semantic group that will be displayed on UI.

BORDER_CSS

VARCHAR(50)

CSS value for "border-bottom", i.e. "0.3em dotted red". When a concept in the semantic group is highlighted in the document, specific elements from this CSS property will be employed for rendering.

Table NLP_TAG

This table maintains NLP attribute groups, which has been mentioned at: [set-of-attributes]

Column Name Type Description

GROUP_ID

VARCHAR(25)

Unique ID for a [set-of-attributes]. This ID will be used to reference a set of attribute IDs.

QUERY_TAG

VARCHAR(25)

The attribute ID in the group that represents the mention.

HIGHLIGHT_TAGS

VARCHAR(100)

All attribute IDs in the group that can be highlighted in the document.

TAG_INCLUDED_PROMPT

VARCHAR(300)

The text description of this attribute is included in the query. If +[ATTRIBUTE_ID] is selected, what does it signify?

TAG_EXCLUDED_PROMPT

VARCHAR(300)

The text description of this attribute is excluded in the query. If -[ATTRIBUTE_ID] is selected, what does it signify?

TAG_NEUTRAL_PROMPT

VARCHAR(300)

The text description of this attribute is not specify in the query.

LABEL

VARCHAR(50)

The name for this Attribute on UI for display.

BORDER_CSS

VARCHAR(50)

CSS value for "border-bottom", i.e. "0.15em solid #EA3223". When an attribute group is highlighted in the document, specific elements from this CSS property will be employed for rendering.

DISPLAY_ORDER

NUMERIC

The position of this attribute group relative to others when displayed on the UI.

SUBWAY_ICON

VARCHAR(1)

A singular letter serves as the representation of this attribute group. This letter will appear in various locations on the UI to indicate the selection status of this group.

SUBWAY_INCLUDED_PROMPT

VARCHAR(100)

The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is included in the query.

SUBWAY_EXCLUDED_PROMPT

VARCHAR(100)

The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is excluded in the query.

SUBWAY_NEUTRAL_PROMPT

VARCHAR(100)

The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is not in the query.

HIDDEN_ON_UI

BOOLEAN

Indicating whether this attribute group should be hidden on the UI.

EMERSE configuration

In order for Solr to properly treat UMLS concepts and Semantic group as query terms, the following configuration line should be added into emerse.properties file:

nlp.artifcat.prefix=CUI_ SMG_

This tells Solr any term starting with CUI_ or SMG_ should not be subject to tokenization. If custom pipeline is used, replace CUI_, SMG_ with your own concept/group prefix.

EMERSE UI Consideration

EMERSE queries a Solr field named nlp_display within the index. When this field is populated, its contents are showcased within the NLP Detail section. This functionality proves invaluable when a user’s NLP tool generates supplementary content, such as summaries or abstracts, based on the original document.

Additionally, users have the option to establish distinct Solr fields dedicated to storing NLP outputs. This enables the configuration of these fields as query filters, facilitating more refined search capabilities.

Query Syntax

The introduction of the EMERSE query syntax is out of the scope of this document. However, it’s worth mentioning the integration of NLP artifacts in the query.

For NLP attributes, such as negation, the use of + and - operators is employed, followed by square brackets [] containing the attribute’s ID. These notations indicate whether the attribute should align with the query phrase. For example, in order to query smoke history in the negated form, the query might be written as:

+[POL_NEG]"smoke history"

given POL_NEG as the attribute ID for negation.

Querying alignment with a phrase for multiple attributes is also permissible. For example, in order to query smoke history in the non-negated form for the patient only, the query might be written as:

-[POL_NEG]-[TG_FAMILY]"smoke history"

given POL_NEG as the attribute ID for negation and TG_FAMILY as the attribute ID for non-patient mention.

For NLP entities, on the other hand, are treated as regular phrase in terms of query. They can be combined with regular text phrases in the query. For example:

"CUI_TEDDY bear"

This query looks for any token/phrase that is mapped as the entity by ID CUI_TEDDY and followed by a token bear. If CUI_TEDDY is mapped to "Teddy", "A teddy", "Teddies", the above query is equivalent to:

"Teddy bear" OR "A teddy bear" OR "Teddies bear