EMERSE NLP Guide

This guide describes how natural language processing (NLP) is incorporated into EMERSE, how to configure the system properly, and how to make changes so that you can incorporate your own annotations into the search index.

Background

Free text exhibits a high degree of complexity in its structure and diversity,making it improbable that any solitary approach will consistently yield optimal results. Due to this inherent complexity, the NLP capabilities integrated into EMERSE, which can be efficiently executed while the documents are sent to Solr for indexing, are constructed on top of well-established open-source libraries like OpenNLP and CTAKES. These fundamental features encompass rule-based information extraction and text classification.

NLP Implementation

NLP Element

EMERSE requires a highly customized Lucene Analyzer to pre-compute NLP artifacts, which are subsequently integrated with standard tokens for indexing purposes. These artifacts are aligned with tokens based upon their positional information and length.

There exist two distinct categories of artifacts:

Attribute, An attribute is defined as additional information that can be linked to an individual token. Multiple attributes can be associated with a single token.
Entity, An entity is characterized as a concept extracted from the original document through NLP analysis. A concept may span multiple tokens and is positioned on the first token it encompasses. It’s worth noting that an entity may include symbols that are overlooked by a tokenizer, meaning an entity might contain characters not present in the index.

Token-Aligned Layered Index Structure

Here is an example to conceptually illustrate the Token-Aligned Layered Index using a sentence: This is not a teddy bear.

This

not

teddy

bear

PPOL_NEG

POL_NEG

CUI_TEDDY_BEAR

SMG_STUFFED_ANIMALS

There are four layers in this index. Text tokens are in the first row. The following rows contain NLP artifacts that are aligned with text tokens. Attributes, which reside at the second row, are always associated with individual tokens. Entities, concepts for this case, on the other hand, are presented in the third row and may span over multiple tokens. The fourth row encompasses a different type of entities, the semantic group. Each concept is assigned with one or more semantic groups. More information of semantic group can be found at: UMLS Semantic Group and Concept.

NLP Pipeline

A typical Lucene Analysis Chain includes:

Zero, one or more CharFilters;
A single Tokenizer;
Zero, one or more TokenFilters;

Standard Pipeline

EMERSE takes solr.HTMLStripCharFilterFactory as the default CharFilter in order to pre-process documents in HTML format.

A custom tokenizer has been developed with the following capacities:

Sentence Boundary Detection: It can effectively identify sentence boundaries within the text.
Token-Level Text Segmentation: It segments the text into tokens within each sentence, preparing it for more intricate NLP tasks.
Document Header Recognition: If a document includes a header, the tokenizer can recognize and interpret it. This header, when present, offers valuable hints and instructions on how to conduct the tokenization process.

Ideally, this tokenizer should also assume responsibility for more intricate NLP tasks, including entity identification and attribute discovery. Nonetheless, a monolithic implementation of such tasks would introduce unnecessary complexity and a lack of modularity, thereby hindering the ability to update or replace specific functionalities individually.

EMERSE has encapsulated these NLP tasks within filters that process a token stream originating from either the tokenizer or other filters. These filters operate autonomously and are versatile in their ability to work together in any desired sequence. For instance, when utilizing filters A and B, the pipeline’s execution as A → B will yield the same result as B → A.

Here is a possible implementation of the EMERSE NLP pipeline as a Solr analyzer:

<analyzer type="index">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-delimiters.txt"/>
    <tokenizer class="org.emerse.opennlp.OpenNLPTokenizerFactory"
        tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
        sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
        hasHeader="true"
        usingLuceneTokenizer="true"
    />
    <filter class="org.emerse.ctakes.CTAKESFilterFactory"
        excludeTypes="nlp/pos_exclude.txt"
        prefix="CUI"
        dictionary="nlp/dictionary/custom.script"
        posModel="nlp/models/mayo-pos.zip"
    />
    <filter class="org.emerse.nlp.TaggerFilterFactory"
        addTokenPrefix="false"
        delimiters="nlp/delimiters.txt"
        negations="nlp/negations.txt"
        negationIgnore="nlp/negationIgnore.txt"
        certainty="nlp/probability_resource_patterns.txt"
        family="nlp/family_history_resource_patterns_A.txt"
        familyTemplates="nlp/family_history_resource_patterns_B.txt"
        familyMembers="nlp/family_history_resource_patterns_C.txt"
        history="nlp/history.txt"
        historyIgnore="nlp/historyIgnore.txt"
        tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
    />
    <filter class="org.emerse.nlp.PostAnalysisFilterFactory"/>
</analyzer>

Custom Pipeline

NLP is a convoluted, and it’s unrealistic to expect that EMERSE NLP implementation would meet the needs of every user. It’s quite common for users to have NLP entities and attributes generated by software packages external to EMERSE. In such cases, users anticipate that EMERSE can index the original documents along with NLP objects not generated by EMERSE itself. To address this requirement, additional implementation is necessary, enabling users to submit the original document along with NLP artifacts in a single file. Here is an example:

RU1FUlNFX0g=1|5|XHRcdEVNRVJTRSoqKioqKioqRVNSRU1FXHRcdA==
37	42	CUI_310367	R
37	42	SMG_Drug	R
66	90	CUI_4054523	R
66	90	SMG_Finding	R
71	90	CUI_13404	R
71	90	SMG_Finding	R
84	90	CUI_225386	R
84	90	SMG_Finding	R
95	99	CUI_5575035	R
95	99	SMG_Finding	R
103	111	CUI_5201148	R
103	111	SMG_Finding	R
112	119	CUI_15672	R
112	119	SMG_Findnig	R
124	128	CUI_18494	R
124	128	SMG_Anatomy	R
124	133	CUI_2170	R
124	133	SMG_Disorder	R
135	138	PPOL_NEG	T
139	145	PPOL_NEG	T
146	152	POL_NEG	T
146	152	CUI_15967	R
146	152	SMG_Finding	R
154	158	POL_NEG	T
154	170	CUI_1971624	R
154	170	SMG_Finding	R
159	161	POL_NEG	T
162	170	POL_NEG	T
162	170	CUI_3618	R
162	170	SMG_Finding	R
172	174	POL_NEG	T
175	180	POL_NEG	T
175	180	CUI_817096	R
175	180	SMG_Anatomy	R
175	185	CUI_8031	R
175	185	SMG_Finding	R
181	185	POL_NEG	T
181	185	CUI_30193	R
181	185	SMG_Finding	R
191	197	PTG_FAMILY	T
198	201	PTG_FAMILY	T
202	211	TG_FAMILY	T
202	211	CUI_677607	R
202	211	SMG_Disorder	R
212	213	TG_FAMILY	T
214	225	TG_FAMILY	T
214	225	CUI_40147	R
214	225	SMG_Disorder	R
237	244	PTG_HISTORY	T
237	244	CUI_262926	R
237	244	SMG_Finding	R
245	247	TG_HISTORY	T
248	252	TG_HISTORY	T
253	254	TG_HISTORY	T
255	263	TG_HISTORY	T
255	263	CUI_11849	R
255	263	SMG_Disorder	R
255	263	CUI_11847	R
255	263	SMG_Disorder	R
264	267	TG_HISTORY	T
268	274	TG_HISTORY	T
268	274	CUI_27092	R
268	274	SMG_Disorder	R
290	294	PPOL_NEG	T
295	298	PPOL_NEG	T
299	303	PPOL_NEG	T
304	305	PPOL_NEG	T
306	313	PPOL_NEG	T
306	313	PTG_HISTORY	T
306	313	CUI_262926	R
306	313	SMG_Finding	R
314	316	PPOL_NEG	T
314	316	TG_HISTORY	T
317	324	POL_NEG	T
317	324	TG_HISTORY	T
317	324	CUI_40132	R
317	324	SMG_Anatomy	R
317	331	CUI_7115	R
317	331	SMG_Disorder	R
325	331	POL_NEG	T
325	331	TG_HISTORY	T
325	331	CUI_6826	R
325	331	SMG_Disorder	R
347	353	CUI_39593	R
347	353	SMG_Finding	R
360	368	PCT_UNCERTAIN	T
360	368	CUI_332149	R
360	368	SMG_Finding	R
369	379	CT_UNCERTAIN	T
369	379	CUI_4551524	R
369	379	SMG_Finding	R
369	388	CUI_4364	R
369	388	SMG_Disorder	R
380	388	CT_UNCERTAIN	T
380	388	CUI_12634	R
380	388	SMG_Disorder	R
XHRcdEVNRVJTRSoqKioqKioqRVNSRU1FXHRcdA==
Mrs. Anderson was seen in our clinic today. She is complaining of mild shortness of breath, as well as moderate fatigue and hair loss. She denies fevers, loss of appetite, or chest pain. Her mother had Hashimoto's thyroiditis. She has a history of type 2 diabetes and myopia. Mrs. Anderson does not have a history of thyroid cancer. She should be tested for a possible autoimmune disorder.

In this file, the single line header serves as the designated location for providing arguments to the tokenizer. Following the header is a dedicated NLP artifact section, where each row contains the start offset, end offset, artifact ID, and artifact type, which are delimited by tabs. A separator demarcates the conclusion of the NLP artifact section. Subsequently, the original document is appended.

Artifact ID is a unique identifier and should not contain any space.

A NLP artifact offsets are defined by two integers: start offset and end offset. These offsets represent the character positions within the original text where the first token begins and the last token ends + 1. For example:

This is a teddy bear.

If an artifact is aligned with teddy bear. Such a line in the file should like:

10  20  CUI_TEDDY_BEAR  R

If an artifact is a concept and it has a group associated with it. The group must be represented in the following line.

4	18	CUI_APPLE_BANANA	R
4	18	SMG_FRUITS	R

Artifact type, as discussed earlier, is either Attribute or Entity:

NLP Artifact Type	Flag
Attribute	T
Entity	R

NLP Artifact Type

Flag

Attribute

Entity

Here is a possible implementation of the EMERSE custom NLP pipeline:

<analyzer type="index">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="org.emerse.nlp.HybridTokenizerFactory"
        tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
        sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
        hasHeader="true"
        usingLuceneTokenizer="true"
    />
    <filter class="org.emerse.nlp.PostAnalysisFilterFactory" allowMisalignment="true"/>
</analyzer>

In a custom pipeline, hasHeader attribute on the HybridTokenizer has to be true.

EMERSE Document Header

The EMERSE header is a single line of text terminated by a newline symbol. It recognizes its header row by a prefix signature: RU1FUlNFX0g=. The header accepts 3 positional arguments which are demarcated by the pipe character |.

Position	Description
1	Numeric, The number of new line symbols after which a sequence can be regarded as the conclusion of a sentence. This parameter is required and has to be a numeric value greater than 0.
2	Numeric, the number of spaces after which a sequence can be regarded as the conclusion of a sentence. This parameter is required and has to be a numeric value greater than 0.
3	String, a line of text indicating the end of NLP artifact section. This parameter is optional and is only needed for a Custom Pipeline. If this is a Standard Pipeline, leave the parameter empty, do not fill in any value, which will break the indexing process and truncate the document.

Here is an example of EMERSE header row for Standard Pipeline:

RU1FUlNFX0g=1|5|

Here is an example of EMERSE header row for Custom Pipeline:

RU1FUlNFX0g=2|5|XHRcdEVNRVJTRSoqKioqKioqRVNSRU1FXHRcdA==

A valid EMERSE Document Header should always end with a line feed (LF): '\n'. The behavior will be unknown if a LF can not be found at the end of the header.

EMERSE Field Type

TextWithHeaderFieldType is a Field type for Solr index. If the document has a EMERSE header defined, either using Standard Pipeline or Custom Pipeline, org.emerse.solr.TextWithHeaderFieldType has to be the field type.

Solr and Lucene do not provide native support for indexing only specific portions of a document. Consequently, EMERSE has to crate a custom field type. This field type is designed to discern distinct sections within a document, enabling EMERSE to process them appropriately.

Here is an example of defining an analysis chain on this field type:

<fieldType name="text_with_header_post_analysis" class="org.emerse.solr.TextWithHeaderFieldType"
           termPositions="true" termVectors="true" termOffsets="true" termPayloads="true">
    <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="org.emerse.nlp.HybridTokenizerFactory"
            standardTokenizer="true"
            hasHeader="true"
            tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
            sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
            numOfNewLineSymbols="2"
        />
        <filter class="org.emerse.nlp.PostAnalysisFilterFactory" allowMisalignment="true"/>
    </analyzer>
</fieldType>

EMERSE Document Field

In order to use org.emerse.solr.TextWithHeaderFieldType, the Solr field for storing documents must have the following attributes activated on the field.

<field name="RPT_TEXT" type="text_with_header_post_analysis"/>

The field has to be indexed and stored. TermVector has to be enabled, along with termPosition, termOffsets and termPayloads.

EMERSE Extra NLP Field

NLP Document header is not stored in the EMERSE Document Field since it is not part of the original document. However, for certain scenario, i.e., document update, it may be necessary to keep NLP header information available.

<field name="NLP_HEADER" type="string" indexed="false" stored="true" docValues="true"/>

This field is not indexed but stored.

Tokenizer & Filters

Standard NLP Tokenizer

OpenNLPTokenizer is the tokenizer for the standard EMERSE NLP pipeline. It segments the provided text on a per-sentence basis. Sentence detection and tokenization are built upon OpenNLP.

It can be declared using org.emerse.opennlp.OpenNLPTokenizerFactory, which supports the following arguments:

Argument Name	Required	Default Value	Description
sentenceModel	Yes	N/A	String, the file path of the pre-training sentence detection model for OpenNLP. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/
tokenizerModel	Yes	N/A	String, the file path of the pre-training tokenization model for OpenNLP. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/
usingLuceneTokenizer	No	False	Ture or False, reprocess tokens using Lucene’s StandardTokenizer, this approach ensures that the indexed tokens adhere to the behavior of the StandardTokenizer.
hasHeader	No	False	True or False, if True, the tokenizer will expect there is a single line header at the beginning of the document. If the document has a header, the field type for storing the text in Solr index has to be org.emerse.solr.TextWithHeaderFieldType
numOfNewLineSymbols	No	1	Numeric, number of the new line symbols (carriage return and line feed combined) before the end of a sentence. This is a hint to help tokenizer decide the boundary of the sentences.
numOfSpaces	No	5	Numeric, number of space before the end of a sentence. This is a hint to help tokenizer decide the boundary of the sentences.

Argument Name

Required

Default Value

Description

sentenceModel

Yes

N/A

String, the file path of the pre-training sentence detection model for OpenNLP. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

tokenizerModel

Yes

N/A

String, the file path of the pre-training tokenization model for OpenNLP. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

usingLuceneTokenizer

False

Ture or False, reprocess tokens using Lucene’s StandardTokenizer, this approach ensures that the indexed tokens adhere to the behavior of the StandardTokenizer.

hasHeader

False

True or False, if True, the tokenizer will expect there is a single line header at the beginning of the document. If the document has a header, the field type for storing the text in Solr index has to be org.emerse.solr.TextWithHeaderFieldType

numOfNewLineSymbols

Numeric, number of the new line symbols (carriage return and line feed combined) before the end of a sentence. This is a hint to help tokenizer decide the boundary of the sentences.

numOfSpaces

Numeric, number of space before the end of a sentence. This is a hint to help tokenizer decide the boundary of the sentences.

Custom NLP Tokenizer

HybridTokenizer is the tokenizer for the Custom Pipeline.

It can be declared using org.emerse.nlp.HybridTokenizerFactory, which supports the following arguments:

Argument Name	Required	Default Value	Description
standardTokenizer	No	False	True or False, if True, HybridTokenizer will be wrapped with Lucene StandardTokenizer; If False, OpenNLPTokenizer will be wrapped. If user decides to take any NLP artifact from EMERSE NLP filters into the Solr index, OpenNLPTokenizer should be used.
maxTokenCount	No	255	Numeric, this is for StandardTokenizer only. It defines the maximum token length. If a token is seen that exceeds this length then it is split at maxTokenCount intervals.
lineFeedWidth	No	1	Numeric, the length of the line feed at the end of the header row. If nothing is specified, EMERSE expects a single line feed at the end of the header row.

Argument Name

Required

Default Value

Description

standardTokenizer

False

True or False, if True, HybridTokenizer will be wrapped with Lucene StandardTokenizer; If False, OpenNLPTokenizer will be wrapped. If user decides to take any NLP artifact from EMERSE NLP filters into the Solr index, OpenNLPTokenizer should be used.

maxTokenCount

255

Numeric, this is for StandardTokenizer only. It defines the maximum token length. If a token is seen that exceeds this length then it is split at maxTokenCount intervals.

lineFeedWidth

Numeric, the length of the line feed at the end of the header row. If nothing is specified, EMERSE expects a single line feed at the end of the header row.

Naming Entity Recognition (NER) Filter

CTAKESFilter is a TokenFilter that can extract NLP entities from a preloaded dictionary. The extraction implementation is derived from CTAKES fast-dictionary lookup method. The dictionary that may be retrieved from EMERSE is compiled from Unified Medical Language System (UMLS)

It can be declared using org.emerse.ctakes.CTAKESFilterFactory, which supports the following arguments:

Argument Name	Required	Default Value	Description
dictionary	Yes	N/A	string, the file path of the dictionary. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/
prefix	Yes	N/A	String, the prefix that is prepended to the entity ID that will be inserted into the index
longestMatch	No	True	True or False, In a scenario where multiple entities are recognized, and one of these entities encompasses all the others, the decision to insert into the index depends on whether this statement is true or false. If it is true, only the longest entity will be included in the index; if it is false, all identified entities will be added to the index.
posModel	No	N/A	String, the file path of the POS(Parts-Of-Speech) model. The POS model combined with an exclusdeTypes list are used to exclude tokens with certain POS types from being identified as entities.If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/
excludeTypes	No	N/A	String, the file path of the POS(Parts-Of-Speech) types. If a token has been identified with any POS type in this list, no entity extraction will be conducted against this token. This list can only be activated if a posModel is given.

Argument Name

Required

Default Value

Description

dictionary

Yes

N/A

string, the file path of the dictionary. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

prefix

Yes

N/A

String, the prefix that is prepended to the entity ID that will be inserted into the index

longestMatch

True

True or False, In a scenario where multiple entities are recognized, and one of these entities encompasses all the others, the decision to insert into the index depends on whether this statement is true or false. If it is true, only the longest entity will be included in the index; if it is false, all identified entities will be added to the index.

posModel

N/A

String, the file path of the POS(Parts-Of-Speech) model. The POS model combined with an exclusdeTypes list are used to exclude tokens with certain POS types from being identified as entities.If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

excludeTypes

N/A

String, the file path of the POS(Parts-Of-Speech) types. If a token has been identified with any POS type in this list, no entity extraction will be conducted against this token. This list can only be activated if a posModel is given.

UMLS Semantic Group and Concept

The UMLS semantic network reduces the complexity of the Metathesaurus by grouping concepts according to the semantic types that have been assigned to them.

EMERSE UI enables users to accentuate a document through Semantic Groups. When a concept is identified by the filter, the associated semantic groups are automatically linked to the concept. An example is given at: Token-Aligned Layered Index Structure.

Filter for Negation, Uncertainty and More

TaggerFilter is a TokenFilter that is designed for detecting the following 4 types of mentions in the text:

Negation
Uncertainty
Non-patient mentions
Patient history

This filter uses rule based matching for mention detection. Rules can be customized by user and user will have the choice to turn on and off one or more mention detections.

It can be declared using org.emerse.nlp.TaggerFilterFactory, which supports the following arguments:

Argument Name	Required	Default Value	Description
delimiters	No	N/A	String, the file path for the delimiters. These delimiters play a crucial role in dividing a sentence into distinct sections. By doing so, they prevent the detection of mentions from extending beyond the confines of each section, avoiding potential inaccuracies at the sentence’s end. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/
negations	No	N/A	String, the file path for the negation rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/
negationIgnore	No	N/A	String, the file path for the negation ignore rules. This is useful for the case like can not be ignored. If Can not is a negation rule, without this ignorance, can not be ignored could be wrongly identified as negation. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/
certainty	No	N/A	String, the file path for the uncertainty rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/
family	No	N/A	String, the file path for the family rules that are in static mode. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/
familyTemplates	No	N/A	String, the file path for the family rules that are in dynamic mode. This argument should work with argument familyMembers together. The rules in familyTemplates act as a format string and the placeholders in the format string will be replaced by all rules defined in familyMembers. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/
familyMembers	No	N/A	String, the file path for the family member definitions. This argument should work with argument familyTemplates together. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/
history	No	N/A	String, the file path for the patient history mention rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/
historyIgnore	No	N/A	String, the file path for the history ignore rules. This is useful for the case like history on admit. If History is a history rule, without this ignorance, history on admit could be wrongly identified as negation. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/
addTokenPrefix	No	True	True or False, if True, this filter will add a prefix of TX_ and tx_ to tokens. Tokens with the TX_ prefix are case-sensitive, while tokens with the tx_ prefix are case-insensitive; If False, no prefix will be added and the tokens are sent to downstream as they are.

Argument Name

Required

Default Value

Description

delimiters

N/A

String, the file path for the delimiters. These delimiters play a crucial role in dividing a sentence into distinct sections. By doing so, they prevent the detection of mentions from extending beyond the confines of each section, avoiding potential inaccuracies at the sentence’s end. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

negations

N/A

String, the file path for the negation rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

negationIgnore

N/A

String, the file path for the negation ignore rules. This is useful for the case like can not be ignored. If Can not is a negation rule, without this ignorance, can not be ignored could be wrongly identified as negation. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

certainty

N/A

String, the file path for the uncertainty rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

family

N/A

String, the file path for the family rules that are in static mode. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

familyTemplates

N/A

String, the file path for the family rules that are in dynamic mode. This argument should work with argument familyMembers together. The rules in familyTemplates act as a format string and the placeholders in the format string will be replaced by all rules defined in familyMembers. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

familyMembers

N/A

String, the file path for the family member definitions. This argument should work with argument familyTemplates together. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

history

N/A

String, the file path for the patient history mention rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

historyIgnore

N/A

String, the file path for the history ignore rules. This is useful for the case like history on admit. If History is a history rule, without this ignorance, history on admit could be wrongly identified as negation. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

addTokenPrefix

True

True or False, if True, this filter will add a prefix of TX_ and tx_ to tokens. Tokens with the TX_ prefix are case-sensitive, while tokens with the tx_ prefix are case-insensitive; If False, no prefix will be added and the tokens are sent to downstream as they are.

Group of Attribute IDs

If a mention is identified through a given rule, EMERSE will tag tokens with the corresponding attribute ID before and/or after the token pattern in the rule until the sentence boundary is reached. The direction is defined in conjunction with the rule as either prepending *****, indicating movement towards the beginning of the sentence, or appending *****, indicating movement towards the end of the sentence. Additionally, tokens matching the rule will be tagged with a distinct attribute ID specifying the rule.:

This is not a teddy.

If a negation is provided as not *****, every token after not will be tagged as negation (POL_NEG as the attribute ID) and not itself will be tagged as the "trigger" for this negation (PPOL_NEG as the attribute ID). Tokens This is will not be touched since ***** is after not, which means only the tokens after the trigger will be considered as negated.

This	is	not	a	teddy
		PPOL_NEG	POL_NEG	POL_NEG

This

not

teddy

PPOL_NEG

POL_NEG

From the above example, it is evident that a set of attributes is required to describe the occurrence of a mention.

Post Analysis Consolidation Filter

The PostAnalysisFilter is a TokenField crucially positioned at the pipeline’s conclusion. It processes tokens using Lucene StandardTokenizer. Should the output tokens from StandardTokenizer differ from the input token, the input token is discarded, and NLP artifacts overlapping with this token are realigned with the output tokens. It may also add a prefix of TX_ and tx_ to tokens. Tokens with the TX_ prefix are case-sensitive, while tokens with the tx_ prefix are case-insensitive.

This filter can be declared using org.emerse.nlp.PostAnalysisFilterFactory, which supports the following argument:

Argument Name	Required	Default Value	Description
addTokenPrefix	No	True	True or False, if True, a token will be provided in both case-sensitive and case-insensitive forms and be prefixed with TX_ and tx_ respectively.

Argument Name

Required

Default Value

Description

addTokenPrefix

True

True or False, if True, a token will be provided in both case-sensitive and case-insensitive forms and be prefixed with TX_ and tx_ respectively.

TokenAdjusterFilter is a TokenFilter that is only necessary in the scenario of defining a query analyzer with the utilization of OpenNLPTokenizer.

The default query analyzer typically comes equipped with Lucene StandardTokenizer. In cases where OpenNLPTokenizer is employed during Solr indexing, the resulting tokens may deviate from those produced by StandardTokenizer. In such instances, TokenAdjusterFilter steps in to harmonize the output tokens, ensuring an exact match with the tokens generated during the Solr indexing process.

There is no argument available for this filter.

Query Analyzer

To facilitate Solr indexing, a mandatory component is the Analyzer chain (pipeline). Similarly, for Solr queries, an analyzer chain must be defined. Since EMERSE offers flexibility to customize the NLP pipeline, it’s crucial to emphasize that the tokens extracted from a query phrase should precisely match those generated by the analyzer during the Solr indexing process.

Here is the query analyzer that EMERSE recommends:

<analyzer type="query">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-delimiters.txt"/>
    <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="nlp/models/sd-med-model.zip" tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"/>
    <filter class="org.emerse.nlp.TokenAdjusterFilterFactory" reservedPrefix="CUI_,POL_NEG,PPOL_NEG,APOL_NEG,TG_FAMILY,PTG_FAMILY,ATG_FAMILY,CT_UNCERTAIN,PCT_UNCERTAIN,ACT_UNCERTAIN,TG_HISTORY,PTG_HISTORY,ATG_HISTORY"/>
</analyzer>

org.emerse.nlp.TokenAdjusterFilterFactory is designed to work with the OpenNLPTokenizer created by solr.OpenNLPTokenizerFactory only. Failing to use the specified tokenizer will lead to unexpected results, which may impact the functionality of EMERSE.

Alternatively:

<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="org.emerse.opennlp.OpenNLPTokenizerFactory"
        tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
        sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
        usingLuceneTokenizer="true"
    />
    <filter class="org.emerse.nlp.PostAnalysisFilterFactory" addTokenPrefix="false"/>
</analyzer>

Query Syntax

The introduction of the EMERSE query syntax is out of the scope of this document. However, it’s worth mentioning the integration of NLP artifacts in the query.

For NLP attributes, such as negation, the use of + and - operators is employed, followed by square brackets [] containing the attribute’s ID. These notations indicate whether the attribute should align with the query phrase. For example, in order to query smoke history in the negated form, the query might be written as:

+[POL_NEG]"smoke history"

given POL_NEG as the attribute ID for negation.

Querying alignment with a phrase for multiple attributes is also permissible. For example, in order to query smoke history in the non-negated form for the patient only, the query might be written as:

-[POL_NEG]-[TG_FAMILY]"smoke history"

given POL_NEG as the attribute ID for negation and TG_FAMILY as the attribute ID for non-patient mention.

For NLP entities, on the other hand, are treated as regular phrase in terms of query. They can be combined with regular text phrases in the query. For example:

"CUI_TEDDY bear"

This query looks for any token/phrase that is mapped as the entity by ID CUI_TEDDY and followed by a token bear. If CUI_TEDDY is mapped to "Teddy", "A teddy", "Teddies", the above query is equivalent to:

"Teddy bear" OR "A teddy bear" OR "Teddies bear

NLP Configuration Files

NLP configuration files, including mapping file for the CharFilter, model files for standard NLP tokenizer, UMLS dictionary for NER and pattern files for detection of negation, uncertainty etc., are all located under DOCUMENT_INDEX_ROOT/config/nlp.

Delimiter Mapping

Prior to version 7, EMERSE includes MappingCharFilter in both indexing pipeline and query pipeline. The mapping file, mapping-delimiters.txt, may contain the removing of dots from various character sets, for example:

"。" => ""

This type of mapping should be removed in version 7, as detecting a dot is a common method for NLP to identify sentence boundaries.

OpenNLP Tokenizer Model

Model files are in a subdirectory named models. By default, it includes OpenNLP trained token, sentence and POS models for English. If the documents are in a different language, models may be available on OpenNLP website.

Pattern Files for Detection of Negation, Uncertainty and More

Sentiment detection in EMERSE relies on pattern matching. Newer technologies, such as convolutional neural networks (CNNs), were not adopted primarily due to their impact on indexing performance. To efficiently index large volumes of notes within a reasonable timeframe, pattern matching remains a viable solution.

EMERSE can detect negation, uncertainty, non-patient mentions, and patient history. Some detection types involve two pattern files: one for the detection patterns and another for overriding these patterns to address false positives. For example, negations.txt versus negationIgnore.txt, history.txt versus historyIgnore.txt. Here is a snippet of negation patterns:

denied symptoms of *****
***** couldn't be

The text in each line, i.e. couldn’t be, is a trigger phrase to activate negation detection. The 4 asterisks * represent a placeholder for anything that is being negated. * can be placed either before or after the trigger.

Negation-Ignore file, may include:

couldn't be excluded

For the given negation pattern * couldn’t be, if the sentence is: Pneumonia can not be excluded, negation will not be concluded due to the negation ignore trigger couldn’t be excluded. Be aware, ignore file contains trigger only.

Uncertainty only implements a pattern file probability_resource_patterns.txt and non-patient mention, on the other hand, has 3 pattern files, among which, family_history_resource_patterns_A.txt is a regular pattern file, family_history_resource_patterns_B.txt is a template file and @@@@@ in each line can be replaced by the familial terms defined in family_history_resource_patterns_C.txt

UMLS Dictionary

dictionary subdirectory contains the UMLS dictionary tailed by EMERSE using CTAKES UI. The dictionary file is a SQL file since that’s the default output from CTAKES. EMERSE parses this SQL file and extracts all concepts in conjunction with their groups. Entity recognition through Standard Pipeline requires a script file built through CTAKES Dictionary Creator. (As of 2024 this program still required Java 8 to run properly).

The EMERSE team has a default dictionary file to distribute to sites installing EMERSE, or upgrading from prior versions that are not NLP-enabled. Sites simply need to provide to us with a copy of a valid UMLS license first.

For those wishing to make their own dictionary, users can create a custom dictionary and specify the path of the script file for OpenNLPTokenizerFactory using its dictionary attribute.

By default, this dictionary resides at:

SOLR_HOME/DOCUMENT_CORE/conf/nlp/dictionary

UMLS Source Data

At the time of this writing (May 2024), the source of the data to match concepts in the clinical notes comes from the UMLS version 2023AB. A subset of source vocabularies and semantic types are included in the default EMERSE distribution, specified in the tables below. These were selected because they are the mostly likely to have clinical relevance and appear in the EHR notes. Each semantic type is mapped to a Type Unique Identifier (TUI).

Table 1. List of the all UMLS source vocabularies, including those in the default EMERSE distribution, derived from UMLS version 2023AB
Abbreviation	Vocabulary Name	Included in EMERSE?
AIR	AI/RHEUM
ALT	Alternative Billing Concepts
AOD	Alcohol and Other Drug Thesaurus	YES
AOT	Authorized Osteopathic Thesaurus	YES
ATC	Anatomical Therapeutic Chemical Classification System
BI	Beth Israel Problem List
CCC	Clinical Care Classification
CCPSS	Clinical Problem Statements
CCS	Clinical Classifications Software
CCSR_ICD10CM	Clinical Classifications Software Refined for ICD-10-CM
CCSR_ICD10PCS	Clinical Classifications Software Refined for ICD-10-PCS
CDCREC	Race & Ethnicity - CDC
CDT	CDT
CHV	Consumer Health Vocabulary	YES
COSTAR	COSTAR
CPM	Medical Entities Dictionary
CPT	CPT - Current Procedural Terminology	YES
CPTSP	CPT Spanish
CSP	CRISP Thesaurus
CST	COSTART
CVX	Vaccines Administered
DDB	Diseases Database	YES
DMDICD10	ICD-10 German
DMDUMD	UMDNS German
DRUGBANK	DrugBank	YES
DSM-5	Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition	YES
DXP	DXplain
FMA	Foundational Model of Anatomy
GO	Gene Ontology
GS	Gold Standard Drug Database	YES
HCDT	CDT in HCPCS
HCPCS	HCPCS - Healthcare Common Procedure Coding System	YES
HCPT	CPT in HCPCS
HGNC	HUGO Gene Nomenclature Committee
HL7V2.5	HL7 Version 2.5
HL7V3.0	HL7 Version 3.0
HLREL	ICPC2E ICD10 Relationships
HPO	Human Phenotype Ontology	YES
ICD10	International Classification of Diseases and Related Health Problems, Tenth Revision	YES
ICD10AE	ICD-10, American English Equivalents	YES
ICD10AM	ICD-10, Australian Modification
ICD10AMAE	ICD-10, Australian Modification, Americanized English Equivalents
ICD10CM	International Classification of Diseases, Tenth Revision, Clinical Modification	YES
ICD10DUT	ICD10, Dutch Translation
ICD10PCS	ICD-10 Procedure Coding System	YES
ICD9CM	International Classification of Diseases, Ninth Revision, Clinical Modification
ICF	International Classification of Functioning, Disability and Health	YES
ICF-CY	International Classification of Functioning, Disability and Health for Children and Youth	YES
ICNP	International Classification for Nursing Practice
ICPC	International Classification of Primary Care
ICPC2EDUT	ICPC2E Dutch
ICPC2EENG	International Classification of Primary Care, 2nd Edition, Electronic
ICPC2ICD10DUT	ICPC2-ICD10 Thesaurus, Dutch Translation
ICPC2ICD10ENG	ICPC2-ICD10 Thesaurus
ICPC2P	ICPC-2 PLUS
ICPCBAQ	ICPC Basque
ICPCDAN	ICPC Danish
ICPCDUT	ICPC Dutch
ICPCFIN	ICPC Finnish
ICPCFRE	ICPC French
ICPCGER	ICPC German
ICPCHEB	ICPC Hebrew
ICPCHUN	ICPC Hungarian
ICPCITA	ICPC Italian
ICPCNOR	ICPC Norwegian
ICPCPOR	ICPC Portuguese
ICPCSPA	ICPC Spanish
ICPCSWE	ICPC Swedish
JABL	Congenital Mental Retardation Syndromes	YES
KCD5	Korean Standard Classification of Disease Version 5
LCH	Library of Congress Subject Headings
LCH_NW	Library of Congress Subject Headings, Northwestern University subset
LNC	LOINC	YES
LNC-DE-AT	LOINC Linguistic Variant - German, Austria
LNC-DE-DE	LOINC Linguistic Variant - German, Germany
LNC-EL-GR	LOINC Linguistic Variant - Greek, Greece
LNC-ES-AR	LOINC Linguistic Variant - Spanish, Argentina
LNC-ES-ES	LOINC Linguistic Variant - Spanish, Spain
LNC-ES-MX	LOINC Linguistic Variant - Spanish, Mexico
LNC-ET-EE	LOINC Linguistic Variant - Estonian, Estonia
LNC-FR-BE	LOINC Linguistic Variant - French, Belgium
LNC-FR-CA	LOINC Linguistic Variant - French, Canada
LNC-FR-FR	LOINC Linguistic Variant - French, France
LNC-IT-IT	LOINC Linguistic Variant - Italian, Italy
LNC-KO-KR	LOINC Linguistic Variant - Korea, Korean
LNC-NL-NL	LOINC Linguistic Variant - Dutch, Netherlands
LNC-PL-PL	LOINC Linguistic Variant - Polish, Poland
LNC-PT-BR	LOINC Linguistic Variant - Portuguese, Brazil
LNC-RU-RU	LOINC Linguistic Variant - Russian, Russia
LNC-TR-TR	LOINC Linguistic Variant - Turkish, Turkey
LNC-UK-UA	LOINC Linguistic Variant - Ukrainian, Ukraine
LNC-ZH-CN	LOINC Linguistic Variant - Chinese, China
MCM	Glossary of Clinical Epidemiologic Terms
MDR	MedDRA	YES
MDRARA	MedDRA Arabic
MDRBPO	MedDRA Brazilian Portuguese
MDRCZE	MedDRA Czech
MDRDUT	MedDRA Dutch
MDRFRE	MedDRA French
MDRGER	MedDRA German
MDRGRE	MedDRA Greek
MDRHUN	MedDRA Hungarian
MDRITA	MedDRA Italian
MDRJPN	MedDRA Japanese
MDRKOR	MedDRA Korean
MDRLAV	MedDRA Latvian
MDRPOL	MedDRA Polish
MDRPOR	MedDRA Portuguese
MDRRUS	MedDRA Russian
MDRSPA	MedDRA Spanish
MDRSWE	MedDRA Swedish
MED-RT	Medication Reference Terminology
MEDCIN	MEDCIN
MEDLINEPLUS	MedlinePlus Health Topics	YES
MEDLINEPLUS_SPA	MedlinePlus Spanish Health Topics
MMSL	Multum	YES
MMX	Micromedex	YES
MSH	MeSH	YES
MSHCZE	MeSH Czech
MSHDUT	MeSH Dutch
MSHFIN	MeSH Finnish
MSHFRE	MeSH French
MSHGER	MeSH German
MSHITA	MeSH Italian
MSHJPN	MeSH Japanese
MSHLAV	MeSH Latvian
MSHNOR	MeSH Norwegian
MSHPOL	MeSH Polish
MSHPOR	MeSH Portuguese
MSHRUS	MeSH Russian
MSHSCR	MeSH Croatian
MSHSPA	MeSH Spanish
MSHSWE	MeSH Swedish
MTH	Metathesaurus Names
MTHCMSFRF	Metathesaurus CMS Formulary Reference File
MTHICD9	ICD-9-CM Entry Terms
MTHICPC2EAE	ICPC2E American English Equivalents
MTHICPC2ICD10AE	ICPC2E-ICD10 Thesaurus, American English Equivalents
MTHMST	Minimal Standard Terminology (UMLS)	YES
MTHMSTFRE	Minimal Standard Terminology French (UMLS)
MTHMSTITA	Minimal Standard Terminology Italian (UMLS)
MTHSPL	FDA Structured Product Labels
MVX	Manufacturers of Vaccines	YES
NANDA-I	NANDA-I Taxonomy
NCBI	NCBI Taxonomy
NCI	NCI Thesaurus	YES
NCISEER	NCI SEER ICD Mappings
NDDF	FDB MedKnowledge
NEU	Neuronames Brain Hierarchy
NIC	Nursing Interventions Classification
NOC	Nursing Outcomes Classification
NUCCHCPT	National Uniform Claim Committee - Health Care Provider Taxonomy
OMIM	Online Mendelian Inheritance in Man
OMS	Omaha System
ORPHANET	ORPHANET	YES
PCDS	Patient Care Data Set
PDQ	Physician Data Query
PNDS	Perioperative Nursing Data Set	YES
PPAC	Pharmacy Practice Activity Classification
PSY	Psychological Index Terms	YES
QMR	Quick Medical Reference
RAM	Clinical Concepts by R A Miller
RCD	Read Codes
RCDAE	Read Codes Am Engl
RCDSA	Read Codes Am Synth
RCDSY	Read Codes Synth
RXNORM	RXNORM	YES
SCTSPA	SNOMED CT Spanish Edition
SNM	SNOMED 1982
SNMI	SNOMED Intl 1998
SNOMEDCT_US	SNOMED CT, US Edition	YES
SNOMEDCT_VET	SNOMED CT, Veterinary Extension
SOP	Source of Payment Typology
SPN	Standard Product Nomenclature
SRC	Source Terminology Names (UMLS)
TKMT	Traditional Korean Medical Terms
ULT	UltraSTAR
UMD	UMDNS
USP	USP Compendial Nomenclature
USPMG	USP Model Guidelines
UWDA	Digital Anatomist
VANDF	National Drug File	YES
WHO	WHOART
WHOFRE	WHOART French
WHOGER	WHOART German
WHOPOR	WHOART Portuguese
WHOSPA	WHOART Spanish

Table 2. List of the all UMLS Semantic Types, including those in the default EMERSE distribution, derived from UMLS version 2023AB
Type Unique Identifier (TUI)	Semantic Type Name	Included in EMERSE?
T001	Organism
T002	Plant
T004	Fungus
T005	Virus	YES
T007	Bacterium
T008	Animal
T010	Vertebrate
T011	Amphibian
T012	Bird
T013	Fish
T014	Reptile
T015	Mammal
T016	Human	YES
T017	Anatomical Structure	YES
T018	Embryonic Structure
T019	Congenital Abnormality	YES
T020	Acquired Abnormality	YES
T021	Fully Formed Anatomical Structure	YES
T022	Body System	YES
T023	Body Part, Organ, or Organ Component	YES
T024	Tissue	YES
T025	Cell	YES
T026	Cell Component	YES
T028	Gene or Genome
T029	Body Location or Region	YES
T030	Body Space or Junction	YES
T031	Body Substance	YES
T032	Organism Attribute
T033	Finding	YES
T034	Laboratory or Test Result	YES
T037	Injury or Poisoning	YES
T038	Biologic Function
T039	Physiologic Function	YES
T040	Organism Function	YES
T041	Mental Process	YES
T042	Organ or Tissue Function	YES
T043	Cell Function	YES
T044	Molecular Function	YES
T045	Genetic Function	YES
T046	Pathologic Function	YES
T047	Disease or Syndrome	YES
T048	Mental or Behavioral Dysfunction	YES
T049	Cell or Molecular Dysfunction	YES
T050	Experimental Model of Disease	YES
T051	Event
T052	Activity	YES
T053	Behavior	YES
T054	Social Behavior	YES
T055	Individual Behavior	YES
T056	Daily or Recreational Activity	YES
T057	Occupational Activity	YES
T058	Health Care Activity	YES
T059	Laboratory Procedure	YES
T060	Diagnostic Procedure	YES
T061	Therapeutic or Preventive Procedure	YES
T062	Research Activity
T063	Molecular Biology Research Technique
T064	Governmental or Regulatory Activity
T065	Educational Activity
T066	Machine Activity
T067	Phenomenon or Process
T068	Human-caused Phenomenon or Process
T069	Environmental Effect of Humans
T070	Natural Phenomenon or Process
T071	Entity
T072	Physical Object
T073	Manufactured Object
T074	Medical Device	YES
T075	Research Device	YES
T077	Conceptual Entity
T078	Idea or Concept
T079	Temporal Concept
T080	Qualitative Concept
T081	Quantitative Concept
T082	Spatial Concept
T083	Geographic Area
T085	Molecular Sequence
T086	Nucleotide Sequence
T087	Amino Acid Sequence
T088	Carbohydrate Sequence
T089	Regulation or Law
T090	Occupation or Discipline	YES
T091	Biomedical Occupation or Discipline	YES
T092	Organization
T093	Health Care Related Organization
T094	Professional Society
T095	Self-help or Relief Organization
T096	Group
T097	Professional or Occupational Group	YES
T098	Population Group	YES
T099	Family Group
T100	Age Group
T101	Patient or Disabled Group	YES
T102	Group Attribute
T103	Chemical
T104	Chemical Viewed Structurally
T109	Organic Chemical	YES
T114	Nucleic Acid, Nucleoside, or Nucleotide	YES
T116	Amino Acid, Peptide, or Protein	YES
T120	Chemical Viewed Functionally
T121	Pharmacologic Substance	YES
T122	Biomedical or Dental Material	YES
T123	Biologically Active Substance	YES
T125	Hormone	YES
T126	Enzyme	YES
T127	Vitamin	YES
T129	Immunologic Factor	YES
T130	Indicator, Reagent, or Diagnostic Aid	YES
T131	Hazardous or Poisonous Substance	YES
T167	Substance	YES
T168	Food	YES
T169	Functional Concept	YES
T170	Intellectual Product
T171	Language
T184	Sign or Symptom	YES
T185	Classification	YES
T190	Anatomical Abnormality	YES
T191	Neoplastic Process	YES
T192	Receptor
T194	Archaeon
T195	Antibiotic	YES
T196	Element, Ion, or Isotope	YES
T197	Inorganic Chemical	YES
T200	Clinical Drug	YES
T201	Clinical Attribute
T203	Drug Delivery Device	YES
T204	Eukaryote

Document Fields Example

<schema name="example" version="1.5">

	<!-- You can use different field names for all fields.
	Just be sure they match what the EMRESE Admin app expects the fields to be named. -->

	<uniqueKey>ID</uniqueKey>

	<!-- unique across sources -->
	<field name="ID" type="string" required="true"/>
	<!-- unique within a source, optional.
	For use by a user if they want to cross-reference
	a document in EMERSE with another system
	without having to demange the cross-source unique ID above -->
	<field name="RPT_ID" type="string"/>

  <!-- Required fields -->
	<field name="MRN" type="string" required="true"/>
	<field name="ENCOUNTER_DATE" type="date" required="true"/>
	<field name="DOC_TYPE" type="string" required="true"/>
	<field name="SOURCE" type="string" required="true"/>
	<field name="RPT_TEXT" type="text_with_header_post_analysis" required="true"/>

	<!-- Required fields which are calculated by
	 the ETL system based on the encounter date and the birthdate of the patient. -->
	<field name="AGE_DAYS" type="int" required="true"/>
	<field name="AGE_MONTHS" type="int" required="true"/>

	<!-- The encounter ID, if available -->
	<field name="CSN" type="string"/>

	<!-- Other fields describing metadata -->
	<field name="ADMIT_DATE" type="date"/>
	<field name="RPT_DATE" type="date"/>
	<field name="DEPT" type="string" default="unknown"/>
	<field name="SVC" type="string" default="unknown"/>
	<field name="CLINICIAN" type="string" default="unknown"/>
	<field name="CATEGORY" type="string" multiValued="true"/>

	<!-- The NLP_HEADER is optional but cannot be renamed -->
	<field name="NLP_HEADER" type="string" indexed="false"/>

	<!-- Metadata fields related to indexing itself, provided by the ETL system -->
	<field name="INDEX_DATE" type="date"/>
	<field name="LAST_UPDATED" type="date"/>

	<!-- _version_ must be a field since it's used by Solr internally.
	Solr fills in the values automatically. -->
	<field name="_version_" type="long"/>

	<!-- Field Types -->
	<fieldType name="date" class="solr.DatePointField"
		sortMissingLast="true" uninvertible="false" docValues="true"/>
	<fieldType name="long" class="solr.LongPointField"
		sortMissingLast="true" uninvertible="false" docValues="true"/>
	<fieldType name="int" class="solr.IntPointField"
		sortMissingLast="true" uninvertible="false" docValues="true"/>
	<fieldType name="string" class="solr.StrField"
		sortMissingLast="true" uninvertible="false" docValues="true"/>
	<fieldType name="boolean" class="solr.BoolField"
		sortMissingLast="true" uninvertible="false" docValues="true"/>
    ....
</schema>

EMERSE Database

EMERSE stores NLP artifacts in two database tables. By default, only EMERSE built-in artifacts are populated. To include custom NLP artifacts within the EMERSE application, it is necessary to populate them into these tables accordingly.

Table SEMANTIC_GROUP

This table is for storing NLP Semantic Groups, which has been discussed at: Token-Aligned Layered Index Structure

Column Name	Type	Description
GROUP_ID	VARCHAR(25)	Semantic Group ID, this is the name stored in the Solr index to represent an entity.
LABEL	VARCHAR(50)	Group Name, a user-friendly name for the semantic group that will be displayed on UI.
BORDER_CSS	VARCHAR(50)	CSS value for "border-bottom", i.e. "0.3em dotted red". When a concept in the semantic group is highlighted in the document, specific elements from this CSS property will be employed for rendering.

Column Name

Type

Description

GROUP_ID

VARCHAR(25)

Semantic Group ID, this is the name stored in the Solr index to represent an entity.

LABEL

VARCHAR(50)

Group Name, a user-friendly name for the semantic group that will be displayed on UI.

BORDER_CSS

VARCHAR(50)

CSS value for "border-bottom", i.e. "0.3em dotted red". When a concept in the semantic group is highlighted in the document, specific elements from this CSS property will be employed for rendering.

Table NLP_TAG

This table maintains NLP attribute groups, which has been mentioned at: Group of Attribute IDs

Column Name	Type	Description
GROUP_ID	VARCHAR(25)	Unique ID for a Group of Attribute IDs. This ID will be used to reference a set of attribute IDs.
QUERY_TAG	VARCHAR(25)	The attribute ID in the group that represents the mention.
HIGHLIGHT_TAGS	VARCHAR(100)	All attribute IDs in the group that can be highlighted in the document.
TAG_INCLUDED_PROMPT	VARCHAR(300)	The text description of this attribute is included in the query. If +[ATTRIBUTE_ID] is selected, what does it signify?
TAG_EXCLUDED_PROMPT	VARCHAR(300)	The text description of this attribute is excluded in the query. If -[ATTRIBUTE_ID] is selected, what does it signify?
TAG_NEUTRAL_PROMPT	VARCHAR(300)	The text description of this attribute is not specify in the query.
LABEL	VARCHAR(50)	The name for this Attribute on UI for display.
BORDER_CSS	VARCHAR(50)	CSS value for "border-bottom", i.e. "0.15em solid #EA3223". When an attribute group is highlighted in the document, specific elements from this CSS property will be employed for rendering.
DISPLAY_ORDER	NUMERIC	The position of this attribute group relative to others when displayed on the UI.
SUBWAY_ICON	VARCHAR(1)	A singular letter serves as the representation of this attribute group. This letter will appear in various locations on the UI to indicate the selection status of this group.
SUBWAY_INCLUDED_PROMPT	VARCHAR(100)	The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is included in the query.
SUBWAY_EXCLUDED_PROMPT	VARCHAR(100)	The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is excluded in the query.
SUBWAY_NEUTRAL_PROMPT	VARCHAR(100)	The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is not in the query.
HIDDEN_ON_UI	BOOLEAN	Indicating whether this attribute group should be hidden on the UI.

Column Name

Type

Description

GROUP_ID

VARCHAR(25)

Unique ID for a Group of Attribute IDs. This ID will be used to reference a set of attribute IDs.

QUERY_TAG

VARCHAR(25)

The attribute ID in the group that represents the mention.

HIGHLIGHT_TAGS

VARCHAR(100)

All attribute IDs in the group that can be highlighted in the document.

TAG_INCLUDED_PROMPT

VARCHAR(300)

The text description of this attribute is included in the query. If +[ATTRIBUTE_ID] is selected, what does it signify?

TAG_EXCLUDED_PROMPT

VARCHAR(300)

The text description of this attribute is excluded in the query. If -[ATTRIBUTE_ID] is selected, what does it signify?

TAG_NEUTRAL_PROMPT

VARCHAR(300)

The text description of this attribute is not specify in the query.

LABEL

VARCHAR(50)

The name for this Attribute on UI for display.

BORDER_CSS

VARCHAR(50)

CSS value for "border-bottom", i.e. "0.15em solid #EA3223". When an attribute group is highlighted in the document, specific elements from this CSS property will be employed for rendering.

DISPLAY_ORDER

NUMERIC

The position of this attribute group relative to others when displayed on the UI.

SUBWAY_ICON

VARCHAR(1)

A singular letter serves as the representation of this attribute group. This letter will appear in various locations on the UI to indicate the selection status of this group.

SUBWAY_INCLUDED_PROMPT

VARCHAR(100)

The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is included in the query.

SUBWAY_EXCLUDED_PROMPT

VARCHAR(100)

The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is excluded in the query.

SUBWAY_NEUTRAL_PROMPT

VARCHAR(100)

The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is not in the query.

HIDDEN_ON_UI

BOOLEAN

Indicating whether this attribute group should be hidden on the UI.

EMERSE configuration

In order for Solr to properly treat UMLS concepts and Semantic group as query terms, the following configuration line should be added into emerse.properties file:

nlp.artifcat.prefix=CUI_ SMG_

This tells Solr any term starting with CUI_ or SMG_ should not be subject to tokenization. If custom pipeline is used, replace CUI_, SMG_ with your own concept/group prefix.

EMERSE UI Consideration

EMERSE queries a Solr field named nlp_display within the index. When this field is populated, its contents are showcased within the NLP Detail section. This functionality proves invaluable when a user’s NLP tool generates supplementary content, such as summaries or abstracts, based on the original document.

Additionally, users have the option to establish distinct Solr fields dedicated to storing NLP outputs. This enables the configuration of these fields as query filters, facilitating more refined search capabilities.