EMERSE   EMERSE Website   Documentation Documentation

This guide describes how natural language processing (NLP) is incorporated into EMERSE, how to configure the system properly, and how to make changes so that you can incorporate your own annotations into the search index.

Background

Free text exhibits a high degree of complexity in its structure and diversity,making it improbable that any solitary approach will consistently yield optimal results. Due to this inherent complexity, the NLP capabilities integrated into EMERSE, which can be efficiently executed while the documents are sent to Solr for indexing, are constructed on top of well-established open-source libraries like OpenNLP and CTAKES. These fundamental features encompass rule-based information extraction and text classification.

NLP Implementation

NLP Element

EMERSE requires a highly customized Lucene Analyzer to pre-compute NLP artifacts, which are subsequently integrated with standard tokens for indexing purposes. These artifacts are aligned with tokens based upon their positional information and length.

There exist two distinct categories of artifacts:

  • Attribute, An attribute is defined as additional information that can be linked to an individual token. Multiple attributes can be associated with a single token.

  • Entity, An entity is characterized as a concept extracted from the original document through NLP analysis. A concept may span multiple tokens and is positioned on the first token it encompasses. It’s worth noting that an entity may include symbols that are overlooked by a tokenizer, meaning an entity might contain characters not present in the index.

Token-Aligned Layered Index Structure

Here is an example to conceptually illustrate the Token-Aligned Layered Index using a sentence: This is not a teddy bear.

This is not a teddy bear

PPOL_NEG

POL_NEG

POL_NEG

POL_NEG

CUI_TEDDY_BEAR

SMG_STUFFED_ANIMALS

There are four layers in this index. Text tokens are in the first row. The following rows contain NLP artifacts that are aligned with text tokens. Attributes, which reside at the second row, are always associated with individual tokens. Entities, concepts for this case, on the other hand, are presented in the third row and may span over multiple tokens. The fourth row encompasses a different type of entities, the semantic group. Each concept is assigned with one or more semantic groups. More information of semantic group can be found at: UMLS Semantic Group and Concept.

NLP Pipeline

A typical Lucene Analysis Chain includes:

  • Zero, one or more CharFilters;

  • A single Tokenizer;

  • Zero, one or more TokenFilters;

Standard Pipeline

EMERSE takes solr.HTMLStripCharFilterFactory as the default CharFilter in order to pre-process documents in HTML format.

A custom tokenizer has been developed with the following capacities:

  • Sentence Boundary Detection: It can effectively identify sentence boundaries within the text.

  • Token-Level Text Segmentation: It segments the text into tokens within each sentence, preparing it for more intricate NLP tasks.

  • Document Header Recognition: If a document includes a header, the tokenizer can recognize and interpret it. This header, when present, offers valuable hints and instructions on how to conduct the tokenization process.

Ideally, this tokenizer should also assume responsibility for more intricate NLP tasks, including entity identification and attribute discovery. Nonetheless, a monolithic implementation of such tasks would introduce unnecessary complexity and a lack of modularity, thereby hindering the ability to update or replace specific functionalities individually.

EMERSE has encapsulated these NLP tasks within filters that process a token stream originating from either the tokenizer or other filters. These filters operate autonomously and are versatile in their ability to work together in any desired sequence. For instance, when utilizing filters A and B, the pipeline’s execution as A → B will yield the same result as B → A.

Here is a possible implementation of the EMERSE NLP pipeline as a Solr analyzer:

<analyzer type="index">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-delimiters.txt"/>
    <tokenizer class="org.emerse.opennlp.OpenNLPTokenizerFactory"
        tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
        sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
        hasHeader="true"
        usingLuceneTokenizer="true"
    />
    <filter class="org.emerse.ctakes.CTAKESFilterFactory"
        excludeTypes="nlp/pos_exclude.txt"
        prefix="CUI"
        dictionary="nlp/dictionary/custom.script"
        posModel="nlp/models/mayo-pos.zip"
    />
    <filter class="org.emerse.nlp.TaggerFilterFactory"
        addTokenPrefix="false"
        delimiters="nlp/delimiters.txt"
        negations="nlp/negations.txt"
        negationIgnore="nlp/negationIgnore.txt"
        certainty="nlp/probability_resource_patterns.txt"
        family="nlp/family_history_resource_patterns_A.txt"
        familyTemplates="nlp/family_history_resource_patterns_B.txt"
        familyMembers="nlp/family_history_resource_patterns_C.txt"
        history="nlp/history.txt"
        historyIgnore="nlp/historyIgnore.txt"
        tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
    />
    <filter class="org.emerse.nlp.PostAnalysisFilterFactory"/>
</analyzer>

Custom Pipeline

NLP is a convoluted, and it’s unrealistic to expect that EMERSE NLP implementation would meet the needs of every user. It’s quite common for users to have NLP entities and attributes generated by software packages external to EMERSE. In such cases, users anticipate that EMERSE can index the original documents along with NLP objects not generated by EMERSE itself. To address this requirement, additional implementation is necessary, enabling users to submit the original document along with NLP artifacts in a single file. Here is an example:

THIS_IS_A_HEADER
4	18	CUI_APPLE_BANANA	R
4	18	SMG_FRUITS	R
5	17	CUI_APPLE_BANANA_1	R
4	18	SMG_FRUITS_1	R
21	36	POL_NEG	T
THIS_IS_A_SEPARATOR
not +apple-banana--. Second sentence. Third sentence.

In this file, the single line header serves as the designated location for providing arguments to the tokenizer. Following the header is a dedicated NLP artifact section, where each row contains the start offset, end offset, artifact ID, and artifact type, which are delimited by tabs. A separator demarcates the conclusion of the NLP artifact section. Subsequently, the original document is appended.

Artifact ID is a unique identifier and should not contain any space.

A NLP artifact offsets are defined by two integers: start offset and end offset. These offsets represent the character positions within the original text where the first token begins and the last token ends + 1. For example:

This is a teddy bear.

If an artifact is aligned with teddy bear. Such a line in the file should like:

10  20  CUI_TEDDY_BEAR  R

If an artifact is a concept and it has a group associated with it. The group must be represented in the following line.

4	18	CUI_APPLE_BANANA	R
4	18	SMG_FRUITS	R

Artifact type, as discussed earlier, is either Attribute or Entity:

NLP Artifact Type Flag

Attribute

T

Entity

R

Here is a possible implementation of the EMERSE custom NLP pipeline:

<analyzer type="index">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="org.emerse.nlp.HybridTokenizerFactory"
        tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
        sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
        hasHeader="true"
        usingLuceneTokenizer="true"
    />
    <filter class="org.emerse.nlp.PostAnalysisFilterFactory"/>
</analyzer>

In a custom pipeline, hasHeader attribute on the HybridTokenizer has to be true.

EMERSE Document Header

The EMERSE header is a single line of text terminated by a newline symbol. It recognizes its header row by a prefix signature: RU1FUlNFX0g=. The header accepts 3 positional arguments which are demarcated by the pipe character |.

Position Description

1

Numeric, The number of new line symbols after which a sequence can be regarded as the conclusion of a sentence. This parameter is required and has to be a numeric value greater than 0.

2

Numeric, the number of spaces after which a sequence can be regarded as the conclusion of a sentence. This parameter is required and has to be a numeric value greater than 0.

3

String, a line of text indicating the end of NLP artifact section. This parameter is optional and is only needed for a Custom Pipeline. If this is a Standard Pipeline, leave the parameter empty, do not fill in any value, which will break the indexing process and truncate the document.

Here is an example of EMERSE header row for Standard Pipeline:

RU1FUlNFX0g=1|5|

Here is an example of EMERSE header row for Custom Pipeline:

RU1FUlNFX0g=2|5|XHRcdEVNRVJTRSoqKioqKioqRVNSRU1FXHRcdA==

A valid EMERSE Document Header should always end with a line feed (LF): '\n'. The behavior will be unknown if a LF can not be found at the end of the header.

EMERSE Field Type

TextWithHeaderFieldType is a Field type for Solr index. If the document has a EMERSE header defined, either using Standard Pipeline or Custom Pipeline, org.emerse.solr.TextWithHeaderFieldType has to be the field type.

Solr and Lucene do not provide native support for indexing only specific portions of a document. Consequently, EMERSE has to crate a custom field type. This field type is designed to discern distinct sections within a document, enabling EMERSE to process them appropriately.

Here is an example of defining an analysis chain on this field type:

<fieldType name="text_with_header_post_analysis" class="org.emerse.solr.TextWithHeaderFieldType"
           termPositions="true" termVectors="true" termOffsets="true" termPayloads="true">
    <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="org.emerse.nlp.HybridTokenizerFactory"
            standardTokenizer="true"
            hasHeader="true"
            tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
            sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
            numOfNewLineSymbols="2"
        />
    </analyzer>
</fieldType>

EMERSE Document Field

In order to use org.emerse.solr.TextWithHeaderFieldType, the Solr field for storing documents must have the following attributes activated on the field.

<field name="RPT_TEXT" type="text_with_header_post_analysis"/>

The field has to be indexed and stored. TermVector has to be enabled, along with termPosition, termOffsets and termPayloads.

EMERSE Extra NLP Field

NLP Document header is not stored in the EMERSE Document Field since it is not part of the original document. However, for certain scenario, i.e., document update, it may be necessary to keep NLP header information available.

<field name="NLP_HEADER" type="string" indexed="false" stored="true" docValues="true"/>

This field is not indexed but stored.

Tokenizer & Filters

Standard NLP Tokenizer

OpenNLPTokenizer is the tokenizer for the standard EMERSE NLP pipeline. It segments the provided text on a per-sentence basis. Sentence detection and tokenization are built upon OpenNLP.

It can be declared using org.emerse.opennlp.OpenNLPTokenizerFactory, which supports the following arguments:

Argument Name Required Default Value Description

sentenceModel

Yes

N/A

String, the file path of the pre-training sentence detection model for OpenNLP. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

tokenizerModel

Yes

N/A

String, the file path of the pre-training tokenization model for OpenNLP. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

usingLuceneTokenizer

No

False

Ture or False, reprocess tokens using Lucene’s StandardTokenizer, this approach ensures that the indexed tokens adhere to the behavior of the StandardTokenizer.

hasHeader

No

False

True or False, if True, the tokenizer will expect there is a single line header at the beginning of the document. If the document has a header, the field type for storing the text in Solr index has to be org.emerse.solr.TextWithHeaderFieldType

numOfNewLineSymbols

No

1

Numeric, number of the new line symbols (carriage return and line feed combined) before the end of a sentence. This is a hint to help tokenizer decide the boundary of the sentences.

numOfSpaces

No

5

Numeric, number of space before the end of a sentence. This is a hint to help tokenizer decide the boundary of the sentences.

Custom NLP Tokenizer

HybridTokenizer is the tokenizer for the Custom Pipeline.

It can be declared using org.emerse.nlp.HybridTokenizerFactory, which supports the following arguments:

Argument Name Required Default Value Description

standardTokenizer

No

False

True or False, if True, HybridTokenizer will be wrapped with Lucene StandardTokenizer; If False, OpenNLPTokenizer will be wrapped. If user decides to take any NLP artifact from EMERSE NLP filters into the Solr index, OpenNLPTokenizer should be used.

maxTokenCount

No

255

Numeric, this is for StandardTokenizer only. It defines the maximum token length. If a token is seen that exceeds this length then it is split at maxTokenCount intervals.

lineFeedWidth

No

1

Numeric, the length of the line feed at the end of the header row. If nothing is specified, EMERSE expects a single line feed at the end of the header row.

Naming Entity Recognition (NER) Filter

CTAKESFilter is a TokenFilter that can extract NLP entities from a preloaded dictionary. The extraction implementation is derived from CTAKES fast-dictionary lookup method. The dictionary that may be retrieved from EMERSE is compiled from Unified Medical Language System (UMLS)

It can be declared using org.emerse.ctakes.CTAKESFilterFactory, which supports the following arguments:

Argument Name Required Default Value Description

dictionary

Yes

N/A

string, the file path of the dictionary. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

prefix

Yes

N/A

String, the prefix that is prepended to the entity ID that will be inserted into the index

longestMatch

No

True

True or False, In a scenario where multiple entities are recognized, and one of these entities encompasses all the others, the decision to insert into the index depends on whether this statement is true or false. If it is true, only the longest entity will be included in the index; if it is false, all identified entities will be added to the index.

posModel

No

N/A

String, the file path of the POS(Parts-Of-Speech) model. The POS model combined with an exclusdeTypes list are used to exclude tokens with certain POS types from being identified as entities.If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

excludeTypes

No

N/A

String, the file path of the POS(Parts-Of-Speech) types. If a token has been identified with any POS type in this list, no entity extraction will be conducted against this token. This list can only be activated if a posModel is given.

UMLS Semantic Group and Concept

The UMLS semantic network reduces the complexity of the Metathesaurus by grouping concepts according to the semantic types that have been assigned to them.

EMERSE UI enables users to accentuate a document through Semantic Groups. When a concept is identified by the filter, the associated semantic groups are automatically linked to the concept. An example is given at: Token-Aligned Layered Index Structure.

Filter for Negation, Uncertainty and More

TaggerFilter is a TokenFilter that is designed for detecting the following 4 types of mentions in the text:

  • Negation

  • Uncertainty

  • Non-patient mentions

  • Patient history

This filter uses rule based matching for mention detection. Rules can be customized by user and user will have the choice to turn on and off one or more mention detections.

It can be declared using org.emerse.nlp.TaggerFilterFactory, which supports the following arguments:

Argument Name Required Default Value Description

delimiters

No

N/A

String, the file path for the delimiters. These delimiters play a crucial role in dividing a sentence into distinct sections. By doing so, they prevent the detection of mentions from extending beyond the confines of each section, avoiding potential inaccuracies at the sentence’s end. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

negations

No

N/A

String, the file path for the negation rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

negationIgnore

No

N/A

String, the file path for the negation ignore rules. This is useful for the case like can not be ignored. If Can not is a negation rule, without this ignorance, can not be ignored could be wrongly identified as negation. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

certainty

No

N/A

String, the file path for the uncertainty rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

family

No

N/A

String, the file path for the family rules that are in static mode. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

familyTemplates

No

N/A

String, the file path for the family rules that are in dynamic mode. This argument should work with argument familyMembers together. The rules in familyTemplates act as a format string and the placeholders in the format string will be replaced by all rules defined in familyMembers. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

familyMembers

No

N/A

String, the file path for the family member definitions. This argument should work with argument familyTemplates together. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

history

No

N/A

String, the file path for the patient history mention rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

historyIgnore

No

N/A

String, the file path for the history ignore rules. This is useful for the case like history on admit. If History is a history rule, without this ignorance, history on admit could be wrongly identified as negation. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/

addTokenPrefix

No

True

True or False, if True, this filter will add a prefix of TX_ and tx_ to tokens. Tokens with the TX_ prefix are case-sensitive, while tokens with the tx_ prefix are case-insensitive; If False, no prefix will be added and the tokens are sent to downstream as they are.

Group of Attribute IDs

If a mention is identified through a given rule, EMERSE will tag tokens with the corresponding attribute ID before and/or after the token pattern in the rule until the sentence boundary is reached. The direction is defined in conjunction with the rule as either prepending *****, indicating movement towards the beginning of the sentence, or appending *****, indicating movement towards the end of the sentence. Additionally, tokens matching the rule will be tagged with a distinct attribute ID specifying the rule.:

This is not a teddy.

If a negation is provided as not *****, every token after not will be tagged as negation (POL_NEG as the attribute ID) and not itself will be tagged as the "trigger" for this negation (PPOL_NEG as the attribute ID). Tokens This is will not be touched since ***** is after not, which means only the tokens after the trigger will be considered as negated.

This is not a teddy

PPOL_NEG

POL_NEG

POL_NEG

From the above example, it is evident that a set of attributes is required to describe the occurrence of a mention.

Post Analysis Consolidation Filter

The PostAnalysisFilter is a TokenField crucially positioned at the pipeline’s conclusion. It processes tokens using Lucene StandardTokenizer. Should the output tokens from StandardTokenizer differ from the input token, the input token is discarded, and NLP artifacts overlapping with this token are realigned with the output tokens. It may also add a prefix of TX_ and tx_ to tokens. Tokens with the TX_ prefix are case-sensitive, while tokens with the tx_ prefix are case-insensitive.

This filter can be declared using org.emerse.nlp.PostAnalysisFilterFactory, which supports the following argument:

Argument Name Required Default Value Description

addTokenPrefix

No

True

True or False, if True, a token will be provided in both case-sensitive and case-insensitive forms and be prefixed with TX_ and tx_ respectively.

TokenAdjusterFilter is a TokenFilter that is only necessary in the scenario of defining a query analyzer with the utilization of OpenNLPTokenizer.

The default query analyzer typically comes equipped with Lucene StandardTokenizer. In cases where OpenNLPTokenizer is employed during Solr indexing, the resulting tokens may deviate from those produced by StandardTokenizer. In such instances, TokenAdjusterFilter steps in to harmonize the output tokens, ensuring an exact match with the tokens generated during the Solr indexing process.

There is no argument available for this filter.

Query Analyzer

To facilitate Solr indexing, a mandatory component is the Analyzer chain (pipeline). Similarly, for Solr queries, an analyzer chain must be defined. Since EMERSE offers flexibility to customize the NLP pipeline, it’s crucial to emphasize that the tokens extracted from a query phrase should precisely match those generated by the analyzer during the Solr indexing process.

Here is the query analyzer that EMERSE recommends:

<analyzer type="query">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-delimiters.txt"/>
    <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="nlp/models/sd-med-model.zip" tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"/>
    <filter class="org.emerse.nlp.TokenAdjusterFilterFactory" reservedPrefix="CUI_,POL_NEG,PPOL_NEG,APOL_NEG,TG_FAMILY,PTG_FAMILY,ATG_FAMILY,CT_UNCERTAIN,PCT_UNCERTAIN,ACT_UNCERTAIN,TG_HISTORY,PTG_HISTORY,ATG_HISTORY"/>
</analyzer>

org.emerse.nlp.TokenAdjusterFilterFactory is designed to work with the OpenNLPTokenizer created by solr.OpenNLPTokenizerFactory only. Failing to use the specified tokenizer will lead to unexpected results, which may impact the functionality of EMERSE.

Alternatively:

<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="org.emerse.opennlp.OpenNLPTokenizerFactory"
        tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
        sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
        usingLuceneTokenizer="true"
    />
    <filter class="org.emerse.nlp.PostAnalysisFilterFactory" addTokenPrefix="false"/>
</analyzer>

Query Syntax

The introduction of the EMERSE query syntax is out of the scope of this document. However, it’s worth mentioning the integration of NLP artifacts in the query.

For NLP attributes, such as negation, the use of + and - operators is employed, followed by square brackets [] containing the attribute’s ID. These notations indicate whether the attribute should align with the query phrase. For example, in order to query smoke history in the negated form, the query might be written as:

+[POL_NEG]"smoke history"

given POL_NEG as the attribute ID for negation.

Querying alignment with a phrase for multiple attributes is also permissible. For example, in order to query smoke history in the non-negated form for the patient only, the query might be written as:

-[POL_NEG]-[TG_FAMILY]"smoke history"

given POL_NEG as the attribute ID for negation and TG_FAMILY as the attribute ID for non-patient mention.

For NLP entities, on the other hand, are treated as regular phrase in terms of query. They can be combined with regular text phrases in the query. For example:

"CUI_TEDDY bear"

This query looks for any token/phrase that is mapped as the entity by ID CUI_TEDDY and followed by a token bear. If CUI_TEDDY is mapped to "Teddy", "A teddy", "Teddies", the above query is equivalent to:

"Teddy bear" OR "A teddy bear" OR "Teddies bear

NLP Configuration Files

NLP configuration files, including mapping file for the CharFilter, model files for standard NLP tokenizer, UMLS dictionary for NER and pattern files for detection of negation, uncertainty etc., are all located under DOCUMENT_INDEX_ROOT/config/nlp.

Delimiter Mapping

Prior to version 7, EMERSE includes MappingCharFilter in both indexing pipeline and query pipeline. The mapping file, mapping-delimiters.txt, may contain the removing of dots from various character sets, for example:

"。" => ""

This type of mapping should be removed in version 7, as detecting a dot is a common method for NLP to identify sentence boundaries.

OpenNLP Tokenizer Model

Model files are in a subdirectory named models. By default, it includes OpenNLP trained token, sentence and POS models for English. If the documents are in a different language, models may be available on OpenNLP website.

Pattern Files for Detection of Negation, Uncertainty and More

Sentiment detection in EMERSE relies on pattern matching. Newer technologies, such as convolutional neural networks (CNNs), were not adopted primarily due to their impact on indexing performance. To efficiently index large volumes of notes within a reasonable timeframe, pattern matching remains a viable solution.

EMERSE can detect negation, uncertainty, non-patient mentions, and patient history. Some detection types involve two pattern files: one for the detection patterns and another for overriding these patterns to address false positives. For example, negations.txt versus negationIgnore.txt, history.txt versus historyIgnore.txt. Here is a snippet of negation patterns:

denied symptoms of *****
***** couldn't be

The text in each line, i.e. couldn’t be, is a trigger phrase to activate negation detection. The 4 asterisks * represent a placeholder for anything that is being negated. * can be placed either before or after the trigger.

Negation-Ignore file, may include:

couldn't be excluded

For the given negation pattern * couldn’t be, if the sentence is: Pneumonia can not be excluded, negation will not be concluded due to the negation ignore trigger couldn’t be excluded. Be aware, ignore file contains trigger only.

Uncertainty only implements a pattern file probability_resource_patterns.txt and non-patient mention, on the other hand, has 3 pattern files, among which, family_history_resource_patterns_A.txt is a regular pattern file, family_history_resource_patterns_B.txt is a template file and @@@@@ in each line can be replaced by the familial terms defined in family_history_resource_patterns_C.txt

UMLS Dictionary

dictionary subdirectory contains the UMLS dictionary tailed by EMERSE using CTAKES UI. The dictionary file is a SQL file since that’s the default output from CTAKES. EMERSE parses this SQL file and extracts all concepts in conjunction with their groups. Entity recognition through Standard Pipeline requires a script file built through CTAKES Dictionary Creator. (As of 2024 this program still required Java 8 to run properly).

The EMERSE team has a default dictionary file to distribute to sites installing EMERSE, or upgrading from prior versions that are not NLP-enabled. Sites simply need to provide to us with a copy of a valid UMLS license first.

For those wishing to make their own dictionary, users can create a custom dictionary and specify the path of the script file for OpenNLPTokenizerFactory using its dictionary attribute.

By default, this dictionary resides at:

SOLR_HOME/DOCUMENT_CORE/conf/nlp/dictionary

UMLS Source Data

At the time of this writing (May 2024), the source of the data to match concepts in the clinical notes comes from the UMLS version 2023AB. A subset of source vocabularies and semantic types are included in the default EMERSE distribution, specified in the tables below. These were selected because they are the mostly likely to have clinical relevance and appear in the EHR notes. Each semantic type is mapped to a Type Unique Identifier (TUI).

Table 1. List of the all UMLS source vocabularies, including those in the default EMERSE distribution, derived from UMLS version 2023AB
Abbreviation Vocabulary Name Included in EMERSE?

AIR

AI/RHEUM

ALT

Alternative Billing Concepts

AOD

Alcohol and Other Drug Thesaurus

YES

AOT

Authorized Osteopathic Thesaurus

YES

ATC

Anatomical Therapeutic Chemical Classification System

BI

Beth Israel Problem List

CCC

Clinical Care Classification

CCPSS

Clinical Problem Statements

CCS

Clinical Classifications Software

CCSR_ICD10CM

Clinical Classifications Software Refined for ICD-10-CM

CCSR_ICD10PCS

Clinical Classifications Software Refined for ICD-10-PCS

CDCREC

Race & Ethnicity - CDC

CDT

CDT

CHV

Consumer Health Vocabulary

YES

COSTAR

COSTAR

CPM

Medical Entities Dictionary

CPT

CPT - Current Procedural Terminology

YES

CPTSP

CPT Spanish

CSP

CRISP Thesaurus

CST

COSTART

CVX

Vaccines Administered

DDB

Diseases Database

YES

DMDICD10

ICD-10 German

DMDUMD

UMDNS German

DRUGBANK

DrugBank

YES

DSM-5

Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition

YES

DXP

DXplain

FMA

Foundational Model of Anatomy

GO

Gene Ontology

GS

Gold Standard Drug Database

YES

HCDT

CDT in HCPCS

HCPCS

HCPCS - Healthcare Common Procedure Coding System

YES

HCPT

CPT in HCPCS

HGNC

HUGO Gene Nomenclature Committee

HL7V2.5

HL7 Version 2.5

HL7V3.0

HL7 Version 3.0

HLREL

ICPC2E ICD10 Relationships

HPO

Human Phenotype Ontology

YES

ICD10

International Classification of Diseases and Related Health Problems, Tenth Revision

YES

ICD10AE

ICD-10, American English Equivalents

YES

ICD10AM

ICD-10, Australian Modification

ICD10AMAE

ICD-10, Australian Modification, Americanized English Equivalents

ICD10CM

International Classification of Diseases, Tenth Revision, Clinical Modification

YES

ICD10DUT

ICD10, Dutch Translation

ICD10PCS

ICD-10 Procedure Coding System

YES

ICD9CM

International Classification of Diseases, Ninth Revision, Clinical Modification

ICF

International Classification of Functioning, Disability and Health

YES

ICF-CY

International Classification of Functioning, Disability and Health for Children and Youth

YES

ICNP

International Classification for Nursing Practice

ICPC

International Classification of Primary Care

ICPC2EDUT

ICPC2E Dutch

ICPC2EENG

International Classification of Primary Care, 2nd Edition, Electronic

ICPC2ICD10DUT

ICPC2-ICD10 Thesaurus, Dutch Translation

ICPC2ICD10ENG

ICPC2-ICD10 Thesaurus

ICPC2P

ICPC-2 PLUS

ICPCBAQ

ICPC Basque

ICPCDAN

ICPC Danish

ICPCDUT

ICPC Dutch

ICPCFIN

ICPC Finnish

ICPCFRE

ICPC French

ICPCGER

ICPC German

ICPCHEB

ICPC Hebrew

ICPCHUN

ICPC Hungarian

ICPCITA

ICPC Italian

ICPCNOR

ICPC Norwegian

ICPCPOR

ICPC Portuguese

ICPCSPA

ICPC Spanish

ICPCSWE

ICPC Swedish

JABL

Congenital Mental Retardation Syndromes

YES

KCD5

Korean Standard Classification of Disease Version 5

LCH

Library of Congress Subject Headings

LCH_NW

Library of Congress Subject Headings, Northwestern University subset

LNC

LOINC

YES

LNC-DE-AT

LOINC Linguistic Variant - German, Austria

LNC-DE-DE

LOINC Linguistic Variant - German, Germany

LNC-EL-GR

LOINC Linguistic Variant - Greek, Greece

LNC-ES-AR

LOINC Linguistic Variant - Spanish, Argentina

LNC-ES-ES

LOINC Linguistic Variant - Spanish, Spain

LNC-ES-MX

LOINC Linguistic Variant - Spanish, Mexico

LNC-ET-EE

LOINC Linguistic Variant - Estonian, Estonia

LNC-FR-BE

LOINC Linguistic Variant - French, Belgium

LNC-FR-CA

LOINC Linguistic Variant - French, Canada

LNC-FR-FR

LOINC Linguistic Variant - French, France

LNC-IT-IT

LOINC Linguistic Variant - Italian, Italy

LNC-KO-KR

LOINC Linguistic Variant - Korea, Korean

LNC-NL-NL

LOINC Linguistic Variant - Dutch, Netherlands

LNC-PL-PL

LOINC Linguistic Variant - Polish, Poland

LNC-PT-BR

LOINC Linguistic Variant - Portuguese, Brazil

LNC-RU-RU

LOINC Linguistic Variant - Russian, Russia

LNC-TR-TR

LOINC Linguistic Variant - Turkish, Turkey

LNC-UK-UA

LOINC Linguistic Variant - Ukrainian, Ukraine

LNC-ZH-CN

LOINC Linguistic Variant - Chinese, China

MCM

Glossary of Clinical Epidemiologic Terms

MDR

MedDRA

YES

MDRARA

MedDRA Arabic

MDRBPO

MedDRA Brazilian Portuguese

MDRCZE

MedDRA Czech

MDRDUT

MedDRA Dutch

MDRFRE

MedDRA French

MDRGER

MedDRA German

MDRGRE

MedDRA Greek

MDRHUN

MedDRA Hungarian

MDRITA

MedDRA Italian

MDRJPN

MedDRA Japanese

MDRKOR

MedDRA Korean

MDRLAV

MedDRA Latvian

MDRPOL

MedDRA Polish

MDRPOR

MedDRA Portuguese

MDRRUS

MedDRA Russian

MDRSPA

MedDRA Spanish

MDRSWE

MedDRA Swedish

MED-RT

Medication Reference Terminology

MEDCIN

MEDCIN

MEDLINEPLUS

MedlinePlus Health Topics

YES

MEDLINEPLUS_SPA

MedlinePlus Spanish Health Topics

MMSL

Multum

YES

MMX

Micromedex

YES

MSH

MeSH

YES

MSHCZE

MeSH Czech

MSHDUT

MeSH Dutch

MSHFIN

MeSH Finnish

MSHFRE

MeSH French

MSHGER

MeSH German

MSHITA

MeSH Italian

MSHJPN

MeSH Japanese

MSHLAV

MeSH Latvian

MSHNOR

MeSH Norwegian

MSHPOL

MeSH Polish

MSHPOR

MeSH Portuguese

MSHRUS

MeSH Russian

MSHSCR

MeSH Croatian

MSHSPA

MeSH Spanish

MSHSWE

MeSH Swedish

MTH

Metathesaurus Names

MTHCMSFRF

Metathesaurus CMS Formulary Reference File

MTHICD9

ICD-9-CM Entry Terms

MTHICPC2EAE

ICPC2E American English Equivalents

MTHICPC2ICD10AE

ICPC2E-ICD10 Thesaurus, American English Equivalents

MTHMST

Minimal Standard Terminology (UMLS)

YES

MTHMSTFRE

Minimal Standard Terminology French (UMLS)

MTHMSTITA

Minimal Standard Terminology Italian (UMLS)

MTHSPL

FDA Structured Product Labels

MVX

Manufacturers of Vaccines

YES

NANDA-I

NANDA-I Taxonomy

NCBI

NCBI Taxonomy

NCI

NCI Thesaurus

YES

NCISEER

NCI SEER ICD Mappings

NDDF

FDB MedKnowledge

NEU

Neuronames Brain Hierarchy

NIC

Nursing Interventions Classification

NOC

Nursing Outcomes Classification

NUCCHCPT

National Uniform Claim Committee - Health Care Provider Taxonomy

OMIM

Online Mendelian Inheritance in Man

OMS

Omaha System

ORPHANET

ORPHANET

YES

PCDS

Patient Care Data Set

PDQ

Physician Data Query

PNDS

Perioperative Nursing Data Set

YES

PPAC

Pharmacy Practice Activity Classification

PSY

Psychological Index Terms

YES

QMR

Quick Medical Reference

RAM

Clinical Concepts by R A Miller

RCD

Read Codes

RCDAE

Read Codes Am Engl

RCDSA

Read Codes Am Synth

RCDSY

Read Codes Synth

RXNORM

RXNORM

YES

SCTSPA

SNOMED CT Spanish Edition

SNM

SNOMED 1982

SNMI

SNOMED Intl 1998

SNOMEDCT_US

SNOMED CT, US Edition

YES

SNOMEDCT_VET

SNOMED CT, Veterinary Extension

SOP

Source of Payment Typology

SPN

Standard Product Nomenclature

SRC

Source Terminology Names (UMLS)

TKMT

Traditional Korean Medical Terms

ULT

UltraSTAR

UMD

UMDNS

USP

USP Compendial Nomenclature

USPMG

USP Model Guidelines

UWDA

Digital Anatomist

VANDF

National Drug File

YES

WHO

WHOART

WHOFRE

WHOART French

WHOGER

WHOART German

WHOPOR

WHOART Portuguese

WHOSPA

WHOART Spanish

Table 2. List of the all UMLS Semantic Types, including those in the default EMERSE distribution, derived from UMLS version 2023AB
Type Unique Identifier (TUI) Semantic Type Name Included in EMERSE?

T001

Organism

T002

Plant

T004

Fungus

T005

Virus

YES

T007

Bacterium

T008

Animal

T010

Vertebrate

T011

Amphibian

T012

Bird

T013

Fish

T014

Reptile

T015

Mammal

T016

Human

YES

T017

Anatomical Structure

YES

T018

Embryonic Structure

T019

Congenital Abnormality

YES

T020

Acquired Abnormality

YES

T021

Fully Formed Anatomical Structure

YES

T022

Body System

YES

T023

Body Part, Organ, or Organ Component

YES

T024

Tissue

YES

T025

Cell

YES

T026

Cell Component

YES

T028

Gene or Genome

T029

Body Location or Region

YES

T030

Body Space or Junction

YES

T031

Body Substance

YES

T032

Organism Attribute

T033

Finding

YES

T034

Laboratory or Test Result

YES

T037

Injury or Poisoning

YES

T038

Biologic Function

T039

Physiologic Function

YES

T040

Organism Function

YES

T041

Mental Process

YES

T042

Organ or Tissue Function

YES

T043

Cell Function

YES

T044

Molecular Function

YES

T045

Genetic Function

YES

T046

Pathologic Function

YES

T047

Disease or Syndrome

YES

T048

Mental or Behavioral Dysfunction

YES

T049

Cell or Molecular Dysfunction

YES

T050

Experimental Model of Disease

YES

T051

Event

T052

Activity

YES

T053

Behavior

YES

T054

Social Behavior

YES

T055

Individual Behavior

YES

T056

Daily or Recreational Activity

YES

T057

Occupational Activity

YES

T058

Health Care Activity

YES

T059

Laboratory Procedure

YES

T060

Diagnostic Procedure

YES

T061

Therapeutic or Preventive Procedure

YES

T062

Research Activity

T063

Molecular Biology Research Technique

T064

Governmental or Regulatory Activity

T065

Educational Activity

T066

Machine Activity

T067

Phenomenon or Process

T068

Human-caused Phenomenon or Process

T069

Environmental Effect of Humans

T070

Natural Phenomenon or Process

T071

Entity

T072

Physical Object

T073

Manufactured Object

T074

Medical Device

YES

T075

Research Device

YES

T077

Conceptual Entity

T078

Idea or Concept

T079

Temporal Concept

T080

Qualitative Concept

T081

Quantitative Concept

T082

Spatial Concept

T083

Geographic Area

T085

Molecular Sequence

T086

Nucleotide Sequence

T087

Amino Acid Sequence

T088

Carbohydrate Sequence

T089

Regulation or Law

T090

Occupation or Discipline

YES

T091

Biomedical Occupation or Discipline

YES

T092

Organization

T093

Health Care Related Organization

T094

Professional Society

T095

Self-help or Relief Organization

T096

Group

T097

Professional or Occupational Group

YES

T098

Population Group

YES

T099

Family Group

T100

Age Group

T101

Patient or Disabled Group

YES

T102

Group Attribute

T103

Chemical

T104

Chemical Viewed Structurally

T109

Organic Chemical

YES

T114

Nucleic Acid, Nucleoside, or Nucleotide

YES

T116

Amino Acid, Peptide, or Protein

YES

T120

Chemical Viewed Functionally

T121

Pharmacologic Substance

YES

T122

Biomedical or Dental Material

YES

T123

Biologically Active Substance

YES

T125

Hormone

YES

T126

Enzyme

YES

T127

Vitamin

YES

T129

Immunologic Factor

YES

T130

Indicator, Reagent, or Diagnostic Aid

YES

T131

Hazardous or Poisonous Substance

YES

T167

Substance

YES

T168

Food

YES

T169

Functional Concept

YES

T170

Intellectual Product

T171

Language

T184

Sign or Symptom

YES

T185

Classification

YES

T190

Anatomical Abnormality

YES

T191

Neoplastic Process

YES

T192

Receptor

T194

Archaeon

T195

Antibiotic

YES

T196

Element, Ion, or Isotope

YES

T197

Inorganic Chemical

YES

T200

Clinical Drug

YES

T201

Clinical Attribute

T203

Drug Delivery Device

YES

T204

Eukaryote

Document Fields Example

<schema name="example" version="1.5">

	<!-- You can use different field names for all fields.
	Just be sure they match what the EMRESE Admin app expects the fields to be named. -->

	<uniqueKey>ID</uniqueKey>

	<!-- unique across sources -->
	<field name="ID" type="string" required="true"/>
	<!-- unique within a source, optional.
	For use by a user if they want to cross-reference
	a document in EMERSE with another system
	without having to demange the cross-source unique ID above -->
	<field name="RPT_ID" type="string"/>

  <!-- Required fields -->
	<field name="MRN" type="string" required="true"/>
	<field name="ENCOUNTER_DATE" type="date" required="true"/>
	<field name="DOC_TYPE" type="string" required="true"/>
	<field name="SOURCE" type="string" required="true"/>
	<field name="RPT_TEXT" type="text_with_header_post_analysis" required="true"/>

	<!-- Required fields which are calculated by
	 the ETL system based on the encounter date and the birthdate of the patient. -->
	<field name="AGE_DAYS" type="int" required="true"/>
	<field name="AGE_MONTHS" type="int" required="true"/>

	<!-- The encounter ID, if available -->
	<field name="CSN" type="string"/>

	<!-- Other fields describing metadata -->
	<field name="ADMIT_DATE" type="date"/>
	<field name="RPT_DATE" type="date"/>
	<field name="DEPT" type="string" default="unknown"/>
	<field name="SVC" type="string" default="unknown"/>
	<field name="CLINICIAN" type="string" default="unknown"/>
	<field name="CATEGORY" type="string" multiValued="true"/>

	<!-- The NLP_HEADER is optional but cannot be renamed -->
	<field name="NLP_HEADER" type="string" indexed="false"/>

	<!-- Metadata fields related to indexing itself, provided by the ETL system -->
	<field name="INDEX_DATE" type="date"/>
	<field name="LAST_UPDATED" type="date"/>

	<!-- _version_ must be a field since it's used by Solr internally.
	Solr fills in the values automatically. -->
	<field name="_version_" type="long"/>

	<!-- Field Types -->
	<fieldType name="date" class="solr.DatePointField"
		sortMissingLast="true" uninvertible="false" docValues="true"/>
	<fieldType name="long" class="solr.LongPointField"
		sortMissingLast="true" uninvertible="false" docValues="true"/>
	<fieldType name="int" class="solr.IntPointField"
		sortMissingLast="true" uninvertible="false" docValues="true"/>
	<fieldType name="string" class="solr.StrField"
		sortMissingLast="true" uninvertible="false" docValues="true"/>
	<fieldType name="boolean" class="solr.BoolField"
		sortMissingLast="true" uninvertible="false" docValues="true"/>
    ....
</schema>

EMERSE Database

EMERSE stores NLP artifacts in two database tables. By default, only EMERSE built-in artifacts are populated. To include custom NLP artifacts within the EMERSE application, it is necessary to populate them into these tables accordingly.

Table SEMANTIC_GROUP

This table is for storing NLP Semantic Groups, which has been discussed at: Token-Aligned Layered Index Structure

Column Name Type Description

GROUP_ID

VARCHAR(25)

Semantic Group ID, this is the name stored in the Solr index to represent an entity.

LABEL

VARCHAR(50)

Group Name, a user-friendly name for the semantic group that will be displayed on UI.

BORDER_CSS

VARCHAR(50)

CSS value for "border-bottom", i.e. "0.3em dotted red". When a concept in the semantic group is highlighted in the document, specific elements from this CSS property will be employed for rendering.

Table NLP_TAG

This table maintains NLP attribute groups, which has been mentioned at: Group of Attribute IDs

Column Name Type Description

GROUP_ID

VARCHAR(25)

Unique ID for a Group of Attribute IDs. This ID will be used to reference a set of attribute IDs.

QUERY_TAG

VARCHAR(25)

The attribute ID in the group that represents the mention.

HIGHLIGHT_TAGS

VARCHAR(100)

All attribute IDs in the group that can be highlighted in the document.

TAG_INCLUDED_PROMPT

VARCHAR(300)

The text description of this attribute is included in the query. If +[ATTRIBUTE_ID] is selected, what does it signify?

TAG_EXCLUDED_PROMPT

VARCHAR(300)

The text description of this attribute is excluded in the query. If -[ATTRIBUTE_ID] is selected, what does it signify?

TAG_NEUTRAL_PROMPT

VARCHAR(300)

The text description of this attribute is not specify in the query.

LABEL

VARCHAR(50)

The name for this Attribute on UI for display.

BORDER_CSS

VARCHAR(50)

CSS value for "border-bottom", i.e. "0.15em solid #EA3223". When an attribute group is highlighted in the document, specific elements from this CSS property will be employed for rendering.

DISPLAY_ORDER

NUMERIC

The position of this attribute group relative to others when displayed on the UI.

SUBWAY_ICON

VARCHAR(1)

A singular letter serves as the representation of this attribute group. This letter will appear in various locations on the UI to indicate the selection status of this group.

SUBWAY_INCLUDED_PROMPT

VARCHAR(100)

The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is included in the query.

SUBWAY_EXCLUDED_PROMPT

VARCHAR(100)

The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is excluded in the query.

SUBWAY_NEUTRAL_PROMPT

VARCHAR(100)

The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is not in the query.

HIDDEN_ON_UI

BOOLEAN

Indicating whether this attribute group should be hidden on the UI.

EMERSE UI Consideration

EMERSE queries a Solr field named nlp_display within the index. When this field is populated, its contents are showcased within the NLP Detail section. This functionality proves invaluable when a user’s NLP tool generates supplementary content, such as summaries or abstracts, based on the original document.

Additionally, users have the option to establish distinct Solr fields dedicated to storing NLP outputs. This enables the configuration of these fields as query filters, facilitating more refined search capabilities.