This guide describes how natural language processing (NLP) is incorporated into EMERSE, how to configure the system properly, and how to make changes so that you can incorporate your own annotations into the search index.
Background
Free text exhibits a high degree of complexity in its structure and diversity,making it improbable that any solitary approach will consistently yield optimal results. Due to this inherent complexity, the NLP capabilities integrated into EMERSE, which can be efficiently executed while the documents are sent to Solr for indexing, are constructed on top of well-established open-source libraries like OpenNLP and CTAKES. These fundamental features encompass rule-based information extraction and text classification.
NLP Implementation
NLP Element
EMERSE requires a highly customized Lucene Analyzer to pre-compute NLP artifacts, which are subsequently integrated with standard tokens for indexing purposes. These artifacts are aligned with tokens based upon their positional information and length.
There exist two distinct categories of artifacts:
-
Attribute, An attribute is defined as additional information that can be linked to an individual token. Multiple attributes can be associated with a single token.
-
Entity, An entity is characterized as a concept extracted from the original document through NLP analysis. A concept may span multiple tokens and is positioned on the first token it encompasses. It’s worth noting that an entity may include symbols that are overlooked by a tokenizer, meaning an entity might contain characters not present in the index.
Token-Aligned Layered Index Structure
Here is an example to conceptually illustrate the Token-Aligned Layered Index using a sentence: This is not a teddy bear.
This | is | not | a | teddy | bear |
---|---|---|---|---|---|
PPOL_NEG |
POL_NEG |
POL_NEG |
POL_NEG |
||
CUI_TEDDY_BEAR |
|||||
SMG_STUFFED_ANIMALS |
There are four layers in this index. Text tokens are in the first row. The following rows contain NLP artifacts that are aligned with text tokens. Attributes, which reside at the second row, are always associated with individual tokens. Entities, concepts for this case, on the other hand, are presented in the third row and may span over multiple tokens. The fourth row encompasses a different type of entities, the semantic group. Each concept is assigned with one or more semantic groups. More information of semantic group can be found at: UMLS Semantic Group and Concept.
NLP Pipeline
A typical Lucene Analysis Chain includes:
-
Zero, one or more CharFilters;
-
A single Tokenizer;
-
Zero, one or more TokenFilters;
Standard Pipeline
EMERSE takes solr.HTMLStripCharFilterFactory
as the default CharFilter in order to pre-process documents in HTML format.
A custom tokenizer has been developed with the following capacities:
-
Sentence Boundary Detection: It can effectively identify sentence boundaries within the text.
-
Token-Level Text Segmentation: It segments the text into tokens within each sentence, preparing it for more intricate NLP tasks.
-
Document Header Recognition: If a document includes a header, the tokenizer can recognize and interpret it. This header, when present, offers valuable hints and instructions on how to conduct the tokenization process.
Ideally, this tokenizer should also assume responsibility for more intricate NLP tasks, including entity identification and attribute discovery. Nonetheless, a monolithic implementation of such tasks would introduce unnecessary complexity and a lack of modularity, thereby hindering the ability to update or replace specific functionalities individually.
EMERSE has encapsulated these NLP tasks within filters that process a token stream originating from either the tokenizer or other filters. These filters operate autonomously and are versatile in their ability to work together in any desired sequence. For instance, when utilizing filters A and B, the pipeline’s execution as A → B will yield the same result as B → A.
Here is a possible implementation of the EMERSE NLP pipeline as a Solr analyzer:
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-delimiters.txt"/>
<tokenizer class="org.emerse.opennlp.OpenNLPTokenizerFactory"
tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
hasHeader="true"
usingLuceneTokenizer="true"
/>
<filter class="org.emerse.ctakes.CTAKESFilterFactory"
excludeTypes="nlp/pos_exclude.txt"
prefix="CUI"
dictionary="nlp/dictionary/custom.script"
posModel="nlp/models/mayo-pos.zip"
/>
<filter class="org.emerse.nlp.TaggerFilterFactory"
addTokenPrefix="false"
delimiters="nlp/delimiters.txt"
negations="nlp/negations.txt"
negationIgnore="nlp/negationIgnore.txt"
certainty="nlp/probability_resource_patterns.txt"
family="nlp/family_history_resource_patterns_A.txt"
familyTemplates="nlp/family_history_resource_patterns_B.txt"
familyMembers="nlp/family_history_resource_patterns_C.txt"
history="nlp/history.txt"
historyIgnore="nlp/historyIgnore.txt"
tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
/>
<filter class="org.emerse.nlp.PostAnalysisFilterFactory"/>
</analyzer>
Custom Pipeline
NLP is a convoluted, and it’s unrealistic to expect that EMERSE NLP implementation would meet the needs of every user. It’s quite common for users to have NLP entities and attributes generated by software packages external to EMERSE. In such cases, users anticipate that EMERSE can index the original documents along with NLP objects not generated by EMERSE itself. To address this requirement, additional implementation is necessary, enabling users to submit the original document along with NLP artifacts in a single file. Here is an example:
THIS_IS_A_HEADER 4 18 CUI_APPLE_BANANA R 4 18 SMG_FRUITS R 5 17 CUI_APPLE_BANANA_1 R 4 18 SMG_FRUITS_1 R 21 36 POL_NEG T THIS_IS_A_SEPARATOR not +apple-banana--. Second sentence. Third sentence.
In this file, the single line header serves as the designated location for providing arguments to the tokenizer. Following the header is a dedicated NLP artifact section, where each row contains the start offset, end offset, artifact ID, and artifact type, which are delimited by tabs. A separator demarcates the conclusion of the NLP artifact section. Subsequently, the original document is appended.
Artifact ID is a unique identifier and should not contain any space.
A NLP artifact offsets are defined by two integers: start offset and end offset. These offsets represent the character positions within the original text where the first token begins and the last token ends + 1. For example:
This is a teddy bear.
If an artifact is aligned with teddy bear. Such a line in the file should like:
10 20 CUI_TEDDY_BEAR R
If an artifact is a concept and it has a group associated with it. The group must be represented in the following line.
4 18 CUI_APPLE_BANANA R 4 18 SMG_FRUITS R
Artifact type, as discussed earlier, is either Attribute or Entity:
NLP Artifact Type | Flag |
---|---|
Attribute |
T |
Entity |
R |
Here is a possible implementation of the EMERSE custom NLP pipeline:
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="org.emerse.nlp.HybridTokenizerFactory"
tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
hasHeader="true"
usingLuceneTokenizer="true"
/>
<filter class="org.emerse.nlp.PostAnalysisFilterFactory"/>
</analyzer>
In a custom pipeline, hasHeader
attribute on the HybridTokenizer
has to be true.
EMERSE Document Header
The EMERSE header is a single line of text terminated by a newline symbol. It recognizes its header row by a prefix signature: RU1FUlNFX0g=. The header accepts 3 positional arguments which are demarcated by the pipe character |.
Position | Description |
---|---|
1 |
Numeric, The number of new line symbols after which a sequence can be regarded as the conclusion of a sentence. This parameter is required and has to be a numeric value greater than 0. |
2 |
Numeric, the number of spaces after which a sequence can be regarded as the conclusion of a sentence. This parameter is required and has to be a numeric value greater than 0. |
3 |
String, a line of text indicating the end of NLP artifact section. This parameter is optional and is only needed for a Custom Pipeline. If this is a Standard Pipeline, leave the parameter empty, do not fill in any value, which will break the indexing process and truncate the document. |
Here is an example of EMERSE header row for Standard Pipeline:
RU1FUlNFX0g=1|5|
Here is an example of EMERSE header row for Custom Pipeline:
RU1FUlNFX0g=2|5|XHRcdEVNRVJTRSoqKioqKioqRVNSRU1FXHRcdA==
A valid EMERSE Document Header should always end with a line feed (LF): '\n'. The behavior will be unknown if a LF can not be found at the end of the header.
EMERSE Field Type
TextWithHeaderFieldType
is a Field type for Solr index. If the document has a EMERSE header defined, either using Standard Pipeline or Custom Pipeline, org.emerse.solr.TextWithHeaderFieldType has to be the field type.
Solr and Lucene do not provide native support for indexing only specific portions of a document. Consequently, EMERSE has to crate a custom field type. This field type is designed to discern distinct sections within a document, enabling EMERSE to process them appropriately.
Here is an example of defining an analysis chain on this field type:
<fieldType name="text_with_header_post_analysis" class="org.emerse.solr.TextWithHeaderFieldType"
termPositions="true" termVectors="true" termOffsets="true" termPayloads="true">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="org.emerse.nlp.HybridTokenizerFactory"
standardTokenizer="true"
hasHeader="true"
tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
numOfNewLineSymbols="2"
/>
</analyzer>
</fieldType>
EMERSE Document Field
In order to use org.emerse.solr.TextWithHeaderFieldType
, the Solr field for storing documents must have the following attributes activated on the field.
<field name="RPT_TEXT" type="text_with_header_post_analysis"/>
The field has to be indexed and stored. TermVector has to be enabled, along with termPosition, termOffsets and termPayloads.
EMERSE Extra NLP Field
NLP Document header is not stored in the EMERSE Document Field since it is not part of the original document. However, for certain scenario, i.e., document update, it may be necessary to keep NLP header information available.
<field name="NLP_HEADER" type="string" indexed="false" stored="true" docValues="true"/>
This field is not indexed but stored.
Tokenizer & Filters
Standard NLP Tokenizer
OpenNLPTokenizer
is the tokenizer for the standard EMERSE NLP pipeline. It segments the provided text on a per-sentence basis. Sentence detection and tokenization are built upon OpenNLP.
It can be declared using org.emerse.opennlp.OpenNLPTokenizerFactory
, which supports the following arguments:
Argument Name | Required | Default Value | Description |
---|---|---|---|
sentenceModel |
Yes |
N/A |
String, the file path of the pre-training sentence detection model for OpenNLP. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
tokenizerModel |
Yes |
N/A |
String, the file path of the pre-training tokenization model for OpenNLP. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
usingLuceneTokenizer |
No |
False |
Ture or False, reprocess tokens using Lucene’s StandardTokenizer, this approach ensures that the indexed tokens adhere to the behavior of the StandardTokenizer. |
hasHeader |
No |
False |
True or False, if True, the tokenizer will expect there is a single line header at the beginning of the document. If the document has a header, the field type for storing the text in Solr index has to be org.emerse.solr.TextWithHeaderFieldType |
numOfNewLineSymbols |
No |
1 |
Numeric, number of the new line symbols (carriage return and line feed combined) before the end of a sentence. This is a hint to help tokenizer decide the boundary of the sentences. |
numOfSpaces |
No |
5 |
Numeric, number of space before the end of a sentence. This is a hint to help tokenizer decide the boundary of the sentences. |
Custom NLP Tokenizer
HybridTokenizer
is the tokenizer for the Custom Pipeline.
It can be declared using org.emerse.nlp.HybridTokenizerFactory
, which supports the following arguments:
Argument Name | Required | Default Value | Description |
---|---|---|---|
standardTokenizer |
No |
False |
True or False, if True, HybridTokenizer will be wrapped with Lucene StandardTokenizer; If False, OpenNLPTokenizer will be wrapped. If user decides to take any NLP artifact from EMERSE NLP filters into the Solr index, OpenNLPTokenizer should be used. |
maxTokenCount |
No |
255 |
Numeric, this is for StandardTokenizer only. It defines the maximum token length. If a token is seen that exceeds this length then it is split at maxTokenCount intervals. |
lineFeedWidth |
No |
1 |
Numeric, the length of the line feed at the end of the header row. If nothing is specified, EMERSE expects a single line feed at the end of the header row. |
Naming Entity Recognition (NER) Filter
CTAKESFilter
is a TokenFilter that can extract NLP entities from a preloaded dictionary. The extraction implementation is derived from CTAKES fast-dictionary lookup method. The dictionary that may be retrieved from EMERSE is compiled from Unified Medical Language System (UMLS)
It can be declared using org.emerse.ctakes.CTAKESFilterFactory
, which supports the following arguments:
Argument Name | Required | Default Value | Description |
---|---|---|---|
dictionary |
Yes |
N/A |
string, the file path of the dictionary. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
prefix |
Yes |
N/A |
String, the prefix that is prepended to the entity ID that will be inserted into the index |
longestMatch |
No |
True |
True or False, In a scenario where multiple entities are recognized, and one of these entities encompasses all the others, the decision to insert into the index depends on whether this statement is true or false. If it is true, only the longest entity will be included in the index; if it is false, all identified entities will be added to the index. |
posModel |
No |
N/A |
String, the file path of the POS(Parts-Of-Speech) model. The POS model combined with an exclusdeTypes list are used to exclude tokens with certain POS types from being identified as entities.If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
excludeTypes |
No |
N/A |
String, the file path of the POS(Parts-Of-Speech) types. If a token has been identified with any POS type in this list, no entity extraction will be conducted against this token. This list can only be activated if a posModel is given. |
UMLS Semantic Group and Concept
The UMLS semantic network reduces the complexity of the Metathesaurus by grouping concepts according to the semantic types that have been assigned to them.
EMERSE UI enables users to accentuate a document through Semantic Groups. When a concept is identified by the filter, the associated semantic groups are automatically linked to the concept. An example is given at: Token-Aligned Layered Index Structure.
Filter for Negation, Uncertainty and More
TaggerFilter
is a TokenFilter that is designed for detecting the following 4 types of mentions in the text:
-
Negation
-
Uncertainty
-
Non-patient mentions
-
Patient history
This filter uses rule based matching for mention detection. Rules can be customized by user and user will have the choice to turn on and off one or more mention detections.
It can be declared using org.emerse.nlp.TaggerFilterFactory
, which supports the following arguments:
Argument Name | Required | Default Value | Description |
---|---|---|---|
delimiters |
No |
N/A |
String, the file path for the delimiters. These delimiters play a crucial role in dividing a sentence into distinct sections. By doing so, they prevent the detection of mentions from extending beyond the confines of each section, avoiding potential inaccuracies at the sentence’s end. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
negations |
No |
N/A |
String, the file path for the negation rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
negationIgnore |
No |
N/A |
String, the file path for the negation ignore rules. This is useful for the case like can not be ignored. If Can not is a negation rule, without this ignorance, can not be ignored could be wrongly identified as negation. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
certainty |
No |
N/A |
String, the file path for the uncertainty rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
family |
No |
N/A |
String, the file path for the family rules that are in static mode. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
familyTemplates |
No |
N/A |
String, the file path for the family rules that are in dynamic mode. This argument should work with argument familyMembers together. The rules in familyTemplates act as a format string and the placeholders in the format string will be replaced by all rules defined in familyMembers. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
familyMembers |
No |
N/A |
String, the file path for the family member definitions. This argument should work with argument familyTemplates together. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
history |
No |
N/A |
String, the file path for the patient history mention rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
historyIgnore |
No |
N/A |
String, the file path for the history ignore rules. This is useful for the case like history on admit. If History is a history rule, without this ignorance, history on admit could be wrongly identified as negation. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
addTokenPrefix |
No |
True |
True or False, if True, this filter will add a prefix of TX_ and tx_ to tokens. Tokens with the TX_ prefix are case-sensitive, while tokens with the tx_ prefix are case-insensitive; If False, no prefix will be added and the tokens are sent to downstream as they are. |
Group of Attribute IDs
If a mention is identified through a given rule, EMERSE will tag tokens with the corresponding attribute ID before and/or after the token pattern in the rule until the sentence boundary is reached. The direction is defined in conjunction with the rule as either prepending *****, indicating movement towards the beginning of the sentence, or appending *****, indicating movement towards the end of the sentence. Additionally, tokens matching the rule will be tagged with a distinct attribute ID specifying the rule.:
This is not a teddy.
If a negation is provided as not *****, every token after not will be tagged as negation (POL_NEG as the attribute ID) and not itself will be tagged as the "trigger" for this negation (PPOL_NEG as the attribute ID). Tokens This is will not be touched since ***** is after not, which means only the tokens after the trigger will be considered as negated.
This | is | not | a | teddy |
---|---|---|---|---|
PPOL_NEG |
POL_NEG |
POL_NEG |
From the above example, it is evident that a set of attributes is required to describe the occurrence of a mention.
Post Analysis Consolidation Filter
The PostAnalysisFilter
is a TokenField crucially positioned at the pipeline’s conclusion. It processes tokens using Lucene StandardTokenizer. Should the output tokens from StandardTokenizer differ from the input token, the input token is discarded, and NLP artifacts overlapping with this token are realigned with the output tokens. It may also add a prefix of TX_ and tx_ to tokens. Tokens with the TX_ prefix are case-sensitive, while tokens with the tx_ prefix are case-insensitive.
This filter can be declared using org.emerse.nlp.PostAnalysisFilterFactory
, which supports the following argument:
Argument Name | Required | Default Value | Description |
---|---|---|---|
addTokenPrefix |
No |
True |
True or False, if True, a token will be provided in both case-sensitive and case-insensitive forms and be prefixed with TX_ and tx_ respectively. |
TokenAdjusterFilter
is a TokenFilter that is only necessary in the scenario of defining a query analyzer with the utilization of OpenNLPTokenizer.
The default query analyzer typically comes equipped with Lucene StandardTokenizer. In cases where OpenNLPTokenizer is employed during Solr indexing, the resulting tokens may deviate from those produced by StandardTokenizer. In such instances, TokenAdjusterFilter steps in to harmonize the output tokens, ensuring an exact match with the tokens generated during the Solr indexing process.
There is no argument available for this filter.
Query Analyzer
To facilitate Solr indexing, a mandatory component is the Analyzer chain (pipeline). Similarly, for Solr queries, an analyzer chain must be defined. Since EMERSE offers flexibility to customize the NLP pipeline, it’s crucial to emphasize that the tokens extracted from a query phrase should precisely match those generated by the analyzer during the Solr indexing process.
Here is the query analyzer that EMERSE recommends:
<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-delimiters.txt"/>
<tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="nlp/models/sd-med-model.zip" tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"/>
<filter class="org.emerse.nlp.TokenAdjusterFilterFactory" reservedPrefix="CUI_,POL_NEG,PPOL_NEG,APOL_NEG,TG_FAMILY,PTG_FAMILY,ATG_FAMILY,CT_UNCERTAIN,PCT_UNCERTAIN,ACT_UNCERTAIN,TG_HISTORY,PTG_HISTORY,ATG_HISTORY"/>
</analyzer>
org.emerse.nlp.TokenAdjusterFilterFactory
is designed to work with the OpenNLPTokenizer created by solr.OpenNLPTokenizerFactory
only. Failing to use the specified tokenizer will lead to unexpected results, which may impact the functionality of EMERSE.
Alternatively:
<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="org.emerse.opennlp.OpenNLPTokenizerFactory"
tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
usingLuceneTokenizer="true"
/>
<filter class="org.emerse.nlp.PostAnalysisFilterFactory" addTokenPrefix="false"/>
</analyzer>
Query Syntax
The introduction of the EMERSE query syntax is out of the scope of this document. However, it’s worth mentioning the integration of NLP artifacts in the query.
For NLP attributes, such as negation, the use of + and - operators is employed, followed by square brackets [] containing the attribute’s ID. These notations indicate whether the attribute should align with the query phrase. For example, in order to query smoke history in the negated form, the query might be written as:
+[POL_NEG]"smoke history"
given POL_NEG as the attribute ID for negation.
Querying alignment with a phrase for multiple attributes is also permissible. For example, in order to query smoke history in the non-negated form for the patient only, the query might be written as:
-[POL_NEG]-[TG_FAMILY]"smoke history"
given POL_NEG as the attribute ID for negation and TG_FAMILY as the attribute ID for non-patient mention.
For NLP entities, on the other hand, are treated as regular phrase in terms of query. They can be combined with regular text phrases in the query. For example:
"CUI_TEDDY bear"
This query looks for any token/phrase that is mapped as the entity by ID CUI_TEDDY and followed by a token bear. If CUI_TEDDY is mapped to "Teddy", "A teddy", "Teddies", the above query is equivalent to:
"Teddy bear" OR "A teddy bear" OR "Teddies bear
NLP Configuration Files
NLP configuration files, including mapping file for the CharFilter
, model files for standard NLP tokenizer, UMLS dictionary for NER and pattern files for detection of negation, uncertainty etc., are all located under DOCUMENT_INDEX_ROOT/config/nlp
.
Delimiter Mapping
Prior to version 7, EMERSE includes MappingCharFilter in both indexing pipeline and query pipeline. The mapping file, mapping-delimiters.txt
, may contain the removing of dots from various character sets, for example:
"。" => ""
This type of mapping should be removed in version 7, as detecting a dot is a common method for NLP to identify sentence boundaries.
OpenNLP Tokenizer Model
Model files are in a subdirectory named models
. By default, it includes OpenNLP trained token, sentence and POS models for English. If the documents are in a different language, models may be available on OpenNLP website.
Pattern Files for Detection of Negation, Uncertainty and More
Sentiment detection in EMERSE relies on pattern matching. Newer technologies, such as convolutional neural networks (CNNs), were not adopted primarily due to their impact on indexing performance. To efficiently index large volumes of notes within a reasonable timeframe, pattern matching remains a viable solution.
EMERSE can detect negation, uncertainty, non-patient mentions, and patient history. Some detection types involve two pattern files: one for the detection patterns and another for overriding these patterns to address false positives. For example, negations.txt
versus negationIgnore.txt
, history.txt
versus historyIgnore.txt
. Here is a snippet of negation patterns:
denied symptoms of *****
***** couldn't be
The text in each line, i.e. couldn’t be
, is a trigger phrase to activate negation detection. The 4 asterisks *
represent a placeholder for anything that is being negated. *
can be placed either before or after the trigger.
Negation-Ignore file, may include:
couldn't be excluded
For the given negation pattern * couldn’t be
, if the sentence is: Pneumonia can not be excluded
, negation will not be concluded due to the negation ignore trigger couldn’t be excluded
. Be aware, ignore file contains trigger only.
Uncertainty only implements a pattern file probability_resource_patterns.txt
and non-patient mention, on the other hand, has 3 pattern files, among which, family_history_resource_patterns_A.txt
is a regular pattern file, family_history_resource_patterns_B.txt
is a template file and @@@@@
in each line can be replaced by the familial terms defined in family_history_resource_patterns_C.txt
UMLS Dictionary
dictionary
subdirectory contains the UMLS dictionary tailed by EMERSE using CTAKES UI. The dictionary file is a SQL file since that’s the default output from CTAKES. EMERSE parses this SQL file and extracts all concepts in conjunction with their groups. Entity recognition through Standard Pipeline requires a script file built through CTAKES Dictionary Creator. (As of 2024 this program still required Java 8 to run properly).
The EMERSE team has a default dictionary file to distribute to sites installing EMERSE, or upgrading from prior versions that are not NLP-enabled. Sites simply need to provide to us with a copy of a valid UMLS license first.
For those wishing to make their own dictionary, users can create a custom dictionary and specify the path of the script file for OpenNLPTokenizerFactory using its dictionary attribute.
By default, this dictionary resides at:
SOLR_HOME/DOCUMENT_CORE/conf/nlp/dictionary
UMLS Source Data
At the time of this writing (May 2024), the source of the data to match concepts in the clinical notes comes from the UMLS version 2023AB. A subset of source vocabularies and semantic types are included in the default EMERSE distribution, specified in the tables below. These were selected because they are the mostly likely to have clinical relevance and appear in the EHR notes. Each semantic type is mapped to a Type Unique Identifier (TUI).
Abbreviation | Vocabulary Name | Included in EMERSE? |
---|---|---|
AIR |
AI/RHEUM |
|
ALT |
Alternative Billing Concepts |
|
AOD |
Alcohol and Other Drug Thesaurus |
YES |
AOT |
Authorized Osteopathic Thesaurus |
YES |
ATC |
Anatomical Therapeutic Chemical Classification System |
|
BI |
Beth Israel Problem List |
|
CCC |
Clinical Care Classification |
|
CCPSS |
Clinical Problem Statements |
|
CCS |
Clinical Classifications Software |
|
CCSR_ICD10CM |
Clinical Classifications Software Refined for ICD-10-CM |
|
CCSR_ICD10PCS |
Clinical Classifications Software Refined for ICD-10-PCS |
|
CDCREC |
Race & Ethnicity - CDC |
|
CDT |
CDT |
|
CHV |
Consumer Health Vocabulary |
YES |
COSTAR |
COSTAR |
|
CPM |
Medical Entities Dictionary |
|
CPT |
CPT - Current Procedural Terminology |
YES |
CPTSP |
CPT Spanish |
|
CSP |
CRISP Thesaurus |
|
CST |
COSTART |
|
CVX |
Vaccines Administered |
|
DDB |
Diseases Database |
YES |
DMDICD10 |
ICD-10 German |
|
DMDUMD |
UMDNS German |
|
DRUGBANK |
DrugBank |
YES |
DSM-5 |
Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition |
YES |
DXP |
DXplain |
|
FMA |
Foundational Model of Anatomy |
|
GO |
Gene Ontology |
|
GS |
Gold Standard Drug Database |
YES |
HCDT |
CDT in HCPCS |
|
HCPCS |
HCPCS - Healthcare Common Procedure Coding System |
YES |
HCPT |
CPT in HCPCS |
|
HGNC |
HUGO Gene Nomenclature Committee |
|
HL7V2.5 |
HL7 Version 2.5 |
|
HL7V3.0 |
HL7 Version 3.0 |
|
HLREL |
ICPC2E ICD10 Relationships |
|
HPO |
Human Phenotype Ontology |
YES |
ICD10 |
International Classification of Diseases and Related Health Problems, Tenth Revision |
YES |
ICD10AE |
ICD-10, American English Equivalents |
YES |
ICD10AM |
ICD-10, Australian Modification |
|
ICD10AMAE |
ICD-10, Australian Modification, Americanized English Equivalents |
|
ICD10CM |
International Classification of Diseases, Tenth Revision, Clinical Modification |
YES |
ICD10DUT |
ICD10, Dutch Translation |
|
ICD10PCS |
ICD-10 Procedure Coding System |
YES |
ICD9CM |
International Classification of Diseases, Ninth Revision, Clinical Modification |
|
ICF |
International Classification of Functioning, Disability and Health |
YES |
ICF-CY |
International Classification of Functioning, Disability and Health for Children and Youth |
YES |
ICNP |
International Classification for Nursing Practice |
|
ICPC |
International Classification of Primary Care |
|
ICPC2EDUT |
ICPC2E Dutch |
|
ICPC2EENG |
International Classification of Primary Care, 2nd Edition, Electronic |
|
ICPC2ICD10DUT |
ICPC2-ICD10 Thesaurus, Dutch Translation |
|
ICPC2ICD10ENG |
ICPC2-ICD10 Thesaurus |
|
ICPC2P |
ICPC-2 PLUS |
|
ICPCBAQ |
ICPC Basque |
|
ICPCDAN |
ICPC Danish |
|
ICPCDUT |
ICPC Dutch |
|
ICPCFIN |
ICPC Finnish |
|
ICPCFRE |
ICPC French |
|
ICPCGER |
ICPC German |
|
ICPCHEB |
ICPC Hebrew |
|
ICPCHUN |
ICPC Hungarian |
|
ICPCITA |
ICPC Italian |
|
ICPCNOR |
ICPC Norwegian |
|
ICPCPOR |
ICPC Portuguese |
|
ICPCSPA |
ICPC Spanish |
|
ICPCSWE |
ICPC Swedish |
|
JABL |
Congenital Mental Retardation Syndromes |
YES |
KCD5 |
Korean Standard Classification of Disease Version 5 |
|
LCH |
Library of Congress Subject Headings |
|
LCH_NW |
Library of Congress Subject Headings, Northwestern University subset |
|
LNC |
LOINC |
YES |
LNC-DE-AT |
LOINC Linguistic Variant - German, Austria |
|
LNC-DE-DE |
LOINC Linguistic Variant - German, Germany |
|
LNC-EL-GR |
LOINC Linguistic Variant - Greek, Greece |
|
LNC-ES-AR |
LOINC Linguistic Variant - Spanish, Argentina |
|
LNC-ES-ES |
LOINC Linguistic Variant - Spanish, Spain |
|
LNC-ES-MX |
LOINC Linguistic Variant - Spanish, Mexico |
|
LNC-ET-EE |
LOINC Linguistic Variant - Estonian, Estonia |
|
LNC-FR-BE |
LOINC Linguistic Variant - French, Belgium |
|
LNC-FR-CA |
LOINC Linguistic Variant - French, Canada |
|
LNC-FR-FR |
LOINC Linguistic Variant - French, France |
|
LNC-IT-IT |
LOINC Linguistic Variant - Italian, Italy |
|
LNC-KO-KR |
LOINC Linguistic Variant - Korea, Korean |
|
LNC-NL-NL |
LOINC Linguistic Variant - Dutch, Netherlands |
|
LNC-PL-PL |
LOINC Linguistic Variant - Polish, Poland |
|
LNC-PT-BR |
LOINC Linguistic Variant - Portuguese, Brazil |
|
LNC-RU-RU |
LOINC Linguistic Variant - Russian, Russia |
|
LNC-TR-TR |
LOINC Linguistic Variant - Turkish, Turkey |
|
LNC-UK-UA |
LOINC Linguistic Variant - Ukrainian, Ukraine |
|
LNC-ZH-CN |
LOINC Linguistic Variant - Chinese, China |
|
MCM |
Glossary of Clinical Epidemiologic Terms |
|
MDR |
MedDRA |
YES |
MDRARA |
MedDRA Arabic |
|
MDRBPO |
MedDRA Brazilian Portuguese |
|
MDRCZE |
MedDRA Czech |
|
MDRDUT |
MedDRA Dutch |
|
MDRFRE |
MedDRA French |
|
MDRGER |
MedDRA German |
|
MDRGRE |
MedDRA Greek |
|
MDRHUN |
MedDRA Hungarian |
|
MDRITA |
MedDRA Italian |
|
MDRJPN |
MedDRA Japanese |
|
MDRKOR |
MedDRA Korean |
|
MDRLAV |
MedDRA Latvian |
|
MDRPOL |
MedDRA Polish |
|
MDRPOR |
MedDRA Portuguese |
|
MDRRUS |
MedDRA Russian |
|
MDRSPA |
MedDRA Spanish |
|
MDRSWE |
MedDRA Swedish |
|
MED-RT |
Medication Reference Terminology |
|
MEDCIN |
MEDCIN |
|
MEDLINEPLUS |
MedlinePlus Health Topics |
YES |
MEDLINEPLUS_SPA |
MedlinePlus Spanish Health Topics |
|
MMSL |
Multum |
YES |
MMX |
Micromedex |
YES |
MSH |
MeSH |
YES |
MSHCZE |
MeSH Czech |
|
MSHDUT |
MeSH Dutch |
|
MSHFIN |
MeSH Finnish |
|
MSHFRE |
MeSH French |
|
MSHGER |
MeSH German |
|
MSHITA |
MeSH Italian |
|
MSHJPN |
MeSH Japanese |
|
MSHLAV |
MeSH Latvian |
|
MSHNOR |
MeSH Norwegian |
|
MSHPOL |
MeSH Polish |
|
MSHPOR |
MeSH Portuguese |
|
MSHRUS |
MeSH Russian |
|
MSHSCR |
MeSH Croatian |
|
MSHSPA |
MeSH Spanish |
|
MSHSWE |
MeSH Swedish |
|
MTH |
Metathesaurus Names |
|
MTHCMSFRF |
Metathesaurus CMS Formulary Reference File |
|
MTHICD9 |
ICD-9-CM Entry Terms |
|
MTHICPC2EAE |
ICPC2E American English Equivalents |
|
MTHICPC2ICD10AE |
ICPC2E-ICD10 Thesaurus, American English Equivalents |
|
MTHMST |
Minimal Standard Terminology (UMLS) |
YES |
MTHMSTFRE |
Minimal Standard Terminology French (UMLS) |
|
MTHMSTITA |
Minimal Standard Terminology Italian (UMLS) |
|
MTHSPL |
FDA Structured Product Labels |
|
MVX |
Manufacturers of Vaccines |
YES |
NANDA-I |
NANDA-I Taxonomy |
|
NCBI |
NCBI Taxonomy |
|
NCI |
NCI Thesaurus |
YES |
NCISEER |
NCI SEER ICD Mappings |
|
NDDF |
FDB MedKnowledge |
|
NEU |
Neuronames Brain Hierarchy |
|
NIC |
Nursing Interventions Classification |
|
NOC |
Nursing Outcomes Classification |
|
NUCCHCPT |
National Uniform Claim Committee - Health Care Provider Taxonomy |
|
OMIM |
Online Mendelian Inheritance in Man |
|
OMS |
Omaha System |
|
ORPHANET |
ORPHANET |
YES |
PCDS |
Patient Care Data Set |
|
PDQ |
Physician Data Query |
|
PNDS |
Perioperative Nursing Data Set |
YES |
PPAC |
Pharmacy Practice Activity Classification |
|
PSY |
Psychological Index Terms |
YES |
QMR |
Quick Medical Reference |
|
RAM |
Clinical Concepts by R A Miller |
|
RCD |
Read Codes |
|
RCDAE |
Read Codes Am Engl |
|
RCDSA |
Read Codes Am Synth |
|
RCDSY |
Read Codes Synth |
|
RXNORM |
RXNORM |
YES |
SCTSPA |
SNOMED CT Spanish Edition |
|
SNM |
SNOMED 1982 |
|
SNMI |
SNOMED Intl 1998 |
|
SNOMEDCT_US |
SNOMED CT, US Edition |
YES |
SNOMEDCT_VET |
SNOMED CT, Veterinary Extension |
|
SOP |
Source of Payment Typology |
|
SPN |
Standard Product Nomenclature |
|
SRC |
Source Terminology Names (UMLS) |
|
TKMT |
Traditional Korean Medical Terms |
|
ULT |
UltraSTAR |
|
UMD |
UMDNS |
|
USP |
USP Compendial Nomenclature |
|
USPMG |
USP Model Guidelines |
|
UWDA |
Digital Anatomist |
|
VANDF |
National Drug File |
YES |
WHO |
WHOART |
|
WHOFRE |
WHOART French |
|
WHOGER |
WHOART German |
|
WHOPOR |
WHOART Portuguese |
|
WHOSPA |
WHOART Spanish |
Type Unique Identifier (TUI) | Semantic Type Name | Included in EMERSE? |
---|---|---|
T001 |
Organism |
|
T002 |
Plant |
|
T004 |
Fungus |
|
T005 |
Virus |
YES |
T007 |
Bacterium |
|
T008 |
Animal |
|
T010 |
Vertebrate |
|
T011 |
Amphibian |
|
T012 |
Bird |
|
T013 |
Fish |
|
T014 |
Reptile |
|
T015 |
Mammal |
|
T016 |
Human |
YES |
T017 |
Anatomical Structure |
YES |
T018 |
Embryonic Structure |
|
T019 |
Congenital Abnormality |
YES |
T020 |
Acquired Abnormality |
YES |
T021 |
Fully Formed Anatomical Structure |
YES |
T022 |
Body System |
YES |
T023 |
Body Part, Organ, or Organ Component |
YES |
T024 |
Tissue |
YES |
T025 |
Cell |
YES |
T026 |
Cell Component |
YES |
T028 |
Gene or Genome |
|
T029 |
Body Location or Region |
YES |
T030 |
Body Space or Junction |
YES |
T031 |
Body Substance |
YES |
T032 |
Organism Attribute |
|
T033 |
Finding |
YES |
T034 |
Laboratory or Test Result |
YES |
T037 |
Injury or Poisoning |
YES |
T038 |
Biologic Function |
|
T039 |
Physiologic Function |
YES |
T040 |
Organism Function |
YES |
T041 |
Mental Process |
YES |
T042 |
Organ or Tissue Function |
YES |
T043 |
Cell Function |
YES |
T044 |
Molecular Function |
YES |
T045 |
Genetic Function |
YES |
T046 |
Pathologic Function |
YES |
T047 |
Disease or Syndrome |
YES |
T048 |
Mental or Behavioral Dysfunction |
YES |
T049 |
Cell or Molecular Dysfunction |
YES |
T050 |
Experimental Model of Disease |
YES |
T051 |
Event |
|
T052 |
Activity |
YES |
T053 |
Behavior |
YES |
T054 |
Social Behavior |
YES |
T055 |
Individual Behavior |
YES |
T056 |
Daily or Recreational Activity |
YES |
T057 |
Occupational Activity |
YES |
T058 |
Health Care Activity |
YES |
T059 |
Laboratory Procedure |
YES |
T060 |
Diagnostic Procedure |
YES |
T061 |
Therapeutic or Preventive Procedure |
YES |
T062 |
Research Activity |
|
T063 |
Molecular Biology Research Technique |
|
T064 |
Governmental or Regulatory Activity |
|
T065 |
Educational Activity |
|
T066 |
Machine Activity |
|
T067 |
Phenomenon or Process |
|
T068 |
Human-caused Phenomenon or Process |
|
T069 |
Environmental Effect of Humans |
|
T070 |
Natural Phenomenon or Process |
|
T071 |
Entity |
|
T072 |
Physical Object |
|
T073 |
Manufactured Object |
|
T074 |
Medical Device |
YES |
T075 |
Research Device |
YES |
T077 |
Conceptual Entity |
|
T078 |
Idea or Concept |
|
T079 |
Temporal Concept |
|
T080 |
Qualitative Concept |
|
T081 |
Quantitative Concept |
|
T082 |
Spatial Concept |
|
T083 |
Geographic Area |
|
T085 |
Molecular Sequence |
|
T086 |
Nucleotide Sequence |
|
T087 |
Amino Acid Sequence |
|
T088 |
Carbohydrate Sequence |
|
T089 |
Regulation or Law |
|
T090 |
Occupation or Discipline |
YES |
T091 |
Biomedical Occupation or Discipline |
YES |
T092 |
Organization |
|
T093 |
Health Care Related Organization |
|
T094 |
Professional Society |
|
T095 |
Self-help or Relief Organization |
|
T096 |
Group |
|
T097 |
Professional or Occupational Group |
YES |
T098 |
Population Group |
YES |
T099 |
Family Group |
|
T100 |
Age Group |
|
T101 |
Patient or Disabled Group |
YES |
T102 |
Group Attribute |
|
T103 |
Chemical |
|
T104 |
Chemical Viewed Structurally |
|
T109 |
Organic Chemical |
YES |
T114 |
Nucleic Acid, Nucleoside, or Nucleotide |
YES |
T116 |
Amino Acid, Peptide, or Protein |
YES |
T120 |
Chemical Viewed Functionally |
|
T121 |
Pharmacologic Substance |
YES |
T122 |
Biomedical or Dental Material |
YES |
T123 |
Biologically Active Substance |
YES |
T125 |
Hormone |
YES |
T126 |
Enzyme |
YES |
T127 |
Vitamin |
YES |
T129 |
Immunologic Factor |
YES |
T130 |
Indicator, Reagent, or Diagnostic Aid |
YES |
T131 |
Hazardous or Poisonous Substance |
YES |
T167 |
Substance |
YES |
T168 |
Food |
YES |
T169 |
Functional Concept |
YES |
T170 |
Intellectual Product |
|
T171 |
Language |
|
T184 |
Sign or Symptom |
YES |
T185 |
Classification |
YES |
T190 |
Anatomical Abnormality |
YES |
T191 |
Neoplastic Process |
YES |
T192 |
Receptor |
|
T194 |
Archaeon |
|
T195 |
Antibiotic |
YES |
T196 |
Element, Ion, or Isotope |
YES |
T197 |
Inorganic Chemical |
YES |
T200 |
Clinical Drug |
YES |
T201 |
Clinical Attribute |
|
T203 |
Drug Delivery Device |
YES |
T204 |
Eukaryote |
Document Fields Example
<schema name="example" version="1.5">
<!-- You can use different field names for all fields.
Just be sure they match what the EMRESE Admin app expects the fields to be named. -->
<uniqueKey>ID</uniqueKey>
<!-- unique across sources -->
<field name="ID" type="string" required="true"/>
<!-- unique within a source, optional.
For use by a user if they want to cross-reference
a document in EMERSE with another system
without having to demange the cross-source unique ID above -->
<field name="RPT_ID" type="string"/>
<!-- Required fields -->
<field name="MRN" type="string" required="true"/>
<field name="ENCOUNTER_DATE" type="date" required="true"/>
<field name="DOC_TYPE" type="string" required="true"/>
<field name="SOURCE" type="string" required="true"/>
<field name="RPT_TEXT" type="text_with_header_post_analysis" required="true"/>
<!-- Required fields which are calculated by
the ETL system based on the encounter date and the birthdate of the patient. -->
<field name="AGE_DAYS" type="int" required="true"/>
<field name="AGE_MONTHS" type="int" required="true"/>
<!-- The encounter ID, if available -->
<field name="CSN" type="string"/>
<!-- Other fields describing metadata -->
<field name="ADMIT_DATE" type="date"/>
<field name="RPT_DATE" type="date"/>
<field name="DEPT" type="string" default="unknown"/>
<field name="SVC" type="string" default="unknown"/>
<field name="CLINICIAN" type="string" default="unknown"/>
<field name="CATEGORY" type="string" multiValued="true"/>
<!-- The NLP_HEADER is optional but cannot be renamed -->
<field name="NLP_HEADER" type="string" indexed="false"/>
<!-- Metadata fields related to indexing itself, provided by the ETL system -->
<field name="INDEX_DATE" type="date"/>
<field name="LAST_UPDATED" type="date"/>
<!-- _version_ must be a field since it's used by Solr internally.
Solr fills in the values automatically. -->
<field name="_version_" type="long"/>
<!-- Field Types -->
<fieldType name="date" class="solr.DatePointField"
sortMissingLast="true" uninvertible="false" docValues="true"/>
<fieldType name="long" class="solr.LongPointField"
sortMissingLast="true" uninvertible="false" docValues="true"/>
<fieldType name="int" class="solr.IntPointField"
sortMissingLast="true" uninvertible="false" docValues="true"/>
<fieldType name="string" class="solr.StrField"
sortMissingLast="true" uninvertible="false" docValues="true"/>
<fieldType name="boolean" class="solr.BoolField"
sortMissingLast="true" uninvertible="false" docValues="true"/>
....
</schema>
EMERSE Database
EMERSE stores NLP artifacts in two database tables. By default, only EMERSE built-in artifacts are populated. To include custom NLP artifacts within the EMERSE application, it is necessary to populate them into these tables accordingly.
Table SEMANTIC_GROUP
This table is for storing NLP Semantic Groups, which has been discussed at: Token-Aligned Layered Index Structure
Column Name | Type | Description |
---|---|---|
GROUP_ID |
VARCHAR(25) |
Semantic Group ID, this is the name stored in the Solr index to represent an entity. |
LABEL |
VARCHAR(50) |
Group Name, a user-friendly name for the semantic group that will be displayed on UI. |
BORDER_CSS |
VARCHAR(50) |
CSS value for "border-bottom", i.e. "0.3em dotted red". When a concept in the semantic group is highlighted in the document, specific elements from this CSS property will be employed for rendering. |
Table NLP_TAG
This table maintains NLP attribute groups, which has been mentioned at: Group of Attribute IDs
Column Name | Type | Description |
---|---|---|
GROUP_ID |
VARCHAR(25) |
Unique ID for a Group of Attribute IDs. This ID will be used to reference a set of attribute IDs. |
QUERY_TAG |
VARCHAR(25) |
The attribute ID in the group that represents the mention. |
HIGHLIGHT_TAGS |
VARCHAR(100) |
All attribute IDs in the group that can be highlighted in the document. |
TAG_INCLUDED_PROMPT |
VARCHAR(300) |
The text description of this attribute is included in the query. If +[ATTRIBUTE_ID] is selected, what does it signify? |
TAG_EXCLUDED_PROMPT |
VARCHAR(300) |
The text description of this attribute is excluded in the query. If -[ATTRIBUTE_ID] is selected, what does it signify? |
TAG_NEUTRAL_PROMPT |
VARCHAR(300) |
The text description of this attribute is not specify in the query. |
LABEL |
VARCHAR(50) |
The name for this Attribute on UI for display. |
BORDER_CSS |
VARCHAR(50) |
CSS value for "border-bottom", i.e. "0.15em solid #EA3223". When an attribute group is highlighted in the document, specific elements from this CSS property will be employed for rendering. |
DISPLAY_ORDER |
NUMERIC |
The position of this attribute group relative to others when displayed on the UI. |
SUBWAY_ICON |
VARCHAR(1) |
A singular letter serves as the representation of this attribute group. This letter will appear in various locations on the UI to indicate the selection status of this group. |
SUBWAY_INCLUDED_PROMPT |
VARCHAR(100) |
The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is included in the query. |
SUBWAY_EXCLUDED_PROMPT |
VARCHAR(100) |
The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is excluded in the query. |
SUBWAY_NEUTRAL_PROMPT |
VARCHAR(100) |
The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is not in the query. |
HIDDEN_ON_UI |
BOOLEAN |
Indicating whether this attribute group should be hidden on the UI. |
EMERSE UI Consideration
EMERSE queries a Solr field named nlp_display within the index. When this field is populated, its contents are showcased within the NLP Detail section. This functionality proves invaluable when a user’s NLP tool generates supplementary content, such as summaries or abstracts, based on the original document.
Additionally, users have the option to establish distinct Solr fields dedicated to storing NLP outputs. This enables the configuration of these fields as query filters, facilitating more refined search capabilities.