This guide describes how natural language processing (NLP) is incorporated into EMERSE, how to configure the system properly, and how to make changes so that you can incorporate your own annotations into the search index.
Background
EMERSE uses NLP to enhance basic keyword search by allowing for more semantic search. The most basic form of search is when a term matches only exactly that word in documents. This can find too little: you might want to find that word regardless of case, misspellings, conjugations, or find other words that mean the same thing. It can also find too much: you may not want to find places where the word refers to something else, when it is not in reference to the patient, or when it appears in a certain phrase. EMERSE already provides search limiters to do much of this, but NLP is in a position to provide more nuanced solutions. It can selectively tag words or phrases as refering to abstract concepts disambiguating word or acronym usage, tag words to indicate when they are talking about the patient, and more.
We’ve integrated open-source libraries OpenNLP and CTAKES for our NLP, however, it is possible to add disable them, or the output of another NLP process to theirs.
NLP Implementation
Solr
Our NLP system is implemented as part of Solr. This means, when you send a document to Solr, Solr will tokenize the document, and index those tokens, thus making it searchable. Documents in Solr consist of many fields, only one of which is the text of the document, and the others are metadata fields, such as when the document was written, the patient it is for etc. Each such field needs a description of how Solr should process it, in particular, how it should be tokenized. This is what the field’s type is for.
Fields a document can have, plus their types, are all defined in the managed-schema XML file in the core’s conf directory in the index. A declation of a field type look like this:
<fieldType
name="SOME NAME"
class="org.example.SomeClass"
other-options-possible>
<analyzer type="index">...</analyzer>
<analyzer type="query">...</analyzer>
</fieldType>
Once you have the type, you can make zero or more fields actually use that type, like so:
<field name="SOME FIELD NAME" type="SOME FIELD TYPE NAME"/>
The main work in running NLP in Solr is defining the <analyzer type="index"> element in the field type. The analyzer with type="index", called the index analyzer, is used when indexing documents, which is what we primarily talk about in this document, but the other analyzer is the query analyzer. We will talk about the query analyzer here.
However, the class of the field type is important too: you must use org.emerse.solr.TextWithHeaderFieldType instead of the more normal solr.TextField if your tokenizer has hasHeader="true" set. (See below for more info.)
When you send a header with the document text, you are giving the analyzer additional information it needs for tokenization and indexing. However, that header is not really part of the document, and so shouldn’t be stored and presented as if it were. In order to present the header to the analyzer but not store it, we needed a custom class for the field type.
However, we do actually need to store the header in order to support document updates, where you may update the metadata of the document, but not change the text. This is implemented as a delete-and-insert in Solr, meaning it synthesizes a new document based on the existing document and the update you wish to make to fields of the document. Thus, we need a place to store the header, so we can reconstruct the header-plus-document-text to present it to the analyzer. This place is a field named NLP_HEADER and must have exactly the definition:
<field name="NLP_HEADER" type="string" indexed="false" stored="true" docValues="true"/>
Finally, and importantly, if the field type has this custom class, you must always have a header in every document.
For quick reference, a document with a header attached looks like:
RU1FUlNFX0g=1|5|
Mrs. Anderson was seen in our clinic today. She is complaining of mild shortness of breath, as well as moderate fatigue and hair loss. She denies fevers, loss of appetite, or chest pain. Her mother had Hashimoto's thyroiditis. She has a history of type 2 diabetes and myopia. Mrs. Anderson does not have a history of thyroid cancer. She should be tested for a possible autoimmune disorder.
The header is the first line. See here for more details. It’s possible to extend the header to include custom NLP tokens, as covered below.
One of the simpler examples of a field type of the document text field is:
<fieldType name="text_with_header_post_analysis"
class="org.emerse.solr.TextWithHeaderFieldType"
termPositions="true" termVectors="true" termOffsets="true" termPayloads="true"
>
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="org.emerse.nlp.HybridTokenizerFactory"
standardTokenizer="true"
hasHeader="true"
tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
numOfNewLineSymbols="2"
/>
<filter class="org.emerse.nlp.PostAnalysisFilterFactory" allowMisalignment="true"/>
</analyzer>
</fieldType>
In the rest of the document, whenever you see a <analyzer type="index"> element or when talk about an analyzer, analysis chain, or a pipeline, it is this index analyzer element in the field type for the document text’s field that we are talking about.
Tokens
EMERSE uses a custom Lucene Analyzer to run NLP over the text, and produce tokens for indexing. Ordinarily, tokens are just words of the text, sometimes stemmed, lowercased, etc, depending on how you want search terms to match the text. However in EMERSE, we have multiple tokens per word of text, with different tokens conveying different information about the text. So as to prevent ambiguity, tokens that convey the words of the text itself are prefixed with either TX_ or tx_. Tokens that convey other things are not allowed to start with these prefixes.
NLP Tokens
Other than tokens that refer to the words of the text itself, there are two broad categories of tokens:
- Entities
-
An entity is a token that refers to a specific concept, such as one given in the UMLS. It makes sense for users to search directly for an entity within the text. An example entity could be
CUI_2395which refers to alzheimers disease. - Annotations
-
An annotation is a token which adds auxillary information to entities and text tokens. It doesn’t make sense for a user to search for it directly, but to use it to modify a search of another term. An example could be
POL_NEGwhich indicates the word is used in a negative sense of a sentence, such as the phrasealzheimers diseasein the phrasedoes not have alzheimers disease.
The inclusion of these two types of tokens is the contribution of the NLP system to the search index, and enables all the NLP search features in EMERSE to work.
Token Attributes
Each token has an range of characters it refes to in the text. Similarly, each token has a range of "positions" it refers to. These "positions" are conceptually positions of the words of the text, with the first word in the text at position one, second word of the text at position two, etc. Tokens that refer to words of the text itself only span one position, but NLP tokens can span multiple positions.
Positions are determine how search phrases match. If a phrase such as alzheimers disease is to match, the end position of the token tx_alzheimers must be one less than the start position of the token tx_disease.
Token Attribute Reconciliation
Words determine what the positions are, and those same words have character ranges as well, so we want any token than spans a certain set of positions to have the corresponding range of characters. However, NLP systems generally only output character ranges, and they don’t necessarily correspond exactly with the way we break up words in our system. For this reason, we reconcile the character offsets of the output of NLP with the tokenization of words in Solr, assigning position ranges based on the character ranges, and clipping character ranges to the character ranges of those corresponding positions. (Not all characters need to be in the character range of a position; whitespace or punctuation is often not part of any "word" and hence not part of character ranges of tokens, unless those tokens span multiple words and those characters are between those words.)
However, in assigning positions to NLP token based on their character ranges, we can make a choice: we can either have a single token in the index associated with multiple positions, or we duplicate that token once for each such position, so that each token spans only a single position. Conventionally, we duplicate the token for annotations, but have the token span many positions for entities.
Token-Aligned Layered Index Structure
Here is an example to conceptually illustrate the Token-Aligned Layered Index using a sentence: This is not a teddy bear.
| This | is | not | a | teddy | bear |
|---|---|---|---|---|---|
PPOL_NEG |
POL_NEG |
POL_NEG |
POL_NEG |
||
CUI_TEDDY_BEAR |
|||||
SMG_STUFFED_ANIMALS |
|||||
There are four layers in this index. Text tokens are in the first row (the prefix omitted). The following rows contain NLP tokens that are aligned with text tokens. The second row are the annotations, third row has an entity which is a concept from UMLS, and the four row has an entity which is a semantic group. Each concept is assigned one or more semantic groups, making semantic groups a very large category of concepts. More information of semantic group can be found here.
NLP Pipeline
A Lucene analysis chain, or the "NLP pipeline" as we call it, comprises:
-
zero or more CharFilters,
-
a single Tokenizer, and
-
zero or more TokenFilters.
CharFilters work mainly at the character level of the text, effectively pre-processing the text so what is seen down the line is in more of a normal form. For instance, it can cause the pipeline to effectively treat special characters as other more normal characters or to ignore parts of the text that correspond to HTML markup.
The tokenizer decides what the words and sentences are in the text.
The token filters do the NLP work, generally adding additional tokens to the stream, though it is possible for them to remove tokens as well. The order of token filters doesn’t matter, unless we otherwise note that it does.
Here is a possible implementation of the EMERSE NLP pipeline:
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-delimiters.txt"/>
<tokenizer class="org.emerse.opennlp.OpenNLPTokenizerFactory"
tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
hasHeader="true"
usingLuceneTokenizer="true"
/>
<filter class="org.emerse.ctakes.CTAKESFilterFactory"
excludeTypes="nlp/pos_exclude.txt"
prefix="CUI"
dictionary="nlp/dictionary/custom.script"
posModel="nlp/models/mayo-pos.zip"
/>
<filter class="org.emerse.nlp.TaggerFilterFactory"
addTokenPrefix="false"
delimiters="nlp/delimiters.txt"
negations="nlp/negations.txt"
negationIgnore="nlp/negationIgnore.txt"
certainty="nlp/probability_resource_patterns.txt"
family="nlp/family_history_resource_patterns_A.txt"
familyTemplates="nlp/family_history_resource_patterns_B.txt"
familyMembers="nlp/family_history_resource_patterns_C.txt"
history="nlp/history.txt"
historyIgnore="nlp/historyIgnore.txt"
tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
/>
<filter class="org.emerse.nlp.PostAnalysisFilterFactory"/>
</analyzer>
We’ll go over the options for each of these.
HTML Strip CharFilter
<charFilter class="solr.HTMLStripCharFilterFactory"/>
The HTML strip CharFilter causes the rest of the system to ignore HTML tags. These are just the tags themselves, not the children of those tags. For instance, in the text <b>Alzheimer’s</b> the text <b> and </b> will be ignored, but not the word Alzheimer’s. The offsets of the tokens does account for the tags, so the start of the word Alzheimer’s is character three, not zero within this snippet.
Mapping CharFilter
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-delimiters.txt"/>
The Mapping CharFilter replaces every occurrence of one character with another, at the direction of the given file, mapping-delimiters.txt. We use this to normalize certain punctuation characters, such as "smart" or "curly" quote marks which differ depending whether they are opening or closing quotes, to straight ones that don’t differ.
OpenNLP Tokenizer
<tokenizer class="org.emerse.opennlp.OpenNLPTokenizerFactory"
tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
hasHeader="true"
usingLuceneTokenizer="true"
/>
The open nlp tokenizer decides what the "words" of the text are. Though it may seem obvious what this is, but there are edge cases around numbers, hyphenation, etc, where there are choices to be made that affect search. It also decides what the sentences of the text are.
| Argument Name | Required | Default Value | Description |
|---|---|---|---|
sentenceModel |
Yes |
N/A |
String, the file path of the pre-training sentence detection model for OpenNLP. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
tokenizerModel |
Yes |
N/A |
String, the file path of the pre-training tokenization model for OpenNLP. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
usingLuceneTokenizer |
No |
False |
Ture or False, reprocess tokens using Lucene’s StandardTokenizer, this approach ensures that the indexed tokens adhere to the behavior of the StandardTokenizer. |
hasHeader |
No |
False |
True or False, if True, the tokenizer will expect there is a single line header at the beginning of the document. If the document has a header, the field type for storing the text in Solr index has to be org.emerse.solr.TextWithHeaderFieldType |
numOfNewLineSymbols |
No |
1 |
Numeric, number of the new line symbols (carriage return and line feed combined) before the end of a sentence. This is a hint to help tokenizer decide the boundary of the sentences. |
numOfSpaces |
No |
5 |
Numeric, number of space before the end of a sentence. This is a hint to help tokenizer decide the boundary of the sentences. |
The NLP Header
The OpenNLP tokenizer expects documents to have a header attached to them if the attribute hasHeader is set to true (as shown in the example). The header is indicated by a kind of file signature or "magic number": the text RU1FUlNFX0g= should appear at the very beginning of the document. This text is unlikely to occur in any real document, and hence is a good indication that it is intended to be the header. After this signature, there are three fields separated by pipes (|):
-
the number of newlines that should end a sentence,
-
the number of spaces that should end a sentence, and
-
boundary text, which indicates the end of the custom token TSV and start of the actual document.
The boundary is all the text from the last pipe character to the next newline. If the boundary text is empty, the document starts immediately. Otherwise, the custom token TSV starts immediately, and continues until a newline followed by the boundary text, and then another newline is seen, at which point the document starts.
Thus, a header can look like:
RU1FUlNFX0g=1|5|
or:
RU1FUlNFX0g=2|5|XHRcdEVNRVJTRSoqKioqKioqRVNSRU1FXHRcdA==
The OpenNLP tokenizer does not work correctly with a custom token TSV. Instead, use the Hybrid Tokenizer.
Hybrid Tokenizer
<tokenizer class="org.emerse.nlp.HybridTokenizerFactory"
tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
hasHeader="true"
usingLuceneTokenizer="true"
/>
The hybrid tokenizer is basically the same as the OpenNLP tokenizer, except that it works with the custom token TSV, introducing the tokens stated there into the token stream it produces. See here for more information on the format of that TSV, and how to include your custom NLP tokens into documents.
| Argument Name | Required | Default Value | Description |
|---|---|---|---|
standardTokenizer |
No |
False |
True or False, if True, HybridTokenizer will be wrapped with Lucene StandardTokenizer; If False, OpenNLPTokenizer will be wrapped. If user decides to take any NLP artifact from EMERSE NLP filters into the Solr index, OpenNLPTokenizer should be used. |
maxTokenCount |
No |
255 |
Numeric, this is for StandardTokenizer only. It defines the maximum token length. If a token is seen that exceeds this length then it is split at maxTokenCount intervals. |
lineFeedWidth |
No |
1 |
Numeric, the length of the line feed at the end of the header row. If nothing is specified, EMERSE expects a single line feed at the end of the header row. |
CTAKES TokenFilter
<filter class="org.emerse.ctakes.CTAKESFilterFactory"
excludeTypes="nlp/pos_exclude.txt"
prefix="CUI"
dictionary="nlp/dictionary/custom.script"
posModel="nlp/models/mayo-pos.zip"
/>
The CTAKES TokenFilter produces entity tokens for UMLS concepts and semantic groups.
| Argument Name | Required | Default Value | Description |
|---|---|---|---|
dictionary |
Yes |
N/A |
string, the file path of the dictionary. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
prefix |
Yes |
N/A |
String, the prefix that is prepended to the entity ID that will be inserted into the index |
longestMatch |
No |
True |
True or False, In a scenario where multiple entities are recognized, and one of these entities encompasses all the others, the decision to insert into the index depends on whether this statement is true or false. If it is true, only the longest entity will be included in the index; if it is false, all identified entities will be added to the index. |
posModel |
No |
N/A |
String, the file path of the POS(Parts-Of-Speech) model. The POS model combined with an exclusdeTypes list are used to exclude tokens with certain POS types from being identified as entities.If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
excludeTypes |
No |
N/A |
String, the file path of the POS(Parts-Of-Speech) types. If a token has been identified with any POS type in this list, no entity extraction will be conducted against this token. This list can only be activated if a posModel is given. |
Tagger TokenFilter
<filter class="org.emerse.nlp.TaggerFilterFactory"
addTokenPrefix="false"
delimiters="nlp/delimiters.txt"
negations="nlp/negations.txt"
negationIgnore="nlp/negationIgnore.txt"
certainty="nlp/probability_resource_patterns.txt"
family="nlp/family_history_resource_patterns_A.txt"
familyTemplates="nlp/family_history_resource_patterns_B.txt"
familyMembers="nlp/family_history_resource_patterns_C.txt"
history="nlp/history.txt"
historyIgnore="nlp/historyIgnore.txt"
tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
/>
The tagger token factory uses a series of rules to produce annotation tokens for
-
negation
-
uncertainty
-
non-patient mentions, and
-
patient history.
| Argument Name | Required | Default Value | Description |
|---|---|---|---|
delimiters |
No |
N/A |
String, the file path for the delimiters. These delimiters play a crucial role in dividing a sentence into distinct sections. By doing so, they prevent the detection of mentions from extending beyond the confines of each section, avoiding potential inaccuracies at the sentence’s end. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
negations |
No |
N/A |
String, the file path for the negation rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
negationIgnore |
No |
N/A |
String, the file path for the negation ignore rules. This is useful for the case like can not be ignored. If Can not is a negation rule, without this ignorance, can not be ignored could be wrongly identified as negation. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
certainty |
No |
N/A |
String, the file path for the uncertainty rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
family |
No |
N/A |
String, the file path for the family rules that are in static mode. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
familyTemplates |
No |
N/A |
String, the file path for the family rules that are in dynamic mode. This argument should work with argument familyMembers together. The rules in familyTemplates act as a format string and the placeholders in the format string will be replaced by all rules defined in familyMembers. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
familyMembers |
No |
N/A |
String, the file path for the family member definitions. This argument should work with argument familyTemplates together. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
history |
No |
N/A |
String, the file path for the patient history mention rules. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
historyIgnore |
No |
N/A |
String, the file path for the history ignore rules. This is useful for the case like history on admit. If History is a history rule, without this ignorance, history on admit could be wrongly identified as negation. If a relative path is given, the root is against SOLR_DATA_ROOT/CORE_NAME/conf/ |
addTokenPrefix |
No |
True |
True or False, if True, this filter will add a prefix of TX_ and tx_ to tokens. Tokens with the TX_ prefix are case-sensitive, while tokens with the tx_ prefix are case-insensitive; If False, no prefix will be added and the tokens are sent to downstream as they are. |
Post Analysis TokenFilter
<filter class="org.emerse.nlp.PostAnalysisFilterFactory"/>
The Post analysis TokenFilter does token reconciliation. It must be the last token filter in the pipeline.
| Argument Name | Required | Default Value | Description |
|---|---|---|---|
addTokenPrefix |
No |
True |
True or False, if True, a token will be provided in both case-sensitive and case-insensitive forms and be prefixed with TX_ and tx_ respectively. |
Query Analyzer
So far, we’ve discussed the index analyzer. What about the query analyzer?
The query analyzer is used when analyzing the text of user’s queries, so that the words users type into term bundles in EMERSE are broken into the same tokens as for indexing, that way they match up and documents can be found. Howveer, the query analyzer doesn’t do NLP, since the user isn’t entering general text, so the analysis chain needs to be different, which is why there are separate analyzers for index and query.
You should use the same char filters and tokenizer as the main index analyzer, but use none of the token filters. Instead, use only the org.emerse.nlp.TokenAdjusterFilterFactory token filter. The token adjuster filter factory appends the TX_ or tx_ prefixes to words, except when it appears the word already has one of the reserved prefixes, as given in the passed attributed.
<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-delimiters.txt"/>
<tokenizer class="solr.OpenNLPTokenizerFactory"
sentenceModel="nlp/models/sd-med-model.zip"
tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"/>
<filter class="org.emerse.nlp.TokenAdjusterFilterFactory"
reservedPrefix="CUI_,POL_NEG,PPOL_NEG,APOL_NEG,TG_FAMILY,PTG_FAMILY,ATG_FAMILY,CT_UNCERTAIN,PCT_UNCERTAIN,ACT_UNCERTAIN,TG_HISTORY,PTG_HISTORY,ATG_HISTORY"/>
</analyzer>
Custom NLP
If you want to provide your own entities or annotations in EMERSE, you must use the Hybrid Tokenizer passing hasHeader="true", and provide the custom NLP tokens into the TSV part of that header.
The TSV has four columns. The first two give the character range in the text the token is for. This is a count of characters (in UTF-16 format with a surrogate pair counting as two characters) starting from the beginning of the document, after the boundary and its trailing newline appear below the TSV. The third column gives the token. The fourth column gives the type of reconciliation to be done, as shown in this table:
| Flag | Reconciliation Method |
|---|---|
T |
Duplicate the token once per position in the character range (like annotations) |
R |
Generate only token per row in TSV, assign it all positions the character range (like entities) |
Again, "position" here means a position of a word, counting words from the beginning of the document.
Here is an example custom token TSV, complete with the header:
RU1FUlNFX0g=1|5|XHRcdEVNRVJTRSoqKioqKioqRVNSRU1FXHRcdA== 37 42 CUI_310367 R 37 42 SMG_Drug R 66 90 CUI_4054523 R 66 90 SMG_Finding R 71 90 CUI_13404 R 71 90 SMG_Finding R 84 90 CUI_225386 R 84 90 SMG_Finding R 95 99 CUI_5575035 R 95 99 SMG_Finding R 103 111 CUI_5201148 R 103 111 SMG_Finding R 112 119 CUI_15672 R 112 119 SMG_Findnig R 124 128 CUI_18494 R 124 128 SMG_Anatomy R 124 133 CUI_2170 R 124 133 SMG_Disorder R 135 138 PPOL_NEG T 139 145 PPOL_NEG T 146 152 POL_NEG T 146 152 CUI_15967 R 146 152 SMG_Finding R 154 158 POL_NEG T 154 170 CUI_1971624 R 154 170 SMG_Finding R 159 161 POL_NEG T 162 170 POL_NEG T 162 170 CUI_3618 R 162 170 SMG_Finding R 172 174 POL_NEG T 175 180 POL_NEG T 175 180 CUI_817096 R 175 180 SMG_Anatomy R 175 185 CUI_8031 R 175 185 SMG_Finding R 181 185 POL_NEG T 181 185 CUI_30193 R 181 185 SMG_Finding R 191 197 PTG_FAMILY T 198 201 PTG_FAMILY T 202 211 TG_FAMILY T 202 211 CUI_677607 R 202 211 SMG_Disorder R 212 213 TG_FAMILY T 214 225 TG_FAMILY T 214 225 CUI_40147 R 214 225 SMG_Disorder R 237 244 PTG_HISTORY T 237 244 CUI_262926 R 237 244 SMG_Finding R 245 247 TG_HISTORY T 248 252 TG_HISTORY T 253 254 TG_HISTORY T 255 263 TG_HISTORY T 255 263 CUI_11849 R 255 263 SMG_Disorder R 255 263 CUI_11847 R 255 263 SMG_Disorder R 264 267 TG_HISTORY T 268 274 TG_HISTORY T 268 274 CUI_27092 R 268 274 SMG_Disorder R 290 294 PPOL_NEG T 295 298 PPOL_NEG T 299 303 PPOL_NEG T 304 305 PPOL_NEG T 306 313 PPOL_NEG T 306 313 PTG_HISTORY T 306 313 CUI_262926 R 306 313 SMG_Finding R 314 316 PPOL_NEG T 314 316 TG_HISTORY T 317 324 POL_NEG T 317 324 TG_HISTORY T 317 324 CUI_40132 R 317 324 SMG_Anatomy R 317 331 CUI_7115 R 317 331 SMG_Disorder R 325 331 POL_NEG T 325 331 TG_HISTORY T 325 331 CUI_6826 R 325 331 SMG_Disorder R 347 353 CUI_39593 R 347 353 SMG_Finding R 360 368 PCT_UNCERTAIN T 360 368 CUI_332149 R 360 368 SMG_Finding R 369 379 CT_UNCERTAIN T 369 379 CUI_4551524 R 369 379 SMG_Finding R 369 388 CUI_4364 R 369 388 SMG_Disorder R 380 388 CT_UNCERTAIN T 380 388 CUI_12634 R 380 388 SMG_Disorder R XHRcdEVNRVJTRSoqKioqKioqRVNSRU1FXHRcdA== Mrs. Anderson was seen in our clinic today. She is complaining of mild shortness of breath, as well as moderate fatigue and hair loss. She denies fevers, loss of appetite, or chest pain. Her mother had Hashimoto's thyroiditis. She has a history of type 2 diabetes and myopia. Mrs. Anderson does not have a history of thyroid cancer. She should be tested for a possible autoimmune disorder.
In this file, the single line header serves as the designated location for providing arguments to the tokenizer. Following the header is a dedicated NLP artifact section, where each row contains the start offset, end offset, artifact ID, and artifact type, which are delimited by tabs. A separator demarcates the conclusion of the NLP artifact section. Subsequently, the original document is appended.
Here is a possible implementation of the EMERSE custom NLP pipeline:
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="org.emerse.nlp.HybridTokenizerFactory"
tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
hasHeader="true"
usingLuceneTokenizer="true"
/>
<filter class="org.emerse.nlp.PostAnalysisFilterFactory" allowMisalignment="true"/>
</analyzer>
Query Analyzer
To facilitate Solr indexing, a mandatory component is the Analyzer chain (pipeline). Similarly, for Solr queries, an analyzer chain must be defined. Since EMERSE offers flexibility to customize the NLP pipeline, it’s crucial to emphasize that the tokens extracted from a query phrase should precisely match those generated by the analyzer during the Solr indexing process.
Here is the query analyzer that EMERSE recommends:
<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-delimiters.txt"/>
<tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="nlp/models/sd-med-model.zip" tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"/>
<filter class="org.emerse.nlp.TokenAdjusterFilterFactory" reservedPrefix="CUI_,POL_NEG,PPOL_NEG,APOL_NEG,TG_FAMILY,PTG_FAMILY,ATG_FAMILY,CT_UNCERTAIN,PCT_UNCERTAIN,ACT_UNCERTAIN,TG_HISTORY,PTG_HISTORY,ATG_HISTORY"/>
</analyzer>
org.emerse.nlp.TokenAdjusterFilterFactory is designed to work with the OpenNLPTokenizer created by solr.OpenNLPTokenizerFactory only. Failing to use the specified tokenizer will lead to unexpected results, which may impact the functionality of EMERSE.
Alternatively:
<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="org.emerse.opennlp.OpenNLPTokenizerFactory"
tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
sentenceModel="nlp/models/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin"
usingLuceneTokenizer="true"
/>
<filter class="org.emerse.nlp.PostAnalysisFilterFactory" addTokenPrefix="false"/>
</analyzer>
NLP Configuration Files
NLP configuration files, including mapping file for the CharFilter, model files for standard NLP tokenizer, UMLS dictionary for NER and pattern files for detection of negation, uncertainty etc., are all located under DOCUMENT_INDEX_ROOT/config/nlp.
Delimiter Mapping
Prior to version 7, EMERSE includes MappingCharFilter in both indexing pipeline and query pipeline. The mapping file, mapping-delimiters.txt, may contain the removing of dots from various character sets, for example:
"。" => ""
This type of mapping should be removed in version 7, as detecting a dot is a common method for NLP to identify sentence boundaries.
OpenNLP Tokenizer Model
Model files are in a subdirectory named models. By default, it includes OpenNLP trained token, sentence and POS models for English. If the documents are in a different language, models may be available on OpenNLP website.
Pattern Files for Detection of Negation, Uncertainty and More
Sentiment detection in EMERSE relies on pattern matching. Newer technologies, such as convolutional neural networks (CNNs), were not adopted primarily due to their impact on indexing performance. To efficiently index large volumes of notes within a reasonable timeframe, pattern matching remains a viable solution.
EMERSE can detect negation, uncertainty, non-patient mentions, and patient history. Some detection types involve two pattern files: one for the detection patterns and another for overriding these patterns to address false positives. For example, negations.txt versus negationIgnore.txt, history.txt versus historyIgnore.txt. Here is a snippet of negation patterns:
denied symptoms of *****
***** couldn't be
The text in each line, i.e. couldn’t be, is a trigger phrase to activate negation detection. The 4 asterisks * represent a placeholder for anything that is being negated. * can be placed either before or after the trigger.
Negation-Ignore file, may include:
couldn't be excluded
For the given negation pattern * couldn’t be, if the sentence is: Pneumonia can not be excluded, negation will not be concluded due to the negation ignore trigger couldn’t be excluded. Be aware, ignore file contains trigger only.
Uncertainty only implements a pattern file probability_resource_patterns.txt and non-patient mention, on the other hand, has 3 pattern files, among which, family_history_resource_patterns_A.txt is a regular pattern file, family_history_resource_patterns_B.txt is a template file and @@@@@ in each line can be replaced by the familial terms defined in family_history_resource_patterns_C.txt
If a mention is identified through a given rule, EMERSE will tag tokens with the corresponding attribute ID before and/or after the token pattern in the rule until the sentence boundary is reached. The direction is defined in conjunction with the rule as either prepending *****, indicating movement towards the beginning of the sentence, or appending *****, indicating movement towards the end of the sentence. Additionally, tokens matching the rule will be tagged with a distinct attribute ID specifying the rule.:
This is not a teddy.
If a negation is provided as not *****, every token after not will be tagged as negation (POL_NEG as the attribute ID) and not itself will be tagged as the "trigger" for this negation (PPOL_NEG as the attribute ID). Tokens This is will not be touched since ***** is after not, which means only the tokens after the trigger will be considered as negated.
| This | is | not | a | teddy |
|---|---|---|---|---|
PPOL_NEG |
POL_NEG |
POL_NEG |
From the above example, it is evident that a set of attributes is required to describe the occurrence of a mention.
UMLS Dictionary
dictionary subdirectory contains the UMLS dictionary tailed by EMERSE using CTAKES UI. The dictionary file is a SQL file since that’s the default output from CTAKES. EMERSE parses this SQL file and extracts all concepts in conjunction with their groups. Entity recognition through using the xref:[ctakes-token-filter] requires a script file built through CTAKES Dictionary Creator. (As of 2024 this program still required Java 8 to run properly).
The EMERSE team has a default dictionary file to distribute to sites installing EMERSE, or upgrading from prior versions that are not NLP-enabled. Sites simply need to provide to us with a copy of a valid UMLS license first.
For those wishing to make their own dictionary, users can create a custom dictionary and specify the path of the script file for OpenNLPTokenizerFactory using its dictionary attribute.
By default, this dictionary resides at:
SOLR_HOME/DOCUMENT_CORE/conf/nlp/dictionary
UMLS Source Data
At the time of this writing (May 2024), the source of the data to match concepts in the clinical notes comes from the UMLS version 2023AB. A subset of source vocabularies and semantic types are included in the default EMERSE distribution, specified in the tables below. These were selected because they are the mostly likely to have clinical relevance and appear in the EHR notes. Each semantic type is mapped to a Type Unique Identifier (TUI).
| Abbreviation | Vocabulary Name | Included in EMERSE? |
|---|---|---|
AIR |
AI/RHEUM |
|
ALT |
Alternative Billing Concepts |
|
AOD |
Alcohol and Other Drug Thesaurus |
YES |
AOT |
Authorized Osteopathic Thesaurus |
YES |
ATC |
Anatomical Therapeutic Chemical Classification System |
|
BI |
Beth Israel Problem List |
|
CCC |
Clinical Care Classification |
|
CCPSS |
Clinical Problem Statements |
|
CCS |
Clinical Classifications Software |
|
CCSR_ICD10CM |
Clinical Classifications Software Refined for ICD-10-CM |
|
CCSR_ICD10PCS |
Clinical Classifications Software Refined for ICD-10-PCS |
|
CDCREC |
Race & Ethnicity - CDC |
|
CDT |
CDT |
|
CHV |
Consumer Health Vocabulary |
YES |
COSTAR |
COSTAR |
|
CPM |
Medical Entities Dictionary |
|
CPT |
CPT - Current Procedural Terminology |
YES |
CPTSP |
CPT Spanish |
|
CSP |
CRISP Thesaurus |
|
CST |
COSTART |
|
CVX |
Vaccines Administered |
|
DDB |
Diseases Database |
YES |
DMDICD10 |
ICD-10 German |
|
DMDUMD |
UMDNS German |
|
DRUGBANK |
DrugBank |
YES |
DSM-5 |
Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition |
YES |
DXP |
DXplain |
|
FMA |
Foundational Model of Anatomy |
|
GO |
Gene Ontology |
|
GS |
Gold Standard Drug Database |
YES |
HCDT |
CDT in HCPCS |
|
HCPCS |
HCPCS - Healthcare Common Procedure Coding System |
YES |
HCPT |
CPT in HCPCS |
|
HGNC |
HUGO Gene Nomenclature Committee |
|
HL7V2.5 |
HL7 Version 2.5 |
|
HL7V3.0 |
HL7 Version 3.0 |
|
HLREL |
ICPC2E ICD10 Relationships |
|
HPO |
Human Phenotype Ontology |
YES |
ICD10 |
International Classification of Diseases and Related Health Problems, Tenth Revision |
YES |
ICD10AE |
ICD-10, American English Equivalents |
YES |
ICD10AM |
ICD-10, Australian Modification |
|
ICD10AMAE |
ICD-10, Australian Modification, Americanized English Equivalents |
|
ICD10CM |
International Classification of Diseases, Tenth Revision, Clinical Modification |
YES |
ICD10DUT |
ICD10, Dutch Translation |
|
ICD10PCS |
ICD-10 Procedure Coding System |
YES |
ICD9CM |
International Classification of Diseases, Ninth Revision, Clinical Modification |
|
ICF |
International Classification of Functioning, Disability and Health |
YES |
ICF-CY |
International Classification of Functioning, Disability and Health for Children and Youth |
YES |
ICNP |
International Classification for Nursing Practice |
|
ICPC |
International Classification of Primary Care |
|
ICPC2EDUT |
ICPC2E Dutch |
|
ICPC2EENG |
International Classification of Primary Care, 2nd Edition, Electronic |
|
ICPC2ICD10DUT |
ICPC2-ICD10 Thesaurus, Dutch Translation |
|
ICPC2ICD10ENG |
ICPC2-ICD10 Thesaurus |
|
ICPC2P |
ICPC-2 PLUS |
|
ICPCBAQ |
ICPC Basque |
|
ICPCDAN |
ICPC Danish |
|
ICPCDUT |
ICPC Dutch |
|
ICPCFIN |
ICPC Finnish |
|
ICPCFRE |
ICPC French |
|
ICPCGER |
ICPC German |
|
ICPCHEB |
ICPC Hebrew |
|
ICPCHUN |
ICPC Hungarian |
|
ICPCITA |
ICPC Italian |
|
ICPCNOR |
ICPC Norwegian |
|
ICPCPOR |
ICPC Portuguese |
|
ICPCSPA |
ICPC Spanish |
|
ICPCSWE |
ICPC Swedish |
|
JABL |
Congenital Mental Retardation Syndromes |
YES |
KCD5 |
Korean Standard Classification of Disease Version 5 |
|
LCH |
Library of Congress Subject Headings |
|
LCH_NW |
Library of Congress Subject Headings, Northwestern University subset |
|
LNC |
LOINC |
YES |
LNC-DE-AT |
LOINC Linguistic Variant - German, Austria |
|
LNC-DE-DE |
LOINC Linguistic Variant - German, Germany |
|
LNC-EL-GR |
LOINC Linguistic Variant - Greek, Greece |
|
LNC-ES-AR |
LOINC Linguistic Variant - Spanish, Argentina |
|
LNC-ES-ES |
LOINC Linguistic Variant - Spanish, Spain |
|
LNC-ES-MX |
LOINC Linguistic Variant - Spanish, Mexico |
|
LNC-ET-EE |
LOINC Linguistic Variant - Estonian, Estonia |
|
LNC-FR-BE |
LOINC Linguistic Variant - French, Belgium |
|
LNC-FR-CA |
LOINC Linguistic Variant - French, Canada |
|
LNC-FR-FR |
LOINC Linguistic Variant - French, France |
|
LNC-IT-IT |
LOINC Linguistic Variant - Italian, Italy |
|
LNC-KO-KR |
LOINC Linguistic Variant - Korea, Korean |
|
LNC-NL-NL |
LOINC Linguistic Variant - Dutch, Netherlands |
|
LNC-PL-PL |
LOINC Linguistic Variant - Polish, Poland |
|
LNC-PT-BR |
LOINC Linguistic Variant - Portuguese, Brazil |
|
LNC-RU-RU |
LOINC Linguistic Variant - Russian, Russia |
|
LNC-TR-TR |
LOINC Linguistic Variant - Turkish, Turkey |
|
LNC-UK-UA |
LOINC Linguistic Variant - Ukrainian, Ukraine |
|
LNC-ZH-CN |
LOINC Linguistic Variant - Chinese, China |
|
MCM |
Glossary of Clinical Epidemiologic Terms |
|
MDR |
MedDRA |
YES |
MDRARA |
MedDRA Arabic |
|
MDRBPO |
MedDRA Brazilian Portuguese |
|
MDRCZE |
MedDRA Czech |
|
MDRDUT |
MedDRA Dutch |
|
MDRFRE |
MedDRA French |
|
MDRGER |
MedDRA German |
|
MDRGRE |
MedDRA Greek |
|
MDRHUN |
MedDRA Hungarian |
|
MDRITA |
MedDRA Italian |
|
MDRJPN |
MedDRA Japanese |
|
MDRKOR |
MedDRA Korean |
|
MDRLAV |
MedDRA Latvian |
|
MDRPOL |
MedDRA Polish |
|
MDRPOR |
MedDRA Portuguese |
|
MDRRUS |
MedDRA Russian |
|
MDRSPA |
MedDRA Spanish |
|
MDRSWE |
MedDRA Swedish |
|
MED-RT |
Medication Reference Terminology |
|
MEDCIN |
MEDCIN |
|
MEDLINEPLUS |
MedlinePlus Health Topics |
YES |
MEDLINEPLUS_SPA |
MedlinePlus Spanish Health Topics |
|
MMSL |
Multum |
YES |
MMX |
Micromedex |
YES |
MSH |
MeSH |
YES |
MSHCZE |
MeSH Czech |
|
MSHDUT |
MeSH Dutch |
|
MSHFIN |
MeSH Finnish |
|
MSHFRE |
MeSH French |
|
MSHGER |
MeSH German |
|
MSHITA |
MeSH Italian |
|
MSHJPN |
MeSH Japanese |
|
MSHLAV |
MeSH Latvian |
|
MSHNOR |
MeSH Norwegian |
|
MSHPOL |
MeSH Polish |
|
MSHPOR |
MeSH Portuguese |
|
MSHRUS |
MeSH Russian |
|
MSHSCR |
MeSH Croatian |
|
MSHSPA |
MeSH Spanish |
|
MSHSWE |
MeSH Swedish |
|
MTH |
Metathesaurus Names |
|
MTHCMSFRF |
Metathesaurus CMS Formulary Reference File |
|
MTHICD9 |
ICD-9-CM Entry Terms |
|
MTHICPC2EAE |
ICPC2E American English Equivalents |
|
MTHICPC2ICD10AE |
ICPC2E-ICD10 Thesaurus, American English Equivalents |
|
MTHMST |
Minimal Standard Terminology (UMLS) |
YES |
MTHMSTFRE |
Minimal Standard Terminology French (UMLS) |
|
MTHMSTITA |
Minimal Standard Terminology Italian (UMLS) |
|
MTHSPL |
FDA Structured Product Labels |
|
MVX |
Manufacturers of Vaccines |
YES |
NANDA-I |
NANDA-I Taxonomy |
|
NCBI |
NCBI Taxonomy |
|
NCI |
NCI Thesaurus |
YES |
NCISEER |
NCI SEER ICD Mappings |
|
NDDF |
FDB MedKnowledge |
|
NEU |
Neuronames Brain Hierarchy |
|
NIC |
Nursing Interventions Classification |
|
NOC |
Nursing Outcomes Classification |
|
NUCCHCPT |
National Uniform Claim Committee - Health Care Provider Taxonomy |
|
OMIM |
Online Mendelian Inheritance in Man |
|
OMS |
Omaha System |
|
ORPHANET |
ORPHANET |
YES |
PCDS |
Patient Care Data Set |
|
PDQ |
Physician Data Query |
|
PNDS |
Perioperative Nursing Data Set |
YES |
PPAC |
Pharmacy Practice Activity Classification |
|
PSY |
Psychological Index Terms |
YES |
QMR |
Quick Medical Reference |
|
RAM |
Clinical Concepts by R A Miller |
|
RCD |
Read Codes |
|
RCDAE |
Read Codes Am Engl |
|
RCDSA |
Read Codes Am Synth |
|
RCDSY |
Read Codes Synth |
|
RXNORM |
RXNORM |
YES |
SCTSPA |
SNOMED CT Spanish Edition |
|
SNM |
SNOMED 1982 |
|
SNMI |
SNOMED Intl 1998 |
|
SNOMEDCT_US |
SNOMED CT, US Edition |
YES |
SNOMEDCT_VET |
SNOMED CT, Veterinary Extension |
|
SOP |
Source of Payment Typology |
|
SPN |
Standard Product Nomenclature |
|
SRC |
Source Terminology Names (UMLS) |
|
TKMT |
Traditional Korean Medical Terms |
|
ULT |
UltraSTAR |
|
UMD |
UMDNS |
|
USP |
USP Compendial Nomenclature |
|
USPMG |
USP Model Guidelines |
|
UWDA |
Digital Anatomist |
|
VANDF |
National Drug File |
YES |
WHO |
WHOART |
|
WHOFRE |
WHOART French |
|
WHOGER |
WHOART German |
|
WHOPOR |
WHOART Portuguese |
|
WHOSPA |
WHOART Spanish |
| Type Unique Identifier (TUI) | Semantic Type Name | Included in EMERSE? |
|---|---|---|
T001 |
Organism |
|
T002 |
Plant |
|
T004 |
Fungus |
|
T005 |
Virus |
YES |
T007 |
Bacterium |
|
T008 |
Animal |
|
T010 |
Vertebrate |
|
T011 |
Amphibian |
|
T012 |
Bird |
|
T013 |
Fish |
|
T014 |
Reptile |
|
T015 |
Mammal |
|
T016 |
Human |
YES |
T017 |
Anatomical Structure |
YES |
T018 |
Embryonic Structure |
|
T019 |
Congenital Abnormality |
YES |
T020 |
Acquired Abnormality |
YES |
T021 |
Fully Formed Anatomical Structure |
YES |
T022 |
Body System |
YES |
T023 |
Body Part, Organ, or Organ Component |
YES |
T024 |
Tissue |
YES |
T025 |
Cell |
YES |
T026 |
Cell Component |
YES |
T028 |
Gene or Genome |
|
T029 |
Body Location or Region |
YES |
T030 |
Body Space or Junction |
YES |
T031 |
Body Substance |
YES |
T032 |
Organism Attribute |
|
T033 |
Finding |
YES |
T034 |
Laboratory or Test Result |
YES |
T037 |
Injury or Poisoning |
YES |
T038 |
Biologic Function |
|
T039 |
Physiologic Function |
YES |
T040 |
Organism Function |
YES |
T041 |
Mental Process |
YES |
T042 |
Organ or Tissue Function |
YES |
T043 |
Cell Function |
YES |
T044 |
Molecular Function |
YES |
T045 |
Genetic Function |
YES |
T046 |
Pathologic Function |
YES |
T047 |
Disease or Syndrome |
YES |
T048 |
Mental or Behavioral Dysfunction |
YES |
T049 |
Cell or Molecular Dysfunction |
YES |
T050 |
Experimental Model of Disease |
YES |
T051 |
Event |
|
T052 |
Activity |
YES |
T053 |
Behavior |
YES |
T054 |
Social Behavior |
YES |
T055 |
Individual Behavior |
YES |
T056 |
Daily or Recreational Activity |
YES |
T057 |
Occupational Activity |
YES |
T058 |
Health Care Activity |
YES |
T059 |
Laboratory Procedure |
YES |
T060 |
Diagnostic Procedure |
YES |
T061 |
Therapeutic or Preventive Procedure |
YES |
T062 |
Research Activity |
|
T063 |
Molecular Biology Research Technique |
|
T064 |
Governmental or Regulatory Activity |
|
T065 |
Educational Activity |
|
T066 |
Machine Activity |
|
T067 |
Phenomenon or Process |
|
T068 |
Human-caused Phenomenon or Process |
|
T069 |
Environmental Effect of Humans |
|
T070 |
Natural Phenomenon or Process |
|
T071 |
Entity |
|
T072 |
Physical Object |
|
T073 |
Manufactured Object |
|
T074 |
Medical Device |
YES |
T075 |
Research Device |
YES |
T077 |
Conceptual Entity |
|
T078 |
Idea or Concept |
|
T079 |
Temporal Concept |
|
T080 |
Qualitative Concept |
|
T081 |
Quantitative Concept |
|
T082 |
Spatial Concept |
|
T083 |
Geographic Area |
|
T085 |
Molecular Sequence |
|
T086 |
Nucleotide Sequence |
|
T087 |
Amino Acid Sequence |
|
T088 |
Carbohydrate Sequence |
|
T089 |
Regulation or Law |
|
T090 |
Occupation or Discipline |
YES |
T091 |
Biomedical Occupation or Discipline |
YES |
T092 |
Organization |
|
T093 |
Health Care Related Organization |
|
T094 |
Professional Society |
|
T095 |
Self-help or Relief Organization |
|
T096 |
Group |
|
T097 |
Professional or Occupational Group |
YES |
T098 |
Population Group |
YES |
T099 |
Family Group |
|
T100 |
Age Group |
|
T101 |
Patient or Disabled Group |
YES |
T102 |
Group Attribute |
|
T103 |
Chemical |
|
T104 |
Chemical Viewed Structurally |
|
T109 |
Organic Chemical |
YES |
T114 |
Nucleic Acid, Nucleoside, or Nucleotide |
YES |
T116 |
Amino Acid, Peptide, or Protein |
YES |
T120 |
Chemical Viewed Functionally |
|
T121 |
Pharmacologic Substance |
YES |
T122 |
Biomedical or Dental Material |
YES |
T123 |
Biologically Active Substance |
YES |
T125 |
Hormone |
YES |
T126 |
Enzyme |
YES |
T127 |
Vitamin |
YES |
T129 |
Immunologic Factor |
YES |
T130 |
Indicator, Reagent, or Diagnostic Aid |
YES |
T131 |
Hazardous or Poisonous Substance |
YES |
T167 |
Substance |
YES |
T168 |
Food |
YES |
T169 |
Functional Concept |
YES |
T170 |
Intellectual Product |
|
T171 |
Language |
|
T184 |
Sign or Symptom |
YES |
T185 |
Classification |
YES |
T190 |
Anatomical Abnormality |
YES |
T191 |
Neoplastic Process |
YES |
T192 |
Receptor |
|
T194 |
Archaeon |
|
T195 |
Antibiotic |
YES |
T196 |
Element, Ion, or Isotope |
YES |
T197 |
Inorganic Chemical |
YES |
T200 |
Clinical Drug |
YES |
T201 |
Clinical Attribute |
|
T203 |
Drug Delivery Device |
YES |
T204 |
Eukaryote |
Document Fields Example
<schema name="example" version="1.5">
<!-- You can use different field names for all fields.
Just be sure they match what the EMRESE Admin app expects the fields to be named. -->
<uniqueKey>ID</uniqueKey>
<!-- unique across sources -->
<field name="ID" type="string" required="true"/>
<!-- unique within a source, optional.
For use by a user if they want to cross-reference
a document in EMERSE with another system
without having to demange the cross-source unique ID above -->
<field name="RPT_ID" type="string"/>
<!-- Required fields -->
<field name="MRN" type="string" required="true"/>
<field name="ENCOUNTER_DATE" type="date" required="true"/>
<field name="DOC_TYPE" type="string" required="true"/>
<field name="SOURCE" type="string" required="true"/>
<field name="RPT_TEXT" type="text_with_header_post_analysis" required="true"/>
<!-- Required fields which are calculated by
the ETL system based on the encounter date and the birthdate of the patient. -->
<field name="AGE_DAYS" type="int" required="true"/>
<field name="AGE_MONTHS" type="int" required="true"/>
<!-- The encounter ID, if available -->
<field name="CSN" type="string"/>
<!-- Other fields describing metadata -->
<field name="ADMIT_DATE" type="date"/>
<field name="RPT_DATE" type="date"/>
<field name="DEPT" type="string" default="unknown"/>
<field name="SVC" type="string" default="unknown"/>
<field name="CLINICIAN" type="string" default="unknown"/>
<field name="CATEGORY" type="string" multiValued="true"/>
<!-- The NLP_HEADER is optional but cannot be renamed -->
<field name="NLP_HEADER" type="string" indexed="false"/>
<!-- Metadata fields related to indexing itself, provided by the ETL system -->
<field name="INDEX_DATE" type="date"/>
<field name="LAST_UPDATED" type="date"/>
<!-- _version_ must be a field since it's used by Solr internally.
Solr fills in the values automatically. -->
<field name="_version_" type="long"/>
<!-- Field Types -->
<fieldType name="date" class="solr.DatePointField"
sortMissingLast="true" uninvertible="false" docValues="true"/>
<fieldType name="long" class="solr.LongPointField"
sortMissingLast="true" uninvertible="false" docValues="true"/>
<fieldType name="int" class="solr.IntPointField"
sortMissingLast="true" uninvertible="false" docValues="true"/>
<fieldType name="string" class="solr.StrField"
sortMissingLast="true" uninvertible="false" docValues="true"/>
<fieldType name="boolean" class="solr.BoolField"
sortMissingLast="true" uninvertible="false" docValues="true"/>
....
</schema>
EMERSE Database
EMERSE stores NLP artifacts in two database tables. By default, only EMERSE built-in artifacts are populated. To include custom NLP artifacts within the EMERSE application, it is necessary to populate them into these tables accordingly.
Table SEMANTIC_GROUP
This table is for storing NLP Semantic Groups, which has been discussed at: Token-Aligned Layered Index Structure
| Column Name | Type | Description |
|---|---|---|
GROUP_ID |
VARCHAR(25) |
Semantic Group ID, this is the name stored in the Solr index to represent an entity. |
LABEL |
VARCHAR(50) |
Group Name, a user-friendly name for the semantic group that will be displayed on UI. |
BORDER_CSS |
VARCHAR(50) |
CSS value for "border-bottom", i.e. "0.3em dotted red". When a concept in the semantic group is highlighted in the document, specific elements from this CSS property will be employed for rendering. |
Table NLP_TAG
This table maintains NLP attribute groups, which has been mentioned at: [set-of-attributes]
| Column Name | Type | Description |
|---|---|---|
GROUP_ID |
VARCHAR(25) |
Unique ID for a [set-of-attributes]. This ID will be used to reference a set of attribute IDs. |
QUERY_TAG |
VARCHAR(25) |
The attribute ID in the group that represents the mention. |
HIGHLIGHT_TAGS |
VARCHAR(100) |
All attribute IDs in the group that can be highlighted in the document. |
TAG_INCLUDED_PROMPT |
VARCHAR(300) |
The text description of this attribute is included in the query. If +[ATTRIBUTE_ID] is selected, what does it signify? |
TAG_EXCLUDED_PROMPT |
VARCHAR(300) |
The text description of this attribute is excluded in the query. If -[ATTRIBUTE_ID] is selected, what does it signify? |
TAG_NEUTRAL_PROMPT |
VARCHAR(300) |
The text description of this attribute is not specify in the query. |
LABEL |
VARCHAR(50) |
The name for this Attribute on UI for display. |
BORDER_CSS |
VARCHAR(50) |
CSS value for "border-bottom", i.e. "0.15em solid #EA3223". When an attribute group is highlighted in the document, specific elements from this CSS property will be employed for rendering. |
DISPLAY_ORDER |
NUMERIC |
The position of this attribute group relative to others when displayed on the UI. |
SUBWAY_ICON |
VARCHAR(1) |
A singular letter serves as the representation of this attribute group. This letter will appear in various locations on the UI to indicate the selection status of this group. |
SUBWAY_INCLUDED_PROMPT |
VARCHAR(100) |
The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is included in the query. |
SUBWAY_EXCLUDED_PROMPT |
VARCHAR(100) |
The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is excluded in the query. |
SUBWAY_NEUTRAL_PROMPT |
VARCHAR(100) |
The text tooltip that appears when the cursor hovers over the icon, providing information when this attribute group is not in the query. |
HIDDEN_ON_UI |
BOOLEAN |
Indicating whether this attribute group should be hidden on the UI. |
EMERSE configuration
In order for Solr to properly treat UMLS concepts and Semantic group as query terms, the following configuration line should be added into
emerse.properties file:
nlp.artifcat.prefix=CUI_ SMG_
This tells Solr any term starting with CUI_ or SMG_ should not be subject to tokenization. If custom pipeline is used, replace CUI_, SMG_ with your own concept/group prefix.
EMERSE UI Consideration
EMERSE queries a Solr field named nlp_display within the index. When this field is populated, its contents are showcased within the NLP Detail section. This functionality proves invaluable when a user’s NLP tool generates supplementary content, such as summaries or abstracts, based on the original document.
Additionally, users have the option to establish distinct Solr fields dedicated to storing NLP outputs. This enables the configuration of these fields as query filters, facilitating more refined search capabilities.
Query Syntax
The introduction of the EMERSE query syntax is out of the scope of this document. However, it’s worth mentioning the integration of NLP artifacts in the query.
For NLP attributes, such as negation, the use of + and - operators is employed, followed by square brackets [] containing the attribute’s ID. These notations indicate whether the attribute should align with the query phrase. For example, in order to query smoke history in the negated form, the query might be written as:
+[POL_NEG]"smoke history"
given POL_NEG as the attribute ID for negation.
Querying alignment with a phrase for multiple attributes is also permissible. For example, in order to query smoke history in the non-negated form for the patient only, the query might be written as:
-[POL_NEG]-[TG_FAMILY]"smoke history"
given POL_NEG as the attribute ID for negation and TG_FAMILY as the attribute ID for non-patient mention.
For NLP entities, on the other hand, are treated as regular phrase in terms of query. They can be combined with regular text phrases in the query. For example:
"CUI_TEDDY bear"
This query looks for any token/phrase that is mapped as the entity by ID CUI_TEDDY and followed by a token bear. If CUI_TEDDY is mapped to "Teddy", "A teddy", "Teddies", the above query is equivalent to:
"Teddy bear" OR "A teddy bear" OR "Teddies bear
