EMERSE Additional Operational Considerations Guide

Overview

This document contains additional considerations that one might want to address regarding the ongoing operation of EMERSE. A few considerations regarding deployment are covered as well.

Determining how much data to load

One important consideration for EMERSE is how much data should be loaded into the system, or how far back in time the data should be loaded. This will depend entirely on the institutional requirements for its use, but a few things should be considered. First, EMERSE can organize data by source systems, so if you have older, legacy data, then those data can be loaded as its own source. Second, because it is a search engine, adding more data is easy because EMERSE should have no problem searching through the extra data—in fact, that is what a search engine is meant to do. Because it is hard to predict future data needs, and because of the small incremental effort for loading additional data, unless there are reasons not to load all of the available data, it would be a good idea to make it all available for the users.

Determining how much data for an initial pilot

Several sites have asked about how much data should be loaded for an initial pilot of EMERSE. For a variety of reasons, sites may not want to load all of the data at first, especially when including just a small subset of users to try out the system. In such a situation it may be important to first determine who those users might be and what their data needs are. For example, should all data for a small subset of patients be loaded into the system? Or should 1 or 2 years of data for all patients be loaded into the system? One important consideration if choosing the latter option is what the implications might be if older data are not loaded for a patient. In such a case, it might be a problem if a user searched a patient, found nothing, and then assumed the problem or condition didn’t exist for the patient because the older data mentioning the condition were not loaded. Whatever decision is made, it should be made clear to the users what the implications are based on the type, volume, and time frame of the data that were loaded (versus what was intentionally left out).

Signed/completed documents

An important decision to make regarding adding documents for indexing relates to whether or not incomplete, or unsigned, documents should be included. Generally, once a document is signed it can no longer be changed, although it can be addended. Unsigned, or incomplete, documents can be provisional and may require further editing or review by the clinician to make corrections or to include additional information. Including unsigned/incomplete documents means that they might be available to EMERSE users faster than signed/completed documents, but the trade-off is that the users may be looking at provisional notes that could contain inaccuracies that would later be corrected before the clinician finalized and signed the document. While there are no specific guidelines about the best approach, these trade-off should be considered carefully when making the decision on what notes to include or exclude. At Michigan Medicine, only signed documents are included within EMERSE.

Index update frequency

Users often want to know how quickly a document will be available in EMERSE once it has been created within the EHR. This will depend partly on your process for moving documents from the EHR to EMERSE, or from the EHR to an intermediate data repository, and then from the repository to EMERSE. At Michigan Medicine we update the index once each day, around 3 AM. This means that a document that arrives in our document repository shortly after the indexing process is complete might sit unindexed for almost a day before it was available within EMERSE. For some users this delay might not matter, but it will be important to make sure users are aware of possible delays in document availability within EMERSE, and to make sure that such delays are acceptable for the intended uses. There is nothing to prevent more frequent, incremental indexing from occurring, but it will be important to make sure that the hardware available can handle indexing updates and concurrent use during the day.

EMERSE currently relies on both Lucene and Solr. In general, Lucene has to close and then re-open indexes for newly indexed documents to become available for searching. Solr, depending on the configuration is more flexible in this regard. As EMERSE moves more towards Solr and becomes less reliant on Lucene, the issue of update frequency may become less relevant.

There is no separate command to index a document once its been sent to Solr. There is a concept of "commit" such that updates may not happen immediately unless a commit is added to the update. Generally Solr decides when either enough data or updates have been sent that it is worth a commit.

Merges/Splits

Patient/Document merges and splits occur routinely within medical record systems. Both involve re-assigning documents to a new (corrected) medical record number. A merge occurs when a patient has been assigned more than one medical record number, and thus the documents may be listed under more than one medical record number. To remedy this, the medical record numbers are merged into a single MRN and the documents then get reassigned to the single MRN. A split occurs when it turns out that a set of documuments for a single patient have been incorrectly assigned and actually belong to more than one patient. In this case the documents have to be re-assigned to the correct patient MRN, which can potentially be to more than one person.

In the world of Health information management, these concepts are sometimes referred to as Duplicates, Overlaps, and Overlays, especially as they related to a Master Patient Index (MPI).

Duplicate: One patient can have more than one MRN at the same facility.

Overlap: One patient can have more than one MRN between facilities that are part of the same overall organization.

Overlay: More than one patient shares the same MRN.

Source: AHIMA MPI Task Force. “Building an Enterprise Master Person Index.” Journal of AHIMA 75, no. 1 (Jan. 2004): 56A–D.

The implication of this for EMERSE is that these splits/merges have to be tracked in some manner to ensure that any changes to documents within the EHR get reflected back into EMERSE. For example, if a document repository is maintained for providing documents to EMERSE, that repository will need to keep track of these changes and subsequently submit these changes to Solr for re-indexing. This might include deleting existing documents within the Solr repository, or submitting the same document but with a diffrent MRN to replace its metadata.

This also has implications for sites that want to include older documents from a legacy medical record system, since those documents should also be subjects to splits/merges if the patients are still active within the newer electronic health record system and documents or MRNs are incorrectly assigned with respect to data in the legacy EHR.

Figure 1. Example of a patient merge

Figure 2. Example of a patient split

Images in documents

EMERSE does not currently support displaying images within documents. This should be taken into consideration when indexing documents from sources in which images are routinely embedded within a a document. Heavily impacted specialties may include dermatology (e.g., photos of skin lesions), or ophthalmology (e.g., drawings and retinal photos).

Although we have not tried this, it should theoretically be impossible to include such images, but this would likely require storing the images in a known location and then embedding a hyperlink to those images within the document being indexed. As far as we know, Solr itself would not store the images within its own data atore.

Non-text documents

Many sites have documents from outside sources, or potentially from internal sources, that are not in either a plain text (.txt) or standard web-based (.html) format which are the main two that Solr can natively handle. In such cases there are ways to extract the text and import it into EMERSE. Examples of these files types include PDF or Microsoft Word (.docx) documents. For these cases, additional components may be required to extract the text. Apache Tika is a tool that can help with this task.

https://tika.apache.org

Documents that are scanned images present additional challenges, and this is addressed in the section below on Optical Character Recognition.

Optical Character Recognition (OCR)

Documents that are images present a challenge for incorporating the text into EMERSE. A PDF file, for example, might contain actual text within the document structure, or it may contain an image of a document that has text, but that text cannot be not readily extracted. These images could arrive at a clinic through a variety of means (e.g., hand-delivered by patients, faxed, etc.), but if the long-term storage strategy for storing the documents is scanning and archiving the images, then the text will not be readily extractable. For example, our strategy at Michigan Medicine is to scan a paper document and then store the scanned image within an institutional document repository using OnBase. The paper version is then destroyed.

Optical Character Recognition (OCR) provides a means for essentially conducting image analysis on a document, and recognizing and extracting the text to make it capable of being indexed in a system such as EMERSE. While EMERSE itself does not have OCR capabilities built-in, OCR could be used as part of a document-processing pipeline that identifies and extracts texts prior to submitting the documents to EMERSE for indexing. We have conducted various proof-of-concepts to show that this is possible. Some details about this can be found in our Virtual Machine Guide.

We have had experience using both commercial (e.g., Nuance) and open source (e.g., Tesseract) OCR options.

https://github.com/tesseract-ocr
https://github.com/tesseract-ocr/tesseract
https://github.com/tesseract-ocr/tesseract/wiki

Most OCR software will recognize printed text, but will not recognize hand-written text. Nevertheless, we have found this can still be useful in cases where the main item of interest may be hand-written, but a search for related text in a document header can still rapidly bring the user to the right page among hundreds or thousands of potential documents.

Our experience with Tesseract has shown that it generally works well, but it performance depends heavily on the quality of the original document image, including how fuzzy the images its, its horizontal orientation, or other extraneous marks (such as handwriting) covering the text. A few examples of images and corresponding extracted text are provided below.

Figure 3. Example of reasonably high quality text

Most of the text is extracted correctly, but there is no space between "If" and "you" (i.e., "Ifyou"):

Voluntary Participation/Refusal to Sign:
Participating in this research project is completely voluntary. Ifyou decide not to participate, or ifyou decide to withdraw (end your participation), you will not suffer any penalty or loss of benefits to which you otherwise would be entitled. Your treatment at the University of Michigan will not be affected by your willingness or refusal to participate.

Duration of Study and Expiration Date or Event:
Unless you revoke your authorization, your permission for us to use your tissue and related information will not expire. This study is expected to continue indefinitely.

Figure 4. Example of reasonably high quality text with a number circled

The number 1 is circled, but the OCR software did not recognize that components as a number.

2.Frequency
Over the past month, how often have you had to urinate again less than two hours after you finished urinating?

0 (L 2 3 4 5

Figure 5. Example of a scanned radiology report

This text was extracted reasonably well. (Some components have been manually de-identified.). Some spaces are missing between words, which would affect the ability to find them due to the tokenization of terms within Solr.

CXR2, CHEST 2 VIEWS created X/XX/XXXX 10:33 during visit on X/XX/XXXX for (NoReason Given) as EMERC
EMERGENCY ROUTINE DISCHARGE
Final Report RadPlus as of X/XX/XXXX 15:26   RequestedbyXXXXX,XXXXX
<<Prev(CTThoraxw/ContrastforPE)

FINAL REPORT

CHEST TWO VIEWS:
REASON FOR EXAMINATION: Chest pain.
Two views were made. There are degenerative changes in the dorsal spine. The lung fields are clear. The heart size is normal.
IMPRESSION: Two view chest is considered negative.
\s Dr. XXXXX XXXXX

Figure 6. Example of a low quality image

This image shows text at an angle, with some hand-written text to the right and other writing over some of the printed text. Much of the text was not identified correctly, including spacing between words.

Upon examination today, findings are as follows:

L-Backgrounddiabeticretinopathy eogBq~tip t&-fmindaa
I requested thalthepysetum for fallow-up In: amos/ yeast sr

C] Iam referringthispatientto:

Integration within an EHR

At Michigan Medicine, EMERSE users can access the system via two primary routes. The first is logging directly into EMERSE, once an account has been created for the user. A second option has also been made available to users in order to streamline the process for conducting a search after reviewing a patient within the electronic health record (EHR), which is Epic at Michigan Medicine. In this latter case, a user must first open a patient tab (workspace) within Epic, which will reveal a set of options from a button in the lower-left corner of the workspace. At Michigan Medicine this button is named "More" and provides access to many other options within Epic itself but also links to outside systems such as EMERSE. We have a submenu set up called "Other Clinical Systems", and one of these systems is EMERSE.

Figure 7. Mockup screenshot showing the EMERSE menu option from within the EHR

When a user chooses EMERSE from this Epic Menu, a browser is automatically opened and the Epic user will automatically be logged into EMERSE and they will land at the Attestation page, which is where the user specifies the intended use of EMERSE for that session. Further, we have provided default text in the EMERSE Attestation text box for this specific scenario which essentially says "search from Epic". The user could change this Attestation reason, or just submit the default answer. Once the Attestation is completed, the user will move to the main work area of EMERSE, and the patient that was opened in Epic will also be loaded within EMERSE. This means that a search can be conducted on that patient right away. In other words, from the user’s perspective the steps from the EHR would be:

Open a patient in the EHR
Choose More → Other Clinical Systems → EMERSE
Complete the EMERSE attestation
Enter search terms for the pre-loaded patient

If a user does not have an EMERSE account and uses this option, an account is automatically created within EMERSE for that user. The logic behind this decision was that if the user already had access to the patient information within the EHR, then they should also have access to the same information with EMERSE.

Security details about how the data and user’s credentials are passed from Epic to EMERSE will vary for each institution. However, the documentation on the Epic site regarding "Integrating External Web Applications into Epic" is an ideal place to start learning about how to do such integrations. There may be other ways to provide such integration, but we describe our current setup to illustrate how this could be done.

Error logging

EMERSE system errors are logged, which may be helpful in tracking down issues. These EMERSE system errors can be found in the tomcat_install_dir/logs directory. They will be in a file called catalina.out as well as a file called emerse.log. Logging is controlled by a log4j2.xml file which can be found inside of the expanded emerse.war file.