This guide contains additional considerations that one might want to address regarding the ongoing operation of EMERSE. A few considerations regarding deployment are covered as well.

Determining how much data to load

One important consideration for EMERSE is how much data should be loaded into the system, or how far back in time the data should be loaded. This will depend entirely on the institutional requirements for its use, but a few things should be considered. First, EMERSE can organize data by source systems, so if you have older, legacy data, then those data can be loaded as its own source. Second, because it is a search engine, adding more data is easy because EMERSE should have no problem searching through the extra data—​in fact, that is what a search engine is meant to do. Because it is hard to predict future data needs, and because of the small incremental effort for loading additional data, unless there are reasons not to load all of the available data, it would be a good idea to make it all available for the users.

Determining how much data for an initial pilot

Several sites have asked about how much data should be loaded for an initial pilot of EMERSE. For a variety of reasons, sites may not want to load all of the data at first, especially when including just a small subset of users to try out the system. In such a situation it may be important to first determine who those users might be and what their data needs are. For example, should all data for a small subset of patients be loaded into the system? Or should 1 or 2 years of data for all patients be loaded into the system? One important consideration if choosing the latter option is what the implications might be if older data are not loaded for a patient. In such a case, it might be a problem if a user searched a patient, found nothing, and then assumed the problem or condition didn’t exist for the patient because the older data mentioning the condition were not loaded. Whatever decision is made, it should be made clear to the users what the implications are based on the type, volume, and time frame of the data that were loaded (versus what was intentionally left out).

Including document metadata or structured data

EMERSE is meant to search the free text of documents. With the Advanced Terms feature it may be possible to search additional metadata if the data are loaded properly into the Solr indexes. However, for many users it may be even more valuable to append additional metadata or structured data to the document itself, making those elements searchable even they were not initially created to be a part of the original document. For example, it might be useful to append the current medication list or problem list to the document before it is sent to Solr for indexing. It might also be useful to provide billing codes, the name of the clinician who authored the note, etc. By doing so this additional information becomes a searchable part of the document and thus easy for users to find with the standard EMERSE interface.

For example, in the text below all of the text preceding "Clinical Text:" could have come from structured data and just appended (pre-pended) to the beginning of the note to provide additional searchable data about the clinical encounter. Such choices will have to be institution-specific but additional data will likely be of great value to users.

Clinician: John A Doe
Document Type: Outpatient Progress Note

Current Medication List:
1. acetaminophen 200 mg every six hours
2. atorvastatin 20 mg every day
3. fluoxetine 40 mg daily in the morning

Current Problem List:
1. Hypertension
2. Hyperlipidemia
3. Depression
4. Back pain, chronic

Billing codes for this visit:
1. G89.29  (Other chronic pain)
2. E78.5: (Hyperlipidemia, unspecified)

Clinical Text:

John Doe is a 57 year old male with a history of hypertension, back pain, depression, and high cholesterol.  He came to clinic today to discuss his back pain that has been getting worse over the last two weeks...

Signed/completed documents

An important decision to make regarding adding documents for indexing relates to whether or not incomplete, or unsigned, documents should be included. Generally, once a document is signed it can no longer be changed, although it can be addended. Unsigned, or incomplete, documents can be provisional and may require further editing or review by the clinician to make corrections or to include additional information. Including unsigned/incomplete documents means that they might be available to EMERSE users faster than signed/completed documents, but the trade-off is that the users may be looking at provisional notes that could contain inaccuracies that would later be corrected before the clinician finalized and signed the document. While there are no specific guidelines about the best approach, these trade-off should be considered carefully when making the decision on what notes to include or exclude. At Michigan Medicine, only signed documents are included within EMERSE.

Index update frequency

Users often want to know how quickly a document will be available in EMERSE once it has been created within the EHR. This will depend partly on your process for moving documents from the EHR to EMERSE, or from the EHR to an intermediate data repository, and then from the repository to EMERSE. At Michigan Medicine we update the index once each day, around 3 AM. This means that a document that arrives in our document repository shortly after the indexing process is complete might sit unindexed for almost a day before it was available within EMERSE. For some users this delay might not matter, but it will be important to make sure users are aware of possible delays in document availability within EMERSE, and to make sure that such delays are acceptable for the intended uses. There is nothing to prevent more frequent, incremental indexing from occurring, but it will be important to make sure that the hardware available can handle indexing updates and concurrent use during the day.

EMERSE currently relies on both Lucene and Solr. In general, Lucene has to close and then re-open indexes for newly indexed documents to become available for searching. Solr, depending on the configuration is more flexible in this regard. As EMERSE moves more towards Solr and becomes less reliant on Lucene, the issue of update frequency may become less relevant.

There is no separate command to index a document once its been sent to Solr. There is a concept of "commit" such that updates may not happen immediately unless a commit is added to the update. Generally Solr decides when either enough data or updates have been sent that it is worth a commit.


Patient/Document merges and splits occur routinely within medical record systems. Both involve re-assigning documents to a new (corrected) medical record number. A merge occurs when a patient has been assigned more than one medical record number, and thus the documents may be listed under more than one medical record number. To remedy this, the medical record numbers are merged into a single MRN and the documents then get reassigned to the single MRN. A split occurs when it turns out that a set of documuments for a single patient have been incorrectly assigned and actually belong to more than one patient. In this case the documents have to be re-assigned to the correct patient MRN, which can potentially be to more than one person.

In the world of Health information management, these concepts are sometimes referred to as Duplicates, Overlaps, and Overlays, especially as they related to a Master Patient Index (MPI).

Duplicate: One patient can have more than one MRN at the same facility.

Overlap: One patient can have more than one MRN between facilities that are part of the same overall organization.

Overlay: More than one patient shares the same MRN.

Source: AHIMA MPI Task Force. “Building an Enterprise Master Person Index.” Journal of AHIMA 75, no. 1 (Jan. 2004): 56A–D.

The implication of this for EMERSE is that these splits/merges have to be tracked in some manner to ensure that any changes to documents within the EHR get reflected back into EMERSE. For example, if a document repository is maintained for providing documents to EMERSE, that repository will need to keep track of these changes and subsequently submit these changes to Solr for re-indexing. This might include deleting existing documents within the Solr repository, or submitting the same document but with a diffrent MRN to replace its metadata.

This also has implications for sites that want to include older documents from a legacy medical record system, since those documents should also be subjects to splits/merges if the patients are still active within the newer electronic health record system and documents or MRNs are incorrectly assigned with respect to data in the legacy EHR.

Example of a patient merge
Figure 1. Example of a patient merge


Example of a patient split
Figure 2. Example of a patient split

Dealing with splits/merges from an operational perspective will take some planning. With a large set of Solr documents, re-indexing all the documents regularly will not be practical.

When indexing documents incrementally, depending on the mechanism used to detect changed documents, it may not be possible to detect that there are documents whose patient MRNs have been modified. To overcome this challenge at Michigan Medicine, we have a Pentaho Data Integration (PDI) job that runs regularly to read the history of all MRN merges/combines, and updates any documents in the Solr index that exist on the patient’s prior MRN. Michigan Medicine can share the PDI job that does this, but the main points are:

  1. Finding a source for MRN merge history

  2. Finding documents in Solr for a specific MRN

  3. Using the update API in Solr to update each document found

In Epic’s Clarity datastore, the following SQL can be used to generate a list of MRN’s that have changed:

select mrn_hx.pat_id, mrn_hx as prior_hx, mrn_hx.mrn_hx_change_inst as merge_datetime, mrn_hx.mrn_hx_pat_name as patnamehx, mrnid.identity_id as MRN, oldid.identity_new_id as PRIOR_MRN , pt.pat_first_name, pt.pat_last_name, pt.pat_middle_name
from pat_mrn_hx mrn_hx
inner join patient pt on mrn_hx.pat_id=pt.pat_id
inner join identity_id mrnid on pt.pat_id=mrnid.pat_id and identity_type_id=14
inner join identity_id_hx oldid on mrn_hx.mrn_hx=oldid.old_pat_id and id_type_hx=14
Note that the identity_id type for an MRN may not be 14 at your organization.

Finding documents in Solr for a specific MRN:


Where unified is the name of the main Solr index and med_rec_num should be replaced with an actual medical record number.

To update a specific document, send an HTTP post to this URL:


Post body:

{ "ID”:”document_primary_key”,

The update statement will cause Solr to internally re-index the document with the new MRN value. The original document’s text and other fields will be preserved. In this example document_primary_key should be replaced with the actual document ID retrieved in the prior query, above ("Finding documents in Solr for a specific MRN").

Solr is protected by basic auth, so credentials are not in the URL

Images in documents

EMERSE does not currently support displaying images within documents. This should be taken into consideration when indexing documents from sources in which images are routinely embedded within a a document. Heavily impacted specialties may include dermatology (e.g., photos of skin lesions), or ophthalmology (e.g., drawings and retinal photos).

Although we have not tried this, it should theoretically be impossible to include such images, but this would likely require storing the images in a known location and then embedding a hyperlink to those images within the document being indexed. As far as we know, Solr itself would not store the images within its own data atore.

Non-text documents

Many sites have documents from outside sources, or potentially from internal sources, that are not in either a plain text (.txt) or standard web-based (.html) format which are the main two that Solr can natively handle. In such cases there are ways to extract the text and import it into EMERSE. Examples of these files types include PDF or Microsoft Word (.docx) documents. For these cases, additional components may be required to extract the text. Apache Tika is a tool that can help with this task.


Documents that are scanned images present additional challenges, and this is addressed in the section below on Optical Character Recognition.

Optical Character Recognition (OCR)

Documents that are images present a challenge for incorporating the text into EMERSE. A PDF file, for example, might contain actual text within the document structure, or it may contain an image of a document that has text, but that text cannot be not readily extracted. These images could arrive at a clinic through a variety of means (e.g., hand-delivered by patients, faxed, etc.), but if the long-term storage strategy for storing the documents is scanning and archiving the images, then the text will not be readily extractable. For example, our strategy at Michigan Medicine is to scan a paper document and then store the scanned image within an institutional document repository using OnBase. The paper version is then destroyed.

Optical Character Recognition (OCR) provides a means for essentially conducting image analysis on a document, and recognizing and extracting the text to make it capable of being indexed in a system such as EMERSE. While EMERSE itself does not have OCR capabilities built-in, OCR could be used as part of a document-processing pipeline that identifies and extracts texts prior to submitting the documents to EMERSE for indexing. We have conducted various proof-of-concepts to show that this is possible. Some details about this can be found in our Virtual Machine Guide.

We have had experience using both commercial (e.g., Nuance) and open source (e.g., Tesseract) OCR options.


Most OCR software will recognize printed text, but will not recognize hand-written text. Nevertheless, we have found this can still be useful in cases where the main item of interest may be hand-written, but a search for related text in a document header can still rapidly bring the user to the right page among hundreds or thousands of potential documents.

Our experience with Tesseract has shown that it generally works well, but it performance depends heavily on the quality of the original document image, including how fuzzy the images its, its horizontal orientation, or other extraneous marks (such as handwriting) covering the text. A few examples of images and corresponding extracted text are provided below.


OCR example 1
Figure 3. Example of reasonably high quality text
Most of the text is extracted correctly, but there is no space between "If" and "you" (i.e., "Ifyou"):
Voluntary Participation/Refusal to Sign:
Participating in this research project is completely voluntary. Ifyou decide not to participate, or ifyou decide to withdraw (end your participation), you will not suffer any penalty or loss of benefits to which you otherwise would be entitled. Your treatment at the University of Michigan will not be affected by your willingness or refusal to participate.

Duration of Study and Expiration Date or Event:
Unless you revoke your authorization, your permission for us to use your tissue and related information will not expire. This study is expected to continue indefinitely.


OCR example 2
Figure 4. Example of reasonably high quality text with a number circled
The number 1 is circled, but the OCR software did not recognize that components as a number.
Over the past month, how often have you had to urinate again less than two hours after you finished urinating?

0 (L 2 3 4 5


OCR example 2
Figure 5. Example of a scanned radiology report
This text was extracted reasonably well. (Some components have been manually de-identified.). Some spaces are missing between words, which would affect the ability to find them due to the tokenization of terms within Solr.
CXR2, CHEST 2 VIEWS created X/XX/XXXX 10:33 during visit on X/XX/XXXX for (NoReason Given) as EMERC
Final Report RadPlus as of X/XX/XXXX 15:26   RequestedbyXXXXX,XXXXX


Two views were made. There are degenerative changes in the dorsal spine. The lung fields are clear. The heart size is normal.
IMPRESSION: Two view chest is considered negative.


OCR example 2
Figure 6. Example of a low quality image
This image shows text at an angle, with some hand-written text to the right and other writing over some of the printed text. Much of the text was not identified correctly, including spacing between words.
Upon examination today, findings are as follows:

L-Backgrounddiabeticretinopathy eogBq~tip t&-fmindaa
I requested thalthepysetum for fallow-up In: amos/ yeast sr

C] Iam referringthispatientto:

Integration within an EHR

At Michigan Medicine, EMERSE users can access the system via two primary routes. The first is logging directly into EMERSE, once an account has been created for the user. A second option has also been made available to users in order to streamline the process for conducting a search after reviewing a patient within the electronic health record (EHR), which is Epic at Michigan Medicine. In this latter case, a user must first open a patient tab (workspace) within Epic, which will reveal a set of options from a button in the lower-left corner of the workspace. At Michigan Medicine this button is named "More" and provides access to many other options within Epic itself but also links to outside systems such as EMERSE. We have a submenu set up called "Other Clinical Systems", and one of these systems is EMERSE.

EMERSE menu option from within the EHR
Figure 7. Mockup screenshot showing the EMERSE menu option from within the EHR

When a user chooses EMERSE from this Epic Menu, a browser is automatically opened and the Epic user will automatically be logged into EMERSE and they will land at the Attestation page, which is where the user specifies the intended use of EMERSE for that session. Further, we have provided default text in the EMERSE Attestation text box for this specific scenario which essentially says "search from Epic". The user could change this Attestation reason, or just submit the default answer. Once the Attestation is completed, the user will move to the main work area of EMERSE, and the patient that was opened in Epic will also be loaded within EMERSE. This means that a search can be conducted on that patient right away. In other words, from the user’s perspective the steps from the EHR would be:

  1. Open a patient in the EHR

  2. Choose MoreOther Clinical SystemsEMERSE

  3. Complete the EMERSE attestation

  4. Enter search terms for the pre-loaded patient

If a user does not have an EMERSE account and uses this option, an account is automatically created within EMERSE for that user. The logic behind this decision was that if the user already had access to the patient information within the EHR, then they should also have access to the same information with EMERSE.

Security details about how the data and user’s credentials are passed from Epic to EMERSE will vary for each institution. However, the documentation on the Epic site regarding "Integrating External Web Applications into Epic" is an ideal place to start learning about how to do such integrations. There may be other ways to provide such integration, but we describe our current setup to illustrate how this could be done.

Error logging

EMERSE system errors are logged, which may be helpful in tracking down issues. These EMERSE system errors can be found in the tomcat_install_dir/logs directory. They will be in a file called catalina.out as well as a file called emerse.log. Logging is controlled by a log4j2.xml file which can be found inside of the expanded emerse.war file.

Some of the logs can be accessed through the EMERSE Administrator application or the diagnostics.html page. For more information about these options, please see the Troubleshooting Guide.

Guidelines for providing users access

Each site will have to determine their own local rules and guidelines for who should be permitted to gain access to EMERSE, especially since patient information can be viewed. At Michigan Medicine the following guidelines were implemented:

  • Users who already have acces to the electronic health record can have access to EMERSE through a secure link from the EHR to EMERSE, and this link can only be accessed by logging into the EHR. (See Integration within an EHR, above). Nevertheless, users must always complete the EMERSE Attestation page at login, and if used for research the specific IRB-approved study must be selected at the time of login.

  • A user who has EMERSE access because of EHR access will automatically have their EMERSE account revoked if the EHR access is revoked. This is done through a batch job that checks the EHR for any user who has had their access revoked.

  • If a user does not have access to the EHR, EMERSE access can be granted via an EMERSE administrator after reviewing the user’s IRB application to verify that access is permitted by the IRB, and after other privacy training has been completed and documented.

  • Periodic reviews are conduced of users granted access for research-only purposes to ensure that a user’s account is revoked when access is no longer required.

  • Users who have been granted administrator roles in EMERSE do not have their access revoked via a batch process. This is done manually by other EMERSE administrators.

  • It should be noted that any user who has had their general health system credentials revoked (by removing them from LDAP) will not be able to access EMERSE since LDAP authentication is used in addition to local user table data within EMERSE.