Overview

This guide provides information on integrating external data into the EMERSE application, as well as basics about the indexing process. In order for EMERSE to search and highlight documents, they must be submitted to the Solr instance to which EMERSE is attached. Additionally, patients who have indexed documents need their corresponding demographics loaded into EMERSE’s Patient table in the database. Optionally, research study and user (personnel) information can be loaded into a few EMERSE database tables that will allow EMERSE to check that authenticated users are research personnel for approved studies, which is used for determining research access and populating the audit tables at the time of login.

Data integration and indexing are tightly linked, which is why they are discussed together in this guide. Not covered in depth is Solr itself, as this is already well documented within the Apache Solr project itself. Note that knowledge of Solr will be crucial to indexing for EMERSE.

Integrating Documents

The EMERSE application uses Solr to search, highlight, and display documents. This section provides possible approaches for submitting documents to Solr for indexing, as well as information about conforming your documents to the Solr schema that EMERSE requires.

Document Metadata / Solr Schema

Each Solr index uses an XML configuration file (managed-schema) that describes the fields that exist for the documents that will be indexed. This file informs Solr of the datatypes of the fields (String, Date, etc), but is also used to configure numerous settings related to how documents are analyzed (tokenized). The EMERSE software is tightly coupled with a specific Solr Schema.

Whereas the fields of the documents index are highly customizable, the patient index and its replica should have exactly the schema defined in the example indexes; they should not be changed.

Finally, there are <fieldType>`s defined in the `managed-schema files. These should generally be left alone, but you can read up on their definition in the Solr documentation.

EMERSE Solr schema

EMERSE needs all documents to have five common fields. Each such field has a row in the DOC_FIELD_EMR_INTENT table, which maps the generic field name (called EMR intent) to the actual Solr field name (with one exception).

  1. unique document identifier (EMR intent named RPT_ID)

  2. patient MRN (EMR intent named MRN)

  3. encounter date (EMR intent named CLINICAL_DATE)

  4. document modified date (EMR intent named LAST_UPDATED)

  5. document text tokenized as lower case (EMR intent named RPT_TEXT)

  6. document text tokenized with case (EMR intent named RPT_TEXT_NOCI)

  7. document source (which must be the solr field SOURCE; there is no EMR intent)

Below is an excerpt from the managed-schema of the documents index. Field names can be changed as long as the SOLR_FIELD_NAME column in the DOC_FIELD_EMR_INTENT table matches. Note that additionally, whatever field uniquely identifies a document should also be set as the <uniqueKey> in the schema.

<uniqueKey>ID</uniqueKey>
<field name="MRN" type="string"/>
<field name="ID" type="string"/>
<field name="RPT_DATE" type="date"/>
<field name="SOURCE" type="string" docValues="true"/>
<field name="RPT_TEXT" type="text_general_htmlstrip" termVectors="true" termPositions="true" termOffsets="true" />
<field name="RPT_TEXT_NOIC" type="text_general_htmlstrip_nolowercase" stored="false" termVectors="true" termPositions="true" termOffsets="true" />
<field name="_version_" type="long"/>

We’ll go over the EMR intents here:

  • RPT_ID: This should hold a unique identifier for a document. When indexing from multiple sources, make sure that keys don’t overlap. At Michigan Medicine we prefix the source systems' primary key to ensure uniqueness between systems.

  • MRN: The unique identifier for the patient associated with the document, usually the medical record number.

  • CLINICAL_DATE: A date for the document, usually the creation date or date of service. Used in the UI to sort and limit results.

  • RPT_TEXT: The actual text of the note or document. This will ultimately be lower-cased and indexed for case-insensitive searches (the default).

  • RPT_TEXT_NOIC: This field is automatically copied from RPT_TEXT. It should not be submitted with indexed documents. It is used to create a copy of the document text that is case sensitive. This case-sensitive version is used in a separate index for searching terms in a case-sensitive manner. (NOIC = NO Ignore Case).

Finally, there are some Solr fields that must have exactly these names:

  • SOURCE (the solr field, not EMR intent): The name of the source, which should be the same as the SOURCE_KEY in the DOCUMENT_SOURCE table, described in more detail in the Data Guide

  • _version_ is not a field that we modify for EMERSE, but it must remain because it is required by the Solr schema.

Extra care must be taken to ensure that the Solr field defined by EMR intent RPT_ID (henceforth called its ID) is truly unique across all sources. It is often the case that document from different sources will have the same ID. If you submit a document to Solr that has the same ID as one already in Solr, it overwrites said document, thus at most one of the documents will appear in Solr. To solve this, it is best to assign a "code" to each source, where no source’s code is a prefix of any other source’s code, and prefix to each document from a source the source’s code. For instance, if you have two sources, and you code them as A and B, then an ID 123 present is both sources will be sent to solr as A123 if the document is from source A, and B123 if its from source B. Codes surgery and surgery_er would not work in general, since surgery is a prefix of surgery_er so that (for instance) a document with ID _er123 in source surgery and a document with ID 123 in source surgery_er would both become surgery_er123, and thus overwrite one another in Solr. Obviously, if you can guarantee no ID in surgery starts with _er then you should be fine.

Optional fields:

Additional metadata, such as a document type (e.g., “progress note”, “surgical note”) and clinical service (e.g., “general pediatrics”, “rheumatology”), can also be included. These are helpful for users, as they can be displayed when the application displays lists of documents. Additionally they may be used to sort documents and for filtering results.

If you want to show a field in the Solr schema in the EMERSE UI, you will need to add a row to the database table DOCUMENT_FIELDS to set which pages the field will be shown. See the Data Guide for more information.

Example:

Suppose you want to add in a new field related to the doctor’s ID number, which you want to internally name within Solr (DR_NUM). Simply add the line below to the Solr configuration file:

<field name="DR_NUM" type="string"/>

Technical approaches for Solr indexing

Getting documents from source systems will vary considerably at each site and will depend on multiple factors including the number of different sources, how the documents are stored, how they are formatted, the type of access or connections available, etc.

Three high level approaches are briefly described here, with details in additional sections that follow:

  1. Custom code with Solr Client: We have found this to be the ideal approach because of it’s speed, multi-threaded capabilities, and flexibility. Client libraries are available for most common programming languages. See section, below: Indexing programmatically with Solr Client

  2. Solr Data Import Handler (DIH): The Solr DIH ships with Solr and is relatively easy to use. It may be a good approach for getting started and learning Solr. However, while the DIH is useful to understand how the retrieval/indexing process works, the DIH is slow compared to other methods and is not multi-threaded so we do not recommend it for large-scale implementations. See: Indexing using Solr DIH (Data Import Handler)

  3. Data integration/ETL tools: Documents can also be presented to Solr using non-developer tools such as Pentaho Data Integration (PDI), which is free and open source, and is used on our demonstration EMERSE Virtual Machine for loading/indexing data. See: Indexing with ETL tools

Indexing programmatically with Solr Client

Our primary tool for working with Solr at Michigan Medicine has primarily been with SolrJ, a Java based client used to add/update documents in a Solr index. The remainder of this section references SolrJ, but is applicable to Solr clients in other languages.

There are Solr API’s/clients available for different programming languages should Java be less desired. This page provides numerous links to other language implementations:

At the University of Michigan we have developed a small amount of custom code to sit between our various source systems and Solr. This custom code is not part of the standard EMERSE release since the specific needs will vary for each institution. The code pulls data from our document repositories via SQL/JDBC and presents the documents to Solr for indexing via SolrJ. This approach also allows us to do additional transformations as needed such as converting documents in RTF format to HTML format which is ideal for displaying in EMERSE. Another possible type of transformation is concatenating pieces of a document from multiple database rows. This can be done through SQL or through custom code, but it needs to be done before presenting the document to Solr for indexing.

We have an example of a data integration process using a SolrJ application in our Setup Guide.

We are using Unix Cron to schedule a simple curl HTTP request to be made to the indexing code to start once per night. By using modification time information available in the source systems, and remembering the last time indexing occurred, we are able to index only documents changed since the last time the indexing jobs have run.

800
Figure 1. High-level overview of indexing using Apache SolrJ.
To get started programming with the SolrJ API see: http://www.solrtutorial.com/solrj-tutorial.html
Simple Java Example

Below is the source of a java program which can index a hierarchy of files. This example does not use SolrJ, mainly to keep it is as simple to run as possible, but you are strongly suggested to use SolrJ or related client libraries to simplify your code. This file assumes Solr is accessible at http://localhost:8983/solr, and that you have the default configuration for field names and index name. For the specific example in the code below, you would also likely need to include the name of the Solr index (also called the core or collection name) to which you are sending the documents, so the final URL may look something like http://localhost:8983/solr/documents. But you can also add the collection name in the Solr API and leave it as just /solr at the end of the URL.

import java.io.*;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.CharBuffer;
import java.text.SimpleDateFormat;
import java.util.*;
import java.util.function.Function;

/**
 * This directly hits Solr's REST API using only native Java.  See:
 *
 * http://lucene.apache.org/solr/guide/7_3/indexing-and-basic-data-operations.html
 *
 * For more details and more production-ready options.
 *
 * To run this file, compile it to a .class file, then run the class file by its class name, the
 * name without the extension:
 * <pre>
 *     javac Index.java
 *     java Index path-to-data-directory
 * </pre>
 * <p>
 *
 * The data directory should contain a hierarchy of files.  The first level of directories
 * should name the MRN number, the second level should name the source key, and the files
 * in those directories are the files that are indexed.  For example:
 *
 * 21-77/source1/visit1.html
 * 21-77/source1/visit2.html
 * 21-77/source2/visit3.html
 * 18-32/source3/visit1.html
 *
 * 21-77 is the MRN number, source1 names the source in the database, and visit1.txt
 * is just a file name used in the ID of the solr document created.  (The id is the
 * whole path from the data directory.)
 */
public class Index {

	private static final Function<String, String> NOW_AS_ISO;
	private static final int SOLR_PORT;
	private static final String INDEX_NAME;
	private static Writer writer;

	private static void printUsage() {
		System.err.println("Usage: java [ -DsolrPort=8983 ] ");
		System.err.println("            [ -DindexName=documents ]");
		System.err.println("            Index <data-directory>\n");

		System.err.println("       Options must be before the class file.  ");
		System.err.println("       Defaults are as listed.\n");

		System.err.println("       The first argument after the class file must be ");
		System.err.println("       the directory data is stored in.");
		System.err.println("       The first level of directories names the MRN,");
		System.err.println("       the second level of directories are the source keys,");
		System.err.println("       and the filename is included as part of the document id.");
		System.err.println("       So radiology files for patient 27-77 would appear like:");

		System.err.println("           27-77/source3/visit1.html");
		System.err.println("           27-77/source3/visit2.html");
		System.err.println("           27-77/source3/report1.html");
		System.err.println("           etc.\n");

		System.err.println("       (Recall that the radiology source has id \"source3\".)\n");

		System.err.println("       Each file can have a related properties file:\n");

		System.err.println("           27-77/Radiology/visit1.properties");
		System.err.println("           27-77/Radiology/visit2.properties");
		System.err.println("           27-77/Radiology/report1.properties");
		System.err.println("           etc.\n");

		System.err.println("       which populates additional document fields, such as");
		System.err.println("       document type (DOC_TYPE) or department (DEPT):\n");

		System.err.println("           DOC_TYPE=Discharge note");
		System.err.println("           DEPT=Internal medicine");


		System.exit(1);
	}

	static {
		SimpleDateFormat formatter = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'");
		formatter.setTimeZone(TimeZone.getTimeZone("UTC"));
		NOW_AS_ISO = k -> formatter.format(new Date());
		SOLR_PORT = Integer.parseInt(System.getProperty("solrPort", "8983"));
		INDEX_NAME = System.getProperty("indexName", "documents");
	}

	public static void main(String[] args) {
		if (args.length != 1) {
			printUsage();
		}
		try {
			importData(args[0]);
		} catch (Exception e) {
			e.printStackTrace();
			System.err.println("\n\nEncountered error!\n\n");
			printUsage();
		}
	}

	private static void importData(String baseDir) throws Exception {
		@SuppressWarnings("unchecked")
		Map<String, String> p = (Map) new Properties();


		File[] patients = new File(baseDir).listFiles(File::isDirectory);
		if (patients == null) {
			return;
		}
		for (File patient : patients) {
			String mrn = patient.getName();
			File[] sources = patient.listFiles(File::isDirectory);
			if (sources == null) {
				return;
			}
			for (File source : sources) {
				String sourceKey = source.getName();
				File[] files =
						source.listFiles(f -> f.isFile() && !f.getName().endsWith("properties"));
				if (files == null) {
					continue;
				}
				for (File file : files) {
					String docId =
							mrn + File.separator + sourceKey + File.separator + file.getName();

					loadProperties((Properties) (Map) p, file);

					p.put("ID", docId);
					p.put("MRN", mrn);
					p.put("SOURCE", sourceKey);
					p.put("RPT_TEXT", getFileContents(file));
					p.computeIfAbsent("ENCOUNTER_DATE", NOW_AS_ISO);

					indexDocument(p, docId);
				}
			}
		}
	}

	private static void indexDocument(Map<String, String> map, String docId) throws Exception {
		HttpURLConnection connection = (HttpURLConnection) new URL(
				"http",
				"localhost",
				SOLR_PORT,
				"/solr/" + INDEX_NAME + "/update/json/docs").openConnection();
		connection.setRequestMethod("POST");
		connection.setRequestProperty("Content-Type", "application/json");
		connection.setDoOutput(true);
		writeMap(map, connection.getOutputStream());
		int status = connection.getResponseCode();
		System.out.println("Update to document " + docId + " returned status: " + status);
		if (status / 100 != 0) {
			try (Reader r = new InputStreamReader(connection.getInputStream())) {
				CharBuffer cbuf = CharBuffer.allocate(1000);
				while (r.read(cbuf) != -1) {
					System.err.append(cbuf);
					cbuf.clear();
				}
				System.err.flush();
			}
		}
	}


	private static void loadProperties(Properties p, File file) throws IOException {
		p.clear();

		String propFilename = file.getName();
		propFilename = propFilename.substring(0, propFilename.indexOf('.')) + ".properties";
		File propertiesFile = new File(file.getParentFile(), propFilename);
		if (propertiesFile.exists()) {
			p.load(new FileReader(propertiesFile));
		}
	}

	private static String getFileContents(File file) throws IOException {
		try (FileReader reader = new FileReader(file)) {
			StringBuilder sb = new StringBuilder();
			char[] buffy = new char[1000];
			int buffySize;
			while ((buffySize = reader.read(buffy)) != -1) {
				sb.append(buffy, 0, buffySize);
			}
			return sb.toString();
		}
	}

	private static void writeMap(
			Map<String, String> map, OutputStream outputStream) throws Exception {
		writer = new OutputStreamWriter(outputStream);
		writeMap(map);
	}


	private static void writeMap(Map<String, String> map) throws IOException {
		writer.write("{\n");
		Iterator<String> iterator = map.keySet().iterator();
		while (iterator.hasNext()) {
			String key = iterator.next();
			writer.write("  \"");
			writeStringEscaped(key);
			writer.write("\": \"");
			writeStringEscaped(map.get(key));
			writer.write('"');
			if (iterator.hasNext()) {
				writer.write(',');
			}
			writer.write('\n');
		}
		writer.write("}\n");
	}

	private static void writeStringEscaped(String s) throws IOException {
		for (int i = 0; i < s.length(); i++) {
			char c = s.charAt(i);
			int p = Character.codePointAt(s, i);
			if (p <= 31 || c == '"' || c == '\\') {
				writer.write(String.format("\\u%04x", p));
			} else {
				writer.write(c);
			}
		}
	}
}

Indexing with Python

Below is a sample script demonstrating how indexing can be done with Python. This is a simple skeleton script showing how to use pyodbc and pysolr together to upload notes via a generic database connection into Solr. This script assumes that what is needed to query/stage the data is done in advance on the database side or in a pipeline prior to this code being run. Note that Solr will not accept JDBC connections unless you are running the Solr cloud version which has not been tested with EMERSE.

# Script courtesy of Daniel Harris, University of Kentucky, August 2020

import pyodbc
import pysolr
from requests.auth import HTTPBasicAuth

cnxn = pyodbc.connect('DSN=<DSN>')
cursor = cnxn.cursor()

query = u'''
	<query>
'''
try:
	solr = pysolr.Solr('<SOLR_URL>:8983/solr/documents',always_commit=True,auth=HTTPBasicAuth('<SOLR USERNAME>','<SOLR PASSWORD>'), verify=False)
	with cnxn:
		try:
			cursor.execute(query)
			rows = cursor.fetchall()
			for row in rows:
				# fields will vary according to your query, replace X with column #
				enc_date = row[X].strftime('%Y-%m-%dT%H:%M:%SZ')
				update_date = row[X].strftime('%Y-%m-%dT%H:%M:%SZ')
				rpt_date = row[X].strftime('%Y-%m-%dT%H:%M:%SZ')
				solr.add([
					{
						"SOURCE": "<INSERT SOURCE ID>",
						"ID": row[X],
						"RPT_TEXT":row[X],
						"MRN":row[X],
						"LAST_UPDATED":update_date,
						"ENCOUNTER_DATE":enc_date,
						"RPT_DATE":rpt_date,
						"DOC_TYPE":row[X],
						"SRC_SYSTEM":"<SRC SYSTEM LABEL>"
					},
				])
		except Exception as e:
			print("Error:")
			print(e)
	cnxn.close()
except Exception as e:
	print(e)

Indexing using Solr DIH (Data Import Handler)

The Data Import Handler has been deprecated and will be removed from the Solr project as of Solr vesion 9.0 It is being moved to a third party plugin. The directions below are being left in this guide for now in case the tool is still useful for important data into EMERSE.

Probably the easiest way to get started with document indexing is to use the Solr Data Import Handler (DIH). This comes packaged with Solr and is easy to get up and running. While the DIH is useful to understand how the indexing process works, the DIH is slow compared to other methods and is not multi-threaded so we do not recommend it for large-scale implementations. The DIH can be configured to access a source system or database, with details about how to retrieve and potentially ‘transform’ the document constructed in SQL. For example, some documents may be in a database with a single document split across multiple rows. The document can be re-assembled into a single text field using SQL and then Solr can process it. Note that Solr expects a document to be ‘pre-assembled’ and does not itself concatenate fields to reconstruct a document from pieces, which is why this should be done in the SQL that retrieves the document. With the DIH you can define the query and the target source in its associated configuration file, and it will pull documents from that source and present them to Solr for indexing.

Getting started with DIH:

The main steps in using DIH are:

  1. Enable DIH in solr config (see below)

  2. Create an xml that will instruct SOLR how to access documents

  3. Install the JDBC driver and configure datasource

  1. Data Import Handler has to be configured in the solrconfig.xml (located in solr.home directory, for example SOLR_HOME/documents/conf/). Uncomment the lines below in solrconfig.xml to make them active.

  <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
            <str name="config">dataconfig.xml</str>
    </lst>
  </requestHandler>
  1. Database configuration information and the queries needed to retrieve documents are specified in DIH dataconfig.xml file. A sample configuration with mandatory fields required by EMERSE is shown below.

<dataConfig>
  <dataSource name="XE" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@localhost:1521:XE"
   user="system" password="abc" /> +
  <document name="rpta">
    <entity name="rptsa" pk="RPT_ID" query="select report_id,medical_record_num,cast(report_date as date) report_date,report_text from emerse.reports_ctsa where rpt_type=1"
     transformer="ClobTransformer,DateFormatTransformer">
      <field column="report_id" name="RPT_ID" />
      <field column="report_text" name="RPT_TEXT" clob="true"/>
      <field column="medical_record_num" name="MRN" />
      <field column="report_date" name="RPT_DATE" dateTimeFormat="yyyy-MM-dd HH:mm:ss.S" locale="en" />
    </entity>
  </document>
</dataConfig>
  1. A java based driver for the database (JDBC driver) should be installed in the Solr instance in the directory SOLR_HOME/server/lib. The first element in dataconfig.xml is the dataSource. The JDBC URL in the previous snippet is using Oracle. Other database types will require a different driver name.

The DIH can be run from the Solr web application. It is accessible under Core Admin/DataImport menu option. DIH commands can also be directly sent to Solr via an HTTP request. For example, the command to start a full-import is:

Indexing with ETL tools

Documents can be pushed to Solr in a number of ways. The Solr project provides a native Java API, a REST API that can be invoked via ETL tools, curl, or other simple HTTP utilities. At Michigan Medicine, we primarily use the Java API to index documents, but have also successfully used other ETL tools such as Pentaho Data Integrator (PDI).

PDI is currently used to load/index data on our Virtual Machine (VM) distribution of EMERSE.

Many possible implementations of PDI jobs could help with indexing documents, but one scenario we have experimented with is where PDI reads a database and for each row does a HTTP POST call to Solr with the document and metadata in a JSON type of structure.

When using PDI HTTP connectors, consider sending large batches of documents. This will prevent having to open a connection for each document. For further performance improvement, consider enable partitioning in transforms, so that multiple threads are enabled for at least the HTTP portion of this type of PDI job. http://wiki.pentaho.com/display/EAI/Partitioning+data+with+PDI

About Solr

Updates and Solr Documents

Solr determines if a document should be added versus updated/replaced based on a unique document key. If you send Solr a document with a previously used key, Solr will replace the older version of the document with the newer version, but the document will still exist, flagged as replaced, in the index files. This may be needed in cases where the document was updated (e.g., addended) or even where the metadata has changed (e.g., the document was previously assigned to the wrong patient MRN). Over time, Solr may contain numerous deleted documents that can degrade performance. Solr provides an Optimize command to rewrite the indexes to remove these. For more information see Solr Optimization in the Configuration and Optimization Guide.

Solr Configuration Details

EMERSE requires that several aspects of Solr be configured in a specific way. We recommend not changing these configuration settings since doing so might change the way the system performs. A few of these are explained below. Most details about the various configurable components are not described here, since they are already well described in the official Solr documentation. Many of the configuration details are defined in managed-schema. In general, changes to the Solr schema should be made through the admin UI. If they are made directly to managed-schema Solr should not be running or changes could be lost.

Stop Words

Stop words are usually ignored by search engines because they add little meaning to the phrases being searched, and they take up additional storage space. In the case of medical documents, many stop words are legitimate acronyms and abbreviations and, as a result, EMERSE does not use any stop words. Examples of stop words with additional meaning can be seen in the table below. There are tradeoffs with this approach, but by including these additional words it should help to ensure that users will be able to find terms that otherwise might be missed.

Traditional Stop Word Meaning of Acronym/Abbreviation

a

apnea
type A personality
type A aortic dissection

an

aconthosis nigricans

and

axillary node dissection

are

active resistance exercise

as

aortic stenosis

at

ataxia-telangiectasia

be

below elbow

but

breakup time

if

immunofluorescence

in

inch

is

incentive spirometry

it

intrathecal

no

nitric oxide

on

one nightly

or

operating room

the

trans-hiatal esophagectomy

to

tracheal occlusion

was

Wiskott-Aldrich syndrome

will

William (also used in the term 'legal will')

Analyzers

Analyzers are used within Solr to determine how to process the documents for indexing, as well as processing the query itself. This includes determining how the document or query should be tokenized, and how to deal with case. We define two field 'types'. One is Text_general_htmlstrip which is used by the RPT_TEXT field and Text_general_htmlstrip_nolowercase which is used by the RPT_TEXT_NOIC field, the latter of which is for keeping documents in a case-sensitive manner (needed for case-sensitive searches to help distinguish common English words such as "all" from acronyms such as "ALL", or acute lymphoblastic leukemia). These should generally not be modified, but we point them out here to highlight configuration settings that could affect search results.

Further, we define a character map in mapping-delimiters.txt to aid in tokenization and indexing. These can be found below, and essentially replace these characters with a space. This means that phrases such as "17-alpha-hydroxyprogesterone", "O157:H7", and "125.5" will be indexed as separate words: "17", "alpha", "hydroxyprogesterone", "O157", "H7", "125", "5".

"_" => " "
"." => " "
":" => " "

This character map also normalizes a few characters to help ensure consistency with searches. This includes mapping curly quotes (single and double) to straight quotes. A similar conversion happens when entering text into the user interface so that all quotes are automatically converted to the plain/straight variety.

Integrating Documents from the Epic EHR

Many academic medical centers use the Epic EHR. There are multiple ways in which to get documents out of Epic, each with its own advantages and disadvantages. Several of these options are described below. Regardless of how the documents are extracted, we recommend storing documents extracted from the EHR in a document repository. Further details about a repository can be found in the EMERSE Architecture Guide.

Regardless of what approach is used for extracting notes, it may be worth concatenating other data to the note to make a larger, more comprehensive, and self-contained note detailing additional aspects of the patient encounter. For example, if a clinical note from an encounters is extracted, it is possible through other queries to extract a medication list at the time of the encounter, append it to the note, and then store this larger, concatenated merged document in a document repository. Such additional data may be useful for users.

It is also possible to incorporate images or drawings within a document before indexing with Solr. Details about how to do this can be found in section on Images in Documents in the Additional Considerations Guide.

Various approach for obtaining and integrating documents from the Epic EHR are described below.

HL7 Interface

The current way in which EMERSE at Michican Medicine receives encounter summaries and notes from Epic is via an Epic Bridges outbound HL7 interface. Details about this interface can be found here: http://open.epic.com/Content/specs/OutgoingDocumentationInterfaceTechnicalSpecification.pdf (Epic account required)

There may be a fee required by Epic to turn on this interface. The emitted HL7 messages include a report, which includes the clinical note as well as other structured data from an encounter in RTF format, which (importantly) preserves the formatting and structure of the notes. The reports are highly configurable by combining Epic print groups, which dictate the content of the report/RTF file. This can make searching more comprehensive for users. The RTF files obtained via this interface are converted to HTML by the commercial software Aspose.Words API, and stored in our document repository for nightly indexing by Solr. For more details, see: RTF to HTML file conversion.

Available metadata in the HL7 message, such as Encounter Date, Department, and Edit Date, are also stored in the repository along with the note. We recommend that you try to only include finalized, or ‘signed’ notes in the repository to prevent the need to continually monitor for document changes and frequent re-processing and re-indexing. However, such decisions will need to be made locally depending on the local use cases, and needs/expectations of users.

One disadvantage of this approach is that it is not useful for loading/extracting historic documents. That is because this outbound document HL7 interface is triggered to send the document based on pre-specified actions (signing a note, for example) and cannot be called on an ad hoc basis to pull historic notes.

Epic EHR Notes Indexing - Overview
Figure 2. The approach used at the University of Michigan to obtain notes from the Epic EHR and make them searchable within EMERSE.

Epic Clarity

It is also possible to receive notes through the Epic Clarity database; however, at the time of this writing, at most institutions Clarity strips a large amount of the document formatting (i.e., basically converting the document from RTF to plain TXT), and thus display of the notes will not look as they appear in Epic. The original document formatting may be essential for understanding/interpreting the document once displayed within EMERSE, including the structure of tables, and other common markup such as bolding, italics, etc. Epic Clarity can preserve some formatting such as line breaks, but doesn’t include tables and other text formatting.

Many Epic instances may be using the default configuration for moving documents from Chronicles to Clarity, and this default configuration will strip out line feeds/carriage returns during the ETL process. However, this can be changed through configuration of the "Character Replacement Settings" which is part of the KB_SQL Generation Settings. Essentially, the ETL process cannot handle many control characters (such as line feeds), so the character replacement will translate the line feeds to to a different character that the ETL can handle, and then translate them back for storage in Clarity. This translation would have to be turned on for the NOTE_TEXT field, or something with a similar name.

If there is a desire to use Clarity to obtain notes, we suggest looking into the following tables:

HNO_INFO 	(note metadata)
HNO_ENC_INFO 	(notes /encounter linkage)
HNO_NOTE_TEXT 	(actual note text)

These tables don’t seem to be documented in Epic’s Clarity ambulatory care documentation, even though outpatient notes are stored in the same tables along with the inpatient notes as of Epic 2015. However, The inpatient Clarity “training companion” does have an overview of these tables, and they are mentioned in the Clarity data dictionary, so between these two references you should be able to build a query that collects Epic notes.

Additionally, it should also be possible to move the notes in their original RTF format from Chronicles to Clarity. This is advantageous since it will help preserve much of the formatting of the original notes (bold text, italics, font sizes, tables, etc). Assuming that the plain text documents would continue to be moved from Chronicles to Clarity (the default configuration for Epic), an additional custom table would have to be created within Clarity to hold the RTF-formatted notes. It will also be necessary to set up the ETL to extract HNO #41 (the location within Chronicles that holds the RTF-formatted notes) directly from the Chronicles database. When sending these RTF-formatted Clarity documents to EMERSE, the RTF would have to be converted to HTML, but various options are available to do this (see : RTF to HTML file conversion.).

Clarity might still be useful in order to keep track of what documents have been changed (for example, a note might be deleted or addended weeks after it has been stored and indexed in EMERSE).
Storing notes as RTF is advantageous to preserve formatting, but it will increase the storage requirements. We estimate that our RTF documents each take up about 22k of space, although each document does have additional data appended to each one, including the medication list and problem list at the time of the encounter. Further, we estimate that storage for the raw RTF might take up about 10% more space than their plain text (RTF-stripped) counterparts. As of 2021 we began to incorporate images into our documents (encoded in Base64) which further increased the size of the stored documents.
An interesting story about naming conventions: The name HNO for the note table in Epic came from the original phrase "Shared NOtes", but since HNO was already being used by the system, the second letter (H) in sHared was used instead.

Epic Web Services

It is possible to build a web service to extract documents that preserves the original formatting of the notes.

At the University of Michigan we have built a custom web service that extracts formatted notes directly from Chronicles (not Clarity). These notes still have their native formatting which is in RTF, but they are converted to HTML and embedded in a SOAP message during the web service call. This approach will require individuals to have experience with writing custom web services and Epic Cache code. Those interested in exploring this option can contact us at the University of Michigan and we can share the Cache code and the web service definition.

As of version Epic 2017, there are two 'out of the box' web services that can be used for extracting notes:

  • GetClinicalNotes

  • GetClinicalNoteDetails

Further information about these web services can be found within the Epic documentation.

Epic’s webservices are not used for EMERSE at Michican Medicine, but are being used elsewhere in our health system. The major challenge with using the webservices is that they require knowledge of the primary note identifier before making the webservice call. It should be possible, however, to use clarity to pull note identifiers and then make a webservice call to retrieve the note as RTF/html. Other approaches may work equally as well.
Depending on the integration approach, it may be useful to setup a staging database that sits between the source systems and the EMERSE Solr indexes. The primary use of this at Michigan Medicine is to store the documents as they are received from Epic as HL7 messages. They are not directly stored in Solr, even though Solr ultimately creates its own local copy once a documents is sent from the staging database to Solr.

Integrating Documents from EHRs more generally

Fast Healthcare Interoperability Resources (FHIR)

FHIR is an emerging standard that is meant to be EHR vendor-independent so that it can work across all EHR systems that support the standard. Using FHIR could potentially be a viable approach for nearly any modern EHR. With support from the NCI Informatics Technology for Cancer Research (ITCR), a team at the University of Utah developed a FHIR-based tool for bulk extracting documents in their native format based on an input set of medical record numbers (MRNs). The University of Michigan further modified the code, using FHIR R4, whereas the original Utah-developed code used STU3. These projects/code can be found on our EMERSE GitHub repository (contact us for access).

The University of Michigan has tested this software against a "support" Epic environment and not a true production environment. Nevertheless, our results should provide a guidepost for assessing whether this tool is appropriate for your use.

Through a collaboration with the University of Utah, we are able to provide a FHIR-based tool for bulk-extracting documents from EHRs. A description of our findings is below, so that it may serve as a useful guide for others considering using this tool.

  • With Epic, the software pulls documents directly from Chronicles, the transactional database used for active patient care (with other EHR systems it is unclear where the data would be pulled from since it will depend entirely on the vendor’s implementation of the standard). Because the code extracts data directly from the transactional system, it would be safe to assume that some limitations would be place on usage by each institution, especially during peak business hours. At the time of this writing (2022) Epic does not have a way to throttle usage.

  • The data are extracted in HTML format and contain the native formatting of the documents (e.g., bold, italics, tables, etc) which is ideal for viewing within EMERSE.

  • We have found that some images, such as those within educational materials are embedded within the HTML using Base64 encoding, but the patient-specific clinical images are not embedded.

  • FHIR is not designed to pull large amounts of information at one time. In our performance testing on a staging environment, we found the document extraction rate to be approximately 3 documents per second.

  • The only filter currently supported by the code is a start date (thus the date range would be the start date through the current date at the time of code execution). There is not a way to request specific document types, or to request metadata. Thus, if a document is edited, it may not be eay to identify documents that have been updated but were previously extraced.

  • Fo Epic systems, it is possible to submit a medical record number as an identifier using “identifier=MRN|000000000” as the request parameter on the FHIR interface directly to query patients. This is an extension from Epic implementation of FHIR. Then the FHIR identifier is retrieved from the result of the patient query. For all following queries like DocumentReference, Binary etc. the FHIR ID will be used.

We would like to acknowledge the NCI ITCR program (Grants 5U24CA204800 and 5U24CA204863) for supporting development of this tool, as well as the team at the University of Utah including: Douglas Martin, Guilherme Del Fiol, and Kensaku Kawamoto.

Patient Integration

EMERSE requires that all patients referenced in documents exist in the Patient database table. At the University of Michigan, the Patient table is updated once per day (overnight) which coincides with the timing of indexing all new/changed notes (also once per day). The Solr indexing process does not require that this table be up to date, but the EMERSE system will need the MRNs to match between the PATIENT database table and the Solr indexes for everything to work properly when users are on EMERSE.

Patient data should be loaded into the Patient database table, and then these data get copied over to a Solr index created to match the patient data with the search results. This copying process is handled by Solr. The default settings for when this copying occurs is configurable, which is detailed in the Configuration Guide.

Research Study Integration

EMERSE is often used to aid in research studies. While not required, there is a provision for users conducting research to select a research study they are associated with at the "Attestation Page" after successfully logging in. EMERSE then links that particular session with the study in the EMERSE audit logs. Loading your institutional IRB data into the research tables in EMERSE will enable it to show the studies that each user is associated with, that are not expired, and that have a valid status (e.g., Approved status).

At the University of Michigan we use a commercial IRB tracking system (Click Commerce). A subset of the data from this system are moved to a data warehouse every night, and we extract a subset of data from that warehouse to bring into EMERSE for validation of users to studies. A Pentaho Data Integrator job, external to EMERSE, copies this subset to the research study tables in EMERSE, nightly.

If there are large delays in updating the RESEARCH_STUDY tables, it can delay users access to EMERSE, since the study will not appear in the attestation screen for the user to select. There is also a chance that a user could still be permitted to be approved for use on a study even if the study was revoked (again because of the delay introduced by nightly refreshes).

LDAP integration

EMERSE can be configured to use LDAP for user authentication. At this time, however, EMERSE authorization can not be done exclusively with LDAP - only users in the LOGIN_ACCOUNT table can access EMERSE even when LDAP is enabled. This was done intentionally so that accounts could be provisioned/revoked on a manual basis, pending review. The Configuration Guide provides details on changing settings that will enable LDAP authentication.

Active Directory integration

EMERSE does not natively work with Active Directory (AD), but AD can be configured to support the LDAP protocol which EMERSE does support. See, the section on LDAP integration, above