EMERSE   EMERSE Website   Documentation Documentation

Overview

This guide provides information on integrating external data into the EMERSE application, as well as basics about the indexing process. In order for EMERSE to search and highlight documents, they must be submitted to the Solr instance to which EMERSE is attached. Additionally, patients who have indexed documents need their corresponding demographics loaded into EMERSE’s Patient table in the database. Optionally, research study and user (personnel) information can be loaded into a few EMERSE database tables that will allow EMERSE to check that authenticated users are research personnel for approved studies, which is used for determining research access and populating the audit tables at the time of login.

Data integration and indexing are tightly linked, which is why they are discussed together in this guide. Not covered in depth is Solr itself, as this is already well documented within the Apache Solr project itself. Note that knowledge of Solr will be crucial to indexing for EMERSE.

Integrating Documents

The EMERSE application uses Solr to search, highlight, and display documents. This section provides possible approaches for submitting documents to Solr for indexing, as well as information about conforming your documents to the Solr schema that EMERSE requires.

When setting up a new system, it’s best to figure out what fields should exist in Solr, alter Solr’s managed-schema to reflect that, then set up EMERSE’s fields.

Don’t modify managed-schema or other Solr config files while Solr is running. Additionally, altering managed-schema will require a re-index as Solr doesn’t migrate already indexed data so that it reflects your changes to the schema.

Documents in Solr

A document is an object that associates values with fields. Each field must hold the same kind of value, whether its a string, number, or date. Solr also bundles up information about how to index the value of the field with the type of the field. These field types and the fields themselves are described in the managed-schema file in conf directory. There is one for the patient index and one for the documents index. Here are the basic field types that come in the demo system:

<fieldType name="date" class="solr.DatePointField"
	sortMissingLast="true" uninvertible="false" docValues="true"/>
<fieldType name="long" class="solr.LongPointField"
	sortMissingLast="true" uninvertible="false" docValues="true"/>
<fieldType name="int" class="solr.IntPointField"
	sortMissingLast="true" uninvertible="false" docValues="true"/>
<fieldType name="string" class="solr.StrField"
	sortMissingLast="true" uninvertible="false" docValues="true"/>
<fieldType name="boolean" class="solr.BoolField"
	sortMissingLast="true" uninvertible="false" docValues="true"/>

<fieldType name="text_cs" class="solr.TextField"
	sortMissingLast="true" uninvertible="false" docValues="true"
>
	<analyzer>
		<tokenizer name="standard"/>
	</analyzer>
</fieldType>

<fieldType name="text_ci" class="solr.TextField"
	sortMissingLast="true" uninvertible="false" docValues="true"
>
	<analyzer>
		<tokenizer name="standard"/>
		<filter name="lowercase"/>
	</analyzer>
</fieldType>
You can read up on types in Solr in the Solr documentation. There are other types you could possibly use, but EMERSE is only really tested with these.

These field types index the entire value of the field as a single value. So, for instance, if a value of a string field describing the clinician is "David Hanauer" then only a search for exactly that string will find the document; a search for "David" or even "david" with lower case will not locate the document. This is typically fine, since fields describing metadata such as clinician will be entered in by users via filters, where we list the actual values and let users select between them.

However, for the main body text, you will need to use a different type that tokenizes by words, along with adding additional NLP tokens for use in EMERSE. The type used in the demo system looks like:

<fieldType name="text_with_header_post_analysis" class="org.emerse.solr.TextWithHeaderFieldType"
	termPositions="true" termVectors="true" termOffsets="true" termPayloads="true"
>
	<analyzer type="index">
		<charFilter class="solr.HTMLStripCharFilterFactory"/>
		<tokenizer class="org.emerse.opennlp.OpenNLPTokenizerFactory"
			tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
			sentenceModel="nlp/models/sd-med-model.zip"
			hasHeader="true"
			usingLuceneTokenizer="true"
		/>
		<filter class="org.emerse.ctakes.CTAKESFilterFactory"
			excludeTypes="nlp/pos_exclude.txt"
			prefix="CUI"
			dictionary="nlp/dictionary/custom.script"
			longestMatch="false"
		/>
		<filter class="org.emerse.nlp.TaggerFilterFactory"
			addTokenPrefix="false"
			delimiters="nlp/delimiters.txt"
			negations="nlp/negations.txt"
			ignore="nlp/ignore.txt"
			certainty="nlp/probability_resource_patterns.txt"
			family="nlp/family_history_resource_patterns_A.txt"
			familyTemplates="nlp/family_history_resource_patterns_B.txt"
			familyMembers="nlp/family_history_resource_patterns_C.txt"
			tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"
		/>
		<filter class="org.emerse.nlp.PostAnalysisFilterFactory"/>
	</analyzer>
	<analyzer type="query">
		<charFilter class="solr.HTMLStripCharFilterFactory"/>
		<tokenizer class="org.apache.lucene.analysis.standard.StandardTokenizerFactory"/>
	</analyzer>
</fieldType>

More information on exactly what is going on in this type can be found in the NLP Guide. Once you understand these types, you can decide what fields your documents should have. The demo system’s field are defined like:

<uniqueKey>ID</uniqueKey>
<field name="ID" type="string" required="true"/>

<field name="SOURCE" type="string"/>
<field name="MRN" type="string"/>
<field name="CSN" type="string"/>
<field name="ENCOUNTER_DATE" type="date"/>
<field name="DOC_TYPE" type="string" default="unknown"/>

<field name="RPT_TEXT" type="text_with_header_post_analysis"/>
<field name="NLP_HEADER" type="string" indexed="false"/>

<field name="CASE_ACCN_NBR" type="string"/>
<field name="ADMIT_DATE" type="date"/>
<field name="CLINICIAN" type="string" default="unknown"/>
<field name="DEPT" type="string" default="unknown"/>
<field name="INDEX_DATE" type="date"/>
<field name="LAST_UPDATED" type="date"/>
<field name="RPT_DATE" type="date"/>
<field name="RPT_ID" type="string"/>
<field name="SVC" type="string" default="unknown"/>

<field name="CATEGORY" type="string" multiValued="true"/>

<field name="AGE_DAYS" type="int"/>
<field name="AGE_MONTHS" type="int"/>

Each Solr document must be uniquely identified by the value of a single field, called the unique key or ID. The <uniqueKey> element defines what field that is. Typically we keep this as ID. When you submit a document to the index, it will replace any document that was previously indexed that has the same ID. In this way, you can "update" a document.

A field can also be multi-valued by setting multiValued="true" on the field’s element. A multi-valued field contains an array or list of values as its value instead of a single one. Each value is indexed independently, and the document can be retrieved by by any of them; you cannot search for a document that contains a particular list of values in a multi-valued field. For instance, if clinician was a multi-valued string field with the value ["David Hanauer", "John Smith"] you could locate the document by searching by David Hanauer or John Smith but not David, Smith or David Hanauer John Smith. It’s not possible to search for a list of values in Solr.

You can set a default value for a field, which is used when no value is given for the field in a document. By default, if a field has no value, it cannot be found by any search on that field. This isn’t necessarily a problem, but if some documents don’t have a value for a given field and there is a filter setup in EMERSE on that field, then not all documents will be represented in the choices for that filter in the UI, which can be confusing for users or administrators. It can also result in users missing documents when doing their work. For instance, if a user was looking for documents from a certain department, they would select that department in a filter. However, if some documents don’t have a department perhaps because of some data-quality issue, users will not be able to find those documents with their search, unless they choose to remove their filter on department. If instead documents got a default value of "unknown" on department, then "unknown" would be a filter option, and users could search for both their target department, and the "unknown" department.

Finally, the document text field (RPT_TEXT in the demo system) uses a custom type we built to do NLP. This field expects the text of the document to be prefixed with something called the NLP header. See the NLP Guide for more info, but for indexing documents using the default configuration as in the demo system, you should prepend a line with RU1FUlNFX0g=1|5| to HTML documents, or a line with RU1FUlNFX0g=2|5| to non-HTML documents. You should also include this prepended line as the value of the NLP_HEADER field.

Whereas the fields of the documents index are highly customizable, the patient index should have exactly the schema defined in the demo indexes; they should not be changed.

Documents in EMERSE

EMERSE models documents in a very similar fashion as Solr, as an association of fields to values. However, EMERSE partitions the documents in Solr into groups called sources. A document belongs to the source named in the SOURCE field of that document. Then, EMERSE allows a field to be specific to a single source, or general to all sources.

You can view the fields defined in EMERSE in the Fields tab in the admin app. Each field has a few attributes:

  1. a label, which is the value that will be shown to users of EMERSE

  2. a Solr field the EMERSE field is mapped to

  3. a source, which determines if it’s specific to a source, or general to all sources

  4. a type, which in EMERSE is TEXT, DATE, or NUMBER. Typically EMERSE can look at the underlying Solr type and uniquely determine the appropriate type from that, so often there is only one option here

  5. a filter UI, which if set, will allow users to set filter on the field

  6. display settings related to where if anywhere the value of the field should be shown to users to help describe documents, such as in the document list, or a single-document view

Finally, each field can defined a field value grouping. If a grouping is defined, then each value in Solr can be placed into zero, one, or more groups. Each group is defined only by its label. The group labels can then be shown to users instead of the underyling values in Solr. This allows you to store very short codes in Solr, but provide nice names in the EMERSE UI. Similarly, it allows you to put multiple Solr values in the same group so they are presented the same to user. This can be useful if you want to have a more coarse grouping of documents than the underlying values indexed into Solr, or if say, the codes have changed over the years and you want a consistent set of labels for users to choose from.

There is no rule that each Solr field is used by only a single EMERSE field; you can have multiple EMERSE fields be mapped to the same Solr field. Together with thou value grouping feature, this allows you to present multiple levels of hierarchy from the values of a single field. For instance, if you have a field that stores the lab in Solr, you can create two fields in EMERSE, one what groups labs into departments, and one that remains at the granularity of labs. It can also de-couple compound information. For instance, if you have a field whose values contain information on both what body part was imaged and what type of imaging was done, you can define two fields in EMERSE that map to that single value in Solr, and set up the groups to be either the imaging types, or the body parts.

Though the field value grouping system allows these flexible features, EMERSE ultimately must query Solr using the values stored in Solr. That means, if a user sets a single field value group as a filter, EMERSE will end up querying Solr with all the values in that group. If that is a lot, it will end up being a very large and slow query to run. Thus, these features are mainly intended to allow flexibility in customizing the system after data has been indexed, but when planning a system, it’s best to break down documents in every way you think users will need it, keeping the queries as small as possible so they are as fast as possible.

Special Fields in EMERSE

EMERSE a number of general fields that are required. These are listed as "special roles" in the UI. (This only appears for fields listed in the general section.) A single field can play multiple special roles, but typically different fields do. Fields that place a special role generally should have the string type in Solr, but with some obvious exceptions, as listed below.

CLINICAL_DATE

the date of the patient encounter or other medically relevant date. This should be a date type in Solr.

DOC_CONTEXT

the field used to title patient snippets in all-patient search. Often this is something like the document type.

MRN

the patient’s unique identifier, grouping documents by patient.

PATIENT_ENCOUNTER_ID

the patient encounter ID, which links multiple documnets to a single visit from the patient. This is role is optional.

RPT_ID

a unique identifier for the document across all sources. This should be the ID field in Solr, or whatever is marked as <uniqueKey> in the managed-schema file.

RPT_TEXT

the field that contains the document text.

SOURCE

the field that tells what source the document is from. The values of this field are the sources of documents.

One field in Solr must be present, since it’s used by Solr, called _version_. It’s definition should be:

<field name="_version_" type="long"/>

<fieldType name="long" class="solr.LongPointField" docValues="true" invertible="false"/>
Extra care must be taken to ensure that the Solr field with special role RPT_ID (henceforth called its ID) is truly unique across all sources. It is often the case that document from different sources will have the same ID. If you submit a document to Solr that has the same ID as one already in Solr, it overwrites said document, thus at most one of the documents will appear in Solr. To solve this, it is best to assign a "code" to each source, where no source’s code is a prefix of any other source’s code, and prefix each document’s ID with the source’s code. For instance, if you have two sources, and you code them as A and B, then an ID 123 present is both sources will be sent to solr as A123 if the document is from source A, and B123 if its from source B. Codes surgery and surgery_er would not work in general, since surgery is a prefix of surgery_er so that (for instance) a document with ID _er123 in source surgery and a document with ID 123 in source surgery_er would both become surgery_er123, and thus overwrite one another in Solr. Obviously, if you can guarantee no ID in surgery starts with _er then you should be fine.

Technical approaches for Solr indexing

Getting documents from source systems will vary considerably at each site and will depend on multiple factors including the number of different sources, how the documents are stored, how they are formatted, the type of access or connections available, etc.

Three high level approaches are briefly described here, with details in additional sections that follow:

Custom code

We have found this to be the ideal approach because of it’s speed, multi-threaded capabilities, and flexibility. Client libraries are available for most common programming languages. See section, below: Indexing Programmatically

Solr Data Import Handler (DIH)

The Solr DIH ships with Solr and is relatively easy to use. It may be a good approach for getting started and learning Solr. However, while the DIH is useful to understand how the retrieval/indexing process works, the DIH is slow compared to other methods and is not multi-threaded so we do not recommend it for large-scale implementations. See: Indexing using Solr DIH (Data Import Handler)

Data integration/ETL tools

Documents can also be presented to Solr using non-developer tools such as Pentaho Data Integration (PDI), which is free and open source, and is used on our demonstration EMERSE Virtual Machine for loading/indexing data. See: Indexing with ETL tools

Indexing Programmatically

We have developed an indexing application to pull data from our databases and push them into Solr via Solr’s JSON update handler endpoints. This is not part of EMERSE itself, but part of what we call the "data pipeline" to EMERSE. The data pipeline also includes additional transformations before the documents reach the database such as converting documents in RTF format to HTML format which is ideal for displaying in EMERSE.

Though we directly index in our code, it is possible to use a Solr client library to help.

800
Figure 1. High-level overview of indexing using Apache SolrJ.

It’s best to send mulitple documents to Solr in a single update request. This can be done by simply sending a JSON array of documents instead of a single document in the request body. You can also send documents to Solr in multiple threads simultaneously.

On the other hand, it is possible to send documents so fast that Solr cannot process them, returning an error. If you see such an error, it’s best to simply make the request again several times until it’s successful, waiting a second or more between to give Solr some time to catch up. The below Java code shows how this code be done with a simple loop. (See the indexDocument method.)

Java Example

Below is the source of a java program which can index a hierarchy of files. This example does not use SolrJ, mainly to keep it is as simple to run as possible. This file assumes Solr is accessible at http://localhost:8983/solr, and that you have the demo system’s configuration for field names and the index name. For the specific example in the code below, you would also likely need to include the name of the Solr index (also called the core or collection name) to which you are sending the documents, so the final URL may look something like http://localhost:8983/solr/documents. But you can also add the collection name in the Solr API and leave it as just /solr at the end of the URL.

To run it, simply copy and paste it into a file named Index.java, then do:

$ java Index.java

It’ll give more instructions on how it should be run.

Index.java
import java.io.*;
import java.net.HttpURLConnection;
import java.net.URI;
import java.nio.file.Files;
import java.time.Instant;
import java.time.format.DateTimeFormatter;
import java.util.*;
import java.util.function.*;

/**
 * This directly hits Solr's REST API using only native Java.  See:
 * <p>
 * http://lucene.apache.org/solr/guide/7_3/indexing-and-basic-data-operations.html
 * <p>
 * For more details and more production-ready options.
 * <p>
 * To run this file:
 * <pre>
 *     java Index.java
 * </pre>
 * This will print a help message with more usage information.
 */
public class Index
{

	private static final Function<Object, String> NOW_AS_ISO;
	private static final String INDEX_URL;

	private static void printUsage()
	{
		System.err.print("""
			Usage: java [ -DsolrPort=8983 ]
			            [ -DindexName=documents ]
			            Index <data-directory>

			       Options must be before the class file.
			       Defaults are as listed.

			       The first argument after the class file must be
			       the directory data is stored in.
			       The first level of directories names the MRN,
			       the second level of directories are the source keys,
			       and the filename is included as part of the document id.
			       So radiology files for patient 27-77 would appear like:

			           27-77/source3/visit1.html
			           27-77/source3/visit2.html
			           27-77/source3/report1.html
			           etc.

			       (Recall that the radiology source has id "source3".)

			       Each file can have a related properties file:

			           27-77/Radiology/visit1.properties
			           27-77/Radiology/visit2.properties
			           27-77/Radiology/report1.properties
			           etc.

			       which populates additional document fields, such as
			       document type (DOC_TYPE) or department (DEPT):

			           DOC_TYPE=Discharge note
			           DEPT=Internal medicine
			""");
		System.exit(1);
	}

	static
	{
		NOW_AS_ISO = k -> DateTimeFormatter.ISO_INSTANT.format(Instant.now());
		var port = Integer.parseInt(System.getProperty("solrPort", "8983"));
		var indexName = System.getProperty("indexName", "documents");
		INDEX_URL = "http://localhost:" + port + "/solr/" + indexName + "/update/json/docs";
	}

	public static void main(String[] args)
	{
		if (args.length != 1)
		{
			printUsage();
		}
		try
		{
			importData(args[0]);
		}
		catch (Exception e)
		{
			e.printStackTrace();
			System.err.println("\n\nEncountered error!\n\n");
		}
	}

	private static void importData(String baseDir) throws Exception
	{
		var docProps = new Properties();

		File[] patients = new File(baseDir).listFiles(File::isDirectory);
		if (patients == null) return;
		for (var patient : patients)
		{
			String mrn = patient.getName();
			File[] sources = patient.listFiles(File::isDirectory);
			if (sources == null) continue;
			for (var source : sources)
			{
				String sourceKey = source.getName();
				File[] docs = source.listFiles(f -> f.isFile() && !f.getName().endsWith("properties"));
				if (docs == null) continue;
				for (var doc : docs)
				{
					String docId = mrn + File.separator + sourceKey + File.separator + doc.getName();

					loadPropertiesOfDoc(docProps, doc);

					docProps.put("ID", docId);
					docProps.put("MRN", mrn);
					docProps.put("SOURCE", sourceKey);
					docProps.put("RPT_TEXT", Files.readString(doc.toPath()));
					docProps.computeIfAbsent("ENCOUNTER_DATE", NOW_AS_ISO);

					indexDocument(docProps);
				}
			}
		}
	}

	private static void indexDocument(Map<Object, Object> doc) throws Exception
	{
		var maxAttempts = 3;
		for (var attempts = 0; attempts < maxAttempts; attempts++)
		{
			var connection = (HttpURLConnection) new URI(INDEX_URL).toURL().openConnection();
			connection.setRequestMethod("POST");
			connection.setRequestProperty("Content-Type", "application/json");
			connection.setDoOutput(true);
			writeDocToSolr(doc, connection.getOutputStream());
			int status = connection.getResponseCode();
			if (status / 100 == 2) return;
			else
			{
				connection.getInputStream().transferTo(System.err);
				System.err.printf("Got status %d from Solr: waiting one second to retry\n", status);
				System.err.flush();
				Thread.sleep(1000);
			}
		}
		System.err.printf(
			"Attempted sending %s %d times but failed each time; moving on\n",
			doc.get("ID"),
			maxAttempts
		);
	}

	private static void loadPropertiesOfDoc(Properties docProps, File doc) throws IOException
	{
		docProps.clear();

		var propFilename = doc.getName();
		propFilename = propFilename.substring(0, propFilename.indexOf('.')) + ".properties";
		var propertiesFile = new File(doc.getParentFile(), propFilename);
		if (propertiesFile.exists()) docProps.load(new FileReader(propertiesFile));
	}

	private static void writeDocToSolr(
		Map<Object, Object> map, OutputStream outputStream
	) throws Exception
	{
		var writer = new OutputStreamWriter(outputStream);
		writer.write("{\n");
		Iterator<Object> iterator = map.keySet().iterator();
		while (iterator.hasNext())
		{
			var key = (String) iterator.next();
			var value = (String) map.get(key);

			writer.write("  \"");
			writeStringEscaped(key, writer);
			writer.write("\": \"");
			writeStringEscaped(value, writer);
			writer.write('"');
			if (iterator.hasNext()) writer.write(',');
			writer.write('\n');
		}
		writer.write("}\n");
	}

	private static void writeStringEscaped(String s, Writer writer) throws IOException
	{
		for (int i = 0; i < s.length(); i++)
		{
			char c = s.charAt(i);
			int p = Character.codePointAt(s, i);
			if (p <= 31 || c == '"' || c == '\\') writer.write(String.format("\\u%04x", p));
			else writer.write(c);
		}
	}
}

Indexing with Python

Below is a sample script demonstrating how indexing can be done with Python. This is a simple skeleton script showing how to use pyodbc and pysolr together to upload notes via a generic database connection into Solr. This script assumes that what is needed to query/stage the data is done in advance on the database side or in a pipeline prior to this code being run. Note that Solr will not accept JDBC connections unless you are running the Solr cloud version which has not been tested with EMERSE.

# Script courtesy of Daniel Harris, University of Kentucky, August 2020

import pyodbc
import pysolr
from requests.auth import HTTPBasicAuth

cnxn = pyodbc.connect('DSN=<DSN>')
cursor = cnxn.cursor()

query = u'''
	<query>
'''
try:
	solr = pysolr.Solr('<SOLR_URL>:8983/solr/documents',always_commit=True,auth=HTTPBasicAuth('<SOLR USERNAME>','<SOLR PASSWORD>'), verify=False)
	with cnxn:
		try:
			cursor.execute(query)
			rows = cursor.fetchall()
			for row in rows:
				# fields will vary according to your query, replace X with column #
				enc_date = row[X].strftime('%Y-%m-%dT%H:%M:%SZ')
				update_date = row[X].strftime('%Y-%m-%dT%H:%M:%SZ')
				rpt_date = row[X].strftime('%Y-%m-%dT%H:%M:%SZ')
				solr.add([
					{
						"SOURCE": "<INSERT SOURCE ID>",
						"ID": row[X],
						"RPT_TEXT":row[X],
						"MRN":row[X],
						"LAST_UPDATED":update_date,
						"ENCOUNTER_DATE":enc_date,
						"RPT_DATE":rpt_date,
						"DOC_TYPE":row[X],
						"SRC_SYSTEM":"<SRC SYSTEM LABEL>"
					},
				])
		except Exception as e:
			print("Error:")
			print(e)
	cnxn.close()
except Exception as e:
	print(e)

Indexing using Solr DIH (Data Import Handler)

The Data Import Handler has been deprecated and will be removed from the Solr project as of Solr vesion 9.0 It is being moved to a third party plugin. The directions below are being left in this guide for now in case the tool is still useful for important data into EMERSE.

While the DIH is useful to understand how the indexing process works, the DIH is slow compared to other methods and is not multi-threaded so we do not recommend it for large-scale implementations. The DIH can be configured to access a source system or database, with details about how to retrieve and potentially ‘transform’ the document constructed in SQL. For example, some documents may be in a database with a single document split across multiple rows. The document can be re-assembled into a single text field using SQL and then Solr can process it. Note that Solr expects a document to be ‘pre-assembled’ and does not itself concatenate fields to reconstruct a document from pieces, which is why this should be done in the SQL that retrieves the document. With the DIH you can define the query and the target source in its associated configuration file, and it will pull documents from that source and present them to Solr for indexing.

To get started with DIH, look at the GitHub repository for the project.

Indexing with ETL tools

Documents can be pushed to Solr using ETL tools. In the past, we have used Pentaho Data Integrator (PDI) to do our indexing and other data-pipeline work.

Many possible implementations of PDI jobs could help with indexing documents, but one scenario we have experimented with is where PDI reads a database and for each row does a HTTP POST call to Solr with the document and metadata in a JSON type of structure.

When using PDI HTTP connectors, consider sending large batches of documents. This will prevent having to open a connection for each document. For further performance improvement, consider enable partitioning in transforms, so that multiple threads are enabled for at least the HTTP portion of this type of PDI job. http://wiki.pentaho.com/display/EAI/Partitioning+data+with+PDI

About Solr

Updates and Solr Documents

Solr determines if a document should be added versus updated/replaced based on a unique document key. If you send Solr a document with a previously used key, Solr will replace the older version of the document with the newer version, but the document will still exist, flagged as replaced, in the index files. This may be needed in cases where the document was updated (e.g., addended) or even where the metadata has changed (e.g., the document was previously assigned to the wrong patient MRN). Over time, Solr may contain numerous deleted documents that can degrade performance. Solr provides an Optimize command to rewrite the indexes to remove these. For more information see Solr Optimization in the Configuration and Optimization Guide.

Solr Configuration Details

EMERSE requires that several aspects of Solr be configured in a specific way. We recommend not changing these configuration settings since doing so might change the way the system performs. A few of these are explained below. Most details about the various configurable components are not described here, since they are already well described in the official Solr documentation. Many of the configuration details are defined in managed-schema. In general, changes to the Solr schema should be made through the admin UI. If they are made directly to managed-schema Solr should not be running or changes could be lost.

Stop Words

Stop words are usually ignored by search engines because they add little meaning to the phrases being searched, and they take up additional storage space. In the case of medical documents, many stop words are legitimate acronyms and abbreviations and, as a result, EMERSE does not use any stop words. Examples of stop words with additional meaning can be seen in the table below. There are tradeoffs with this approach, but by including these additional words it should help to ensure that users will be able to find terms that otherwise might be missed.

Traditional Stop Word Meaning of Acronym/Abbreviation

a

apnea
type A personality
type A aortic dissection

an

aconthosis nigricans

and

axillary node dissection

are

active resistance exercise

as

aortic stenosis

at

ataxia-telangiectasia

be

below elbow

but

breakup time

if

immunofluorescence

in

inch

is

incentive spirometry

it

intrathecal

no

nitric oxide

on

one nightly

or

operating room

the

trans-hiatal esophagectomy

to

tracheal occlusion

was

Wiskott-Aldrich syndrome

will

William (also used in the term 'legal will')

Analyzers

Analyzers are used within Solr to determine how to process the documents for indexing, as well as processing the query itself. This includes determining how the document or query should be tokenized, and how to deal with case. We define two field 'types'. One is Text_general_htmlstrip which is used by the RPT_TEXT field and Text_general_htmlstrip_nolowercase which is used by the RPT_TEXT_NOIC field, the latter of which is for keeping documents in a case-sensitive manner (needed for case-sensitive searches to help distinguish common English words such as "all" from acronyms such as "ALL", or acute lymphoblastic leukemia). These should generally not be modified, but we point them out here to highlight configuration settings that could affect search results.

Further, we define a character map in mapping-delimiters.txt to aid in tokenization and indexing. These can be found below, and essentially replace these characters with a space. This means that phrases such as "17-alpha-hydroxyprogesterone", "O157:H7", and "125.5" will be indexed as separate words: "17", "alpha", "hydroxyprogesterone", "O157", "H7", "125", "5".

"_" => " "
"." => " "
":" => " "

This character map also normalizes a few characters to help ensure consistency with searches. This includes mapping curly quotes (single and double) to straight quotes. A similar conversion happens when entering text into the user interface so that all quotes are automatically converted to the plain/straight variety.

Integrating Documents from the Epic EHR

Many academic medical centers use the Epic EHR. There are multiple ways in which to get documents out of Epic, each with its own advantages and disadvantages. Several of these options are described below. Regardless of how the documents are extracted, we recommend storing documents extracted from the EHR in a document repository. Further details about a repository can be found in the EMERSE Architecture Guide.

Regardless of what approach is used for extracting notes, it may be worth concatenating other data to the note to make a larger, more comprehensive, and self-contained note detailing additional aspects of the patient encounter. For example, if a clinical note from an encounters is extracted, it is possible through other queries to extract a medication list at the time of the encounter, append it to the note, and then store this larger, concatenated merged document in a document repository. Such additional data may be useful for users.

It is also possible to incorporate images or drawings within a document before indexing with Solr. Details about how to do this can be found in section on Images in Documents in the Additional Considerations Guide.

Various approach for obtaining and integrating documents from the Epic EHR are described below.

HL7 Interface

The current way in which EMERSE at Michican Medicine receives encounter summaries and notes from Epic is via an Epic Bridges outbound HL7 interface. Details about this interface can be found here: http://open.epic.com/Content/specs/OutgoingDocumentationInterfaceTechnicalSpecification.pdf (Epic account required)

There may be a fee required by Epic to turn on this interface. The emitted HL7 messages include a report, which includes the clinical note as well as other structured data from an encounter in RTF format, which (importantly) preserves the formatting and structure of the notes. The reports are highly configurable by combining Epic print groups, which dictate the content of the report/RTF file. This can make searching more comprehensive for users. The RTF files obtained via this interface are converted to HTML by the commercial software Aspose.Words API, and stored in our document repository for nightly indexing by Solr. For more details, see: RTF to HTML file conversion.

Available metadata in the HL7 message, such as Encounter Date, Department, and Edit Date, are also stored in the repository along with the note. We recommend that you try to only include finalized, or ‘signed’ notes in the repository to prevent the need to continually monitor for document changes and frequent re-processing and re-indexing. However, such decisions will need to be made locally depending on the local use cases, and needs/expectations of users.

One disadvantage of this approach is that it is not useful for loading/extracting historic documents. That is because this outbound document HL7 interface is triggered to send the document based on pre-specified actions (signing a note, for example) and cannot be called on an ad hoc basis to pull historic notes.

Epic EHR Notes Indexing - Overview
Figure 2. The approach used at the University of Michigan to obtain notes from the Epic EHR and make them searchable within EMERSE.

Epic Clarity

It is also possible to receive notes through the Epic Clarity database; however, at the time of this writing, at most institutions Clarity strips a large amount of the document formatting (i.e., basically converting the document from RTF to plain TXT), and thus display of the notes will not look as they appear in Epic. The original document formatting may be essential for understanding/interpreting the document once displayed within EMERSE, including the structure of tables, and other common markup such as bolding, italics, etc. Epic Clarity can preserve some formatting such as line breaks, but doesn’t include tables and other text formatting.

Many Epic instances may be using the default configuration for moving documents from Chronicles to Clarity, and this default configuration will strip out line feeds/carriage returns during the ETL process. However, this can be changed through configuration of the "Character Replacement Settings" which is part of the KB_SQL Generation Settings. Essentially, the ETL process cannot handle many control characters (such as line feeds), so the character replacement will translate the line feeds to to a different character that the ETL can handle, and then translate them back for storage in Clarity. This translation would have to be turned on for the NOTE_TEXT field, or something with a similar name.

If there is a desire to use Clarity to obtain notes, we suggest looking into the following tables:

HNO_INFO 	(note metadata)
HNO_ENC_INFO 	(notes /encounter linkage)
HNO_NOTE_TEXT 	(actual note text)

These tables don’t seem to be documented in Epic’s Clarity ambulatory care documentation, even though outpatient notes are stored in the same tables along with the inpatient notes as of Epic 2015. However, The inpatient Clarity “training companion” does have an overview of these tables, and they are mentioned in the Clarity data dictionary, so between these two references you should be able to build a query that collects Epic notes.

Additionally, it should also be possible to move the notes in their original RTF format from Chronicles to Clarity. This is advantageous since it will help preserve much of the formatting of the original notes (bold text, italics, font sizes, tables, etc). Assuming that the plain text documents would continue to be moved from Chronicles to Clarity (the default configuration for Epic), an additional custom table would have to be created within Clarity to hold the RTF-formatted notes. It will also be necessary to set up the ETL to extract HNO #41 (the location within Chronicles that holds the RTF-formatted notes) directly from the Chronicles database. When sending these RTF-formatted Clarity documents to EMERSE, the RTF would have to be converted to HTML, but various options are available to do this (see : RTF to HTML file conversion.).

Clarity might still be useful in order to keep track of what documents have been changed (for example, a note might be deleted or addended weeks after it has been stored and indexed in EMERSE).
Storing notes as RTF is advantageous to preserve formatting, but it will increase the storage requirements. We estimate that our RTF documents each take up about 22k of space, although each document does have additional data appended to each one, including the medication list and problem list at the time of the encounter. Further, we estimate that storage for the raw RTF might take up about 10% more space than their plain text (RTF-stripped) counterparts. As of 2021 we began to incorporate images into our documents (encoded in Base64) which further increased the size of the stored documents.
An interesting story about naming conventions: The name HNO for the note table in Epic came from the original phrase "Shared NOtes", but since HNO was already being used by the system, the second letter (H) in sHared was used instead.

Epic Web Services

It is possible to build a web service to extract documents that preserves the original formatting of the notes.

At the University of Michigan we have built a custom web service that extracts formatted notes directly from Chronicles (not Clarity). These notes still have their native formatting which is in RTF, but they are converted to HTML and embedded in a SOAP message during the web service call. This approach will require individuals to have experience with writing custom web services and Epic Cache code. Those interested in exploring this option can contact us at the University of Michigan and we can share the Cache code and the web service definition.

As of version Epic 2017, there are two 'out of the box' web services that can be used for extracting notes:

  • GetClinicalNotes

  • GetClinicalNoteDetails

Further information about these web services can be found within the Epic documentation.

Epic’s webservices are not used for EMERSE at Michican Medicine, but are being used elsewhere in our health system. The major challenge with using the webservices is that they require knowledge of the primary note identifier before making the webservice call. It should be possible, however, to use clarity to pull note identifiers and then make a webservice call to retrieve the note as RTF/html. Other approaches may work equally as well.
Depending on the integration approach, it may be useful to setup a staging database that sits between the source systems and the EMERSE Solr indexes. The primary use of this at Michigan Medicine is to store the documents as they are received from Epic as HL7 messages. They are not directly stored in Solr, even though Solr ultimately creates its own local copy once a documents is sent from the staging database to Solr.

Integrating Documents from EHRs more generally

Fast Healthcare Interoperability Resources (FHIR)

FHIR is an emerging standard that is meant to be EHR vendor-independent so that it can work across all EHR systems that support the standard. Using FHIR could potentially be a viable approach for nearly any modern EHR. With support from the NCI Informatics Technology for Cancer Research (ITCR), a team at the University of Utah developed a FHIR-based tool for bulk extracting documents in their native format based on an input set of medical record numbers (MRNs). The University of Michigan further modified the code, using FHIR R4, whereas the original Utah-developed code used STU3. These projects/code can be found on our EMERSE GitHub repository (contact us for access).

The University of Michigan has tested this software against a "support" Epic environment and not a true production environment. Nevertheless, our results should provide a guidepost for assessing whether this tool is appropriate for your use.

Through a collaboration with the University of Utah, we are able to provide a FHIR-based tool for bulk-extracting documents from EHRs. A description of our findings is below, so that it may serve as a useful guide for others considering using this tool.

  • With Epic, the software pulls documents directly from Chronicles, the transactional database used for active patient care (with other EHR systems it is unclear where the data would be pulled from since it will depend entirely on the vendor’s implementation of the standard). Because the code extracts data directly from the transactional system, it would be safe to assume that some limitations would be place on usage by each institution, especially during peak business hours. At the time of this writing (2022) Epic does not have a way to throttle usage.

  • The data are extracted in HTML format and contain the native formatting of the documents (e.g., bold, italics, tables, etc) which is ideal for viewing within EMERSE.

  • We have found that some images, such as those within educational materials are embedded within the HTML using Base64 encoding, but the patient-specific clinical images are not embedded.

  • FHIR is not designed to pull large amounts of information at one time. In our performance testing on a staging environment, we found the document extraction rate to be approximately 3 documents per second.

  • The only filter currently supported by the code is a start date (thus the date range would be the start date through the current date at the time of code execution). There is not a way to request specific document types, or to request metadata. Thus, if a document is edited, it may not be eay to identify documents that have been updated but were previously extraced.

  • Fo Epic systems, it is possible to submit a medical record number as an identifier using “identifier=MRN|000000000” as the request parameter on the FHIR interface directly to query patients. This is an extension from Epic implementation of FHIR. Then the FHIR identifier is retrieved from the result of the patient query. For all following queries like DocumentReference, Binary etc. the FHIR ID will be used.

We would like to acknowledge the NCI ITCR program (Grants 5U24CA204800 and 5U24CA204863) for supporting development of this tool, as well as the team at the University of Utah including: Douglas Martin, Guilherme Del Fiol, and Kensaku Kawamoto.

Patient Integration

EMERSE requires that all patients referenced in documents exist in the Patient database table. At the University of Michigan, the Patient table is updated once per day (overnight) which coincides with the timing of indexing all new/changed notes (also once per day). The Solr indexing process does not require that this table be up to date, but the EMERSE system will need the MRNs to match between the PATIENT database table and the Solr indexes for everything to work properly when users are on EMERSE.

Patient data should be loaded into the Patient database table, and then these data get copied over to a Solr index created to match the patient data with the search results. This copying process is handled by Solr. The default settings for when this copying occurs is configurable, which is detailed in the Configuration Guide.

PATIENT table

Table: PATIENT
Population: From external source (such as EHR, or a staging document repository)
Population Frequency: Can be variable, but once per day is reasonable

The EMERSE schema includes a patient table with medical record number (MRN), name, date of birth, and other demographic information which is displayed in the search results. Data in this table are used to display the patient name, validate user-entered or uploaded MRNs and to calculate current ages of the patients. Other demographic data are used to summarize the populations found in a search.

Although the coded demographics information is not required, some features such as the demographics breakdowns within the All Patients search feature will not work if sex, race, ethnicity are not populated. Further, EMERSE can filter the results based on these demographics data.
For all documents indexed, there must be a corresponding patient in the patient table with a medical record number (MRN) that matches the document. This should be taken into consideration when determining the frequency for updating this table.
Column name Description Required or Optional

id

Primary Key

Required

external_id

Medical Record Number

Required

first_name

First Name

Required

middle_name

Middle Name

Optional

last_name

Last Name

Required

birth_date

Birth Date — used to calculate current age

Required

sex_cd

Sex

Optional (but highly recommended)

language_cd

Language (Currently not used. Can be populated in the system, and used for some advanced filtering, but will not be displayed to users.)

Optional

race_cd

Race

Optional (but highly recommended)

marital_status_cd

Marital Status (Currently not used. Can be populated in the system, and used for some advanced filtering, but will not be displayed to users.)

Optional

religion_cd

Religion (Currently not used. Can be populated in the system, and used for some advanced filtering, but will not be displayed to users.)

Optional

zip_cd

ZIP code

Optional

create_date

Date the row was created. Can be used to track changes to the table.

Optional

update_date

Date the row was updated. Can be used to track changes to the table.

Optional

ethnicity_cd

Ethnicity

Optional (but highly recommended)

deleted_flag

Logical delete flag. Useful for merged patient MRNs that result in obsolete MRNs. (see: Merges/Splits) Valid values are 1 = yes, deleted; 0 = no, not deleted

Required

deceased_flag

Currently not used. Valid values are: 1 = yes, deceased; or 0 = no, not deceased

Required

updated_by

To keep track of 'updated by' details

Optional

created_by

To keep track of 'created by' details

Optional

Obsolete/outdated MRNs should never be deleted from the table. Rather, they must remain and be flagged with the deleted_flag. Otherwise the functionality to warn users about obsolete MRNs will not work. Further, if the MRN is simply updated in the table rather than marking the older MRN as deleted and adding a new row, users will not be warned that the MRN has changed. This could cause issues if the MRNs are also stored in other data capture systems since the user would not know that the MRN in the external system is no longer valid.
The data in the Patient table are automatically copied to a Solr index, by default once per day. This is detailed in the Configuration Guide.

Research Study Integration

EMERSE is often used to aid in research studies. While not required, there is a provision for users conducting research to select a research study they are associated with at the "Attestation Page" after successfully logging in. EMERSE then links that particular session with the study in the EMERSE audit logs. Loading your institutional IRB data into the research tables in EMERSE will enable it to show the studies that each user is associated with, that are not expired, and that have a valid status (e.g., Approved status).

At the University of Michigan we use a commercial IRB tracking system (Click Commerce). A subset of the data from this system are moved to a data warehouse every night, and we extract a subset of data from that warehouse to bring into EMERSE for validation of users to studies. A Pentaho Data Integrator job, external to EMERSE, copies this subset to the research study tables in EMERSE, nightly.

If there are large delays in updating the RESEARCH_STUDY tables, it can delay users access to EMERSE, since the study will not appear in the attestation screen for the user to select. There is also a chance that a user could still be permitted to be approved for use on a study even if the study was revoked (again because of the delay introduced by nightly refreshes).

Research Studies and Attestation

Immediately after each login, every user is required to ‘attest’ to their use of EMERSE for that session by specifying their reason for using the system. This is called the Attestation page, and the results are stored in the SESSION_ATTESTATION table. EMERSE provides three basic options (configurable by a system administrator; these options can be disabled) for this attestation: (1) a Free Text field, (2) pre-populated Common Use Cases (for example, “Quality Improvement”, “Patient Care”, “Infection Control”, etc), and (3) a list of research studies to which a user is associated. Additional tables RESEARCH_STUDY_ATTESTATION, and ATTESTION_OTHER may contain additional information depending on whether the user specifies a research study, or other reasons. The free text option would be used by users when no other attestation choices are reasonable. Additionally, previously used entries from the free text box will appear in the table, along with any IRB-approved studies, for the user’s convenience.

For our implementation at Michigan Medicine, we pull data on all studies in the IRB system, even if the study/person is not currently a part of EMERSE. This is because the dataset is generally small, and it makes it easier for users to validate their studies if the data are already populated, once the user is given an EMERSE account
ER diagram - Attestation
Figure 3. Entity relationship diagram of some tables related to capturing attestion data which occurs immediately after a user logs in.

RESEARCH_STUDY table

Table: RESEARCH_STUDY
Population: Populated from external source such as an electronic IRB system
Population Frequency: Can be variable, but once per day is reasonable

If a user is required to select his/her study from the table, then delays in moving IRB data to EMERSE after IRB approval can result in delays access for that user.

This table contains information about research studies. Using this table and RESEARCH_STUDY_MEMBER allows EMERSE to show a list of studies the end user is associated with.

Column name Description Required or Optional

id

Primary Key

Required

external_id

IRB study number — used to link specific studies to usage, and is very helpful for tracking research usage

Required

study_name

Name of the study

Required

principal_investigator_name

Name of the principal investigator

Required

prin_invest_org_id

id of principal investigator. Not currently used by EMERSE. This could be a user id, or email, but it is a good idea to ensure it is unique.

Optional

expiration_date

Expiration date of study. Used to determine if a user should be allowed to proceed. If the expiration date is older than the current date, the user will not be able to select it in the attestation GUI.

Required

project_status

Current project status. This is used to track where a study is in the review and approval process. Only certain study statuses allow access to EMERSE for research. The statuses that allow a study to be selected during attestation are defined in the VALID_RES_STUDY_STATUS table.

Required

last_updated

A last updated date is not used by EMERSE, but can be useful for troubleshooting and tracking changes to the table.

Optional

begin_date

This originally referred to the date the study began or should be allowed to begin. This field that can be used for tracking and troubleshooting.

Optional

VALID_RES_STUDY_STATUS table

Table: VALID_RES_STUDY_STATUS
Population: By System Admin.
Population Frequency: May only need to be done once, at the time of system setup. May need periodic updates if the source data (such as from IRB system) defining study status is changed.

EMERSE contains a simple table defining study statuses. The statuses that are initially populated in the system (loaded up in the build script) are unique to Michigan Medicine—​that is, they were developed locally and are implemented in our separate electronic IRB tracking system—​and other implementations would have to have their own set of valid statuses if these were to be used to validate and approve usage for research. If the status of a research study is not in this table, EMERSE will not allow the study to be used for attestation; in other words, the study would not even be displayed to the user to select.

Column name Description Required or Optional

status

A list of study statuses that EMERSE considers valid in terms of allowing a user to proceed. These statuses are generally defined by the IRB and are universal across studies.

Required

VALID_RES_STUDY_STATUS Table Example:

Status

Exempt Approved - Inital

Approved

Not Regulated

Exempt Approved - Tranistional

RESEARCH_STUDY_MEMBER table

Table: RESEARCH_STUDY_MEMBER
Population: Populated from external source such as an electronic IRB system
Population Frequency: Can be variable, but once per day is reasonable

This table contains information about study team members, and is related to the RESEARCH_STUDY table, described above. Each study can have one or many study team members.

This table at Michigan Medicine contains information on all study team members for all studies, whether they have an EMERSE account or not.
Column name Description Required or Optional

RESEARCH_STUDY_ID

Foreign key reference to row id in RESEARCH_STUDY table

Required

USER_ID

Foreign key reference to row in LOGIN_ACCOUNT table

Required

ROLE_NAME

A string describing a person’s role on the study team. EG. “PI”, “Staff”, “Study Coordinator”. This can be useful when generating usage reports.

Optional

FIRST_NAME

First name of the user who is on the study. It would likely be populated from the source IRB system, but it is not used at all by EMERSE. Nevertheless, it may be useful when generating reports.

Optional

LAST_NAME

Last name of the user who is on the study. It would likely be populated from the source IRB system, but it is not used at all by EMERSE. Nevertheless, it may be useful when generating reports.

Optional

BEGIN_DATE

This is not currently used by EMERSE.

Optional

LAST_UPDATED

Date row was last updated

Optional

DELETED

Flag to indicate if the record has been logically deleted.
0 = false, not deleted;
1 = true, deleted

Required

SESSION_ATTESTATION table

Table: SESSION_ATTESTATION
Population: Used internally by EMERSE
Population Frequency: In real time by EMERSE

Each time a user attests to why they are using EMERSE, a row is inserted into this table, which is one of the audit tables. Attestations related to research can be joined to the RESEARCH_ATTESTION table. Non-research uses can be joined to ATTESTION_OTHER.

Column name Description Required or Optional

id

Primary Key

N/A (populated internally by EMERSE)

type

A string indicating the top level category of attestation. RSA indicates session is used for research. OTH means other usage. Research attestations will have an associated row in RESEARCH_ATTESTATION. If the type is OTH, a row will exist in OTHER_ATTESTATION_REASON when the system populates that table at the time of a user attestation after login.

N/A (populated internally by EMERSE)

User_session_id

A foreign key reference to the USER_SESSION table

N/A (populated internally by EMERSE)

OTHER_ATTESTATION_REASON table

Table: OTHER_ATTESTATION_REASON
Population: By System Admin. Used if Common Use Cases for using EMERSE are desired to ensure consistency of attestation selections.
Population Frequency: May only need to be done once, at the time of system setup.

For non-research attestations, there is a lookup table called OTHER_ATTESTATION_REASON that lists available options. These can be configured by each institution, and may include commonly used access reasons that don’t involve research (such as quality improvement, patient care, etc). These options provide a simple way for a user to click on one of the pre-populated common reasons for use.

Column name Description Required or Optional

USER_KEY

Text based primary key of this table. The column name might better be thought of as as 'reason key'.

Required

DESCRIPTION

The text description that will be displayed in the table of options on the Attestation page.

Required

DELETED FLAG

Has this reason been deleted? (0 = no; 1= yes)

Required

DISPLAY_ORDER

Order of display in the UI. Can be any integer, but should be unique per row. The buttons are ordered by this column via sql sort. Generally start with 0,1,2, etc.

This is no longer used as of version 6.0

Optional.

This is no longer used as of version 6.0

OTHER_ATTESTATION_REASON Table Example:

USER_KEY DESCRIPTION DELETED_FLAG DISPLAY_ORDER *

QI

Quality Improvement

0

0

RVPREPRES

Review Preparatory to Research

0

1

STDYDESC

Study involving only decedents (deceased patients)

0

2

* This column is no longer used as of version 6.0.

ATTESTATION_OTHER table

Table: ATTESTATION_OTHER
Population: Used internally by EMERSE
Population Frequency: Application dependent

The free text reasons that users enter are stored in a table called ATTESTATION_OTHER. This is populated by EMERSE and is not customizable by users.

Column name Description Required or Optional

SESSION_ATTESTATION_ID

A unique ID for the session attestation. Used for audit logging.

Required

FREE_TEXT_REASON

The free text reason that a user entered.

Required

OTHER_ATTEST_REASON_KEY

This will currently only be populated by the system with FRETXT.

Required

ATTESTATION_OTHER Table Example:

SESSION_ATTESTATION_ID FREE_TEXT_REASON OTHER_ATTEST_REASON_KEY

50208

Testing out the system

FRETXT

52060

Testing out the system

FRETXT

46051

Looking up a patient in clinic

FRETXT

71052

infection control monitoring

FRETXT

74107

cancer registry operational work

FRETXT

LDAP integration

EMERSE can be configured to use LDAP for user authentication. At this time, however, EMERSE authorization can not be done exclusively with LDAP - only users in the LOGIN_ACCOUNT table can access EMERSE even when LDAP is enabled. This was done intentionally so that accounts could be provisioned/revoked on a manual basis, pending review. The Configuration Guide provides details on changing settings that will enable LDAP authentication.

Active Directory integration

EMERSE does not natively work with Active Directory (AD), but AD can be configured to support the LDAP protocol which EMERSE does support. See, the section on LDAP integration, above