After working through the the Installation Guide, you’ll have a working instance of EMERSE on a subset of PubMed abstracts and case reports (used in place of real clinical notes, also called documents). To get EMERSE working with your local health record data, you’ll need to do further customization work. In this guide, we’ll show you the steps you’ll have to implement, and some checkpoints along the way to verify that you’ve done things right.

At a high level, you’ll have to:

  1. Gain access to patient demographic data such as name, date of birth, sex, race, etc. We’ll use these data to fill the PATIENT database table.

  2. Build a process to pull in patient data into EMERSE’s PATIENT table as patient data is added or changed.

  3. Gain access to the free text patient notes. There may be multiple systems storing patient notes. EMERSE can be loaded with all of them.

  4. Decide on the schema of Solr documents used to store patient notes. This means what field names are you going to use, and are these fields going to store dates, text, or some other metadata (such as note type, clinical department, etc).

  5. Update the schema in your Solr index to reflect your decision.

  6. Update the DOCUMENT_FIELDS table and related tables to reflect your Solr schema, and decide what fields are visible on what screens in EMERSE.

  7. Build a process to push (or pull) patient notes into Solr, as new notes are created, and old notes are amended or corrected.

Since we don’t have a real source of patient information in this guide, we’ll use the website https://www.epubbooks.com as a source. Authors will act as our "patients", and the books they have available on that site will be our "medical documents." We’ll use a tool called "pandoc" to convert the books from the ePUB format to HTML format which will preserve the formatting of the book and make it display correctly in EMERSE (pandoc isn’t actually needed for EMERSE, just for this setup demonstration). Another source of similar data for testing purposes is the Project Gutenberg website which has books for which the copyright has expired.

To pull the data from the website, we developed a sample Java application built on top of the SolrJ library. It scrapes the website for the author information, downloads the books, converts them to HTML via pandoc, and pushes them into Solr, and pushes the patients (ie, authors) into the PATIENT table in the database.

Contact the EMERSE team for the download links to this sample application.

To integrate your own EHR system into EMERSE, you will have to build a similar application or collection applications that pull from your EHR instead. In reality you may or may not be pulling directly from the EHR. Many sites have intermediate systems (e.g., a research data warehouse) that may contain a copy of the data you will need to access. You don’t have to develop those applications in Java, but if you do use another language, it may have a library similar to SolrJ that facilitates talking to Solr. See the Client API Lineup on the Solr website.

There are a few more major issues you will have to plan for than what the sample application does:

  1. Your EHR data may change, so you will have to account for modifications to documents or patient data that you have already imported.

  2. Multiple patients may "merge" into a single patient because they were discovered to be registered twice in the system. See Merges/Splits in the Additional Considerations Guide.

  3. You likely will want to log the operations of the data-loading applications so if there are issues with the data loaded, you can track what happened. This may involve saving the document in its original format before converting, and recording events such as MRN splits/merges and what documents and patients were updated or added. You can also make additional fields in the Solr documents index specifically for the purpose of tracking this information; EMERSE ignores fields it doesn’t know about.

Loading Patients

Looking at the information available on authors on https://www.epubbooks.com, we can see it’s pretty limited. We strive to fill out as much information as possible in the PATIENT table. (More details about the PATIENT table can be found in the Data Guide.) Here are the columns that we will use:


This is auto generated when we insert rows into the table.


This in the medical record number, or MRN. We’ll put the URL of the author’s page in here. For technical reasons, the MRN cannot be longer than 20 characters, and should not contain characters special in URLs. So, we’ll include only the last 20 characters of the URL, and replace slashes with exclamation marks.


We’ll put the first name of the author in here.


We’ll put the last name of the author in here.


We’ll put January first of the birth year of the author in here.


We’ll put a one in here if we found a death year for the author.

We don’t have information for the other columns, so we’ll leave them blank. To see what the other columns mean (especially the ones ending in _CD), look at the Data Guide.

Now that we know what information we’re going to put in the patient table and where it’s coming from, we need to write a small program that periodically loads the patient information into the PATIENT table. Approaches for doing this can look very different depending on the source of information. See the Integration Guide for a short discussion on the different ways the PATIENT table may be integrated.

Since our source of "patient" information is a website, our only real option is to write and then run a batch job to scrape the site every night, loading it into the PATIENT table.

You can run the author scraper with the following command:

java -cp epubbooks-indexer.jar LoadPatients --url jdbc:oracle:thin:@//localhost:1521/xepdb1 --user emerse --password

You may need to change the oracle URL, user or password, if you configured them differently.

This will scrape the website, and check to see if the author is already in the PATIENT table. If it is, it updates the patient data (it doesn’t check to see if any of it changed.) If the author isn’t in the table, it inserts a new row. This will not clear the prior patients, which goes to show that you may have multiple patient loaders, loading from different sources which need not conflict.

Even though patients can be loaded from multuple sources, each patient must be unique (based on medical record number (MRN)), as there is a constraint on the table to ensure that only unique MRNs can be added.

At this point, if you restart EMERSE, it will not report the new patients. This is because there is a periodic process that scans the patient table, and updates statistics about it, and those statistics are what are displayed in the UI. The patients in the database table are also copied to a Solr index, separate from the document index. This Solr Patient index is used within the system to rapidly group documents by patient and to summarize data about populations retrieved with an All Patient Search (Find Patients).

To trigger this update process manually (which is something you’d only need to do while setting things up), and to also copy these patients into Solr, you’ll need to go into the admin page within EMERSE, which can be accessed via the dropdown in the top right of EMERSE once you’ve logged in. In the admin page, there’ll be a "System" tab, which will allow you to force a synchronization. After clicking that, you can visit Solr’s admin page, and double check that the new patients are loaded into Solr.

Still, though, the EMERSE UI will not show the new patient count. This is because we only count patients that actually have some Solr documents, and these new patients have no documents yet.

Loading Documents

Now that we have our authors in the PATIENT table and replicated the data in that table to the Solr PATIENT index using the admin page, now we just need to build a process of loading documents, in our case, the books.

Patient records only need to be sent to Solr; they don’t need to go to a database. However, it can be common that patient records are stored in a format other than HTML, such as RTF. As it happens, our books are not in HTML format either; they are in ePUB format. We must use a tool to convert ePUB to HTML, and then send the HTML text to Solr since neither EMERSE nor Solr can read the ePUB format (or RTF). We could convert to a simple text format, but it’s much nicer for users to see more richly formatted HTML which is closer to the original document, and the HTML format doesn’t hurt indexing since Solr ignores HTML tags and attributes when indexing words.

Details on how to perform an RTF to HTML conversion can be found in the Integration Guide.

Let’s look at the book pages in https://www.epubbooks.com. We can see we have things like:

  • the title

  • the author

  • the link for the actual content of the book (for which you have to be logged into the site to get)

  • the release date (which we can find in a <meta/> tag not visible in the interface but still in the page)

  • number of pages

  • description

Solr Core Directories

We’ll rework the fields and index settings for our documents index, but first let’s go over how Solr finds its indexes and the structure of those indexes. When you start Solr, it looks through the file hierarchy under the SOLR_HOME directory for indexes. You can set SOLR_HOME by either a command line option when you start solr, solr start -s /path/to/solr/home or by setting the environment variable SOLR_HOME, export SOLR_HOME=/path/to/solr/home before starting Solr.

Any directory underneath SOLR_HOME that contains a file named core.properties makes that directory a Solr core. The name of the core is not the name of the directory, but the value of the name property inside the core.properties file. Cores cannot appear inside of other cores, but cores don’t have to be immediate subdirectories of SOLR_HOME; they can appear at an arbitrary depth. See Figure 1, below, for an illustration of where cores may be placed.

Figure 1. Cores in a hierarchy

Inside a core, files are organized into two directories: data and conf. data contains the binary lucene files making up the index itself, and conf contains additional configuration files beyond the core.properties. The most important configuration file in that directory is the conf/managed-schema xml file, which determines what fields may be in documents for that index, and how those fields are tokenized, indexed, and stored.

To begin customizing our index, let’s create copy the existing indexes, and then clear out the index data to start fresh. Follow the commands in Figure 2, below, which assumes your pubmed indexes are in a directory named pubmed-indexes.

Figure 2. Copying the indexes
# make a copy
mkdir custom-indexes
for index in documents patient patient-slave solr.xml
   cp -R pubmed-indexes/$index custom-indexes/$index

Next, we’ll clear out the index data. An easy way to do this is to just delete the data directory in each core, as in Figure 3. You may want to do this if you have trouble loading data into the indexes and want to start fresh again.

Figure 3. Deleting all index data
rm custom-indexes/*/data
It is also possible to delete a single document, which is described in the Additional Operational Considerations Guide

Now, let’s remove any extraneous config files in our custom-index/documents/conf directory. (These files are either old, or copies of the managed-schema file which is the one actually used.) See Figure 4, below.

Figure 4. Cleaning up the custom-indexes/documents/conf directory
cd custom-indexes/documents/conf
rm unifiedschema.xml* schema.xml *~

Configuring the Solr documents Index

Now, let’s edit the documents/conf/managed-schema file to add the following fields:

  • ID

  • title

  • description

  • releaseDate

  • numberOfPages

  • authorUrl


  • text

  • text_cs (for case-sensitive title)

In the managed-schema file we shouldn’t need to change any of the field type definitions, only the <field/> elements at the end of the file. For this example, make those elements look like that in Figure 5.

The managed-schema file should not be changed while Solr is running, since Solr may keep a copy of it in memory, and write that copy to the file when Solr is shut down, thus overwriting any changes you made to it while Solr was running.
Figure 5. The end of the documents/conf/managed-schema file
  <uniqueKey>ID</uniqueKey> <!-- This element is actually at the top of the file; change it there. -->
  <field name="ID" type="string" indexed="true" required="true" stored="true"/>

  <field name="title"       type="string" indexed="false" stored="true"/>
  <field name="description" type="string" indexed="false" stored="true"/>

  <field name="releaseDate" type="tdate" indexed="true" stored="true"/>

  <field name="numberOfPages" type="long" indexed="true" stored="true"/>

  <field name="authorUrl" type="string" indexed="true" stored="true" docValues="true"/>

  <field name="SOURCE" type="string" docValues="true" indexed="true" stored="true"/>

  <field name="text"    type="text_general_htmlstrip"             termPositions="true" termVectors="true" indexed="true" termOffsets="true" stored="true"/>
  <field name="text_cs" type="text_general_htmlstrip_nolowercase" termPositions="true" termVectors="true" indexed="true" termOffsets="true" stored="true"/>

  <copyField source="text" dest="text_cs"/>

  <field name="_version_" type="long" indexed="true" stored="true"/>
1 This is the unique key of the document to Solr. If you index a document with the same value in the ID field, Solr will replace the old document with the new one. This may become important if (1) the document gets updated or (2) The document get 're-assigned' to a different patient because it was incorrectly assigned previously. The <uniqueKey/> element already exists at the top of the file, so you don’t need to actually include it in the bottom here, but if you want to change the field name, you’ll need to change it in both places.
2 These are unindexed (hence not searchable) text fields which are stored so we can retrieve them from Solr and show them in EMERSE when a user views the book. Even though the contents of the field is not searchable, the field itself can be used to limit the search using the Advanced Search feature, and can also be used for displaying to the user, sorting by the field in the UI, etc.
3 This is a date (hence type is tdate) which is indexed so EMERSE can filter on it with its date function. (We’ll tell EMERSE about it later.)
4 This is a numeric field which is indexed so we can possibly search it in Advanced Search in EMERSE. If we didn’t want this possibility, we could have indexed it like title or description, as a unindexed string.
5 This is the EXTERNAL_ID of the "patient" (which we made the author URL but would really be the medical record number—​MRN). This needs to be indexed and have "doc values" so that certain "joins" between the document index and patient indexes will work. (We use this to generate patient demographic charts for instance.)
6 This is the field that EMERSE looks at to know what source the document is from. We’ll set up the sources later, but this field name cannot be changed, and must have exactly these settings.
7 The text and text_cs fields store the main body of text that will be searched in EMERSE. It will contain HTML, so when indexing, we want to be sure to ignore the HTML tags, which is what the type of text_general_htmlstrip means. This type also lowercases the text so a search for a lowercase term will find uppercase terms as well. For text_cs we use the text_general_htmlstrip_nolowercase type which is the same except it does not also lowercase the text before indexing, thus searching for a lowercase term only finds lowercase terms in the text (this allows for case-sensitive text searches). The other attributes on these tags are needed to do (or help optimize) various kinds of searches we do on these fields.
8 This element related to the item above, and says any text that appears in the input Solr document in the "text" field should be copied also to the "text_cs" field. This makes it so that we don’t have to send the (possibly very large) text of the book both in the "text" field and in "text_cs" field; Solr will do the copying for us.
9 This field needs to exist with these settings and this name for technical reasons related to Solr. Do not modify.

See the integration guide for more details on analyzers, Solr, indexing, and different techniques to push documents to your index.

Once we have this set up, we can start Solr and point it to the new index files:

solr start -s path/to/custom-indexes

Navigate to the Solr admin page to make sure Solr came up and loaded the documents index properly. If there are errors, there should be a banner indicating so on the Solr admin home page, or on the documents core home page. (To get to the core home page, just select it in the core selector dropdown on the left.) If there are errors, you can get more information on them in the Solr log file, located inside the Solr installation director at solr-X.Y.Z/server/logs/solr.log.

Once it’s up, and if it has no errors, we can index documents:

java -cp epubbooks-indexer.jar IndexDocuments \
  --solr http://localhost:8983/solr/documents \
  --cookie 'your cookie from being logged into epubbooks.com'
To find the cookie needed in the above command, log into epubbooks.com and open the network tab in the developer tools. Clear the list of requests (if there are any) and reload the page. Select the "HTML" filter for the request list, and select the first request listed. (It should be the one to the epubbooks.com page you are on. Any request to the host epubbooks.com will do actually, but this is an easy way to find one.) Look at the request headers, and copy the entire value of the "Cookie" header. It can be tough to copy the entire value of it in some browsers since in that portion of the UI it isn’t easily to see what text you are selecting to copy. To make it easier, you can also try changing the settings to show raw headers, or right-click on the request in the list and save the headers or the request in some other format, paste that into a text editor, and copy the cookie out of that format. If the format is something like raw headers or the raw http request, you want the entire "Cookie:" line, but excluding the text "Cookie:" itself. If you’re looking at a cURL command, it should be the text between "Cookie: " and the next single quote character. If you selected HAR, there should be a JSON object with a "name" key whose value is "Cookie", and the cookie itself is the string in the "value" field of that same object. (Don’t include the double quotes around the string.)

This will scrape the website to find the links to the actual ePUB books and download them. It will then invoke pandoc to convert the ePUB books to HTML. (We assume pandoc has already been installed and is on the PATH. See pandoc’s website to download and install it.) It’ll then submit those books with the associated meta data as Solr documents to the Solr core specified on the commandline. To download books from the website, you’ll have to have an account and be logged in. You can do that with your browser, and then (once you’re logged in) take the value of the cookie for the website and paste it into the --cookie parameter. This will enable the application to make a request to download the book as authenticated with your account (for as long as your session is active).

The pandoc application does a little bit of caching, so that if a book has already been downloaded (to a file in a subdirectory of the current directory), it won’t download the book again. Similarly, it doesn’t scrape the website if there’s a saved copy of the scaped data in the current directory, under the files authors.bin and books.bin.

Now you should see some documents in your index if you do a search in the Solr admin page, http://localhost:8983/solr/#/documents/query. (If you don’t change anything in the page, it’ll return everything.)

The document we are sending to Solr from Java looks like:

        "authorUrl":"!edwin-abbott-abbott", (1)
        "SOURCE":"epubbooks", (2)
        "description":"How would a creature limited to two dimensions be able to grasp the possibility of a third? …",
        "releaseDate":"2010-01-11T05:00:00Z", (3)
        "text":"…" (4)
1 Recall that, for technical reasons, the "MRN" should not contain characters special in URLs, and cannot be longer than 20 characters.
2 This is the "source" of the document. All documents in this example will have the same source, but if we get documents from another website, or another system, it would make sense to have a different value here. This will also allow ancillary fields (such as "title" or "description" but not essential ones such as "text") to vary from source to source.
3 In Java, we actually hand SolrJ a date object, and it converts it to a string.
4 Notice that we do not send a "text_cs" field with the same value as "text" field. We configured Solr to copy the value of the "text" field to the "text_cs" field. Also notice that other fields, such as version are not supplied, but Solr gives them a value internally during indexing.

Configuring EMERSE for the new documents index schema

At this point the Solr schema for documents has been set up, but the EMERSE application itself knows nothing about the changes.

We need to tell EMERSE about two things:

  1. our epubbooks source, and

  2. its fields and their types.

To do that, we’ll configure three tables:




See the Data Guide for a description of the columns in these tables, but for the most part they are self-explanatory.

First, we’ll clear out DOCUMENT_SOURCE, but since there are some foreign keys that reference this table, we’ll have to clear out one of the other tables first:

delete from document_fields;
delete from document_source;
insert into document_source
values (
  'epubbooks',                                (1)
  'Books from https://www.epubbooks.com/',
1 This must be the same value as in the SOURCE field in the documents we submit to Solr.

Now, we’ll tell EMERSE about the fields of this source in the DOCUEMNT_FIELDS. We’ve already cleared out the table, so we don’t need to do that now. At the same time, we’ll make some adjustments to the DOC_FIELD_EMR_INTENT table as it makes sense.

('epubbooks', 0, 'Title',
 1, 1,                                                     (1)
 'title', 'TEXT');

('epubbooks', 1, 'Description',
 1, 0,                                                     (2)
 'description', 'TEXT');

 insert into DOCUMENT_FIELDS
 ('epubbooks', 2, 'Number of Pages',
  1, 0,
  'numberOfPages', 'TEXT');                                (3)

('epubbooks', 3, 'Release Date',
 1, 1,
 null, null);

   set SOLR_FIELD_NAME = 'releaseDate'                     (4)

('epubbooks', 4, 'Author Url',
 0, 0,
 null, null);

   set SOLR_FIELD_NAME = 'authorUrl'
 where NAME = 'MRN';

('epubbooks', 5, 'Document Id',
 0, 0,                                                     (6)
 null, null);

 where NAME = 'RPT_ID';

update DOC_FIELD_EMR_INTENT                                (5)
   set SOLR_FIELD_NAME = 'text'
 where NAME = 'RPT_TEXT';

   set SOLR_FIELD_NAME = 'text_cs'
 where NAME = 'RPT_TEXT_NOIC';

   set SOLR_FIELD_NAME = 'releaseDate'                     (7)
1 We set the title to appear both when viewing the document, but also in the summary table as a column.
2 We set the description to only show when viewing the document; it is too long to show in a column in the summary table.
3 We use the type TEXT here even though in the Solr index it is a number. This doesn’t cause any problems.
4 We make the release date a standard field mapped to the CLINICAL_DATE EMR intent. This means EMERSE will use this Solr field to do date filtering for every source. We specify the Solr field name for all sources in the DOC_FIELD_EMR_INTENT table. We do the same for the authorUrl and ID field as well.
5 Similarly, the 'text', and 'text_cs' Solr fields are shared across sources, but only need to be configured in the DOC_FIELD_EMR_INTENT table.
6 Even though the ID field isn’t shown anywhere in the UI, for technical reasons there should still be a row in this table for the EMR intent RPT_ID.
7 The Solr field mapped to from the "LAST_UPDATED" intent is used to find minimum and maximum date of all documents in the index. We’ll map it to our only date field, releaseDate.

Now, before we start EMERSE again, be sure to change the emerse.properties file to point to the new indexes by changing the lucene.indexPath property there.

Checking Configuration

Next, we should check the configuration of EMERSE to make sure everything looks good. To do that, visit the diagnostics page, http://localhost:8080/emerse/diagnostics.html. Click Check Application Health to make EMERSE run through a number of checks. If there are errors, they should be displayed in the table with some help. Next, reflesh the error log below the table to see a list of errors that have been logged to the catalina.out file since startup. (You must click refresh to see the errors; they are not fetched when the page is first loaded.)

If there are no problems on this page, then we can move on to tell EMERSE to "synchronize" with Solr and any changes in database. EMERSE periodically does batch work throughout the day and at night. If you’ve done a big reload of data (as we just have), and want to see if everything is working, it’s best to force EMERSE to do all these batch tasks at once. They will be all completed with 24 hours if you don’t force synchronization, but we don’t want to wait that long.

To force synchronization, we’ll need to log into the admin app, http://localhost:8080/emerse/admin2/. Unlike the diagnostics page, you must be logged in to an administrator account to see this page. A user is an administrator if they have the ACCESS_ADMIN permission. See the Administrator Guide for more information. The emerse user should be an administrator by default. Once you login as an administrator, you should see a Admin menu item in the application menu (top right) which will take you to the admin app. (Or you can just punch in the url above, just be sure to include the trailing slash.)

Once in the admin app, move to the system tab. If you don’t see tabs, try refreshing the page. On the system tab, click the Synchronize button. This should pop up a progress bar, and possibly take a while.

When synchronization finishes, go back to the main application, refresh the page, and try a variety of queries. If you have any trouble, you can go back and check the diagnostics page, in particular the error log there. (There’s also a log tab in the admin app which shows the same thing.)