EMERSE Application Programming Interface (API) Guide

Overview

This guide describes the EMERSE API, which is a HTTP-based programatic interface to EMERSE’s 3rd-party search engine Solr. The main use case of the API is to get access to the documents stored in Solr to do machine learning. The main reason you would use the EMERSE API instead of granting direct access to Solr is that:

EMERSE does authentication, authorization, and audit logging on the queries sent to Solr.
EMERSE has a special analyzer for query terms.

Any user of the EMERSE API will need an account in EMERSE with the API_ACCESS privilege granted. Once that privilege is granted, the user will be able to make a POST to /emerse/springmvc/api/search. The body of the POST should be query parameters that should be sent to Solr. This means the body effectively has a content-type of application/x-www-form-urlencoded, though the content-type header is not checked so it doesn’t need to be passed. The response is an exact copy of what Solr sent. Only the JSON format is supported. Requesting Solr to use a different response writer with &wt will likely cause a 500 since we internally parse the JSON response and record MRNs and document IDs returned in the audit log.

Authentication

Users must also authenticate as a part of their POST. This can be done using basic authentication, or by using cookies. To use basic auth, send the `Authorization: ` header with the base64 encoding of your username and password separated by a colon, as per RFC7617.

Such a request may look like the following, where we’ve used the username emerse and password demouser. So, the text ZW1lcnNlOmRlbW91c2Vy is the base64 encoding of emerse:demouser. We also sent the Solr search parameters q=:&rows=0 to do a search for all documents, returning none of them, in order to get a count of all documents in the index.

POST /emerse/springmvc/api/search HTTP/1.1
Host: localhost:8080
Authorization: Basic ZW1lcnNlOmRlbW91c2Vy
Content-Length: 12
Content-Type: application/x-www-form-urlencoded

q=*:*&rows=0

A response to this request could look like:

HTTP/1.1 200
Content-Type: application/json;charset=UTF-8
Content-Length: 213

{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"*:*",
      "rows":"0"
    }
  },
  "response":{
    "numFound":635861,
    "start":0,
    "numFoundExact":true,
    "docs":[ ]
  }
}

To use cookie-based authentication, issue a POST to /emerse/login with the JSON {"username":"USERNAME","password":"PASSWORD"} replacing the uppercase wods with your actual username and password. Then, use the JSESSION cookie sent back in the Set-Cookie response header.

POST /emerse/login HTTP/1.1
Host: localhost:8080
Accept: */*
Content-Length: 43
Content-Type: application/json

{"username":"emerse","password":"demouser"}

A possible response would be:

HTTP/1.1 200
Set-Cookie: JSESSIONID=B7F7756715D9165F7ABC26FA0E67D4F4; Path=/emerse; HttpOnly
Content-Type: application/json;charset=UTF-8
Content-Length: 17

{ "success":true}

Then subsequent requests should look like:

POST /emerse/springmvc/api/search HTTP/1.1
Host: localhost:8080
Cookie: JSESSIONID=B7F7756715D9165F7ABC26FA0E67D4F4
Content-Length: 12
Content-Type: application/x-www-form-urlencoded

q=*:*&rows=0

Don’t use the basic authentication header on the /emerse/login page; using basic auth prevents a session (and hence cookie) from being used.

Querying

Since the API in a pass-through to Solr, you should consult Solr’s documentation on how to query. However, we will go over some basics here.

First, EMERSE has two collections of "documents". The DOCUMENT collection is the collection of medical notes, and the other PATIENT collection is a collection of "documents" each representing a patient. You can query either, and EMERSE has a special query parameter &collection= that allows you to specify which, defaulting to the DOCUMENT collection if not specified. Collections are also called "indexes" in Solr parlance.

A document in Solr is like a JSON object. Each field holds either a single value or, rarely, an array of values. These values are either text, a ISO-8601 formatted date in the UTC timezone (as a string in JSON), a number, or a boolean, or an array of any such values (but not an array of arrays). For example, here is a document from the PATIENT collection of the EMERSE demo system:

{
  "ID":"7195",
  "MRN":"100007195",
  "FIRSTNAME":"Amari",
  "LASTNAME":"Christian",
  "BIRTHDATE":"1938-06-11T05:00:00Z",
  "SEX_CD":"F",
  "RACE_CD":"B",
  "ETHNICITY_CD":"Pt Refused",
  "RELIGION_CD":"U",
  "LANGUAGE_CD":"U",
  "MARITAL_STATUS_CD":"U",
  "ZIP_CD":"",
  "DECEASED_FLAG":false,
  "DELETED_FLAG":false
}

and here is a document from the DOCUMENT collection of the EMERSE demo system:

{
  "ID":"docid_3_73635",
  "LAST_UPDATED":"1996-03-20T05:00:00Z",
  "ENCOUNTER_DATE":"1996-03-17T05:00:00Z",
  "DOC_TYPE":"Consent Form",
  "DEPT":"Neurosurgery",
  "SOURCE":"source3",
  "RPT_TEXT":"<html><body>Industrial chokeberry pomace...",
  "MRN":"100004264",
  "CSN":"41770",
  "CATEGORY":["278","279","280"],
  "AGE_MONTHS":2,
  "AGE_DAYS":61,
  "_version_":1793068510900912128
}

The fields of the PATIENT index won’t change, but each institution may change the number and names and types of the fields in the DOCUMENT index, so to properly query Solr, users of the API will need to talk to your administrators to know which fields hold what information. Currently, there is no way to determine the field names from the EMERSE UI, though the values of a fields can be seen in the filter for the field, if it’s set up to show the raw Solr values.

There are three major query parameters every use will probably use:

&q= which gives the query to Solr. Conceptually, Solr will construct an ordered list of all documents that match this query.
&start= gives the index of the first document to return from the conceptual list of results built by Solr. The first document is at index zero.
&rows= gives the number of documents to send back. This defaults to 10 if unspecified. Thus, by passing &start= and &rows= you can paginate through the result set. If you don’t want to paginate, you can set &rows= to be a very high number, such as 1 billion, in order to get all results but this might place a large load on Solr.

Query Syntax

The query syntax that goes into the &q= parameter is thoroughly described on Solr’s website. However, as a short introduction, a query consists of a collection of terms, each term prefixed with a field, and then a quoted phrase. The phrase doesn’t need quotes if it doesn’t contain a space, or other special characters, and the term doesn’t need a field prefix if its the default field, which is RPT_TEXT. Such a term might look like cancer which is short for RPT_TEXT:cancer.

The text of a document is broken up into a series of words, which are defined as a sequence of letters or digits separated by punctuation, whitespace, etc, including hyphens and apostrophes. The exact rules for what delimits a word is a bit complex. Each word has one or more tokens associated with it. For instance, the sentence:

The patient denies chest pain.

generates the following tokens in the following table. The first row is the sentence itself for rerfence. The second row and beyond are the tokens generated. Tokens are positioned at words in the documents, so tokens in the same column are positioned at the word in the top row. So, for instance, the token TX_The is positioned where the word "The" is in the sentence.

The

patient

denies

chest

pain

TX_The

TX_patient

TX_denies

TX_chest

TX_pain

tx_the

tx_patient

tx_denies

tx_chest

tx_pain

POL_NEG

CUI_1550655

CUI_817096

CUI_30193

SMG_Finding

SMG_Anatomy

SMG_Finding

CUI_8031

SMG_Finding

As you can see, there are a lot of tokens per word, and tokens have prefixes. These prefixes are:

TX_: Tokens with this prefix exactly match the case of the word they correspond to.
tx_: Tokens with this prefix are lowercased. This allows for case-insensitive matching, since searching for a token tx_abc would find words "ABC" and "abc" or even "aBc" in documents.
POL_: Tokens with this prefix indicate "polarity" meaning whether or not the words are in a semantically negated part of the sentence (such as the object of the verb "denies"). Only the token POL_NEG currently has this prefix. The absence of this token is usually enough for a "positive" mention.
CUI_: Tokens with this prefix indicate the corresponding word is part of a semantic concept in the UMLS given by the given id. "CUI" stands for concept unique identifier.
SMG_: Tokens with this prefix indicate groupings of UMLS CUIs.
TG_: Tokens with this prefix indicate who the the words are referring to. The only really token with this prefix is TG_FAMILY, meaning the words are reffereing to the patient’s family as opposed to the patient themself. For instance, if a document said "The patient’s mother had breast cancer", the phrase words "breast cancer" would get TG_FAMILY tokens at their positions.
CT_: Tokens with this prefix indicate with what certainty the words affirmed. The only real token with this prefix is CT_UNCERTAIN.

Terms of your query are just these tokens. Thus, searching for tx_the will find all documents containing that token. However, in the case you just want to use the tx_ prefix, you can use the PLUCENE query parser {!PLUCENE v='…'} with your searches which does the prepending for your. You can activate that parser by passing the defType=PLUCENE parameter in your search. (There will be an example below.) We’ll assume we’re using this query parser most of the time, but our examples should work with prefixes CUI_ and SMG_ prefixes if and only if passing the ntp=CUI_ SMG_ parameter in your search. The other prefixes aren’t easily accessible right now.

A term can be prefixed with a plus sign to require the term appear in the document, or a minus to require the term does not appear. Unprefixed terms terms are optional. If there are no required terms, then at least one optional term must match. Otherwise, optional terms only change the ranking of a result. (Results are ordered from most matches to least, accounting for the value of a match using TF-IDF ranking or something similiar.)

Finally, parentheses can be used to create sub-queries. This can be used to implement something close to boolean logic.

Here are some examples queries, which may or may not match the above document.

q=tx_patient

Matches
q=TX_The TX_patient

Matches
q="tx_chest tx_pain"

Matches
q="tx_denies tx_pain"

Does not match, since the two words are not adjacent, and this is a single phrase
q=tx_denies tx_pain

Matches, since merely one term is required
q=+tx_denies +tx_pain

Matches, because both terms appear
q=CUI_1234567

Matches, since C1234567 exists
q=AGE_DAYS:[61 TO 161]

Matches
q=AGE_DAYS:[0 TO 12]

Does not match
q=AGE_DAYS:12 tx_pain

Matches since only one term is required and pomace appears.
q=CATEGORY:278

Matches. You cannot make a query to require a document contains a specific (ordered) list of values.
q=DOC_TYPE:"Consent Form"

Matches.
q=DOC_TYPE:Consent

Does not match. Only the RPT_TEXT field (or whatever field stores the main body of text of the document) is tokenized meaning searches match on the words internal to the text. Other fields generally are not tokenized, meaning the whole string must match the phrase, with spacing and other special characters appearing in the phrase exactly as they are in the field of the document. Of course, your administartors may set up certain fields to be tokenized (such as doctor’s names), so you can consult the administrators of your installation to determine if a field is tokenized or not.
q=(DOC_TYPE:"abc" DOC_TYPE:"def") tx_pain

Matches since pomace appears. This query would match any document which contains one of the three terms.
q=+(DOC_TYPE:"abc" DOC_TYPE:"def") tx_pain

Does not match since the subquery must match, and in order to match that, DOC_TYPE must be either abc or def. This query would match any document which has either value for the DOC_TYPE field. If the word pomace appears, it will only increase the ranking.
q=+(DOC_TYPE:"abc" DOC_TYPE:"def") +(tx_chest tx_pain)

This query would match any document which has either value for the DOC_TYPE field, and must have either the word industrial or pomace.
q="tx_main tx_focus" (+(DOC_TYPE:"abc" DOC_TYPE:"def") +(tx_industrial tx_pomace))

This query would match any document that contains the phrase "main focus", or any document which matches the previous query. This is because the "top-level" query just has two optional subqueries, so only one needs to match. The first subquery is just a phrase term, and the second is exactly the query given above. In other words, the pluses and minuses are "scoped" to the parenthetical they appear in.

you can also add filters to your main query using the fq parameter. For example, fq=DOC_TYPE:abc. If you have a cohort, you can create a filter to search only this group of patents only by: fq={!terms f="MRN" separator="," v="100000001,100000002"}. The syntax of terms parser can be found in Solr documentation at here

I should note that you must percent-encode your query, so if your query is: +industrial +pomace you must encode it like %2bindustrial%20%2bpomace when sending it as the argument to q=. There is some leniency in what you send; just know that it will be percent-decoded before use. So, in particular, the + sign is converted to a space, which is why you must encode it as %2b. The space character, for instance, doesn’t really need to be percent-encoded since the query is going into a body of a post, not an actual URL, and a literal space decodes to itself so it is safe. For instance, a full request for such a query would look like:

POST /emerse/springmvc/api/search HTTP/1.1
Host: localhost:8080
Authorization: Basic ZW1lcnNlOmRlbW91c2Vy
Content-Length: 51
Content-Type: application/x-www-form-urlencoded

defType=PLUCENE&rows=10&q=%2bindustrial %2bpomace

A typical response from Solr will look like the following:

{
  "responseHeader":{
    "status":0,
    "QTime":18,
    "params":{
      "q":"*:*",
      "rows":"2",
    }
  },
  "response":{
    "numFound":635861,
    "start":0,
    "numFoundExact":true,
    "docs":[
      {
        "ID":"docid_3_73635",
        "LAST_UPDATED":"1996-03-20T05:00:00Z",
        "ENCOUNTER_DATE":"1996-03-17T05:00:00Z",
        "DOC_TYPE":"Consent Form",
        "DEPT":"Neurosurgery",
        "SOURCE":"source3",
        ...
      },
      {
        "ID":"docid_3_88677",
        "LAST_UPDATED":"1996-03-21T05:00:00Z",
        "ENCOUNTER_DATE":"1996-03-18T05:00:00Z",
        "DOC_TYPE":"Consent Form",
        "SVC":"General",
        "DEPT":"Emergency Medicine",
        "CLINICIAN":"Page, Dixie",
        "SOURCE":"source2",
        ...
      },
      ...
    ]
  }
}

Query configuration

In order to make the prefixed terms work in your query through API, you will need to make sure tx_ and TX_ are included in the reservedPrefix argument of the filter org.emerse.nlp.TokenAdjusterFilterFactory in the managed-schema file for the DOCUMENT index. This should be the default configuration for EMERSE version 7.1. For the EMERSE 7 release prior to version 7.1, your EMERSE administrator will have to manually add them into the reservedPrefix.

<analyzer type="query">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-delimiters.txt"/>
    <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="nlp/models/sd-med-model.zip" tokenizerModel="nlp/models/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin"/>
    <filter class="org.emerse.nlp.TokenAdjusterFilterFactory" reservedPrefix="TX_,tx_,CUI_,POL_NEG,PPOL_NEG,APOL_NEG,TG_FAMILY,PTG_FAMILY,ATG_FAMILY,CT_UNCERTAIN,PCT_UNCERTAIN,ACT_UNCERTAIN,TG_HISTORY,PTG_HISTORY,ATG_HISTORY"/>
</analyzer>

Coding

Ultimately, you’ll want to write code that makes these queries and processes the results. You can you any programming language to do so, as long as it can talk HTTP.

Here is an example python program that queries a hypothetical EMERSE instance. This code is just for demonstration purposes, and may not work 100% since it is merely based on a actual python script that does run:

import json
import ssl

from pathlib     import Path
from http.client import HTTPSConnection
from base64      import b64encode

# Fill in these variables with your username and password,
# or do this some other way...
username = ...
password = ...

sslContext = ssl.create_default_context()

# At UM, we have a custom SSL certificate authority which is often not
# recognized.  We can just skip the check of the cert like this:
sslContext.check_hostname = False
sslContext.verify_mode = ssl.CERT_NONE

client = HTTPSConnection('example.org', 443, context = sslContext)


# make the login request to get the cookie:
query = '{"username":"' + username + '","password":"' + password + '"}'
cookie = ...

client.putrequest('POST', '/emerse/login')
client.putheader('Content-Type', 'application/json')
client.putheader('Content-Length', len(query))
client.endheaders()
client.send(query)

with client.getresponse() as httpResponse:
    rawCookie = httpResponse.getheader('set-cookie')
    cookie = rawCookie[:rawCookie.index(';')]


query = 'q=*:*&rows=2'
query = query.encode('ascii')

client.putrequest('POST', '/emerse/springmvc/api/search')
client.putheader('Cookie', cookie)
client.putheader('Content-Type', 'application/x-www-form-urlencoded')
client.putheader('Content-Length', len(query))
client.endheaders()
client.send(query)

with client.getresponse() as httpResponse:
    response = json.loads(httpResponse.read())
    # Iterate through the documents found
    for doc in response['response']['docs']:
        print(doc['ID'])
        print(doc['RPT_TEXT'])

import json
import ssl
from http.client import HTTPSConnection
from urllib.parse import urlencode

# Fill in these variables with your username and password,
# or do this some other way...
username = ...
password = ...

sslContext = ssl.create_default_context()

# At UM, we have a custom SSL certificate authority which is often not
# recognized.  We can just skip the check of the cert like this:
sslContext.check_hostname = False
sslContext.verify_mode = ssl.CERT_NONE

client = HTTPSConnection('example.org', 443, context = sslContext)


# make the login request to get the cookie:
query = {'username': username, 'password': password}
encoded_query = json.dumps(query).encode('utf-8')
cookie = ...
header = {
    'Content-Type': 'application/json',
    'Content-Length': len(encoded_query)
}
client.request('POST', '/emerse/login', body=encoded_query, headers=header)

with client.getresponse() as httpResponse:
    if httpResponse.status != 200:
        raise Exception(httpResponse.read())
    rawCookie = httpResponse.getheader('set-cookie')
    cookie = rawCookie[:rawCookie.index(';')]


query = {
    'q': '"tx_chest tx_pain"',  #'{!PLUCENE v=\'"chest pain"\'}',
    'fq': '{!terms f="MRN" separator="," v="100000001,100000002"}',
    'rows': 10
}
encoded_query = urlencode(query)
header = {
    'Content-Type': 'application/x-www-form-urlencoded',
    'Content-Length': len(encoded_query),
    'Cookie': cookie
}
client.request('POST', '/emerse/springmvc/api/search', body=encoded_query, headers=header)

with client.getresponse() as httpResponse:
    response = json.loads(httpResponse.read())
    # Iterate through the documents found
    for doc in response['response']['docs']:
        print(doc['ID'])
        print(doc['RPT_TEXT'])