Overview
This guide describes the EMERSE API, which is a HTTP-based programatic interface to EMERSE’s 3rd-party search engine Solr. The main use case of the API is to get access to the documents stored in Solr to do machine learning. The main reason you would use the EMERSE API instead of granting direct access to Solr is that EMERSE does authentication, authorization, and audit logging on the queries sent to Solr. In all other ways, the EMERSE API is a pass-through to Solr.
Any user of the EMERSE API will need an account in EMERSE with the API_ACCESS
privilege granted. Once that privilege is granted, the user will be able to make a POST to /emerse/springmvc/api/search
. The body of the POST should be query parameters that should be sent to Solr. This means the body effectively has a content-type of application/x-www-form-urlencoded
, though the content-type header is not checked so it doesn’t need to be passed. The response is an exact copy of what Solr sent. Only the JSON format is supported. Requesting Solr to use a different response writer with &wt
will likely cause a 500 since we internally parse the JSON response and record MRNs and document IDs returned in the audit log.
Authentication
Users must also authenticate as a part of their POST. This can be done using basic authentication, or by using cookies. To use basic auth, send the `Authorization: ` header with the base64 encoding of your username and password separated by a colon, as per RFC7617.
Such a request may look like the following, where we’ve used the username emerse
and password demouser
. So, the text ZW1lcnNlOmRlbW91c2Vy
is the base64 encoding of emerse:demouser
. We also sent the Solr search parameters q=:&rows=0
to do a search for all documents, returning none of them, in order to get a count of all documents in the index.
POST /emerse/springmvc/api/search HTTP/1.1
Host: localhost:8080
Authorization: Basic ZW1lcnNlOmRlbW91c2Vy
Content-Length: 12
Content-Type: application/x-www-form-urlencoded
q=*:*&rows=0
A response to this request could look like:
HTTP/1.1 200
Content-Type: application/json;charset=UTF-8
Content-Length: 213
{
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"q":"*:*",
"rows":"0"
}
},
"response":{
"numFound":635861,
"start":0,
"numFoundExact":true,
"docs":[ ]
}
}
To use cookie-based authentication, issue a POST to /emerse/login
with the JSON {"username":"USERNAME","password":"PASSWORD"}
replacing the uppercase wods with your actual username and password. Then, use the JSESSION cookie sent back in the Set-Cookie
response header.
POST /emerse/login HTTP/1.1
Host: localhost:8080
Accept: */*
Content-Length: 43
Content-Type: application/json
{"username":"emerse","password":"demouser"}
A possible response would be:
HTTP/1.1 200
Set-Cookie: JSESSIONID=B7F7756715D9165F7ABC26FA0E67D4F4; Path=/emerse; HttpOnly
Content-Type: application/json;charset=UTF-8
Content-Length: 17
{ "success":true}
Then subsequent requests should look like:
POST /emerse/springmvc/api/search HTTP/1.1
Host: localhost:8080
Cookie: JSESSIONID=B7F7756715D9165F7ABC26FA0E67D4F4
Content-Length: 12
Content-Type: application/x-www-form-urlencoded
q=*:*&rows=0
Don’t use the basic authentication header on the /emerse/login page; using basic auth prevents a session (and hence cookie) from being used.
|
Querying
Since the API in a pass-through to Solr, you should consult Solr’s documentation on how to query. However, we will go over some basics here.
First, EMERSE has two collections of "documents". The DOCUMENT
collection is the collection of medical notes, and the other PATIENT
collection is a collection of "documents" each representing a patient. You can query either, and EMERSE has a special query parameter &collection=
that allows you to specify which, defaulting to the DOCUMENT
collection if not specified. Collections are also called "indexes" in Solr parlance.
A document in Solr is like a JSON object. Each field holds either a single value or, rarely, an array of values. These values are either text, a ISO-8601 formatted date in the UTC timezone (as a string in JSON), a number, or a boolean, or an array of any such values (but not an array of arrays). For example, here is a document from the PATIENT
collection of the EMERSE demo system:
{
"ID":"7195",
"MRN":"100007195",
"FIRSTNAME":"Amari",
"LASTNAME":"Christian",
"BIRTHDATE":"1938-06-11T05:00:00Z",
"SEX_CD":"F",
"RACE_CD":"B",
"ETHNICITY_CD":"Pt Refused",
"RELIGION_CD":"U",
"LANGUAGE_CD":"U",
"MARITAL_STATUS_CD":"U",
"ZIP_CD":"",
"DECEASED_FLAG":false,
"DELETED_FLAG":false
}
and here is a document from the DOCUMENT
collection of the EMERSE demo system:
{
"ID":"docid_3_73635",
"LAST_UPDATED":"1996-03-20T05:00:00Z",
"ENCOUNTER_DATE":"1996-03-17T05:00:00Z",
"DOC_TYPE":"Consent Form",
"DEPT":"Neurosurgery",
"SOURCE":"source3",
"RPT_TEXT":"<html><body>Industrial chokeberry pomace...",
"MRN":"100004264",
"CSN":"41770",
"CATEGORY":["278","279","280"],
"AGE_MONTHS":2,
"AGE_DAYS":61,
"_version_":1793068510900912128
}
The fields of the PATIENT
index won’t change, but each institution may change the number and names and types of the fields in the DOCUMENT
index, so to properly query Solr, users of the API will need to talk to your administrators to know which fields hold what information. Currently, there is no way to determine the field names from the EMERSE UI, though the values of a fields can be seen in the filter for the field, if it’s set up to show the raw Solr values.
There are three major query parameters every use will probably use:
-
&q=
which gives the query to Solr. Conceptually, Solr will construct an ordered list of all documents that match this query. -
&start=
gives the index of the first document to return from the conceptual list of results built by Solr. The first document is at index zero. -
&rows=
gives the number of documents to send back. This defaults to 10 if unspecified. Thus, by passing&start=
and&rows=
you can paginate through the result set. If you don’t want to paginate, you can set&rows=
to be a very high number, such as 1 billion, in order to get all results but this might place a large load on Solr.
Query Syntax
The query syntax that goes into the &q=
parameter is thoroughly described on Solr’s website. However, as a short introduction, a query consists of a collection of terms, each term prefixed with a field, and then a quoted phrase. The phrase doesn’t need quotes if it doesn’t contain a space, or other special characters, and the term doesn’t need a field prefix if its the default field, which is RPT_TEXT
. Such a term might look like cancer
which is short for RPT_TEXT:cancer
.
The text of a document is broken up into a series of words, which are defined as a sequence of letters or digits separated by punctuation, whitespace, etc, including hyphens and apostrophes. The exact rules for what delimits a word is a bit complex. Each word has one or more tokens associated with it. For instance, the sentence:
The patient denies chest pain.
generates the following tokens in the following table. The first row is the sentence itself for rerfence. The second row and beyond are the tokens generated. Tokens are positioned at words in the documents, so tokens in the same column are positioned at the word in the top row. So, for instance, the token TX_The
is positioned where the word "The" is in the sentence.
The |
patient |
denies |
chest |
pain |
TX_The |
TX_patient |
TX_denies |
TX_chest |
TX_pain |
tx_the |
tx_patient |
tx_denies |
tx_chest |
tx_pain |
POL_NEG |
POL_NEG |
|||
CUI_1550655 |
CUI_817096 |
CUI_30193 |
||
SMG_Finding |
SMG_Anatomy |
SMG_Finding |
||
CUI_8031 |
||||
SMG_Finding |
As you can see, there are a lot of tokens per word, and tokens have prefixes. These prefixes are:
TX_
-
Tokens with this prefix exactly match the case of the word they correspond to.
tx_
-
Tokens with this prefix are lowercased. This allows for case-insensitive matching, since searching for a token
tx_abc
would find words "ABC" and "abc" or even "aBc" in documents. POL_
-
Tokens with this prefix indicate "polarity" meaning whether or not the words are in a semantically negated part of the sentence (such as the object of the verb "denies"). Only the token
POL_NEG
currently has this prefix. The absence of this token is usually enough for a "positive" mention. CUI_
-
Tokens with this prefix indicate the corresponding word is part of a semantic concept in the UMLS given by the given id. "CUI" stands for concept unique identifier.
SMG_
-
Tokens with this prefix indicate groupings of UMLS CUIs.
TG_
-
Tokens with this prefix indicate who the the words are referring to. The only really token with this prefix is
TG_FAMILY
, meaning the words are reffereing to the patient’s family as opposed to the patient themself. For instance, if a document said "The patient’s mother had breast cancer", the phrase words "breast cancer" would getTG_FAMILY
tokens at their positions. CT_
-
Tokens with this prefix indicate with what certainty the words affirmed. The only real token with this prefix is
CT_UNCERTAIN
.
Terms of your query are just these tokens. Thus, searching for tx_the
will find all documents containing that token. However, in the case you just want to use the tx_
prefix, you can use the PLUCENE
query parser with your searches which does the prepending for your. You can activate that parser by passing the defType=PLUCENE
parameter in your search. (There will be an example below.) We’ll assume we’re using this query parser most of the time, but our examples should work with prefixes CUI_ and SMG_ prefixes. The other prefixes aren’t easily accessible right now.
A term can be prefixed with a plus sign to require the term appear in the document, or a minus to require the term does not appear. Unprefixed terms terms are optional. If there are no required terms, then at least one optional term must match. Otherwise, optional terms only change the ranking of a result. (Results are ordered from most matches to least, accounting for the value of a match using TF-IDF ranking or something similiar.)
Finally, parentheses can be used to create sub-queries. This can be used to implement something close to boolean logic.
Here are some examples queries, which may or may not match the above document.
-
q=industrial
Matches
-
q="industrial chokeberry"
Matches
-
q="industrial pomace"
Does not match, since the two words are not adjacent, and this is a single phrase
-
q=industrial pomace
Matches, since merely one term is required
-
q=+industrial +pomace
Matches, because both terms appear
-
q=AGE_DAYS:61
Matches
-
q=AGE_DAYS:12
Does not match
-
q=AGE_DAYS:12 pomace
Matches since only one term is required and pomace appears.
-
q=CATEGORY:278
Matches. You cannot make a query to require a document contains a specific (ordered) list of values.
-
q=DOC_TYPE:"Consent Form"
Matches.
-
q=DOC_TYPE:Consent
Does not match. Only the RPT_TEXT field (or whatever field stores the main body of text of the document) is tokenized meaning searches match on the words internal to the text. Other fields generally are not tokenized, meaning the whole string must match the phrase, with spacing and other special characters appearing in the phrase exactly as they are in the field of the document. Of course, your administartors may set up certain fields to be tokenized (such as doctor’s names), so you can consult the administrators of your installation to determine if a field is tokenized or not.
-
q=(DOC_TYPE:"abc" DOC_TYPE:"def") pomace
Matches since pomace appears. This query would match any document which contains one of the three terms.
-
q=+(DOC_TYPE:"abc" DOC_TYPE:"def") pomace
Does not match since the subquery must match, and in order to match that,
DOC_TYPE
must be eitherabc
ordef
. This query would match any document which has either value for theDOC_TYPE
field. If the wordpomace
appears, it will only increase the ranking. -
q=+(DOC_TYPE:"abc" DOC_TYPE:"def") +(industrial pomace)
This query would match any document which has either value for the
DOC_TYPE
field, and must have either the wordindustrial
orpomace
. -
q="main focus" (+(DOC_TYPE:"abc" DOC_TYPE:"def") +(industrial pomace))
This query would match any document that contains the phrase "main focus", or any document which matches the previous query. This is because the "top-level" query just has two optional subqueries, so only one needs to match. The first subquery is just a phrase term, and the second is exactly the query given above. In other words, the pluses and minuses are "scoped" to the parenthetical they appear in.
I should note that you must percent-encode your query, so if your query is: +industrial +pomace
you must encode it like %2bindustrial%20%2bpomace
when sending it as the argument to q=
. There is some leniency in what you send; just know that it will be percent-decoded before use. So, in particular, the +
sign is converted to a space, which is why you must encode it as %2b
. The space character, for instance, doesn’t really need to be percent-encoded since the query is going into a body of a post, not an actual URL, and a literal space decodes to itself so it is safe. For instance, a full request for such a query would look like:
POST /emerse/springmvc/api/search HTTP/1.1 Host: localhost:8080 Authorization: Basic ZW1lcnNlOmRlbW91c2Vy Content-Length: 51 Content-Type: application/x-www-form-urlencoded defType=PLUCENE&rows=10&q=%2bindustrial %2bpomace
A typical response from Solr will look like the following:
{
"responseHeader":{
"status":0,
"QTime":18,
"params":{
"q":"*:*",
"rows":"2",
}
},
"response":{
"numFound":635861,
"start":0,
"numFoundExact":true,
"docs":[
{
"ID":"docid_3_73635",
"LAST_UPDATED":"1996-03-20T05:00:00Z",
"ENCOUNTER_DATE":"1996-03-17T05:00:00Z",
"DOC_TYPE":"Consent Form",
"DEPT":"Neurosurgery",
"SOURCE":"source3",
...
},
{
"ID":"docid_3_88677",
"LAST_UPDATED":"1996-03-21T05:00:00Z",
"ENCOUNTER_DATE":"1996-03-18T05:00:00Z",
"DOC_TYPE":"Consent Form",
"SVC":"General",
"DEPT":"Emergency Medicine",
"CLINICIAN":"Page, Dixie",
"SOURCE":"source2",
...
},
...
]
}
}
Coding
Ultimately, you’ll want to write code that makes these queries and processes the results. You can you any programming language to do so, as long as it can talk HTTP.
Here is an example python program that queries a hypothetical EMERSE instance. This code is just for demonstration purposes, and may not work 100% since it is merely based on a actual python script that does run:
import json
import ssl
from pathlib import Path
from http.client import HTTPSConnection
from base64 import b64encode
# Fill in these variables with your username and password,
# or do this some other way...
username = ...
password = ...
sslContext = ssl.create_default_context()
# At UM, we have a custom SSL certificate authority which is often not
# recognized. We can just skip the check of the cert like this:
sslContext.check_hostname = False
sslContext.verify_mode = ssl.CERT_NONE
client = HTTPSConnection('example.org', 443, context = sslContext)
# make the login request to get the cookie:
query = '{"username":"' + username + '","password":"' + password + '"}'
cookie = ...
client.putrequest('POST', '/emerse/login')
client.putheader('Content-Type', 'application/json')
client.putheader('Content-Length', len(query))
client.endheaders()
client.send(query)
with client.getresponse() as httpResponse:
rawCookie = httpResponse.getheader('set-cookie')
cookie = rawCookie[:rawCookie.index(';')]
query = 'q=*:*&rows=2'
query = query.encode('ascii')
client.putrequest('POST', '/emerse/springmvc/api/search')
client.putheader('Cookie', cookie)
client.putheader('Content-Type', 'application/x-www-form-urlencoded')
client.putheader('Content-Length', len(query))
client.endheaders()
client.send(query)
with client.getresponse() as httpResponse:
response = json.loads(httpResponse.read())
# Iterate through the documents found
for doc in response['response']['docs']:
print(doc['ID'])
print(doc['RPT_TEXT'])