EMERSE Networking Guide

Overview

This guide describes topics related to the networking of EMERSE instances across sites, including basic operation and configuration. In general, there will be no need to set up a hub; only local configuration of an existing EMERSE instance is required to join the network. A new hub instance is only needed for a group of EMERSE sites that want to run their own network.

Importantly, to join the network, nothing else needs to be installed. Everything required is already built-into the baseline EMERSE system. Further, joining the network is completely optional. If a site chooses not to join the network that site can still use the basic functionality of EMERSE. The network capability is turned off by default, and can only be turned on by a local system administrator (with some additional configuration required, and described in this guide).

Background

EMERSE was initially designed to be a standalone system at a single site. With the introduction of networking, it is now possible to connect these sites to form an EMERSE network. However, for reasons of trust, most functions of EMERESE cannot be used cross-site. Currently, the most a user can do is run an search against all patients of another site and get an obfuscated count. However, even this small degree of functionality can be helpful.

A site does not have to join any EMERSE network, and it could join many or host its own network (by setting up a hub). In addition, each site has control over how its information is obfuscated.

Network Architecture

Rather than being a completely de-centralized network, EMERSE networks operate using a hub-and-spoke model. This makes it easier to operate and reduces the number of connections required as more sites are added to the network. Each site can connect to multiple hubs, and all sites connected to a hub are connected to one another through the hub (although any site can still limit access to another site, or even specific users, with the Firewall, described below.) Connecting to a hub requires only an outgoing connection, so no special firewall rules need to be added to join a network, assuming the server can reach the Internet. However, to setup a network hub, incoming connections must be allowed, which might require changes to firewall rules and NAT rules. A typical site participating in the network would not need to operate a network hub.

Figure 1. Conceptual diagram of the hub-and-spoke model of the EMERSE network.

The hub consists of three parts: a message broker called Apache ActiveMQ, a piece of custom software called the Query Coordinator, and a database. EMERSE at each site connects to ActiveMQ, as does the Query Coordinator, and they communicate with one another through messages sent to the broker. The Query Coordinator stores in the database some centralized auditing of messages, and some caches of information from the sites, such as total patient count, or whether the site appears to be "online."

Communication between EMERSE instances and the Network Hub is done over a secure, encrypted TLS connection (even though no PHI travels across the network), using the OpenWire Protocol. A connection can only be made from an EMERSE instance outbound to the Network Hub, and not the other way around. Not requiring inbound connections helps ensure security since the EMERSE instance does not accept connections from the outside world. The connection between the EMERSE instance and the Network Hub can persist indefinitely because the OpenWire protocol supports a lightweight 'Keep-Alive Mechanism' wherein the client and server regularly send small 'keep-alive' messages or 'heartbeats' back and forth to each other. These signals check whether the connection is still alive and functioning. If a connection is lost, the EMERSE instance can detect the loss of a connection and attempt to re-connect to the Hub.

Examples

The following are screen shots of the networking functionality from the perspective of a user to show the type of results that would be seen.

Figure 2. Results from a network search involving one other site for the term "myocarditis". In this demonstration example, both sites (nodes) have the exact same test data in the system, but the numbers reported differ to due obfuscation of the results.

Network Results for Nephrectomy and Myocarditis

Figure 3. Results from a network search involving one other site for the terms "nephrectomy" and "myocarditis". In this demonstration example, both sites (nodes) have the exact same test data in the system, but these terms do not appear together often. In fact, in both systems the total count is below the allowable thresholds. Note that one site has a threshold of <15 and another has a threshold of <20. Each site can independently set their own threshold for not reporting back small counts. (Very small counts could potentially identify individuals even though this is banned by the Terms of Use).

Security Considerations

Only counts are reported across the network to other sites.

No documents, free text, patient identifiers, or any other protected health information (PHI) is ever transmitted across the network.

The network has multiple safeguards, both technical and administrative, that are described here at a high-level. Additional details, including configuration of some of these features, are described later in this guide.

High level overview of security features:

No PHI is ever sent across the network
Only obfuscated counts are sent across the network
Even though no PHI is sent across the network, all traffic between sites is encrypted
To ensure attackers don’t try to directly connect to the message broker, or that one site doesn’t attempt to impersonate another, each site authenticates to the broker by a username and password, over a secure TLS connection
Each site has local control over the type and degree of obfuscation using configuration settings
Each site controls the threshold under which small counts from their site are not reported (e.g., small counts might be reported as "less than 10")
The hub limits the number of queries a single user can make in a defined time window
All queries are audited and stored locally with each participating site; these audit logs can be viewed within the Admin application
Query response times are randomized to prevent estimating count sizes based on response times
Each site has access their own Firewall within the admin app that enables blocking a single user from another site or even another site overall from querying the local system
The network architecture, the query coordinator, and EMERSE itself have undergone multiple rounds of security assessments including code reviews, penetration testing, vulnerability scans, and more
All users must agree to the Terms of Use for each user session in which the Network is used; this is built into the user-interface and cannot be bypassed. These terms of use include:

1. Conduct searches only for valid purposes which at this time includes estimating cohort sizes and demonstrating the network capabilities to others
2. Never attempt to identify individuals
3. Never use the counts to gain a "market" advantage over other Participating  sites
4. Never share your login credentials with others

Audit Logs

Each site’s EMERSE instance tracks what queries are run by which network users (both the user name and the site that user is from), plus the count reported back to them. This is kept in the table SEARCH_AUDIT, with the REMOTE_USER and NODE_INDICATOR columns filled out.

Audit logs can also be viewed from within the Admin application.

Figure 4. The Network Audit logs, as seen in the admin application. From these logs it is possible to view the site, the user who ran the query, whether the query was incoming from another site or outgoing from the local site, the results (counts) returned, and the terms used.

Firewall

The EMERSE Admin application also includes a firewall feature to give absolute and granular control to each site (node) regarding who has access to the network feature. The firewall allows an administrator to block incoming traffic from a specific user at a specific site, or even an entire site. It also permits the administrator to block a local user from querying a specific node in the network, or from using the network at all. Note that blocking local users from the network functionality can also be done at a higher level using the Privileges feature that is connected to Roles. Changes to the firewall rules take effect immediately, since there is no need for this information to be transmitted to other sites on the network.

While we do not anticipate the firewall from ever being needed, it does empower a site to feel comfortable that they are in full control over access and can turn aspects of the network off immediately if, for example, the logs reveal suspicious activity from a specific user.

Figure 5. The Network Firewall in the admin application. Both incoming and outgoing network traffic can be blocked at a granular (user) level, or at the level of a site.

When blocking a specific user, that user’s ID must be entered by the administrator. For external users (users at another site), the user ID is validated against the user IDs in the audit logs. For local users, the user ID is validated against the local user table in the database. Note that each instance of EMERSE only has access to its own local user information—the only way to know about a user from another site is to look through the audit logs for those who have already queried the network.

Configuration

Starting with version 6.0, all EMERSE instances are capable of connecting to networks. To configure a network, create a file named emerse-network.properties in the home directory of the user running the EMERSE application, or set the system property emerse.network.properties.filepath to the location of the properties file. See the installation guide for how to set system properties.

This properties file configures all networks the EMERSE instance connects to. For each network, choose a short identifier for the network, such as UM if you are connecting to the network hosted by University of Michigan. This will be a prefix of the property names that configure the network in the emerse-network.properties file. (XXX is a placeholder in this documentation for the network identifier you chose.) This prefix has no meaning other than to group the properties.

XXX.name: This is the name of the network as it will appear in the UI. Currently, there is no place in the UI where network names appear, but that may change in the future.

To configure the name of your site, as it appears in the list of networked sites, you can set a property in emerse.properties:

emerse.properties

node.name: The name of your node in the network in the network page. By default, this is Your Site.

Configuring the Connection

The central hub runs an instance of a message broker called ActiveMQ. It is this broker that EMERSE actually connects to to communicate with the other instances on the network. All of this information should be given to you by the hub owner.

XXX.node.id: This is the name of this EMERSE instance on this network. This should be the same as the username below, though this restriction may change in the future.
XXX.url: This is the URL of the ActiveMQ instance that the hub is running. It should be typcially of the form ssl://hostname:port.
XXX.username: This is the username your EMERSE instances uses to authenticate to the network.
XXX.password: The password used to authenticate this EMERSE instance to the network.

Term bundles can be shared across the network. When network bundle sharing is enabled on at least one network, it will alter the bundle sharing screen to add an option to share the bundle with all sites on all networks where network bundle sharing is enabled, and also allow you to share the bundle with individual people from the sites of that network which have bundle sharing enabled to. If you disable bundle sharing on a network, other sites on that network cannot share bundles with you, nor can you share them with them.

XXX.bundles.enabled: true (the default) to enable bundles for the network, or false to disabled it for the network.

Configuring Queues

The following properties configure what queues this EMERSE instance should listen to and send requests to so it can communicate with the query coordinator.

XXX.emerse.in: This is the name of the queue that all EMERSE instances on the network listen on to receive requests. In other words, it’s EMERSE’s "inbox." Which EMERSE instance a message is "to" is determined by the nodeId header on the message.
XXX.emerse.out: This is the name of the queue that all EMERSE instances use to send responses to requests. In other words, it’s EMERESE’s "outbox." Only the query coordinator dequeues messages from this queue. Which EMERES instance a message is "from" is determined by the JMSXUserID header on the message.

In all cases, a node in the network is identified by it’s XXX.node.id which should be the same as the username of its account in ActiveMQ.

Count Obfuscation

Your EMERSE instance by default is configured to report an obfuscated count of the patients that match queries originating from the network, meaning a random number chosen around the exact number. This prevents a user from tweaking a query to positively identify a patient based on an exact result returned.

There are a number of ways to customize or disable this obfuscation. The first is that counts below a certain threshold can be merely reported as "less than N" for a number N that you configure. Second, you can choose the type and parameters of the random distribution used to draw a number around the exact count. This means you can have very wide spread of numbers reported for a single query, or a very narrow spread. Finally, you can round the reported number to the nearest multiple of M, for a number M you configure.

Much of the approach for obfuscation was modeled after other systems such as the i2b2 Workbench.

The general code for obfuscation looks like the following:

	public int obfuscate(int count)
	{
		double sample = sample() - mean();

		double obfuscatedValue = count + sample;
		if (obfuscatedValue <= zeroThreshold)
		{
			return 0;
		}
		long roundedValue = Math.round(obfuscatedValue / roundToNearest) * roundToNearest;
		if (roundedValue <= 0)
		{
			return 0;
		}
		return (int) roundedValue;
	}

For the code above, sample() generates a random number from the configured distribution, and mean() returns the mean of that distribution. Details about the options for selection a distribution are below.

For the properties defined below, the XXX represents the name of an individual site in the network, and each site will have its own set of obfuscations that can be configured, in case that differs between sites.

XXX.obfuscate.count.zeroThreshold	This is the lower limit for which no count will be returned. If this parameter is not defined, it will default to 10. Example: XXX.obfuscate.count.zeroThreshold=10
XXX.obfuscate.count.roundToNearest	Round counts to the nearest multiple of this number. By default this is one (so effectively no rounding is performed). Example: XXX.obfuscate.count.roundToNearest=1
XXX.obfuscate.count.distribution	This is the statistical distribution for how the obfuscation will be made. The default if no parameter is defined is `normal`. Example: XXX.obfuscate.count.distribution=normal

Additional obfuscation options related to the distribution are defined in the table below.

Distribution value	Description	Additional Parameters	Parameter description
normal	The normal distribution	XXX.obfuscate.count.distribution.normal.s=2	The standard deviation
binomial	The binomial distribution	XXX.obfuscate.count.distribution.binomial.n=6 XXX.obfuscate.count.distribution.binomial.p=0.5	Number of "trials", a choice between 0 or 1. Probability of a "1"
uniform	The uniform distribution	XXX.obfuscate.count.distribution.uniform.scale=6	This is the uniform distribution on [0, S], where S is this parameter.
disabled	A uniform distribution on [0,0]	(none)	This distribution effectively disables the random variation of results.

Distribution value

Description

Additional Parameters

Parameter description

normal

The normal distribution

XXX.obfuscate.count.distribution.normal.s=2

The standard deviation

binomial

The binomial distribution

XXX.obfuscate.count.distribution.binomial.n=6
XXX.obfuscate.count.distribution.binomial.p=0.5

Number of "trials", a choice between 0 or 1.
Probability of a "1"

uniform

The uniform distribution

XXX.obfuscate.count.distribution.uniform.scale=6

This is the uniform distribution on [0, S], where S is this parameter.

disabled

A uniform distribution on [0,0]

(none)

This distribution effectively disables the random variation of results.

The distributions are the standard statistical distributions. We "sample" from the distribution, meaning we generate a random number with the probability described by the distribution, and we modify the actual count based on that random number.

Response Time Obfuscation

It may be possible to glean some information from the amount of time it takes for a query to respond. For instance, if a query cannot possibly match because a term queried for doesn’t appear in the index, then the response to the query my be received very quickly, but if the term does appear, it may take much longer. This may then be used to infer that the count is zero versus non-zero, even though the count is itself reported the same after obfuscation (as less than ten say).

To prevent this, a response is delivered at least after some small random amount of time. This is less configurable than the count obfuscation; it simply waits a random amount of time uniformly randomly picked between a configurable min and max delay.

XXX.obfuscate.time.minDelayMillis	The minimum number of milliseconds to wait before responding. Default is zero.
XXX.obfuscate.time.maxDelayMillis	The maximum number of milliseconds to wait before responding. Default is 1000 (a single second).

Query Limits

EMERSE throttles requests from users on the network to prevent them from executing a large number of queries to try to deduce the exact count for a query result. This can be specified using two parameters which should be configured in the usual emerse.properties file, not the network properties file.

Namely, when a request from a user comes in, it will be rejected if in the last X minutes, the number of queries run by that user (from that remote site) is at or exceeding the threshold configured.

Queries that are rejected are not logged in the audit system at this point.

remoteUserQueryThreshold	This defines the total number of queries a user can execute at the local site within a certain time frame, which is defined using `remoteUserQueryIntervalInMins`. Default is 10. Example: remoteUserQueryThreshold=10
remoteUserQueryIntervalInMins	This defines the time span in which the `remoteUserQueryThreshold` is permitted. Default is 10. Example: remoteUserQueryIntervalInMins=30

The two parameters, as shown above, would allow for no more than 10 queries within a 30 minute time period from a single user.