Introducing the Trusted Research Environment

 

Image of Dr Olly Butters

Dr Olly Butters

What’s in a name?

First of all, what’s with the name? Let’s dig through the names we’ve had in CHC NENC (Connected Health Cities North East and North Cumbria) for this thing. It started of as the ARK because, y’know, there was going to be a flood of data (not our idea). Then when we started to talk about the regional Health Information Exchange (HIE) we realised if we called it the ARC then the whole thing could be called ARCHIE (we liked this at a lot). Then the National Institute for Health Research (NIHR) came out with their ARC programme, which caused so much confusion that we moved away from calling it the ARC to instead call it the Trusted Research Environment (TRE). Some people call it a Trustworthy Research Environment, but they are wrong. There is now a suggestion that we should call it a Trusted Analytics Platform (TAP). I am going to stick with TRE in this, but you might hear others say any of the above.

So what is it?

It’s a place you can log into to do analysis of sensitive data. That’s it.

Who’s it for?

At this stage we are prototyping what a regional TRE might look like, that would mean it would be open to anyone who needs it to do analysis of sensitive data that we hold in the North East and North Cumbria. From the CHC/GNCR perspective that could be a clinician trying to calculate some statistics about their clinic. It could be a hospital manager looking to do some audits of disparate data the hospital holds. It could be a university researcher looking for trends in combined data from GPs and hospitals. It could be a local council trying to find out how many people have dementia in the region. You get the idea.

Why bother?

The world is changing, with it so is data and how we do analysis. The days of downloading a snapshot of some data to your laptop where you can spend the next year analysing it are ebbing away. There are many reasons for this, two big ones being: it is risky to let people have local snapshots of data – all too often it ends up on a laptop left on the bus/train/tram/hire car/rental bicycle/etc, and we live in a big data world now. For us the pertinent aspects of ‘big’ are that the data volume is likely large, and the velocity (i.e. how quickly it changes) is likely high. This makes downloading a local copy difficult in the first place, and it will probably be out of date very soon after.

With this in mind many places offer an environment you can log into to do analysis on data e.g. the ONS Secure Research Service, the UK Data Service Secure Lab, the SAIL databank etc. We hope to learn from these projects in the design of our TRE.

What are we actually doing?

We are working with AIMES to prototype an environment which users can securely log into (via a VPN) and do their analysis. At a basic level it is a Windows 10 remote desktop hosted in a secure data centre that is fully ISO27001, IGToolKit etc compliant and connected to the Health and Social Care Network (HSCN). In this environment network folders with data on can be shared amongst relevant people and they can do analysis with R. So far this is similar to what you would expect from an off-the-shelf enterprise hosted Windows environment (plus a bit of HSCN and IGToolKit).

The environment in this phase of the project has been put to good use, the predictive modelling of unplanned care project led out of Durham has used it to do an analysis on GP attendance records. They are now starting to look at Early Warning Score data which will make use of SFTP to transfer data directly from a hospital into the TRE. Neither of these projects would have been possible in this way without the TRE.

Where this starts to get really exciting (at least to us) is in the next phase of the project where we can begin to look at extra tools we can put in and around the environment. Let’s draw an analogy of the TRE being like a house, the Windows remote desktop part is just the hallway. Hallways are boring, but they do open up a route to other rooms that are usually less boring. With this in mind, some of the other rooms that will lead off this hallway that we are currently exploring are:

Data flows from the Health Information Exchange (HIE)

As the regional HIE starts to connect everything up in the North East and enables front line care to see relevant data, this opens up an opportunity to develop integrations with the TRE. If every piece of information that flowed through the HIE was potentially available in the TRE then all those folk mentioned above can get access to the data and do their work much more easily.

We are actively working on this right now. We have a sandbox HIE from Tiani-Spirit that we are exploring how best to connect to the TRE, and hope to have a proof of concept to show off in the coming months.

I should emphasize that in no way I am suggesting a free for all on data access here, only people who have a legitimate right to access data would be given it. We just want to make it easier for people to do things they already should be able to do, but can’t due to technical barriers.

Metadata

The previous point about data flows into the TRE is a big one. Imagine any bit of data in the region being within reach. How would we know what is available? We’d need a search engine for data. That’s where a metadata catalogue comes in. By capturing information about what data is available we can open up new ways of finding data.

This could be a hospital manager looking for ways of exploring what is going on the hospital – by giving a search/browse interface to what data exists (but not the data itself) they might be able to find useful data from other departments that can be used to inform decision making. Perhaps a local council might want to find out how many people in their area have Dementia. A metadata catalogue might show that the ICD10 code for Dementia is being collected in clinic X at one of the local hospitals.

Just knowing this data exists is a huge step forwards. Obviously all the relevant governance processes must then be worked through before any data can flow, but now they would know who to ask!

Data analysis

We are pretty easy going about what data analysis tools people might want to use. We like R, but we accept others might want to use tools like SPSS or STATA. We’d like to go down the route of offering a self-service software catalogue where users can pick from a catalogue the software they want to use and it gets installed automatically. We are a way off this at the minute, but it is where we want to be.

DataSHIELD

One of our pet projects that we get very excited about is DataSHIELD. The tag line for this is ‘taking the analysis to the data not the data to the analysis’, and it is all about enabling analysis on sensitive data without ever giving an analyst access to the individual level data (a.k.a. microdata). It does this by having a client-server pair for every statistical function, all the client part does is to send a message to the server part and the server part does the computation. While the server part is calculating values it also does statistical disclosure control, so e.g. if someone asked for the average of a column of data with only one row in it then the DataSHIELD server will not give an answer to the client. By separating the analyst from the individual level data and putting in checks to stop them ever even inferring it, we can make data that is sensitive available to people in a way we would not have been comfortable doing before.

Another feature of DataSHIELD is that it allows federated analysis. In the above example, if instead of one source of data there were multiple sources (e.g. two hospitals with the same data structures) then when an analyst asks for the average of a column, the DataSHIELD client will run the corresponding server commands on both hospitals, which will output the non-disclosive values and then it will combine them to get the exact same answer as if the data were in the same place to begin with.

Calculating the average is quite simple, but we’ve ways of doing many in-depth statistical methods.

Provenance

Provenance is a really important part of this, it is all about making sure we know how data was collected and how it has been processed. Without this information it is easy to lose important context on data – ‘Prescribed asprin’ as an isolated bit of information isn’t as useful as knowing it was part of a wider blood thinning regime.

Find out more and help guide us

That’s some of the work going on around the TRE at the minute. If you want to find out more you can watch this webinar we hosted on 28 March 2019 which goes into more detail.

 

 

 

 

 

 

Leave a comment

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>