National Center for Health Statistics targeting fall launch of virtual data enclave
The National Center for Health Statistics is testing a virtual data enclave internally to make its sensitive data available to more researchers, with plans to onboard select pilot projects in the fall, according to its Research Data Center director.
Speaking at the Joint Statistical Meetings in Washington, Neil Russell said researchers will be able to use the virtual data enclave (VDE) from wherever they’re at to find and request data from NCHS.
The launch of the enclave represents a culture shift for a “fairly conservative” federal statistical agency, in response to the Foundations for Evidence-Based Policymaking Act of 2019 encouraging increased data releases, Russell said. NCHS — the Centers for Disease Control and Prevention center that tracks vital statistics on births, deaths, diseases and conditions to inform public health decisions — recognized researchers having to go to one of four secure research data centers (RDCs) or 32 federal statistical RDCs (FSRDCs) nationwide to access its restricted-use data was impractical.
“There is a definite financial hurdle to accessing restricted-use data through the physical data enclave model,” Russell said. “And we’re hopeful that a whole new group, or cohort, of researchers may be motivated to access the restricted-use data through a virtual data enclave.”
A researcher in New Mexico, which lacks any RDCs or FSRDCs, will no longer need to travel to Texas, Colorado or Utah to obtain the restricted-use data they need for their work. And no background investigations will be required of researchers at NCHS, which sponsored the VDE.
RDCs closed at the height of the COVID-19 pandemic, but the VDE can operate 24/7 in theory.
The VDE is 99% built and Windows-based with familiar software — namely SAS so researchers can write code to generate outputs they then request from NCHS — to be customer friendly, Russell said.
NCHS’s sister agency, the National Institute for Occupational Safety and Health, already had an operational VDE, so the former didn’t require a contract. Instead NCHS sent NIOSH its enclave requirements designed for data scientists and payment, which came out of CDC Data Modernization Initiative funds, in September.
NIOSH had no way of handling non-CDC employees logging into the VDE, so the General Services Administration’s Login.gov service was used. Outside researchers must show their driver’s license to create an account, and NCHS conducts virtual inspections of their offsite locations.
NCHS further had NIOSH build a tracking system to create an audit trail for all data released.
NIOSH’s VDE already had an authority to operate at the Federal Information Security Management Act moderate level; encrypted researchers’ connections; required two-factor authentication, and prevented downloading, copy-pasting, printing and emailing of data.
To address the rest of the risk of data exfiltration, NCHS requires researchers and, in some cases, their employers to sign data-use agreements specifying where they’d like to access the data from via a secure server.
While NCHS can’t control violations of that agreement, such as a researcher taking a photo of their output prior to submitting it to NCHS for review, they can be caught.
“I’ve seen journal articles produced through restricted use data that we didn’t know where they got it from; we know it happens,” Russell said. “Your access to the data will be terminated and your employer notified.”
Researchers still must pay a data access fee, and NCHS hasn’t calculated the true operational cost of the VDE just yet.
If more researchers seek VDE access than NCHS can handle, which seems likely, Russell will have to ask the CDC for additional funding to scale the environment.
“It is possible that the demand for this mode of access will outstrip our supply,” Russell said. “Currently I only have approval to stand up 10 virtual machines, which seems ridiculous.”