All Facilities Open for Users

PARADIM Data Collective: Secure Open-Source Data Streaming

PARADIM Highlight #76—Education and Outreach (2023)

Tyrel M. McQueen, Margaret Eminizer, David Elbert (JHU)

Since 2019, PARADIM has deployed a streaming data architecture for real-time collection, collation, analysis, storage, and access of data from research instrumentation. In parallel, a DMREF for data driven discovery has developed OpenMSIStream, a software toolkit easing adding of data streaming to commercial instruments.

research image

An important consideration in both of these efforts is how to protect the confidentiality, integrity, and verifiability of streamed data, while ensuring data is accessible at the appropriate time to provide data alongside publications.

KafkaCrypto was developed by PARADIM to provide end-to-end encryption and data assurance for streaming materials data. It relies on a core symmetric crypto ratchet implementation to provide forward secrecy, and verifiability of what data was produced and consumed when and by what. Together with a Noise protocol framework based asynchronous key exchange, it provides the data confidentiality, integrity, and verifiability needs of users.

OpenMSIStream now incorporates KafkaCrypto and both are now available to the community for adoption and further development.

What has been achieved:

Flexible, open-source code release providing any laboratory easy access to secure streaming data. Access to growing tools for auto-curation of experimental and computational data to create a backbone for automated FAIR data. An extendable framework to create automated data analysis and deployment of ML models in laboratory settings.

Importance of the Achievement:

Streaming experimental data provides a foundation for automated data curation of FAIR data releases and the accelerated extraction of information through AI/ML. PARADIM contributed basic data producers, data consumers, and an end-to-end cryptographic layer to the recently released, open-source Python library OpenMSIStream. OpenMSIStream provides the reliable and durable data confidentiality needed to meet these critical needs of PARADIM users focused on proprietary or otherwise confidential research. OpenMSIStream with KafkaCrypto makes secure data streaming available in PARADIM and across science

Full reference:

M. Eminizer, S. Tabrisky, A. Sharifzadeh, C. DiMarco, J.M. Diamond, K.T. Ramesh, T.C. Hufnagel, T.M. McQueen, D. Elbert, “OpenMSIStream: A Python package for facilitating integration of streaming data in diverse laboratory environments,“ Journal of Open Source Software 8, 4986 (2023) (https://doi.org/10.21105/joss.04896)

Acknowledgments:

The development of OpenMSIStream has been financially supported by NSF Awards #1921959 and #2129051. Tyrel M. McQueen and the development of KafkaCrypto were supported by the Platform for the Accelerated Realization, Analysis, and Discovery of Interface Materials (PARADIM), a NSF Materials Innovation Platform, under cooperative agreement #1539918.

Additional Information

Code: Access to OpenMSIStream is available at https://github.com/openmsi/openmsistream. KafkaCrypto is available at https://github.com/tmcqueen-materials/kafkacrypto.

Broader Impacts: This work included training and contributions by students (S. Tabrisky, undergraduate at Dartmouth College; J.M. Diamond, graduate student at JHU) and a postdoctoral fellow (C. DiMarco at JHU).

Contributions Statement: OpenMSIStream was originally developed as part of DMREF #1921959 to support development of a materials design loop centered on data flow and instantiated for creation of spall-resistant aluminum alloys. VariMat, #2129051, provided deployment of OpenMSIStream to a broad array of scientific equipment and associated use for automated semantic curation and consumption to object stores. PARADIM provided initial work on basic data producers and consumers; improved serialization/deserialization; enhanced reliability for large data streams; and the seamless end-to-end encryption layer providing data confidentiality, assurance, and verifiability.

Research Highlight