PARADIM has taken a leading role in community efforts to harness the revolution in the scale and scope of data in the materials domain.
The 2D Data Framework (2DDF) provides a structure to develop and disseminated data-intensive approaches and skills throughout our user base and beyond.
The first training workshop engaged 26 students and postdocs from across DMREF and EFRI2-DARE projects focused on two-dimensional materials. Participants had 4.5 days of instruction and problem solving led by the Hopkins and NIST team. The curriculum included basics of data-science environments using python and SQL utilizing the PARADIM Data Collective containerized environment built on the NSF funded, Big Data architecture SciServer. Topics included data wrangling, use of common materials APIs, and an introduction to atomistic calculations from the notebook environment.
Dissemination of the training tools and computing environments created for the workshop has set the stage for a 2019 PARADIM Summer School on Materials Growth and Design: Discovery in the Era of Big (Materials) Data.
This first-of-its kind training workshop was targeted to groups that had received data supplements to their original grants. A condition of those supplements was agreement to send at least one graduate student or postdoc to a training workshop and PARADIM stepped in to create and deliver it.
The curriculum itself was designed to take beginning students through basic tools including:
- Terminal Shell for work in linux environments and with HPC partners
- Git/GitHub for version control and collaborative development
- Python and Jupyter Notebooks for a modern data science experience
- Databases (SQL/NoSQL) to provide more than routine understanding of how databases are not just storage options and the meaning of structured versus unstructured data
- Basic Data Wrangling in Python to give a practical introduction to using packages like Pandas to pull data together and processes it for visualization
- Data visualization which was introduced, but only at fairly basic levels due to time constraints.
We also spent time on more advanced topics like how one uses APIs in python notebooks to access the breadth to MGI resources available and how one can interact with HPC resources to do atomistic calculations, but visualize the data in the same PDC environment.
Feedback from participants was strong with most feeling we met our goals:
- New skills to work with materials data
- Better appreciation of MGI-related, materials data resources
- Motivation to expand your data science skill set
- New friends and potential colleagues
Students involved in computational/theory research were some whatout-of-place and future workshops are focusing on one or the other target audience. We need training for beginners and training for sophisticate users, but not in one workshop.
The 2D Data Framework has become the organizing framework for this community work. Moving forward PARADIM and 2DDC will work together to make the 2DDF helpful across the 2D domain.
It’s worth noting that this effort has substantially helped our broader efforts to move the materials community further into data-centric work. Several things are in the planning stage that will reach across the domain.