Setting up a sustainable data project

Presenter: Asura Enkhbayar, Research Data Analyst at Invest in Open Infrastructure
Moderators: Josh Weiss, Director of Digital Learning Solutions; Mae Bethel, Academic Technology Specialist
Recording of the session:

Central Questions:

Why sustainable data science?
What are some key questions to ask at the beginning of a data project?
What are useful tools or processes for sustainable data practices?

Key quotes:

[There are] different kinds of costs that occur with open and reproducible practices – for instance, the cost of labor. (11:50)

An example of a project [is one] that started off as an individual exploration, turned into a software project, and, based on that, we then published a group of papers and [did] different smaller research investigations. That’s a very common model. (31:19)

[You have] important considerations when making hiring decisions. Just setting up a budget for a project or even doing everyday engagement with your team. If we’re aiming for reproducible and open science, there is a cost associated that needs to be balanced with it. (35:28)

Since explorative work ends up not thinking about the long-term and rather getting the data right away, a lot of folks end up revisiting project structure at a later point. But in my experience, giving it a little bit more thought in the beginning can be helpful. (45:46)

Take-aways:

What’s wrong with reproducible science?

There are hidden costs of labor, training, education, maintenance, and opportunity in timing.
Be mindful that there is an extra burden on the person trying to learn the tool.
Be careful about committing to certain tools. They each will come with drawbacks of features you have access to and analysis and ease of data transfer
Some management styles may be ignoring the technical limitations of open source tools and relying on individual commitment, which can turn into burnout.

Project Considerations

Qualitative data is different from quantitative data. Think about which you are using for the project.
Will we collect the data ourselves and will we keep it or is it just a snapshot?
Do we need to develop tools and infrastructure?
Try to think about all possible scenarios. Do you talk to a research analyst? Do you have collaborators that need clean data? Would you ever have to revert the data to its original?
Look at the humans doing the work. They all have affordances and constraints. There may be limitations in people having time and energy for tasks.

Process Considerations

Is it FOSS (free and open-source software)?
Is there a steep learning curve or is it easy to learn?
Is there a community of users who can support and help?
Seek tools that are versatile so that you can reuse across other projects
You just have to negotiate between reproducibility and openness

IT Teaching Resources

Classroom Resources

Setting up a sustainable data project

How to balance hidden costs, project structure, and reproducibility across the lifetime of a data project

Central Questions:

Key quotes:

Take-aways: