IT Teaching Resources

Setting up a sustainable data project

How to balance hidden costs, project structure, and reproducibility across the lifetime of a data project

Article Technology Use Case

Presenter: Asura Enkhbayar, Research Data Analyst at Invest in Open Infrastructure
Moderators: Josh Weiss, Director of Digital Learning Solutions; Mae Bethel, Academic Technology Specialist
Recording of the session:

Central Questions: 

  • Why sustainable data science?
  • What are some key questions to ask at the beginning of a data project?
  • What are useful tools or processes for sustainable data practices?

Key quotes: 

[There are] different kinds of costs that occur with open and reproducible practices – for instance, the cost of labor. (11:50)

An example of a project [is one] that started off as an individual exploration, turned into a software project, and, based on that, we then published a group of papers and [did] different smaller research investigations. That’s a very common model. (31:19)

[You have] important considerations when making hiring decisions. Just setting up a budget for a project or even doing everyday engagement with your team. If we’re aiming for reproducible and open science, there is a cost associated that needs to be balanced with it. (35:28)

Since explorative work ends up not thinking about the long-term and rather getting the data right away, a lot of folks end up revisiting project structure at a later point. But in my experience, giving it a little bit more thought in the beginning can be helpful. (45:46)

Take-aways: 

What’s wrong with reproducible science?

  • There are hidden costs of labor, training, education, maintenance, and opportunity in timing.
  • Be mindful that there is an extra burden on the person trying to learn the tool.
  • Be careful about committing to certain tools. They each will come with drawbacks of features you have access to and analysis and ease of data transfer
  • Some management styles may be ignoring the technical limitations of open source tools and relying on individual commitment, which can turn into burnout.

Project Considerations

  • Qualitative data is different from quantitative data. Think about which you are using for the project. 
  • Will we collect the data ourselves and will we keep it or is it just a snapshot?
  • Do we need to develop tools and infrastructure?
  • Try to think about all possible scenarios. Do you talk to a research analyst? Do you have collaborators that need clean data? Would you ever have to revert the data to its original?
  • Look at the humans doing the work. They all have affordances and constraints. There may be limitations in people having time and energy for tasks.

Process Considerations

  • Is it FOSS (free and open-source software)?
  • Is there a steep learning curve or is it easy to learn?
  • Is there a community of users who can support and help?
  • Seek tools that are versatile so that you can reuse across other projects
  • You just have to negotiate between reproducibility and openness