Presenter: Asura Enkhbayar, Research Data Analyst at Invest in Open Infrastructure
Moderators: Josh Weiss, Director of Digital Learning Solutions; Mae Bethel, Academic Technology Specialist
Recording of the session:
Central Questions:
- Why sustainable data science?
- What are some key questions to ask at the beginning of a data project?
- What are useful tools or processes for sustainable data practices?
Key quotes:
[There are] different kinds of costs that occur with open and reproducible practices – for instance, the cost of labor. (11:50)
An example of a project [is one] that started off as an individual exploration, turned into a software project, and, based on that, we then published a group of papers and [did] different smaller research investigations. That’s a very common model. (31:19)
[You have] important considerations when making hiring decisions. Just setting up a budget for a project or even doing everyday engagement with your team. If we’re aiming for reproducible and open science, there is a cost associated that needs to be balanced with it. (35:28)
Since explorative work ends up not thinking about the long-term and rather getting the data right away, a lot of folks end up revisiting project structure at a later point. But in my experience, giving it a little bit more thought in the beginning can be helpful. (45:46)
Take-aways:
What’s wrong with reproducible science?
- There are hidden costs of labor, training, education, maintenance, and opportunity in timing.
- Be mindful that there is an extra burden on the person trying to learn the tool.
- Be careful about committing to certain tools. They each will come with drawbacks of features you have access to and analysis and ease of data transfer
- Some management styles may be ignoring the technical limitations of open source tools and relying on individual commitment, which can turn into burnout.
Project Considerations
- Qualitative data is different from quantitative data. Think about which you are using for the project.
- Will we collect the data ourselves and will we keep it or is it just a snapshot?
- Do we need to develop tools and infrastructure?
- Try to think about all possible scenarios. Do you talk to a research analyst? Do you have collaborators that need clean data? Would you ever have to revert the data to its original?
- Look at the humans doing the work. They all have affordances and constraints. There may be limitations in people having time and energy for tasks.
Process Considerations
- Is it FOSS (free and open-source software)?
- Is there a steep learning curve or is it easy to learn?
- Is there a community of users who can support and help?
- Seek tools that are versatile so that you can reuse across other projects
- You just have to negotiate between reproducibility and openness