Commit c950fc44 authored by Philipp Cimiano's avatar Philipp Cimiano

minor edits

parent 739c590b
......@@ -6,14 +6,14 @@ The reproduction of analyses still involves substantial effort. Originally, we h
In addition to the effort devoted to the reproduction itself, the Conquaire project has performed a number of workshops with all the researches from the eight use cases to introduce them to the goals of the project, to introduce Git, etc.
As a conclusion, we can say that the success rate for reproduction was very high. We were able to reproduce the results within all case studies. Yet, the level of reproducibility was not the same for all project. According to the taxonomy of levels of reproducibility introduction in chapter \ref{conquaire_book_intro}, we have on clear case of full analytical reproducibility and three further project that reached the category of full analytical reproducibility by the end of the project after recoding analytical workflows using open and free programming languages. Four case studies have the status of \emph{at least} limited reproducibility as the reproduction of their work (still) involves obtaining third-party commercial licenses for tool. It requires a minimal further investment to bring these cases into the level of full analytical reproducibility.
As a conclusion, we can say that the success rate for reproduction was very high. We were able to reproduce the results within all case studies. Yet, the level of reproducibility was not the same for all project. According to the taxonomy of levels of reproducibility introduction in chapter \ref{conquaire_book_intro}, we have one clear case of full analytical reproducibility and three further cases that reached the category of full analytical reproducibility by the end of the project after recoding analytical workflows using open and free programming languages. Four case studies have the status of \emph{at least} limited reproducibility as the reproduction of their work (still) involves obtaining third-party commercial licenses for tool. It requires a minimal further investment to bring these cases into the level of full analytical reproducibility.
This is a clear success in our view, clearly showing that analytical reproducibility is feasible.
The main obstacles for analytical reproducibility found were i) the lack of documentation and thus reliance on guidance by the original authors, ii) the reliance on some manual steps in the analytical workflow (e.g. clicking on a GUI) , iii) the reliance on non-open and commercial software, and iv) lack of information about which particular version of software and/or data was used to generate a specific result.
An institutional policy and infrastructure can alleviate most of the problems mentioned above. Our experience shows that using a distributed version control system is a best practice to be followed and a basic step towards reproducibility. Our experience shows that scientists in any field can quickly learn to work with Git, in particular if GUIs such as GitLab are provided. Most of the scientists involved in case studies in Conquaire had no issues in uploading their data to a Git repository.
Our experience also shows that scientists are deeply motivated to make their results reproducible, even if this leads to a level of exposure that might lead to errors being discovered. In some cases we discovered minor errors in plots, scripts etc. and the involved scientists were more than happy to correct these minor issues. The exposure and independent validation brings benefits that are generally appreciated. This is indeed an important conclusion from Conquaire. While at the beginning of the project we were sceptic how open scientists would be willing to make their research artifacts available and support reproduction, we are more than convinced that there is a strong culture within science of being as open as possible to ensure external scrutiny or validation of scientific results.
Our experience also shows that scientists are deeply motivated to make their results reproducible, even if this leads to a level of exposure that might lead to errors being discovered. In some cases we discovered minor errors in plots, scripts etc., and the involved scientists were more than happy to correct these minor issues. The exposure and independent validation brings benefits that are generally appreciated. This is indeed an important conclusion from Conquaire. At the beginning of the project we were sceptic how openly scientists would be willing to make their research artifacts available and support reproduction. At the end of the project we can corroborate that there is a strong culture within science of being as open as possible to ensure external scrutiny or validation of scientific results.
Our experience has been positive thus and we would like to encourage research organizations world-wide in setting up policies encouraging their researchers to make their results analytically reproducible. On the basis of the results of Conquaire, Bielefeld University is working towards the establishment of policies in this respect.
We would like to end this book with a number of clear recommendations to research institutions wanting to support their scientists in making their results reproducible:
......@@ -21,11 +21,11 @@ We would like to end this book with a number of clear recommendations to researc
\begin{itemize}
\item \textbf{Organization-wide version control system:} Rolling out an organization-wide version control system is the basis for reproducibility. It makes transparent when and by whom data and scripts were collected or created and allows to uniquely reference a particular version of the data and code. Such a system can also support persistent storage of data and has a back-up function for the researchers. We recommend using Git.
\item \textbf{Committing scripts before data collection:} When using a Git repository, our recommendation is to develop policies that encourage scientists to commit their analytical scripts before they collect data. After committing scripts, dummy data could be committed to check that the script works and produces the results on an independent server that is not under control of the scientist. After data collection, the data can be committed and the results generated automatically on the server in a continuous integration like manner. This reduces possibilities for tampering with data to produce a desired result or at least makes post-data-collection modifications transparent.
\item \textbf{Creating incentives for providing documentation:} Organizations should create incentives to foster documentation of datasets, analytical workflows and adopt and enforce standards for describing author metadata, license information etc.
\item \textbf{Creating incentives for providing documentation:} Organizations should create incentives to foster documentation of datasets, analytical workflows and adopt and enforce standards for describing author metadata, licensing information, etc.
\item Independent code execution / result validation: Organizations should implement services and infrastructure that supports the independent execution of software / code to reproduce a certain result. Continuous integration servers fulfill this purpose.
\item \textbf{Gamification:} Principles of gamification might create incentives for ensuring high quality of data. We have positive experience with introducing a badge system. Yet, more investigation and experimentation is needed here.
\item \textbf{Open software:} We clearly recommend to set up policies that encourage researchers to rely on open, free and non-commercial software to facilitate reproduction of results on independent machines without the need to install commercial software and pay high license fees.
\item Metadata: Organizations should train and support researchers in creating high-quality metadata for their data and also train them in selecting and specifying under which licenses their data can be used. Consulting on data exploitation and use while taking into account privacy aspects is crucial. Bielefeld university has created a center for research data management with the mission of consulting and training researchers on such dimensions.
\item \textbf{Metadata:} Organizations should train and support researchers in creating high-quality metadata for their data and also train them in selecting and specifying under which licenses their data can be used. Consulting on data exploitation and re-use while taking into account privacy aspects is crucial. Bielefeld university has created a center for research data management with the mission of consulting and training researchers on such dimensions.
\end{itemize}
However, the most important lesson learned is that analytical reproducibility should not be considered as an afterthought and delayed to the end of a research project. Analytical reproducibility is easy to achieve if one designs experiments and software environments from the start with the goal to make analytical workflows executable on any server by a third party. This minimizes efforts needed as workflows are not disrupted in the middle of a project and minimizes the opportunity to post-modify data and results, thus creating transparency. Applying continuous integration principles from the start and taking into account data quality and publishing data and scripts early in the research process as well as specifying tests that monitor data quality and run analytical workflows independently of the researchers carrying out the research as well as publishing results continuously and transparently in some repository is an effective way of fostering analytical reproducibility.
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment