Commit c125da10 authored by Cord Wiljes's avatar Cord Wiljes

Corrected typos

parent f8979d25
......@@ -62,11 +62,11 @@ For an academic institution, it is important to have an infrastructure-based app
%\end{wrapfigure}
%%%%%%%%%%%%
Conquaire envisions that scientists commit their data and scripts early in the research cycle into a distributed version control system (DVCS) \index{distributed version control system} such as Git, a content-addressable key-value data store based filesystem. A University-wide installation offers various advantages for collaboration: regular data backups that are version controlled, hence retrievable, and security features for data that cannot be corrupted.
Conquaire envisions that scientists commit their data and scripts early in the research cycle into a distributed version control system (DVCS) \index{distributed version control system} such as Git, a content-addressable key-value data store based file system. A University-wide installation offers various advantages for collaboration: regular data backups that are version controlled, hence retrievable, and security features for data that cannot be corrupted.
The Conquaire project decided to adopt Git as the DVCS \index{DVCS}, largely to take advantage of features that ensure a distributed collaborative environment. Github\index{Github}, a social site for software development, uses the Git DVCS as the underlying technology to create a cloud-hosted platform for sharing program code and related technical artifacts. With several collaborative features, the site is free for open-source projects and the intrinsic social features make it very popular among programmers, scientists and technical people wanting to share their work and collaborate. A Stackoverflow survey\footnote{\url{https://insights.stackoverflow.com/survey/2015}} ranked Git usage at 69.3\%, almost double than the second source control - SVN at 36.9\%, making Git the front runner among distributed version control systems.
The Conquaire project decided to adopt Git as the DVCS \index{DVCS}, largely to take advantage of features that ensure a distributed collaborative environment. GitHub\index{Github}, a social site for software development, uses the Git DVCS as the underlying technology to create a cloud-hosted platform for sharing program code and related technical artifacts. With several collaborative features, the site is free for open-source projects and the intrinsic social features make it very popular among programmers, scientists and technical people wanting to share their work and collaborate. A Stackoverflow survey\footnote{\url{https://insights.stackoverflow.com/survey/2015}} ranked Git usage at 69.3\%, almost double than the second source control - SVN at 36.9\%, making Git the front runner among distributed version control systems.
Since Github is a cloud-hosted platform, we looked for alternative free and open source software (FOSS) implementations that could be installed on the University infrastructure. We found Gitlab\index{Gitlab}, a free software framework implementation of a web-based Git-repository manager that supports self-hosting with features similar to Github\footnote{\url{https://conquaire.uni-bielefeld.de/2018/04/17/Git/}}, i.e. an issue-tracker, wiki, CI/CD pipeline, etc. that was layered around the user with different permission levels. These variable permission layers for different feature access plays an important role in collaborating and sharing knowledge across physical boundaries. Like Github, the collaborative features of Gitlab include allowing a user to make multiple \emph{commits}, \emph{pull requests}, make changes and edit their documents, create forks or branches, revert to an old version, and/or merge those changes into the \emph{master} branch.
Since GitHub is a cloud-hosted platform, we looked for alternative free and open source software (FOSS) implementations that could be installed on the University infrastructure. We found Gitlab\index{Gitlab}, a free software framework implementation of a web-based Git-repository manager that supports self-hosting with features similar to GitHub\footnote{\url{https://conquaire.uni-bielefeld.de/2018/04/17/Git/}}, i.e. an issue-tracker, wiki, CI/CD pipeline, etc. that was layered around the user with different permission levels. These variable permission layers for different feature access plays an important role in collaborating and sharing knowledge across physical boundaries. Like GitHub, the collaborative features of Gitlab include allowing a user to make multiple \emph{commits}, \emph{pull requests}, make changes and edit their documents, create forks or branches, revert to an old version, and/or merge those changes into the \emph{master} branch.
When a user makes a Git commit, it consists of three steps that involves Git creating, (i) a tree graph in order to represent the content of the files being committed to the project, (ii) a commit object that is stored and tracked in the \textbf{.Git/objects} folder, and (iii) an object that points to the current branch at the new commit object.
To record the current state of the repository, Git creates a tree graph from the index, which records the location and content of every file within the project repository. The tree graph is composed of two types of objects: \textit{blobs} and \textit{trees}. The command \textit{Git add} stores \textit{blobs} that represent the content of files; while \textit{trees} are stored when a \textit{commit} is made and it represents a directory in the working copy.
Thus, the distributed features of the key-value data store ensure that the Git history stores the old version, the new version, and an interim version that the user stores in their forked (or, working copy) version. The Git project environment aids data sharing and reproducibility when a user checks in research data into a version controlled repository, by ensuring they can reproduce the exact state of the project over a timeline. Thus, multiple users can easily collaborate without fear of their work being erased or overwritten thanks to many Git features for collaboration like merging, fetching, pulling changes to a local branch, branching, stashing, pushing changes, tagging objects, etc.
......@@ -88,7 +88,7 @@ The architecture of the Conquaire quality control system is depicted in Figure \
\paragraph{A. Data preparation and quality checking (marked red)}
\begin{itemize}
\item \textbf{Step 1:} The researcher uploads data to the version control system server. This can be done by the GitLab Browser-based frontend, from the shell using Git commands or with any other available Git-GUI (e.g. Github Desktop, Tortoise Git)
\item \textbf{Step 1:} The researcher uploads data to the version control system server. This can be done by the GitLab Browser-based frontend, from the shell using Git commands or with any other available Git-GUI (e.g. GitHub Desktop, Tortoise Git)
\item \textbf{Step 2:} Uploading one or more files onto the Git-Server automatically triggers the Gitlab CI-runner, which executes the quality checking procedures on the Conquaire quality checking server. These fetch the necessary files from the Git-repository and perform quality checks.
\item \textbf{Step 3:} The result of the quality check is returned to the researcher. It gives a detailed analysis of all files that were commited and provides a report on which tests were passed or failed by the data. The researcher may then correct the data according to the test results and resubmit it to the Git-repository. This cycle can be iterated as long as it is necessary.
\end{itemize}
......
......@@ -3,10 +3,10 @@
The DFG-funded Conquaire project has been concerned with investigating the feasibility of reproducing the analytical phase of research in experimental sciences. We have conducted eight case studies in various areas such as biology, linguistics, psychology, robotics, economics and chemistry as a basis to understand obstacles and best practices towards ensuring reproduciblity of scientific results.
The reproduction of analyses still involves substantial effort. Originally, we had set ourselves the goal to invest a full working week (40 hours) into the reproduction of each of these case studies. In many cases, the time needed to reproduce a result has exceeded this amount by a factor of three. The reason is that, in many cases, while data and scripts were available, the documentation was not sufficient to reproduce the analyses without step-by-step guidance of the authors of the original publication that we set out to reproduce.
In addition to the effort devoted to the reproduction itself, the Conquaire project has performed a number of workshops with all the researches from the eight use cases to introduce them to the goals of the project, to introduce Git, etc.
As a conclusion, we can say that the success rate for reproduction was very high. We were able to reproduce the results within all case studies. Yet, the level of reproducibility was not the same for all project. According to the taxonomy of levels of reproduciblity introduction in chapter \ref{conquaire_book_intro}, we have on clear case of full analytical reproducibility and three further project that reached the category of full analytical reproducibility by the end of the project after recoding analytical workflows using open and free programming languages. Four case studies have the status of \emph{at least} limited reproducibility as the reproduction of their work (still) involves obtaining third-party commercial licenses for tool. It requires a minimal further investment to bring these cases into the level of full analytical reproducibility.
This is a clear success in our view, clearly showing that analytical reproduciblity is feasible.
As a conclusion, we can say that the success rate for reproduction was very high. We were able to reproduce the results within all case studies. Yet, the level of reproducibility was not the same for all project. According to the taxonomy of levels of reproducibility introduction in chapter \ref{conquaire_book_intro}, we have on clear case of full analytical reproducibility and three further project that reached the category of full analytical reproducibility by the end of the project after recoding analytical workflows using open and free programming languages. Four case studies have the status of \emph{at least} limited reproducibility as the reproduction of their work (still) involves obtaining third-party commercial licenses for tool. It requires a minimal further investment to bring these cases into the level of full analytical reproducibility.
This is a clear success in our view, clearly showing that analytical reproducibility is feasible.
The main obstacles for analytical reproducibility found were i) the lack of documentation and thus reliance on guidance by the original authors, ii) the reliance on some manual steps in the analytical workflow (e.g. clicking on a GUI) , iii) the reliance on non-open and commercial software, and iv) lack of information about which particular version of software and/or data was used to generate a specific result.
An institutional policy and infrastructure can alleviate most of the problems mentioned above. Our experience shows that using a distributed version control system is a best practice to be followed and a basic step towards reproduciblity. Our experience shows that scientists in any field can quickly learn to work with Git, in particular if GUIs such as GitLab are provided. Most of the scientists involved in case studies in Conquaire had no issues in uploading their data to a Git repository.
An institutional policy and infrastructure can alleviate most of the problems mentioned above. Our experience shows that using a distributed version control system is a best practice to be followed and a basic step towards reproducibility. Our experience shows that scientists in any field can quickly learn to work with Git, in particular if GUIs such as GitLab are provided. Most of the scientists involved in case studies in Conquaire had no issues in uploading their data to a Git repository.
Our experience also shows that scientists are deeply motivated to make their results reproducible, even if this leads to a level of exposure that might lead to errors being discovered. In some cases we discovered minor errors in plots, scripts etc. and the involved scientists were more than happy to correct these minor issues. The exposure and independent validation brings benefits that are generally appreciated. This is indeed an important conclusion from Conquaire. While at the beginning of the project we were sceptic how open scientists would be willing to make their research artifacts available and support reproduction, we are more than convinced that there is a strong culture within science of being as open as possible to ensure external scrutiny or validation of scientific results.
Our experience has been positive thus and we would like to encourage research organizations world-wide in setting up policies encouraging their researchers to make their results analytically reproducible. On the basis of the results of Conquaire, Bielefeld University is working towards the establishment of policies in this respect.
We would like to end this book with a number of clear recommendations to research institutions wanting to support their scientists in making their results reproducible:
......@@ -20,8 +20,4 @@ We would like to end this book with a number of clear recommendations to researc
\item \textbf{Open software:} We clearly recommend to set up policies that encourage researchers to rely on open, free and non-commercial software to facilitate reproduction of results on independent machines without the need to install commercial software and pay high license fees.
\item Metadata: Organizations should train and support researchers in creating high-quality metadata for their data and also train them in selecting and specifying under which licenses their data can be used. Consulting on data exploitation and use while taking into account privacy aspects is crucial. Bielefeld university has created a center for research data management with the mission of consulting and training researchers on such dimensions.
\end{itemize}
However, the most important lesson learned is that analytical reproducibility should not be considered as an afterthought and delayed to the end of a research project. Analytical reproducibility is easy to achieve if one designs experiments and software environments from the start with the goal to make analytical workflows executable on any server by a third party. This minimizes efforts needed as workflows are not disrupted in the middle of a project and minimizes the opportunity to post-modify data and results, thus creating transparency. Applying continuous integration principles from the start and taking into account data quality and publishing data and scripts early in the research process as well as specifying tests that monitor data quality and run analytical workflows independently of the researchers carrying out the research as well as publishing results continuously and transparently in some repository is an effective way of fostering analytical reproducibility.
However, the most important lesson learned is that analytical reproducibility should not be considered as an afterthought and delayed to the end of a research project. Analytical reproducibility is easy to achieve if one designs experiments and software environments from the start with the goal to make analytical workflows executable on any server by a third party. This minimizes efforts needed as workflows are not disrupted in the middle of a project and minimizes the opportunity to post-modify data and results, thus creating transparency. Applying continuous integration principles from the start and taking into account data quality and publishing data and scripts early in the research process as well as specifying tests that monitor data quality and run analytical workflows independently of the researchers carrying out the research as well as publishing results continuously and transparently in some repository is an effective way of fostering analytical reproducibility.
\ No newline at end of file
......@@ -56,7 +56,7 @@ Accordingly, the main objective of that study was to relate inter-species differ
%SS-3.2
\subsection{Data acquisition: Experimental procedure} \label{expProc}
%SS-3.2
For the experiments decribed in \citep{Theunissen_EtAl_2015}, adult stick insects of the species \textit{Carausius morosus} (de Sinéty 1901), \textit{Aretaon asperrimus} (Brunner von Wattenwyl 1907) and \textit{Medauroidea extradentata} (Redtenbacher 1906) were used. Animals were bred in a laboratory culture at Bielefeld University.
For the experiments described in \citep{Theunissen_EtAl_2015}, adult stick insects of the species \textit{Carausius morosus} (de Sinéty 1901), \textit{Aretaon asperrimus} (Brunner von Wattenwyl 1907) and \textit{Medauroidea extradentata} (Redtenbacher 1906) were used. Animals were bred in a laboratory culture at Bielefeld University.
In each experimental trial, an animal was placed on a horizontal walkway (40 x 490 mm), along which it walked freely. There were four walking/climbing conditions as characterised by the height of two stairs placed on the walkway: in the flat (walking) condition, the walkway was used without stairs; in the climbing conditions low, middle and high, a staircase with two stairs of step height, h, was placed at the end of the walkway (40 x 200 mm; low: h = 8 mm, middle: h = 24 mm, high: h = 48 mm). The flat walking condition served as the reference condition. The four conditions were presented in a randomised sequence of at least 40 trials, resulting in approximately ten trials per condition per animal. The whole setup was painted in opaque black and was surrounded by black drapery in order to minimise visual contrast. The room was darkened and illuminated only by red light LEDs of the Vicon cameras (see below) and indirect light emanating from a TFT computer monitor.
......@@ -138,7 +138,7 @@ On the other hand, if the C3Dserver is installed on a computer with a 64-bit ope
Scientific research groups use a variety of file formats with various machines using standard formats to read in data and output it. Here, the captured data is stored in a \textit{.c3d}-file that can be exchanged and accessed via the C3Dserver, but it is predominantly supported to run on the Windows platform only. The C3D file format is a public domain file format for storing motion and other 3D data recorded in various laboratory settings. The C3Dserver, whose server features includes several MATLAB supporting functions that will allow the files to be analysed with additional MATLAB functions being written to perform operations on the data in \textit{.c3d}-file.
The biggest challenge we thus faced was the requirement of the proprietary C3Dserver for data processing, analysis and visualisation that was only available for machines running the Windows operating systems. Since there was no software support for Linux to read in the motion tracking data to MATLAB, we could not recreate the full pipeline on a Linux machine. The Library is maintaining the infrastructure for research data management (RDM), hence, they would have the additional work of installing, both MATLAB and the Windows server, patching and updating them regularly, including maintaining licenced version upgrades which can get expensive over time.
The biggest challenge we thus faced was the requirement of the proprietary C3Dserver for data processing, analysis and visualisation that was only available for machines running the Windows operating systems. Since there was no software support for Linux to read in the motion tracking data to MATLAB, we could not recreate the full pipeline on a Linux machine. The Library is maintaining the infrastructure for research data management (RDM), hence, they would have the additional work of installing, both MATLAB and the Windows server, patching and updating them regularly, including maintaining licensed version upgrades which can get expensive over time.
The kinematic reconstruction was achieved in MATLAB by combining marker trajectories with the body documentation. The resulting processed data, i.e., joint angle time courses, gait pattern, and velocity, were saved as another \textit{.mat}-file.
Another problem was related to the backslash used in PATHS on the Windows machine. All relative paths in the code supported Windows, which uses a backslash instead of (forward)slash on *nix machines. While analysing the MATLAB data with C3Dserver and MATLAB on Windows, this is not an issue. However, a user trying to use the MATLAB code on a *nix machine would have to replace and correct all the paths before running the code to reproduce the figures from that point onwards.
......
This diff is collapsed.
......@@ -142,7 +142,7 @@ As a main objective of this study, we defined the goal of being able to independ
%SS-4.1
\subsection{Research Data - Primary} \label{RDprimary}
%SS-4.1
The data was read off the BINARY experiment setup, then processed entirely by OriginPro, a proprietary computer software from OriginLab Corporation, that is mainly used for interactive scientific graphing and data analysis on the Microsoft Windows platform only. It is a GUI software with a spreadsheet-like front end which uses a column-oriented data processing approach for calculations. It has its own file format, \textbf{.OPJ}, for project files which are directly processed by the system for statistics, data analysis and visualisation.
The data was read off the BINARY experiment setup, then processed entirely by OriginPro, a proprietary computer software from OriginLab Corporation, that is mainly used for interactive scientific graphing and data analysis on the Microsoft Windows platform only. It is a GUI software with a spreadsheet-like front end which uses a column-oriented data processing approach for calculations. It has its own file format, \textbf{.OPJ}, for project files which are directly processed by the system for statistics, data analysis and visualization.
The group uses OriginPro along with a scripting language known as \textbf{LabTalk} that allows finer control, by writing small macros that run over the data analysis process for the experiment data. With LabTalk the group programs routine operations, including batch operations, with customizable graph templates and analysis dialog box themes. Various features exist to save a collection of operations within the workbook, viz., saving a suite of operations, auto recalculation on changes to data or analysis parameters, and different analysis templates.
......@@ -153,7 +153,7 @@ The group uses OriginPro along with a scripting language known as \textbf{LabTal
The Snomax\textsuperscript{\textregistered} data file contains data from the OPJ data file that is read into the Origin software system. The data was exported into tab-separated files with OriginPro as *.txt files with six TAB delimited columns. The calibration data numbers start from line four with the headers confined to the first three lines; viz. the first line has the column names, while line 2 contains the data description or unit, and the third line contained information about the substance.
For the computational reproducibility experiment, we used Python to process these text files for data analysis and visualisation based on the validated raw data. After calibrating the temperature, the python script binned the data, then grouped the data for all columns by concentration (decreasing) into different bins and then within each concentration bin the data is sorted by (decreasing) calibrated temperature $T_{cal}$. Afterwards, $f_{ice}(T)$ was determined for each temperature value in each bin. In the last step the mass concentration of Snomax\textsuperscript{\textregistered} and the volume of the droplets are converted into the active site density per unit mass, $n_{m}(T)$.
For the computational reproducibility experiment, we used Python to process these text files for data analysis and visualization based on the validated raw data. After calibrating the temperature, the python script binned the data, then grouped the data for all columns by concentration (decreasing) into different bins and then within each concentration bin the data is sorted by (decreasing) calibrated temperature $T_{cal}$. Afterwards, $f_{ice}(T)$ was determined for each temperature value in each bin. In the last step the mass concentration of Snomax\textsuperscript{\textregistered} and the volume of the droplets are converted into the active site density per unit mass, $n_{m}(T)$.
After tabulating $f_{ice}(T)$ and $n_{m}(T)$ for each concentration bin, the resulting data is stored in a csv file that became the input data to reproduce the plot from the original paper shown in Figure \ref{binary_plot}.
......@@ -185,7 +185,7 @@ With the given raw data the results from the original experiment could be succes
\subsection{Summary of Reproducibility Experiment} \label{ReXStatus}
We reproduced the results from the analyses from the original paper as shown in the visualisation graph Figure \ref{fig6-cqr-sonomaxvstemp} by plotting $n_{m}(T)$, the cumulative number of IN per dry mass of Snomax\textsuperscript{\textregistered} as a function of calibrated temperature. Origin software is a proprietary analysis toolbox with no equivalent libre software alternative. Hence, the original OPJ data file format can only read data into the Origin software system. The system allows data to be exported into tab-separated files with delimited columns. Due to the complexity and time associated with learning to use a new system like Origin, we opted to use Python to code the formulae and run the data files to be analyzed.
We reproduced the results from the analyses from the original paper as shown in the visualization graph Figure \ref{fig6-cqr-sonomaxvstemp} by plotting $n_{m}(T)$, the cumulative number of IN per dry mass of Snomax\textsuperscript{\textregistered} as a function of calibrated temperature. Origin software is a proprietary analysis toolbox with no equivalent libre software alternative. Hence, the original OPJ data file format can only read data into the Origin software system. The system allows data to be exported into tab-separated files with delimited columns. Due to the complexity and time associated with learning to use a new system like Origin, we opted to use Python to code the formulae and run the data files to be analyzed.
In addition Python is open source and is supported by many platforms.
\begin{figure}[!ht]
......@@ -246,7 +246,7 @@ The data has been uploaded to the DFG FOR1525 project website (https://www.ice-n
%%sss-5.1.4
%\subsubsection{Data should be Reusable} \label{fairReuse}
%%sss-5.1.4
%Data reuse is an expensive option due to the existence of paid software in the researchers workflow, greatly limiting non-domain users interested in reproducible software. The researchers can compartmentalize the tasks of data acquisition, data processing management, data analysis and visualisation while extending the Python code used in this reproducibility. It would ensure a higher rate of data reuse.
%Data reuse is an expensive option due to the existence of paid software in the researchers workflow, greatly limiting non-domain users interested in reproducible software. The researchers can compartmentalize the tasks of data acquisition, data processing management, data analysis and visualization while extending the Python code used in this reproducibility. It would ensure a higher rate of data reuse.
%%S-6
......
......@@ -82,7 +82,7 @@ Using the process detailed above the scripts produced tables of variables ready
%S-4
\section{Analytical Reproducibility} \label{ReX}
%S-4
Computational reproducibility experiments were conducted with the Psycholinguistics research group at Paderborn University at the paper publishing stage to modify the data analysis scripts and produce results, then implement visualizations with Pandas and matplotlib that was later stored in gitlab under continuous integration. To facilitate team-collaboration on porting and refactoring the code, the python scripts and extracted (TSV format) files for data analysis are available at the following Git repository: \url{https://gitlab.ub.uni-bielefeld.de/conquaire/psycholinguistics}.
Computational reproducibility experiments were conducted with the Psycholinguistics research group at Paderborn University at the paper publishing stage to modify the data analysis scripts and produce results, then implement visualizations with Pandas and matplotlib that was later stored in GitLab under continuous integration. To facilitate team-collaboration on porting and refactoring the code, the python scripts and extracted (TSV format) files for data analysis are available at the following Git repository: \url{https://gitlab.ub.uni-bielefeld.de/conquaire/psycholinguistics}.
%%SS-4.1
%\subsection{Research Data} \label{researchData}
......@@ -114,7 +114,7 @@ The research data workflow lifecycle diagram in Figure \ref{fig2-dataworkflow-2}
\caption{Data Workflow}
\label{fig2-dataworkflow-2}
\end{figure}
The research project used Free \& Open Source Software (FOSS), which increased the prospect of cross-platform availability of processing tools as Python programming language and visualisation packages (like Pandas, Matplotlib) are freely available for multiple platforms.
The research project used Free \& Open Source Software (FOSS), which increased the prospect of cross-platform availability of processing tools as Python programming language and visualization packages (like Pandas, Matplotlib) are freely available for multiple platforms.
The old data analysis scripts, written in Python version 2.x, were ported to version 3.6 for program maintenance due to end-of-life for Python version 2.x. Refactoring the old scripts from a complex mass of conditional loops, into a simplified modular callable program, was undertaken to ease program maintenance.
......
......@@ -427,7 +427,7 @@ Other parts of the original experiments were then repeated. The authors fortunat
All experiments come with a configuration entry of hyper parameters in the `experiment\_configs.csv` file. This file not only controls the neural network architecture used in an experiment, it also documents important details such as hidden layer sizes and learning rates. This level of documentation and parametrization is vastly conducive to replaying experiments the way they were originally performed.
Since the authors included the best performing epochs of their original training, we opted to rerun the evaluation on the test set of the SWdA transcription corpus. Both programs involved in this worked out-of-the-box since the whole codebase makes an effort to use relative paths when referring to data files or cached models. This made switching the Jupter notebook used for analysis a matter of pointing a single directory from the original repository data to that of our new run. We then investigated parts of this output in regards to the original outcome.
Since the authors included the best performing epochs of their original training, we opted to rerun the evaluation on the test set of the SWdA transcription corpus. Both programs involved in this worked out-of-the-box since the whole codebase makes an effort to use relative paths when referring to data files or cached models. This made switching the Jupyter Notebook used for analysis a matter of pointing a single directory from the original repository data to that of our new run. We then investigated parts of this output in regards to the original outcome.
%\begin{table}[h!]
%\begin{tabularx}{\linewidth}{p{3.1cm}|XXXXXXXXX}
......@@ -507,7 +507,7 @@ The library and data for this project were generally very accessible. The resear
\subsubsection{Discussion of the reproducibility experiment} \label{FAIRdat}
% ss-5.1
Through the public GitHub repository and requirements documentation within the Python ecosystem we were able to reproduce most of the software environment that was used in the original experiments. Some details, such as GPU acceleration and other hardware dependent factors are subject to continous improvement and cannot be reliably reproduced. By using compatible versions, a best effort was made to get close to the original setup within the reproduction setting.
Through the public GitHub repository and requirements documentation within the Python ecosystem we were able to reproduce most of the software environment that was used in the original experiments. Some details, such as GPU acceleration and other hardware dependent factors are subject to continuous improvement and cannot be reliably reproduced. By using compatible versions, a best effort was made to get close to the original setup within the reproduction setting.
All major parts of the analytical pipeline were well documented and the authors made visible efforts to comply with many principles of good scientific data management: Findability, Accessibility, Interoperability, and Reusability (FAIR) \footnote{\url{https://www.go-fair.org/fair-principles/}} \cite{wilkinson2016fair}. The system can be found in a public GitHub repository that presents an aggregation of all the necessary source code, documentation and most of the underlying research data that allows others to use and analyse the system. By packaging their resulting models and exposing a concise Application Programming Interface (API) to their library, the project facilitates reuse of the system as a whole in follow-up and related tasks. The project bundles sufficient instructions and programs to download all external data researchers might need in the context of the original experiments.
Much of the raw data that forms the basis of the experiments is widely available. While licensing prevents the project to include the raw voice recordings used to create the ASR models, the dataset is obtainable through reliable sources and the extensive research that has already been performed on it indicates that it will likely remain accessible in the foreseeable future. The authors also provided trained models and the intermittent results they used at the time of publishing, which - in terms of reproducibility - might even be preferable over the raw data due to possible changes in the external ASR system that was used at the time. Additional research is encouraged by maintaining a copy of the Switchboard SWdA corpus itself in a separate repository\footnote{\url{ https://github.com/julianhough/swda}}, without having to incorporate the full disfluency system as a dependency.
The system allowed us to setup a development environment in short time and enabled us to independently reevaluate the models that were generated in the original experiments. The documentation, along with the scientific paper itself, provide enough information to gain familiarity with the codebase. While the project does not currently include an explicit description of semantic metadata, the library provides enough of an abstraction to be interoperable with any external data source.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment