Building reproducible analytical pipelines with R

Build reproducible analytical pipelines to output consistent, high-quality data products using R, Github and Docker. Lea

294 102 9MB

English Pages 522 Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Building reproducible analytical pipelines with R

Table of contents :
Welcome!
How using a few ideas from software engineering can help data scientists, analysts and researchers write reliable code
Preface
1 Introduction
1.1 Who is this book for?
1.2 What is the aim of this book?
1.3 Prerequisites
1.4 What actually is reproducibility?
1.4.1 Using open-source tools to build a RAP is a hard requirement
1.4.2 There are hidden dependencies that can hinder the reproducibility of a project
1.4.3 The requirements of a RAP
1.5 Are there different types of reproducibility?
2 Before we start
2.1 Essential knowledge
3 Project start
3.1 Housing in Luxembourg
3.2 Saving trapped data from Excel
3.3 Analysing the data
3.4 Your project is not done
3.4.1 How easy would it be for someone else to rerun the analysis?
3.4.2 How easy would it be to update the project?
3.4.3 How easy would it be to reuse this code for another project?
3.4.4 What guarantee do we have that the output is stable through time?
3.5 Conclusion
4 Version control with Git
4.1 Installing Git and opening a Github account
4.2 Git superbasics
4.3 Git and Github
4.4 Getting to know Github
4.5 Conclusion
5 Collaborating using Trunk-based development
5.1 Collaborating as a team
5.1.1 TBD basics
5.1.2 Handling conflicts
5.1.3 Make sure you blame the right person
5.1.4 Simplified trunk-based development
5.1.5 Conclusion
5.2 Contributing to public repositories
5.3 Further reading
6 Functional programming
6.1 Introduction
6.1.1 The state of your program
6.1.2 Predictable functions
6.1.3 Referentially transparent and pure functions
6.2 Writing good functions
6.2.1 Functions are first-class objects
6.2.2 Optional arguments
6.2.3 Safe functions
6.2.4 Recursive functions
6.2.5 Anonymous functions
6.2.6 The Unix philosophy applied to R
6.3 Lists: a powerful data-structure
6.3.1 Lists all the way down
6.3.2 Lists can hold many things
6.3.3 Lists as the cure to loops
6.3.4 Data frames
6.4 Functional programming in R
6.4.1 Base capabilities
6.4.2 purrr
6.4.3 withr
6.5 Conclusion
7 Literate programming
7.1 A quick history of literate programming
7.2 {knitr} basics
7.2.1 Set up
7.2.2 Markdown ultrabasics
7.3 Keeping it DRY
7.3.1 Generating R Markdown code from code
7.3.2 Tables in R Markdown documents
7.3.3 Parametrized reports
7.4 Conclusion
8 Conclusion of part 1
9 Rewriting our project
9.1 An Rmd for cleaning the data
9.2 An Rmd for analysing the data
9.3 Conclusion
10 Basic reproducibility: freezing packages
10.1 Recording packages’ version with {renv}
10.1.1 Daily {renv} usage
10.1.2 Collaborating with {renv}
10.1.3 {renv}’s shortcomings
10.2 Becoming an R-cheologist
10.3 Conclusion
11 Packaging your code
11.1 Benefits of packages
11.2 {fusen} quickstart
11.3 Turning our Rmds into a package
11.4 Including datasets
11.5 Installing and sharing the package
11.5.1 Code is hosted
11.5.2 Code cannot be hosted
11.5.3 Marketing your work
11.6 Conclusion
12 Testing your code
12.1 Unit testing
12.2 Assertive programming
12.3 Test-driven development
12.4 Code coverage
12.5 Conclusion
13 Build automation with targets
13.1 Introduction
13.2 {targets} quick-start
13.2.1 _targets.R’s anatomy
13.3 A pipeline is a composition of pure functions
13.4 Handling files
13.5 The dependency graph
13.6 Running the pipeline in parallel
13.7 {targets} and RMarkdown (or Quarto)
13.8 Rewriting our project as a pipeline and {renv} redux
13.9 Some little tips before concluding
13.9.1 Load every target at once
13.9.2 Get metadata information on your pipeline
13.9.3 Make a target (or the whole pipeline) outdated
13.9.4 Customize the network’s visualisation
13.9.5 Use targets from one pipeline in another project
13.9.6 Understanding this cryptic error message
13.10 Conclusion
14 Reproducible analytical pipelines with Docker
14.1 What is Docker?
14.2 A primer on Linux
14.3 First steps with Docker
14.4 The Rocker project
14.5 Dockerizing projects
14.6 Dockerizing development environments
14.6.1 Creating a base image for development
14.6.2 Sharing images through Docker Hub
14.6.3 Sharing a compressed archive of your image
14.7 Some issues of relying on Docker
14.7.1 The problems of relying so much on Docker
14.7.2 Is Docker enough?
14.8 Conclusion
15 Continuous integration and continuous deployment
15.1 CI/CD quickstart for R programmers (and others)
15.2 Running a RAP using Github Actions
15.3 Craft a dockerized dev env with GA
15.4 Run a RAP using a dockerized dev env on GA
15.5 Conclusion
16 Conclusion of part 2
17 The end
“So what?”
References

Polecaj historie