Building reproducible analytical pipelines with R

Build reproducible analytical pipelines to output consistent, high-quality data products using R, Github and Docker. Lea

294 102 9MB

English Pages 522 Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Building reproducible analytical pipelines with R

Author / Uploaded
Bruno Rodrigues

Table of contents :
Welcome!
How using a few ideas from software engineering can help data scientists, analysts and researchers write reliable code
Preface
1 Introduction
1.1 Who is this book for?
1.2 What is the aim of this book?
1.3 Prerequisites
1.4 What actually is reproducibility?
1.4.1 Using open-source tools to build a RAP is a hard requirement
1.4.2 There are hidden dependencies that can hinder the reproducibility of a project
1.4.3 The requirements of a RAP
1.5 Are there different types of reproducibility?
2 Before we start
2.1 Essential knowledge
3 Project start
3.1 Housing in Luxembourg
3.2 Saving trapped data from Excel
3.3 Analysing the data
3.4 Your project is not done
3.4.1 How easy would it be for someone else to rerun the analysis?
3.4.2 How easy would it be to update the project?
3.4.3 How easy would it be to reuse this code for another project?
3.4.4 What guarantee do we have that the output is stable through time?
3.5 Conclusion
4 Version control with Git
4.1 Installing Git and opening a Github account
4.2 Git superbasics
4.3 Git and Github
4.4 Getting to know Github
4.5 Conclusion
5 Collaborating using Trunk-based development
5.1 Collaborating as a team
5.1.1 TBD basics
5.1.2 Handling conflicts
5.1.3 Make sure you blame the right person
5.1.4 Simplified trunk-based development
5.1.5 Conclusion
5.2 Contributing to public repositories
5.3 Further reading
6 Functional programming
6.1 Introduction
6.1.1 The state of your program
6.1.2 Predictable functions
6.1.3 Referentially transparent and pure functions
6.2 Writing good functions
6.2.1 Functions are first-class objects
6.2.2 Optional arguments
6.2.3 Safe functions
6.2.4 Recursive functions
6.2.5 Anonymous functions
6.2.6 The Unix philosophy applied to R
6.3 Lists: a powerful data-structure
6.3.1 Lists all the way down
6.3.2 Lists can hold many things
6.3.3 Lists as the cure to loops
6.3.4 Data frames
6.4 Functional programming in R
6.4.1 Base capabilities
6.4.2 purrr
6.4.3 withr
6.5 Conclusion
7 Literate programming
7.1 A quick history of literate programming
7.2 {knitr} basics
7.2.1 Set up
7.2.2 Markdown ultrabasics
7.3 Keeping it DRY
7.3.1 Generating R Markdown code from code
7.3.2 Tables in R Markdown documents
7.3.3 Parametrized reports
7.4 Conclusion
8 Conclusion of part 1
9 Rewriting our project
9.1 An Rmd for cleaning the data
9.2 An Rmd for analysing the data
9.3 Conclusion
10 Basic reproducibility: freezing packages
10.1 Recording packages’ version with {renv}
10.1.1 Daily {renv} usage
10.1.2 Collaborating with {renv}
10.1.3 {renv}’s shortcomings
10.2 Becoming an R-cheologist
10.3 Conclusion
11 Packaging your code
11.1 Benefits of packages
11.2 {fusen} quickstart
11.3 Turning our Rmds into a package
11.4 Including datasets
11.5 Installing and sharing the package
11.5.1 Code is hosted
11.5.2 Code cannot be hosted
11.5.3 Marketing your work
11.6 Conclusion
12 Testing your code
12.1 Unit testing
12.2 Assertive programming
12.3 Test-driven development
12.4 Code coverage
12.5 Conclusion
13 Build automation with targets
13.1 Introduction
13.2 {targets} quick-start
13.2.1 _targets.R’s anatomy
13.3 A pipeline is a composition of pure functions
13.4 Handling files
13.5 The dependency graph
13.6 Running the pipeline in parallel
13.7 {targets} and RMarkdown (or Quarto)
13.8 Rewriting our project as a pipeline and {renv} redux
13.9 Some little tips before concluding
13.9.1 Load every target at once
13.9.2 Get metadata information on your pipeline
13.9.3 Make a target (or the whole pipeline) outdated
13.9.4 Customize the network’s visualisation
13.9.5 Use targets from one pipeline in another project
13.9.6 Understanding this cryptic error message
13.10 Conclusion
14 Reproducible analytical pipelines with Docker
14.1 What is Docker?
14.2 A primer on Linux
14.3 First steps with Docker
14.4 The Rocker project
14.5 Dockerizing projects
14.6 Dockerizing development environments
14.6.1 Creating a base image for development
14.6.2 Sharing images through Docker Hub
14.6.3 Sharing a compressed archive of your image
14.7 Some issues of relying on Docker
14.7.1 The problems of relying so much on Docker
14.7.2 Is Docker enough?
14.8 Conclusion
15 Continuous integration and continuous deployment
15.1 CI/CD quickstart for R programmers (and others)
15.2 Running a RAP using Github Actions
15.3 Craft a dockerized dev env with GA
15.4 Run a RAP using a dockerized dev env on GA
15.5 Conclusion
16 Conclusion of part 2
17 The end
“So what?”
References

Polecaj historie

Reproducible Research with R and RStudio [3 ed.] 0367144026, 9780367144029

Praise for previous editions: "Gandrud has written a great outline of how a fully reproducible research project sho

840 210 9MB Read more

Reproducible finance with R : code flows and shiny apps for portfolio analysis 9781138484030, 1138484032

3,409 642 71MB Read more

Building ETL Pipelines with Python: Create and deploy enterprise-ready ETL pipelines by employing modern methods [1 ed.] 9781804615256

Develop production-ready ETL pipelines by leveraging Python libraries and deploying them for suitable use cases Key Fea

599 238 6MB Read more

Building Machine Learning Pipelines: Automating Model Life Cycles with TensorFlow [1 ed.] 1492053198, 9781492053194

Companies are spending billions on machine learning projects, but it's money wasted if the models can't be dep

8,993 2,439 9MB Read more

the practice of reproducible research

525 85 3MB Read more

Geohazards and Pipelines: State-of-the-Art Design Using Experimental, Numerical and Analytical Methodologies 9783030498917, 9783030498924

This book presents state-of-the-art methodologies for the design and analysis of buried steel pipelines subjected to sev

869 107 10MB Read more

Reproducible Data Science with Pachyderm: Learn how to build version-controlled, end-to-end data pipelines using Pachyderm 2.0 9781801074483, 1801074488

Create scalable and reliable data pipelines easily with Pachyderm Key FeaturesLearn how to build an enterprise-level rep

1,305 33 8MB Read more

Bayesian Essentials with R

2,042 203 5MB Read more

Data Science in Production: Building Scalable Model Pipelines with Python [First ed.] 9781617294433, 9781491962299, 9781491901427, 9781449358624, 9781491957660, 9781983057977, 9781498716963

2,863 603 3MB Read more

$Sparse Estimation with Math and R: 100 Exercises for Building Logic [1st ed. 2021] 9811614458, 9789811614453$

Sparse Estimation with Math and R: 100 Exercises for Building Logic [1st ed. 2021] 9811614458, 9789811614453

The most crucial ability for machine learning and data science is mathematical logic for grasping their essence rather t

1,080 229 5MB Read more