At the CAA international I gave a workshop on R package development together with Clemens Schmid (@nevromeCS) and Petr Pajdla (@petrpajdla). Here is a “short” run-down and some relevant links.
Clemens, Petr and I designed the workshop in cooperation with the Special Interest Group Scientific Scripting Languages in Archaeology of the CAA international, where we three are members. All our slides (as *.Rmd) and material are available on the SIG’s github repo: https://github.com/sslarch/caa2021_Rpackage_workshop.
Why is package development a good idea?
Packages are the best way to share R code and functions not just with other people, but also with our future selves. By creating a package we are forcing ourselves to work cleanly and to document our code. Using the “standard way” of an R package, other people know what to expect and how to handle our code. This speeds up our own future process and the whole process of science, as others won’t need to “re-invent our wheel” (I wrote a paper on this together with Ben Marwick: Tool-Driven Revolutions in Archaeological Science). R packages are also a great way to create reproducible workflows (see more or less anything from Ben on that topic 😉 , especially though the rrtools package).
Function writing for package development
We started with an introduction about function writing. “Why functions?” you might ask, “this is supposed to be about packages, right?” Well, usually you create a package, because you want to share your functions with other people. So, creating nice and sensible functions is the first step towards that.
I’m not gonna repeat Clemens excellent talk about the topic. He went into details about in what different ways functions can be created and how to think about creating the best functions for your needs. I’m just going to highlight three points, that are (imho) most important for package development:
- If you repeat a task three times, create a separate function for it!
- Give functions default input values if there are sensible expectations.
- Always use
library(), if you use a function from a different package (this is very important, library won’t work!)
In the discussion we agreed that there is no script or function small enough that it isn’t worth sharing with others. If you put the functions in an R package, we — your potential users — will know how to handle your contribution.
The R package structure
This was “my big part” in the workshop (see the slides here). A package can live in different stages, two of which you will already know for sure: An installed package is different than an attached package, because it is not yet loaded into your memory. Three other stages exist: source packages, binary packages and bundled packages.
A source package is “in development state”, it is actually nothing more than a folder directory in a certain structure. I will focus on this in a moment.
A binary package is a source package that’s been compressed into a single file. The ending is *.tar.gz and it is sometimes called a “tarball” for this reason. Linux users know these a lot.
A bundled package is also a source package that’s been compressed into a single file. This time though, the compression is operation system specific. For Windows: *.zip and for Mac *.tgz.
install.packages() usually uses bundled packages.
R source package structure
Now, let’s get started! What does a source package look like?
You need, at the very least, three ingredients for your package: An
R/– folder, a DESCRIPTION file and a NAMESPACE file.
R/ – folder is where your functions live. You will create R-scripts in which you define the functions and put them in here. This is where everyone, including R, will know to find them.
The DESCRIPTION file is a txt which gives all the metadata about the package. Name, description, authors, on which packages your package relies (think dependencies!), … this file makes your package a package. It is really important!
The NAMESPACE file defines the function you want to export — the ones, your users will want to use. This file makes your package usable. You usually don’t need to write in there by hand, but will add to it via some documentation functions (see below).
For your package development you will want to use some mighty wizards, that will help in setting up the folder structure and some important basics. In the tutorial we explain about functions in the packages
roxygen2, that are really helpful.
The workflow of creating an R package
The workflow of creating an R package is relatively simple, definitely easier than I would have thought. Check out the slides for the details, here are just the rough steps:
- Create the basic structure with
- go to your package, set your working directory there or open the Rproj if you use Rstudio
- edit the DESCRIPTION file in there
- create some functions, put them in the R-folder
- add documentation (see below)
devtools::load_all(), which simulates building your package
- test your functions
- find a problem
- tweak the functions
- repeat 6-9 … until you’re happy with the result
devtools::check()to check dependency issues, syntax in functions, package structure… a lot!
- if no errors exist (otherwise fix those first), build your package using:
devtools::build(binary = FALSE)–> tar.gz (should be usable by anyone)
devtools::build(binary = TRUE)–> platform specific (*.zip or *.tgz) to your own platform
devtools::install()–> (re-)installs your package right away on your system and attaches it
And that’s it. You now have a package. It only works though, if you remember to add some documentation. The documentation will automatically write the information which functions to export to your NAMEFILE. So let’s have quick look at the documentation
Documenting an R package
I’m glad Petr took over this part, because he really had a clear and concise way to explain this. Check out his slides!
There are several aspects to the documentation of a package. It is, for example, highly recommended you create a README file. This is the usual place people will look to get some more information on your package. But it isn’t strictly necessary for the package to work. It is necessary though, that you document the functions, otherwise R will have trouble “finding” them.
We use the
roxygen2 package for this purpose. It enables you to use a simple syntax for your documentation and then “transports” all the important information to the places they belong for your package to work as well as to help files. Let’s have a look at only the most important things.
After writing a function (or maybe even while?) you add roxygen comments in the R script with the special tag
#' . Everything written on a line behind this
#' -tag is recognised as a roxygen comment. You put this at the beginning of your function file and the placement plays a role: The first line is the title of the documentation file. The second paragraph is a short description of the function.
With extra tags you will then define some important information.:
@export will name the functions you want to export — this is important for the NAMESPACE file!
@param will give information on the parameters you defined for your function. Very helpful for people, to know, e.g. whether the input should be a dataframe or a matrix or something else.
@return describes the output of your function
There are some more useful tags, but these might be the most important ones. Now, after defining them, you run
devtools::document() to generate the documentation. You can preview the documentation with
? to check whether all the information turns out right. Check also, whether the functions you want to export have been transferred into the NAMESPACE file.
Cool, eh? You will now even have a help-file for your function! But seriously, this is important. I don’t know how often I check the help for functions. And if you don’t want to be bombarded with questions how to use your package and functions, write it down. People will need some guidance. You can also create a vignette, to show some workflows around your package or give example code.
These are the most important steps.
There is a lot you can dive into, we had a section on “fluffy context“, data in packages and some more advanced topics. But for this blog post, the “bare bones” of package development are enough. I recommend having a look at these topics nonetheless, if you want to develop your own package. And of course, don’t just read this blog post. Check out the workshop slides and maybe these sources:
We highly recommend the R Packages book by Hadley Wickham and Jenny Bryan and its online version at https://r-pkgs.org/ for some more context.
Also, there are the package development cheat sheets. You don’t need to remember everything by heart, I for sure don’t.
If you are ambitious, check out the CRAN writing R extensions manual. It is really detailed and will help you, if you want to release your package on CRAN one day. You don’t necessarily need to, though. Just uploading a package on a citable repository (maybe zenodo.org or OSF.io) will enable others to use it easily. Or use github, but that’s another topic…
I hope this blog post inspired you to try and create packages from your scripts. It is not that difficult! I created my first package for the percolation analysis. Of course it took me some time, but I didn’t have this nice workshop I could follow… so my hope is, it made the process easier for you. Good luck!