7 Data, descriptors, details

7.1 What’s the pattern?

Function arguments should always come in the same order: data, then descriptors, then details.

  • Data arguments provide the core data. They are required, and are usually vectors and often determine the type and size of the output. Data arguments are often called data, x, or y.

  • Descriptor arguments describe essential details of the operation, and are usually required.

  • Details arguments control the details of the function. These arguments are optional (because they have default values), and are typically scalars (e.g. na.rm = TRUE, n = 10, prop = 0.1).

A standard argument order makes it easier to understand a function at a glance, and this order implies that required arguments always come before optional arguments.

Related patterns:

  • ... can play the role of the data argument (i.e. when there are an arbitrary number of inputs), as in paste(). This pattern is best using sparingly, and is described in more detail in Chapter 19.

  • ... can also be used to capture details arguments and pass them on to other functions. See Chapters 18 and 21 to how to use ... as safely as possible in this situation.

  • If the descriptor has a default value, I think you should inform the user about it, as in Chapter 15.

7.2 What are some examples?

  • mean() has one data argument (x) and two details (trim and na.rm).

  • The mathematical (+, -, *, /, …) and comparison (<, >, ==, …) operators have two data arguments.

  • ifelse() has three data arguments (test, yes, no).

  • merge() has two data arguments (x, y), one descriptor (by), and a number of details (all, no.dups, sort, …).

  • rnorm() has no data arguments and three descriptors (n, mean, sd). mean and sd default to 0 and 1 respectively, which makes them feel more like details. I’d argue that they shouldn’t have defaults to make it more clear that they’re descriptors. This would have the side-effect of making rnrorm() more consistent with the other RNGs.

    In rt(n, df, ncp), however, I think ncp should default to 0 to make it clear that the non-centrality parameter is detail of the t-distribution, not a core part.

  • grepl() has one data argument (x), one descriptor (pattern), and a number of details (fixed, perl, ignore.case, …).

  • stringr::str_detect() has one data argument (string), one descriptor (pattern), and one detail argument (negate).

  • stringr::str_sub() has three data arguments (string, start, and end). You might wonder what makes start and end data arguments, and I admit it took me a while to figure this out too, but I think the crucial factor is that you can give a single string and multiple start/end positions:

    stringr::str_sub("Hello", 1:5, -1)
    #> [1] "Hello" "ello"  "llo"   "lo"    "o"

    If I was to write str_sub() today, I’d call the first argument x, and I wouldn’t give start and end default values.

  • ggplot2::ggplot() has one data argument (data) and one descriptor (mapping).

  • lm() has one data argument (data), one descriptor (formula), and many details (weights, na.action, method, …). Unfortunately formula comes before data. This is a historical accident, because putting all model variables into a data frame is a relatively recent innovation in the long life cycle of lm().

  • purrr::map() has one data argument (.x) and one descriptor (.f). purrr::map2() has two data arguments (.x, .y) and one descrptor (.f).

  • mapply() has any number of data arguments (…), one descriptor (FUN), and a number of details (SIMPLIFY, USE.NAMES, …). The descriptor comes before the data arguments.

  • At first glance it looks like the ggplot2 layer functions, like geom_point(), don’t obey this principle because the first argument is mapping (a descriptor) and the second is data (presumably a data argument). However, this is because ggplot2 doesn’t use the pipe. If it did (like ggplot1), the first argument would be the plot to modify, which is the data object in this case, because the output is also a plot. Here data acts a descriptor, because it modifies the behaviour of the layer.

    The argument order differs between the layers and ggplot(), because you more commonly specify the data for the plot, and the aesthetic mappings for the layers. This is a little confusing, but I think time has shown it to be a reasonable design decision.

7.3 Why is it important?

This convention makes it easy to understand the structure of a function at a glance: the most important arguments are always on the left hand side, and it’s obvious what arguments most affect the shape of the output. Strongly connecting the shape of the first argument to the shape of output is what makes dplyr (data frames), stringr (character vectors), and the map family (vectors) easier to learn. These families of functions represent transformations that preserve the shape while modifying the value. When combined with the pipe, this leads to code that focusses on the transformations, not the objects being transformed.

These argument types as also affect how you call a function. As discussed in Chapter 4, you should never name data arguments, and always name details arguments. This convention balances concision with readability.

7.4 How do I avoid the problem?

To avoid the problem, you have to carefully analyse the arguments to ensure that you correctly categorise each argument. It’s generally easy to tell the difference between a data argument and a details argument, particularly because data arguments are required and details arguments are optional. But it can be harder to distinguish between data and descriptor, or descriptor and details. This is partly because my categorisation is false trichotomy: there’s really more of a continuous gradient from absolutely required to totally optional than discrete steps. Nevertheless, I think these three categories are useful, and even if you don’t get it absolutely right every time, this framework will help you do better on average.

There are a couple of heuristics that you can also check for:

  • Are the arguments generally ordered from most important to least important? If an important argument comes before an unimportant argument, you may have assigned an argument to the wrong category. (Note that this ordering isn’t strict: sometimes it’s more important to organise related arguments together than to precisely order by importance.)

  • Do any arguments with defaults come before any arguments without defaults? This may be a sign that the argument order is wrong, or that you’ve assigned a default value to an required argument (See Chapter @ref(#def-required) for more details.)

7.5 How do I remediate past mistakes?

Generally, it is not possible to fix an exported function preserving both old behaviour and new behaviour. Typically, you will need to perform major surgery on the function arguments, and it will convey different conventions about which arguments should be named. This implies that you should deprecate the entire existing function and replace it with a new alternative. Because this is invasive to the user, it’s best to do sparingly: if the mistake is minor, you’re better off waiting until you’ve collected other problems before fixing it.

Take tidyr::gather(), for example. It has a number of problems with its design that made them hard to use. Relevant to this chapter, is that the argument order is wrong.