1 Unifying principles

The tidyverse is a language for solving data science challenges with R code. Its primary goal is to facilitate the conversation that a human has with a dataset via the medium of code, and we want to help dig a “pit of success” where the least-effort path trends towards a positive outcome. The primary tool to dig the pit is API design: by carefully considering the external interface to a function, we can help guide the user towards success. But it’s also necessary to have some high level principles that guide how we think broadly about APIs, principles that we can use to “break ties” when other factors are balanced.

The tidyverse has four guiding principles:

  • It is human centered, i.e. the tidyverse is designed specifically to support the activities of a human data analyst.

  • It is consistent, so that what you learn about one function or package can be applied to another, and the number of special cases that you need to remember is as small as possible.

  • It is composable, allowing you to solve complex problems by breaking them down into small pieces, supporting a rapid cycle of exploratory iteration to find the best solution.

  • It is inclusive, because the tidyverse is not just the collection of packages, but it is also the community of people who use them.

These guiding principles are aspirational; they’re not always fully realised in current tidyverse packages, but we strive to make them so.

1.1 Human centered

Programs must be written for people to read, and only incidentally for machines to execute.

— Hal Abelson

Programming is a task performed by humans. To create effective programming tools we must must explicitly recognise and acknowledge the role played by cognitive psychology. This is particularly important for R, because it’s a language that’s used primarily by non-programmers, and we want to make it as easier as possible for first-time programmers to learn the tidyverse. A particularly useful tool is “cognitive load theory”1: we have a limited working memory, and anything we can do to reduce extraneous cognitive load helps the learner and user of the tidyverse. This motivates the next two principles:

  • By being consistent you only need to learn and internalise one expression of an idea, and then you can apply that many times.

  • By being composable you can break down complex problems into bite sized pieces that you can easily hold in your head.

1.2 Consistent

A system should be built with a minimum set of unchangeable parts; those parts should be as general as possible; and all parts of the system should be held in a uniform framework.

— Daniel H. H. Ingalls

If there’s one overarching goal of the tidyverse, it’s to be consistent. We want to find the smallest possible set of key ideas and use them again and again. This is important because it makes the tidyverse easier to learn and remember.

(Another framing of this principle is Less Volume, More Creativity, which comes from Mike McCarthy, the head coach of the Green Bay Parkers, and popularised in Statistics Education by Randall Pruim)

This is related to one of my favourite saying from the Python community:

There should be one—and preferably only one—obvious way to do it.

— Zen of Python

This is a fairly restrictive philosophy, but it’s workable with the tidyverse because the tidyverse is not R; you can always step outside the tidyverse and attack the problem a different way.

The principle of consistency reveals itself in two primary ways: in function APIs and in data structures. The API of a function defines its external interface (independent of its internal implementation). Having consistent APIs means that each time you learn a function, learning the next function is a little easier; once you’ve mastered one package, mastering the next is easier.

There are two ways that we make functions consistent that are so important that they’re explicitly pull out as high-level principles below:

  • Functions should be composable: each individual function should tackle one well contained problem, and you solve complex real-world problems by composing many individual functions.

  • Overall, the API should feel “functional”, which is a technical term for the programming paradigm favoured by the tidyverse

But consistency also applies to data structures: we want to ensure we use the same data structures again and again and again. Principally, we expect data to be stored in tidy data frames or tibbles. This means that tools for converting other formats can be centralised in one place, and that packages development is simplified by assuming that data is already in a standard format.

We want to avoid “Norman doors” where the exterior clues and cues point you in the wrong direction.

1.3 Composable

No matter how complex and polished the individual operations are, it is often the quality of the glue that most directly determines the power of the system.

— Hal Abelson

A powerful strategy for solving complex problems is to combine many simple pieces. Each piece should be easily understood in isolation, and have a standard way of combining with other pieces.

Within the tidyverse, we prefer to compose functions using a single tool: the pipe, %>%. There are two notable exceptions to this principle: ggplot2 composes graphical elements with +, and httr composes requests primarily through .... These are not bad techniques in isolation, and they are well suited to the domains in which they are used, but the disadvantages of inconsistency outweigh any local advantages.

For smaller domains, this means carefully designing functions so that the inputs and outputs align (e.g. the output from string::str_locate() can easily be fed into str_sub()). For middling domains, this means drawing many feature matrices and ensuring that they are dense (e.g. consider the map family in purrr). For larger domains, this means carefully thinking about algebras and grammars, identifying the atoms of a problem and the ways in which they might be composed to solve bigger problems.

We decompose large problems into smaller, more tractable ones by creating and combining functions that transform data rather than by creating objects whose state changes over time.

Other techniques that tend to faciliate composability:

  • Functions are data: this leads some of the most impactful techniques for functional programming, which allow you to reduce code duplication.

  • Immutable objects. Enforces independence between components.

  • Partition side-effects.

  • Type-stable.

1.4 Inclusive

We value not just the interface between the human and the computer, but also the interface between humans. We want the tidyverse to be a diverse, inclusive, and welcoming community.

  • We develop educational materials that are accessible to people with many different skill levels.

  • We prefer explicit codes of conduct.

  • We create safe and friendly communities. We believe that kindness should be a core value of communities.

  • We think about how we can help others who are not like us (they may be visually impaired or may not speak English).

We also appreciate the paradox of tolerance: the only people that we do not welcome are the intolerant.

  1. A good practical introduction is Cognitive load theory in practice (PDF).