-- [1] "a is letter #1 of the alphabet"
-- [2] "b is letter #2 of the alphabet"
-- [3] "c is letter #3 of the alphabet"
-- [4] "d is letter #4 of the alphabet"
-- [5] "e is letter #5 of the alphabet"
-- [6] "f is letter #6 of the alphabet"
-- [7] "g is letter #7 of the alphabet"
-- [8] "h is letter #8 of the alphabet"
-- [9] "i is letter #9 of the alphabet"
-- [10] "j is letter #10 of the alphabet"
-- [11] "k is letter #11 of the alphabet"
-- [12] "l is letter #12 of the alphabet"
-- [13] "m is letter #13 of the alphabet"
-- [14] "n is letter #14 of the alphabet"
-- [15] "o is letter #15 of the alphabet"
-- [16] "p is letter #16 of the alphabet"
-- [17] "q is letter #17 of the alphabet"
-- [18] "r is letter #18 of the alphabet"
-- [19] "s is letter #19 of the alphabet"
-- [20] "t is letter #20 of the alphabet"
-- [21] "u is letter #21 of the alphabet"
-- [22] "v is letter #22 of the alphabet"
-- [23] "w is letter #23 of the alphabet"
-- [24] "x is letter #24 of the alphabet"
-- [25] "y is letter #25 of the alphabet"
-- [26] "z is letter #26 of the alphabet"
R, for all its warts, has most of the features I want from a data science language. It’s powerful, surprisingly versatile, and usually fun to use. But, like all languages, it is neither perfect nor likely to be widely used forever. (I doubt it will enjoy - if that is the right word - the endless afterlife of COBOL and its ilk). So I hope the (distant!) future will see statistical languages that replicate and refine R’s strengths while improving its weaknesses. What should those languages look like?
To discuss a next-generation language, we need to establish what makes R so great to begin with. On reflection, I identified three key ingredients:
- Vector types and vectorized functions. As John Chambers says, if it exists in R, it’s a vector. R doesn’t have any true scalar types; there are only vectors of varying lengths. The rationale is obvious: converting between scalar and vector types would add complexity for little gain, make analysis and data tidying a pain. Anyone who’s ever spent an hour deriving the ordinary least squares estimators by elementary algebra and calculus, and then done it in a few lines with linear algebra, will know what I mean.
But vectorization has benefits beyond mathematical convenience. (For now, let’s use Hadley Wickham’s working definition of a vectorized function: \(f(x[[i]]) = f(x)[[i]]\)). It abstracts away the iteration involved in operations, freeing you to think of functions as acting on each element independently. This results in compact, readable code:
In base Python or most other languages, this would require a for
loop that kept track of letters and indices, resulting in less readable code and a greater likelihood of mistakes. Better still, R features convenience functions like colMeans
that operate at a higher level of abstraction: data frames or arrays, which are versatile generalizations of simple atomic vectors. These capabilities let you ignore implementation details of iteration and write nicely abstract code.
Vectorization is hardly unique to R, but I don’t know of another language
as fundamentally vector-oriented. Our ideal successor language should emulate R in this area.
- Expressive data manipulation
Too often, the actual “science” of data science, like dessert after a big feast, is dwarfed by what came before: data tidying, missing value imputation, transformation, and
everything else required to get messy input into a form that can be analyzed. If a data pipeline doesn’t exist, this can become far more daunting than the analysis itself. No language is better suited for the job than R. A skilled user can achieve even elaborate transformations in ten or twenty lines. With practice, the feeling of power becomes almost addictive. Using another language feels like putting on heavy gloves before tying your shoelaces.
R’s expressive, powerful data manipulation interface grants it this power. It also makes R hard to learn. You can often find five or six obvious, correct ways to do even a simple task, like obtaining the fourth element of the mtcars
columns cyl
.
mtcars$cyl[[4]]
-- [1] 6
mtcars[[c(2, 4)]]
-- [1] 6
mtcars[4, "cyl"]
-- [1] 6
mtcars[["cyl"]][[4]]
-- [1] 6
mtcars[rownames(mtcars)[[4]], "cyl"]
-- [1] 6
A successor to R might develop a smaller set of operators, and smooth out some oddities (like drop = FALSE
). But it should not go too far in this Emphasizing readability and separating tasks into different functions, as dplyr
has done, would make code more readable and easier to debug, but also more verbose. Too radical a departure from R’s approach would fail to replicate what makes it special.
- Metaprogramming
The other two areas I identify are widely cited as strengths of R. This one, though, is esoteric. While almost all R users take advantage of the features that power metaprogramming, many without knowing it, few use them extensively. It’s easy (and sometimes advisable) even for experienced users to avoid invoking it directly. Still, it distinguishes R from most other languages, and rests on bold design decisions made long before the language’s inception.
“Metaprogramming”, as used in the R community, means writing programs that treat R code as data - programming on programs, in other words. It utilizes R’s highly developed capabilities for partial expression substitution, controlled evaluation, and environment manipulation. Books could be written about this topic, and Advanced R covers it in detail.
As a basic example, have you ever wondered why most calls to library
in R scripts look like library(package)
, not library("package")
? The latter is legal, but seldom used. Most functions will throw an error if passed the name of a nonexistent object:
c("a", "b", "c", d)
-- Error in eval(expr, envir, enclos): object 'd' not found
But certain functions capture their inputs directly, without evaluating them, and then evaluate them in a different context. This is called “quoting”, since it captures the syntax of code while ignoring the semantics the way quoting natural language does. The implementation, known as non-standard evaluation, powers much of R’s interface. One prominent example is formulas: a compact mini-language for specifying a statistical relationship to modeling functions. Because the formula is quoted and evaluated in the context of a data frame, the user can provide bare variable names, making for a clean, simple interface:
lm(mpg ~ wt + cyl * disp, data = mtcars)
--
-- Call:
-- lm(formula = mpg ~ wt + cyl * disp, data = mtcars)
--
-- Coefficients:
-- (Intercept) wt cyl
-- 49.55195 -2.73695 -3.00543
-- disp cyl:disp
-- -0.08670 0.01107
The tidyverse
takes this idea much further. Its functions rely on tidy evaluation, an elaborate framework for selecting and modifying variables within “data masks.” In the end, R is really a statistics-oriented descendant of Lisp with more conventional syntax. Many of these ideas - expressions as data, expression substitution, and even optional prefix syntax - come from that immortal language.
`+`(2, 2)
-- [1] 4
All this power comes with serious drawbacks - serious enough that it can be reasonably argued that non-standard evaluation is a bad paradigm. Manipulating expressions means code loses referential transparency (evaluating the same if variable names are changed). Controlled evaluation requires programmers to think about environment inheritance, creating the potential for a host of subtle bugs. Functions that quote some of their arguments but not all, or accept quoted and nonquoted forms of the same argument (like library
), are harder to use. In the end, all this indirection makes code harder to write and reason about (hence the need for a vignette on simply programming with dplyr
). I think the tradeoff is worthwhile; the convenience and flexibility of non-standard evaluation are too valuable to abandon. But unlike the other two characteristics I outlined above, a strong case can be made otherwise.
In short, a successor to R should contain R’s most powerful features: vector types and vectorized functions, a terse but expressive subsetting syntax, and support for expression manipulation and controlled evaluation.
Improving on R’s Weaknesses
R is not without faults. The problems listed below are more annoying than serious, but they stem from design decisions made long ago that can no longer be easily reversed. A successor language should avoid those mistakes.
Finicky Interface
R’s user interface, in places, in harder to learn and use than necessary. It uses conventions inconsistently, exposes too much detail to the user, and contains too many “gotchas” that cause confusing errors you can only avoid with experience.
One of the unwritten rules of programming is that inconsistency should not exist without reason. If you write a class Foo
with methods called bar_bar
, baz_baz
, and quxQux
, your users will wonder why you used camelCase for just one method every time they try to call the logically expected but nonexistent qux_qux
. If you put a data frame argument at the head of one function’s argument list but the tail of another’s, they will wonder why every time they forget which is which. Only constant attention in design can avoid inconsistencies like these, but the best designs do so.
R violates the principle in many places. One trivial but well-known example is the way S3 methods are written generic.class
(e.g., mean.default
), yet dots are used all the time in the names of functions, including S3 generics. The many exceptions (t.test
, all.vars
, …) thwart a potentially useful convention. Unlike the other functionals, mapply
has the function as the first argument, not the second, and the simplify
and use.names
arguments are
actually SIMPLIFY
and USE.NAMES
(not without reason, but good luck remembering). ave
and tapply
do similar things, but ave
uses ...
for grouping factors, while tapply
reserves it for arguments to the FUN
argument. Once you notice one of these seams in the design, you can’t unsee it.
R sometimes contains unnecessary complexity. Interfaces often have complicated semantics, and functions sometimes feature multiple operating modes. For instance, there are two slightly different functions for doing principal components analysis, differing in the algorithm used. The function diag
has four distinct uses (five, if you count diag<-
as part of the same interface). Most troubling to me are the heavily overloaded arguments of certain functions. Consider this passage from the help for get
:
The ‘pos’ argument can specify the environment in which to look
for the object in any of several ways: as a positive integer (the
position in the ‘search’ list); as the character string name of an
element in the search list; or as an ‘environment’ (including
using ‘sys.frame’ to access the currently active function calls).
The default of ‘-1’ indicates the current environment of the call
to ‘get’. The ‘envir’ argument is an alternative way to specify an
environment.
I count three possible types for pos
, all with different meanings, a default value with a special meaning, and another argument that does exactly the same thing for one type. (Plus a suggestion to use call stack introspection, which I’ll leave to braver programmers than me).
Trying to memorize the intricacies of an interface like this is a fool’s errand: at some point, you’ll get it wrong and cause a nasty bug. That leaves no recourse but referring to the documentation each time you use the function, and nothing makes an interface more annoying to use.
Another offender is factors. Factors represent categorical variables by mapping integer codes to levels. Simple idea, but so many potential errors come from this fact. Something as simple as naively concatenating a factor causes disaster:
Attempting to do factor arithmetic only triggers a warning, despite being nonsense (Note also that the factor warning preempts the “mismatched object lengths” warning this would normally trigger):
x + 3:6
-- [1] NA NA NA NA
Worst of all, and not widely known: R’s lexical sort order differs by system locale. (See here for an example). When creating a factor, R defaults to ordering the levels lexically. Good luck with that reproducible research!
Individually, these criticisms are trivial. I don’t mean to cast them as evidence of incompetence or carelessness by the language designers. I have written much worse interfaces to far simpler programs, so I know from experience how hard it is to implement and maintain a good one. But our successor language can do better by following the tidyverse
and making “design for human users” a core principle.
Very Weak Typing
Our new language should have dynamic typing. Static typing makes code easier to reason about and debug, especially in large applications, but it would be awkward to explore or transform data without quick, easy type conversions that can be done interactively. In its present form, I think R makes these conversions a little too easy. R is a weakly typed language: instead of disallowing operations with objects of disparate types, it casts them to a common type. Sometimes the result is predictable:
But sometimes R will allow operations that have no sensible result:
paste0(mtcars, "abc")
-- [1] "c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 19.7, 15, 21.4)abc"
-- [2] "c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4)abc"
-- [3] "c(160, 160, 108, 258, 360, 225, 360, 146.7, 140.8, 167.6, 167.6, 275.8, 275.8, 275.8, 472, 460, 440, 78.7, 75.7, 71.1, 120.1, 318, 304, 350, 400, 79, 120.3, 95.1, 351, 145, 301, 121)abc"
-- [4] "c(110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65, 97, 150, 150, 245, 175, 66, 91, 113, 264, 175, 335, 109)abc"
-- [5] "c(3.9, 3.9, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 3.07, 3.07, 3.07, 2.93, 3, 3.23, 4.08, 4.93, 4.22, 3.7, 2.76, 3.15, 3.73, 3.08, 4.08, 4.43, 3.77, 4.22, 3.62, 3.54, 4.11)abc"
-- [6] "c(2.62, 2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44, 3.44, 4.07, 3.73, 3.78, 5.25, 5.424, 5.345, 2.2, 1.615, 1.835, 2.465, 3.52, 3.435, 3.84, 3.845, 1.935, 2.14, 1.513, 3.17, 2.77, 3.57, 2.78)abc"
-- [7] "c(16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3, 18.9, 17.4, 17.6, 18, 17.98, 17.82, 17.42, 19.47, 18.52, 19.9, 20.01, 16.87, 17.3, 15.41, 17.05, 18.9, 16.7, 16.9, 14.5, 15.5, 14.6, 18.6)abc"
-- [8] "c(0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1)abc"
-- [9] "c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1)abc"
-- [10] "c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5, 4)abc"
-- [11] "c(4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 2, 2, 4, 6, 8, 2)abc"
Moreover, R has no equivalent of Python’s type hinting system. If you want to enforce a specific type for function arguments, you have to do it manually:
foo <- function(x, y, z) {
if (!is.character(x)) {
stop("x must be character")
}
}
Many of the type-checking helpers like is.character
have surprisingly complex behaviors that make them dangerous to rely on.
R functions also do not always have stable return types. sapply
, for example, can return a list, an array, an atomic vector, or even an empty list, depending on the input. Programming guides often recommend lapply
or vapply
instead, since they enforce stable return types, but many unwary users (including me, at various times) who did not know this have written subtly buggy code.
R’s very weak typing accounts for much of the unpredictable behavior that makes it challenging to use in large applications. I think strict typing like Python’s would be excessive; operations like paste(1:10, letters[1:10])
are too convenient to part with. But our successor language will dispense with some of the crazier implicit coercions R allows.
String Manipulation
R’s string manipulation facilities leave something to be desired. In other languages, strings are array types or feature array-like subsetting. R, however, handles strings (i.e., the raw character data that make up the elements of character vectors) with an internal type. You can’t extract string elements the way you can in Python:
= "A typical string"
x 0] x[
-- 'A'
You have to use substr
or substring
(barely distinguishable functions again!)
x <- "A typical string"
substr(x, 1, 1)
-- [1] "A"
The rationale is obvious - the unpalatable alternative would be to implement character vectors as list-like recursive vectors - but it has annoying consequences for the interface, such as strsplit
returning a list:
-- [[1]]
-- [1] "This" "is" "a" "typical"
--
-- [[2]]
-- [1] "character" "vector"
But these are quibbles. The real problem is the regular expression interface. This is the only part of base R I actively dislike. There are too many functions with terse, barely distinguishable names. (If you can remember the difference between gregexpr
and regexec
without looking it up, please teach me your secrets). Functions don’t use PCRE by default, a fact I never remember until it causes an error. They return match data in awkward formats; gregexpr
, for instance, returns a list of match start positions and lengths, making it difficult to extract the actual match data.
Put together, these issues make working with regular expressions much more verbose and painful than necessary. The convoluted snippet below, copied from the documentation, does nothing more than create a matrix with the text from two capture groups. For comparison, Python’s re
module contains a groupdict
method that stores matches in an appropriate data structure automatically.
notables <- c(
" Ben Franklin and Jefferson Davis",
"\tMillard Fillmore"
)
# name groups 'first' and 'last'
name.rex <- "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)"
(parsed <- regexpr(name.rex, notables, perl = TRUE))
-- [1] 3 2
-- attr(,"match.length")
-- [1] 12 16
-- attr(,"index.type")
-- [1] "chars"
-- attr(,"useBytes")
-- [1] TRUE
-- attr(,"capture.start")
-- first last
-- [1,] 3 7
-- [2,] 2 10
-- attr(,"capture.length")
-- first last
-- [1,] 3 8
-- [2,] 7 8
-- attr(,"capture.names")
-- [1] "first" "last"
gregexpr(name.rex, notables, perl = TRUE)[[2]]
-- [1] 2
-- attr(,"match.length")
-- [1] 16
-- attr(,"index.type")
-- [1] "chars"
-- attr(,"useBytes")
-- [1] TRUE
-- attr(,"capture.start")
-- first last
-- [1,] 2 10
-- attr(,"capture.length")
-- first last
-- [1,] 7 8
-- attr(,"capture.names")
-- [1] "first" "last"
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if (result[i] == -1) {
return("")
}
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
parse.one(notables, parsed)
-- first last
-- [1,] "Ben" "Franklin"
-- [2,] "Millard" "Fillmore"
import re
= ["Ben Franklin and Jefferson Davis", "\tMillard Fillmore"]
notables ".*(?P<first>[A-Z][a-z]+).*(?P<last>[A-Z][a-z]+)", x).groupdict() for x in notables] [re.match(
-- [{'first': 'Jefferson', 'last': 'Davis'}, {'first': 'Millard', 'last': 'Fillmore'}]
(Is Fillmore’s inclusion a sly joke? He is chiefly notable for being a bottom-tier president).
The excellent stringr
package provides functions that fix all of these problems. But R users shouldn’t have to choose between taking a major dependency and foregoing easy string processing.
Summing Up
You should have a clear idea by now of the language I want. It relies on vector types and makes it easy to manipulate data. It uses some form of non-standard evaluation and offers powerful metaprogramming tools to interested users. Its interface judiciously hides complexity and contains few discrepancies and special cases. With an easy-to-use package system and thorough documentation, it will rapidly gain users and establish a productive, long-lasting community.
That language sounds a lot like what the people behind tidyverse
have already created. tidyverse
expands and enhances R’s data manipulation capabilities, with particular attention to ease of use and rigorous implementation of non-standard evaluation. Perhaps most importantly, its developers update aggressively; they have made several complete overhauls of dplyr
’s interface over the past few years. This means lots of breaking changes that make tidyverse
infamously dangerous to use in production, but tidyverse
advances and develops new ideas much more quickly than R itself. I think the tradeoff is worthwhile.
It also sounds a little like Julia, a newer statistical language with metaprogramming support, vector types, and an emphasis on performance that is lacking in R.That emphasis, some have observed, gives it the potential to eliminate the “prototype in R/Python, program in C/C++” cycle that plagues machine learning research today. It has nowhere near R’s popularity or anything like its mature ecosystem, but users I’ve encountered speak highly of it. Will I be writing Julia ten years from now? Perhaps. But for now, R reigns supreme.
# TODO: Update this in 4 years to see how things shook out