rep(1:3, length = 10)
-- [1] 1 2 3 1 2 3 1 2 3 1
Ryan Heslin
June 12, 2022
As programming languages go, R isn’t particularly old: its first public release came in early 2000 (see https://www.stat.auckland.ac.nz/~ihaka/downloads/Massey.pdf for more details).
But as many users know, its roots go back further. R was developed from the language S, created in the 1970s by a team led by John Chambers at Bell Labs. Those were the glory days of Bell Labs, when the language C and the Unix ecosystem were developed. Like a modern palace built on the foundations of an ancient one, R bears many traces of its lineage. Syntax is very similar, many features are backward-compatible, and the documentation for some functions even refers to resources about S rather than R. (Try ?sum
, for one example).
Gentleman: “Let’s write some software.”
Ihaka: “Sure, that sounds like fun.”
One of those traces, harder to observe but certainly still present, is also one of R’s most unusual (and, in some quarters, derided) features: an emphasis on convenience in interactive use. Interpreted languages typically support interactivity in some way, since the ability to run a snippet of code and instantly get results is one of their greatest advantages over compiled languages. But S was designed primarily for interactive data exploration, and R has retained that capability as a design focus. In areas great and small, from core design choices to implementation quirks, R makes it as easy as possible to bang out code in the console and see what happens. That makes it a fast, flexible tool for exploring data and following hunches. It also strews mines in the path of anyone programming in the language without detailed knowledge of the its nuances.
A few examples will make this painfully clear.
Can you spot the problem with this call? It runs correctly:
rep(1:3, length = 10)
-- [1] 1 2 3 1 2 3 1 2 3 1
but is missing something. The relevant argument of rep
is actually called length.out
, not length
, but R’s partial argument matching saves us, since length
is a shortened form of length.out
.
This is nice to have when typing code in the console. But relying on partial argument matching in scripts is a very bad idea.
Suppose you’re working with a package that includes some functions with annoyingly long argument names. All that typing is annoying, so you decide you may as well save some keystrokes:
foo <- function(xyzabc = 0, abcxyz) {
rnorm(100, mean = xyzabc, sd = abcxyz)
}
foo(abc = 2)
-- [1] 4.00474004 -1.27821602 1.29900596
-- [4] -0.25402618 2.12623994 -2.07070698
-- [7] -1.86154633 -3.59955689 -1.45326494
-- [10] 0.42668908 -3.88395141 2.35033571
-- [13] 2.53780029 -2.31973690 -2.15962902
-- [16] -3.58521533 -0.49387361 -0.43639201
-- [19] 2.37707518 -0.18366409 -2.35288133
-- [22] 4.02017978 -0.76270196 -5.17816834
-- [25] -1.86765373 2.18323757 -1.88491569
-- [28] 2.11861495 1.99597302 -0.92143335
-- [31] 2.89990969 -1.21095176 -3.24739440
-- [34] -0.73863055 -2.13980634 -0.90604626
-- [37] -3.28649495 0.24517599 0.57840439
-- [40] 3.17347485 0.30699455 -0.83006642
-- [43] 0.80333315 -0.62164082 -0.10226109
-- [46] -0.58857382 -0.52846165 2.15047996
-- [49] -1.99687368 3.73823522 0.05243560
-- [52] -1.02086706 -0.47792090 3.35813506
-- [55] -0.60337422 -0.79519643 -1.22682317
-- [58] 0.62584765 3.13707585 -3.79529077
-- [61] 1.97711579 -0.07767803 3.20392848
-- [64] -6.61453906 0.84650805 -1.43290702
-- [67] -0.58879366 0.48793729 1.58525943
-- [70] -3.55488165 -4.82853410 0.09553368
-- [73] -2.25119001 6.40248922 -0.92176403
-- [76] -4.49742348 0.53547498 -0.53477097
-- [79] -1.56163245 -2.05202139 0.07719572
-- [82] -1.29578256 4.03163973 -0.52462044
-- [85] -1.97150041 4.84215657 -3.99918465
-- [88] -0.89209597 -6.06083829 -3.43076333
-- [91] 0.97269597 0.38725233 0.54526798
-- [94] -0.46413786 2.09924836 0.99886925
-- [97] 1.08172962 2.04040491 3.41532473
-- [100] -1.47850127
All seems well. But then a version update adds a new argument:
foo <- function(abcabc = 100, xyzabc = 0, abcxyz) {
rnorm(abcabc, mean = xyzabc, sd = abcxyz)
}
foo(abc = 2)
-- Error in foo(abc = 2): argument 1 matches multiple formal arguments
R throws an error, unable to find an unambiguous match. (Imagine how painful this would be to debug if R defaulted to the first match instead). The way to avoid this scenario is simple: never rely on partial argument matching in permanent code. Nonetheless, many packages do. You can identify offenders yourself by setting the warnPartialMatchArgs
option:
options(warnPartialMatchArgs = TRUE)
foo <- function(xyzabc = 0, abcxyz) {
rnorm(100, mean = xyzabc, sd = abcxyz)
}
foo(abc = 2)
-- Warning in foo(abc = 2): partial argument match of
-- 'abc' to 'abcxyz'
-- [1] -1.54608373 4.42865773 -3.17474383
-- [4] -1.17646384 -0.69868242 4.84537606
-- [7] -1.24223357 4.37145363 1.68314348
-- [10] -1.43227476 0.03528732 1.29411463
-- [13] -0.68289925 4.36421804 2.01861546
-- [16] 1.28721741 -2.81521034 -3.38708074
-- [19] 0.76613105 0.03878324 2.51910218
-- [22] -2.92955857 2.23110888 4.57614373
-- [25] 1.53914051 1.70718975 -0.64619523
-- [28] 0.93470631 1.75022600 -0.77017935
-- [31] 1.15876487 -0.22535878 0.28346343
-- [34] -4.20902826 4.96216175 -0.96275022
-- [37] 1.69436963 -2.60114624 -0.95495529
-- [40] 0.12017477 -0.59668282 -3.17101207
-- [43] 1.36160906 -3.23026882 -2.35360452
-- [46] 1.93019269 -1.09298847 0.13411490
-- [49] -0.44234286 2.25618932 -0.09979523
-- [52] -0.14890005 0.52002959 -0.24640900
-- [55] 0.15160809 3.87436011 1.12535448
-- [58] 1.98672049 3.67616975 -0.66597559
-- [61] -1.13848192 3.84053492 -0.03255419
-- [64] 1.94460374 -0.65360513 -0.74586357
-- [67] -1.56281171 2.45919767 -1.45356844
-- [70] 0.42007214 2.55341544 0.88215118
-- [73] 0.93721472 1.03405049 0.18142759
-- [76] 2.69026925 1.46383473 -1.37612821
-- [79] -0.55067523 0.69859378 1.06228601
-- [82] 0.01680672 -2.96573602 -4.38035656
-- [85] -0.16421133 -0.65201895 -1.80580770
-- [88] -2.02387589 -3.01723912 -0.52509132
-- [91] -1.11499782 -0.39737547 -5.61049358
-- [94] 3.13083857 2.03545587 -0.54159312
-- [97] 0.78472553 -0.58132656 1.29393200
-- [100] 2.61027762
R is an example of a weakly typed language with dynamic typing. That means data types are known only at runtime, not before, and that the language will try to coerce disparate types to a common type instead of throwing an error. That means the interpreter will happily run code like
paste(mtcars, 1)
-- [1] "c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 19.7, 15, 21.4) 1"
-- [2] "c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4) 1"
-- [3] "c(160, 160, 108, 258, 360, 225, 360, 146.7, 140.8, 167.6, 167.6, 275.8, 275.8, 275.8, 472, 460, 440, 78.7, 75.7, 71.1, 120.1, 318, 304, 350, 400, 79, 120.3, 95.1, 351, 145, 301, 121) 1"
-- [4] "c(110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65, 97, 150, 150, 245, 175, 66, 91, 113, 264, 175, 335, 109) 1"
-- [5] "c(3.9, 3.9, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 3.07, 3.07, 3.07, 2.93, 3, 3.23, 4.08, 4.93, 4.22, 3.7, 2.76, 3.15, 3.73, 3.08, 4.08, 4.43, 3.77, 4.22, 3.62, 3.54, 4.11) 1"
-- [6] "c(2.62, 2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44, 3.44, 4.07, 3.73, 3.78, 5.25, 5.424, 5.345, 2.2, 1.615, 1.835, 2.465, 3.52, 3.435, 3.84, 3.845, 1.935, 2.14, 1.513, 3.17, 2.77, 3.57, 2.78) 1"
-- [7] "c(16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3, 18.9, 17.4, 17.6, 18, 17.98, 17.82, 17.42, 19.47, 18.52, 19.9, 20.01, 16.87, 17.3, 15.41, 17.05, 18.9, 16.7, 16.9, 14.5, 15.5, 14.6, 18.6) 1"
-- [8] "c(0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1) 1"
-- [9] "c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1) 1"
-- [10] "c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5, 4) 1"
-- [11] "c(4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 2, 2, 4, 6, 8, 2) 1"
paste
just coerces everything to character, no matter how ludicrous the results. This behavior can trip you up, but it’s not truly insidious.
Unfortunately, R sometimes changes types under your nose. Suppose we write a function, subset2
. It takes as argument a data frame, and two functions that take a data frame as argument. It filters the data column-wise using col_f
, then rowwise using row_f
.
subset2 <- function(df, col_f, row_f) {
df <- df[, col_f(df)]
df[row_f(df), ]
}
subset2(mtcars, \(x) colSums(x) > 500, \(x) rowSums(x) > 500)
That seems to work. (Deadly words!) But what if my finger had slipped when I typed 500
?
subset2 <- function(df, col_f, row_f) {
df <- df[row_f, col_f(df)]
df[row_f(df), ]
}
subset2(mtcars, \(x) colSums(x) > 5000, \(x) rowSums(x) > 500)
-- Error in xj[i]: invalid subscript type 'closure'
What happened? Only one column of mtcars
, disp
, has a column sum greater than 5000. And what happens if you select a single column with array-style indexing?
mtcars[, "disp"]
-- [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0
-- [8] 146.7 140.8 167.6 167.6 275.8 275.8 275.8
-- [15] 472.0 460.0 440.0 78.7 75.7 71.1 120.1
-- [22] 318.0 304.0 350.0 400.0 79.0 120.3 95.1
-- [29] 351.0 145.0 301.0 121.0
R helpfully simplifies to an atomic vector. We can fix our function by disabling this behavior:
subset3 <- function(df, col_f, row_f) {
df <- df[, col_f(df), drop = FALSE]
df[row_f(df), ]
}
subset3(mtcars, \(x) colSums(x) > 5000, \(x) rowSums(x) > 500)
-- numeric(0)
or, even more sensibly, using list subsetting (single brackets, no comma), which never simplifies.
This behavior isn’t indefensible. It’s consistent with how subsetting works on arrays (which are usually atomic vectors). In interactive use, it’s convenient, since then you’re usually interested in the data a column contains, not the object containing it. But automatic simplification is easily missed and potentially destructive, and the way to avoid it can be found only if you carefully read the documentation.
Suppose you have the following vector:
x <- c(1, 4, 7, NA, -9, NA)
R is strict about missing values, but not about logical constants. T
and F
can be used as abbreviations for TRUE
and FALSE
, respectively. The following is a valid way of taking the mean:
mean(x, na.rm = T)
-- [1] 0.75
Likewise, with F
for FALSE
:
mtcars[1:5, "cyl", drop = F]
What’s the harm in this? While TRUE
and FALSE
are reserved words, the abbreviations aren’t. Let’s say your colleague creates a variable T
, making sure to use uppercase to avoid masking the t
function:
T <- pt(2, df = 10)
This code now fails in a confusing way:
mean(x, na.rm = T)
-- [1] NA
The reason for this feature, as before, is clear: it’s convenient in interactive use. The problem with it is equally clear: it’s suicidal in programmatic use.
The theme here is obvious: features that save a few keystrokes in interactive use can cause maddening bugs if carelessly used in production code. You need familiarity with the language and some degree of vigilance to avoid the pitfalls, and everyone slips now and again.
The longer I’ve spent with R, the more convinced I’ve become that R has outgrown these features. R was designed as an environment for interactive data exploration, statistical testing, and graphical displays, but today it can do so much more: serve Web apps, query remote databases, render just about any document (even this one) with Rmarkdown or Quarto, and many other uses. But to fulfill these sophisticated use cases, you have to carefully avoid traps like the ones discussed here. Some organizations have no doubt avoided the problem by switching to Python. So R’s design emphasis on interactivity may limit its growth.
Moreover, the benefits these features deliver are scant. The three behaviors I describe - partial argument matching, logical abbreviations, and drop = FALSE
save a bit of typing (or, in the last case, an extra step of data manipulation). A few key strokes saved here and there adds up quickly, and the savings may have been significant in the days when users were limited to R’s basic readline
prompt. But that doesn’t balance the potential harm they can cause in production code today, especially when modern IDEs (and Vim or Emacs, of course) support autocompletion, obviating the need for abbreviated code.
Don’t get me wrong. R remains a powerful, expressive language built on solid design principles. It’s my first choice for any kind of data manipulation, and I still find it fun and satisfying to use. But some of its behaviors are more at home in its past than its future.
---
title: "Ghost in the Machine: The Remnant of R's Past That Haunts it Still"
author: "Ryan Heslin"
date: "2022-06-12"
categories: ["R"]
params:
title: "Ghost in the Machine: The Remnant of R's Past That Haunts it Still"
---
As programming languages go, R isn't particularly old: its first public release came in
early 2000 (see [https://www.stat.auckland.ac.nz/~ihaka/downloads/Massey.pdf](https://www.stat.auckland.ac.nz/~ihaka/downloads/Massey.pdf) for more details).
But as many users know, its roots
go back further. R was developed from the language S, created in the 1970s by a team led by John Chambers at Bell Labs.
Those were the glory days of Bell Labs, when the language C and the Unix ecosystem were developed. Like a modern palace built on the foundations of an ancient one,
R bears many traces of its lineage. Syntax is very similar, many features are backward-compatible,
and the documentation for some functions even refers to resources about S rather than R.
(Try `?sum`, for one example).
(I can't help but pause here to relay the account the linked presentation gives of R's origins.
It all began with this hallway conversation between Ross Ihaka and Robert Gentleman in the University of Auckland around 1990):
<blockquote cite="https://www.stat.auckland.ac.nz/~ihaka/downloads/Massey.pdf">
Gentleman: “Let’s write some software.”
Ihaka: “Sure, that sounds like fun.”
</blockquote>
One of those traces, harder to observe but certainly still present, is also one of R's
most unusual (and, in some quarters, derided) features: an emphasis on convenience
in interactive use. Interpreted languages typically support interactivity in some way, since the ability to run a snippet of code and instantly get results is
one of their greatest advantages over compiled languages. But S was designed primarily for interactive data exploration, and R has retained that capability as a design focus.
In areas great and small, from core design
choices to implementation quirks, R makes it
as easy as possible to bang out code in the console and see what happens. That makes it a fast, flexible tool for exploring data and following hunches. It also strews mines in the path of anyone programming in the language without detailed knowledge of the
its nuances.
A few examples will make this painfully clear.
# Partial Matching, Complete Headache
Can you spot the problem with this call? It runs correctly:
```{r}
rep(1:3, length = 10)
```
but is missing something. The relevant argument of `rep` is actually called `length.out`, not `length`, but R's partial argument matching saves us, since `length` is a shortened form of `length.out`.
This is nice to have when typing code in the console. But relying on partial argument matching in scripts is a _very_ bad idea.
Suppose you're working with a package that includes some functions with annoyingly long argument names. All that typing is annoying, so you decide you may as well save some keystrokes:
```{r}
foo <- function(xyzabc = 0, abcxyz) {
rnorm(100, mean = xyzabc, sd = abcxyz)
}
foo(abc = 2)
```
All seems well. But then a version update adds a new argument:
```{r, error = TRUE}
foo <- function(abcabc = 100, xyzabc = 0, abcxyz) {
rnorm(abcabc, mean = xyzabc, sd = abcxyz)
}
foo(abc = 2)
```
R throws an error, unable to find an unambiguous match. (Imagine how painful this would be to debug if R defaulted to the first match instead). The way to avoid this scenario is simple: never rely on partial argument matching in permanent code. Nonetheless, many packages do. You can identify offenders yourself by setting the `warnPartialMatchArgs` option:
```{r, warning = TRUE}
options(warnPartialMatchArgs = TRUE)
foo <- function(xyzabc = 0, abcxyz) {
rnorm(100, mean = xyzabc, sd = abcxyz)
}
foo(abc = 2)
```
# When Simplification Complicates
R is an example of a weakly typed language with dynamic typing. That means data types are
known only at runtime, not before, and that the language will try to coerce disparate
types to a common type instead of throwing an error. That means the interpreter will happily run code like
```{r}
paste(mtcars, 1)
```
`paste` just coerces everything to character, no matter how ludicrous the results. This behavior can trip you up, but it's not truly insidious.
Unfortunately, R sometimes changes types under your nose. Suppose we write a function, `subset2`. It takes as argument a data frame, and two functions that take a data frame as argument. It filters the data column-wise using `col_f`, then rowwise using `row_f`.
```{r}
subset2 <- function(df, col_f, row_f) {
df <- df[, col_f(df)]
df[row_f(df), ]
}
subset2(mtcars, \(x) colSums(x) > 500, \(x) rowSums(x) > 500)
```
That seems to work. (Deadly words!) But what if my finger had slipped when I typed `500`?
```{r, error = TRUE}
subset2 <- function(df, col_f, row_f) {
df <- df[row_f, col_f(df)]
df[row_f(df), ]
}
subset2(mtcars, \(x) colSums(x) > 5000, \(x) rowSums(x) > 500)
```
What happened? Only one column of `mtcars`, `disp`, has a column sum greater than 5000. And what happens if you select a single column with array-style indexing?
```{r}
mtcars[, "disp"]
```
R helpfully simplifies to an atomic vector. We can fix our function by disabling this behavior:
```{r}
subset3 <- function(df, col_f, row_f) {
df <- df[, col_f(df), drop = FALSE]
df[row_f(df), ]
}
subset3(mtcars, \(x) colSums(x) > 5000, \(x) rowSums(x) > 500)
```
or, even more sensibly, using list subsetting (single brackets, no comma), which never simplifies.
This behavior isn't indefensible. It's consistent with how subsetting works on arrays (which are usually atomic vectors). In interactive use, it's convenient, since then you're usually interested in the data a column contains, not the object containing it. But automatic simplification is easily missed and potentially destructive, and the way to avoid it can be found only if you carefully read the documentation.
# Brevity is the Soul of Bugs
Suppose you have the following vector:
```{r}
x <- c(1, 4, 7, NA, -9, NA)
```
R is strict about missing values, but not about logical constants. `T` and `F` can be used as abbreviations for `TRUE` and `FALSE`, respectively. The following is a valid way of taking the mean:
```{r}
mean(x, na.rm = T)
```
Likewise, with `F` for `FALSE`:
```{r}
mtcars[1:5, "cyl", drop = F]
```
What's the harm in this? While `TRUE` and `FALSE` are reserved words, the abbreviations _aren't_. Let's say your colleague creates a variable `T`, making sure to use uppercase to avoid masking the `t` function:
```{r}
T <- pt(2, df = 10)
```
This code now fails in a confusing way:
```{r}
mean(x, na.rm = T)
```
The reason for this feature, as before, is clear: it's convenient in interactive use. The problem with it is equally clear: it's suicidal in programmatic use.
# Conclusion
The theme here is obvious: features that save a few keystrokes in interactive use can cause maddening bugs if carelessly used in production code. You need familiarity with the language and some degree of vigilance to avoid the pitfalls, and everyone slips now and again.
The longer I've spent with R, the more convinced I've become that R has outgrown these features. R was designed as an environment for interactive data exploration, statistical testing, and graphical displays, but today it can do so much more: serve Web apps, query remote databases, render just about any document (even this one) with Rmarkdown or Quarto, and many other uses.
But to fulfill these sophisticated use cases, you have to carefully avoid traps like the ones discussed here. Some organizations have no doubt avoided the problem by switching to Python. So R's design emphasis on interactivity may limit its growth.
Moreover, the benefits these features deliver are scant. The three behaviors I describe - partial argument matching, logical abbreviations, and `drop = FALSE` save a bit of typing (or, in the last case, an extra step of data manipulation). A few key strokes saved here and there adds up quickly, and the savings
may have been significant in the days when users were limited to R's basic `readline` prompt. But that doesn't balance the potential harm they can cause in production code today, especially when modern IDEs (and Vim or Emacs, of course) support autocompletion, obviating the need for abbreviated code.
Don't get me wrong. R remains a powerful, expressive language built on solid design principles. It's my first choice for any kind of data manipulation, and I still find it fun and satisfying to use. But some of its behaviors are more at home in its past than its future.