Living Systems_

Why is R so confusing? Because it is so hard to inspect

Confused man trying to run some R code

I’m not the only one who thinks the R language can be pretty frustrating at times.

In fact, although I have been using it a dozen times over my career, every time I have picked it up again after a hiatus, I have been completely lost.

Many people have written about various aspects of why it is so confusing, and some common themes are things like:

I’m personally not very bothered by these quite surface- or syntax-level differences though. These I can learn and just accept.

After thinking quite a bit about it, I think what instead frustrates me the most is how hard it is to know how to inspect your objects and datatypes.

That is, understanding how my data actually looks and is structured. Because that is what I increasingly find to be the most important thing to have a 100% clue about, in order to be able to verify that your or others’ code does what you expect it to, or to understand how some new code operates.

When you are not able to look at exactly what goes in and out of a function, you are working very much in the dark, and what you are doing is not so much computing as alchemy.

To mention some good examples in this area to give an example, I mean how you are able to inspect objects very deeply in Python for example, by doing:

l = [1,2,3]
l.__dir__()

… to show a list of all the “public” and “internal” (starting with __) methods and fields that is available for this object.

In this case, what you get is:

>>> l.__dir__()
> ['__new__', '__repr__', '__hash__', '__getattribute__', '__lt__', '__le__',
> '__eq__', '__ne__', '__gt__', '__ge__', '__iter__', '__init__', '__len__',
> '__getitem__', '__setitem__', '__delitem__', '__add__', '__mul__', '__rmul__',
> '__contains__', '__iadd__', '__imul__', '__reversed__', '__sizeof__', 'clear',
> 'copy', 'append', 'insert', 'extend', 'pop', 'remove', 'index', 'count',
> 'reverse', 'sort', '__class_getitem__', '__doc__', '__str__', '__setattr__',
> '__delattr__', '__reduce_ex__', '__reduce__', '__subclasshook__',
> '__init_subclass__', '__format__', '__dir__', '__class__']

Here you can see that this list has a number of normal methods:

copy
append
insert
extend
pop
remove
index
count
reverse
sort

… as well as a longer list of internal functions.

In Python I feel that I can always use a few very simple techniques in order to inspect any data type.

Granted, not all languages are as good in this regard as Python, but I feel that R really resides in the total opposite side of the spectrum.

Below I will go into a few specific frustrations I have ran into, while also providing some tips for ways to work around it, based on a combination of techniques I’ve learned from others, together with some own small tricks.

Inconsistent and hard-to-read printing methods

The various ways of peeking into the data of objects are:

  1. Not good at showing an overview of the data.
  2. Inconsistent and hard-to-read printing methods

For number one, note that R is showing all the columns by default, which are broken up into multiple rows. This means that even if you run a simple command such as head(data), you will get a quite unreadable mess of broken rows, like something like this:

<... clipping because goes many pages outside of window ...>
ENSG00000000460       9      15      17       8      28      10       5
ENSG00000000938      73      12     100      71      55      28      35
NA19201 NA19203 NA19204 NA19209 NA19210 NA19222 NA19225
ENSG00000000003       2       0       0       0       0       0       0
ENSG00000000005       0       0       0       0       0       0       0
ENSG00000000419      77      82      57      63      89      60      76
ENSG00000000457      96      58      86     113      48      71      81
ENSG00000000460      16      10      19      17      12      12       7
ENSG00000000938      50      14      59      27      32      25      19
NA19238 NA19239 NA19257
ENSG00000000003       0       0       0
ENSG00000000005       0       0       0
ENSG00000000419      69      84      76
ENSG00000000457      73      87      81
ENSG00000000460      21      35      11
ENSG00000000938      59      58      34

Then there is the fact that despite a flurry of methods to print overview information about objects such as:

… these all tend to show different information, and few of them are really that useful for showing the structure of the data (well, str() does actually do something useful, but has a pretty confusing name. It is not for printing the object “as string” as in python. Rather it shows it’s “structure”. Good on you if you remember that!

The glimpse() function from Tidyverse package seems to do a bit of a better job actually, but the tidyverse is a rather big installation, and while worth it, is not something you would want to need to install just for very basic operations like looking at your data.

What to do about it

What I found useful is to create a small custom function that shows only the top N columns and rows, set to a number that keeps the output from breaking into multiple rows, such as this:

p <- function(data) {
    data[1:7, 1:7]
}

Then you can use it on data types like ExpressionSets (from Bioconductor , and get an output like e.g. this:

> p(edata)
>                 ERS025098 ERS025092 ERS025085 ERS025088 ERS025089 ERS025082 ERS025081
> ENSG00000000003      1354       216       215       924       725       125       796
> ENSG00000000005       712       134         4      1495       119        20         7
> ENSG00000000419       450       547       516       529       808       680       744
> ENSG00000000457       188       368       196       386       156       259       436
> ENSG00000000460        66        29         1        26        11         9        25
> ENSG00000000938       104        79         7        29         0         3         1
> ENSG00000000971         0         0         0         0         0         0         0

I at least personally find that this does a better job at looking into typical matrix-formed data structures than something like head().

Of course, adjust the number of rows/columns to your liking.

Lack of discoverability for function to operate on data types

In R, the methods that you can apply on an object is in general not bound to that method. That is, you can not type the_object.<TAB> to get an auto-completion of the available methods to run for that object. And this goes even for methods that just exposes some parts of the datatype, for showing and

Typically you can find these methods mentioned in the help page of the package that implements the data type, but that typically requires a multi step process to look up since you might not immediately see which package implements a specific class, so you will have to figure that out, and only then can sit down and start reading, which might then take at least a few minutes.

Needing to do this every time you want to access data from an object is simply outrageous.

What to do about it

Somebody (reference to be added) pointed out that you can do the following to list the available method of a project:

# Show methods
methods = methods(class=class(your_object))
methods

# Show non-visible methods
attr(methods, "info")

If the above does not work, there is also showMethods(), which is supposed to do the same thing for so called S4 classes, as opposed to S3 classes (I won’t go into the details here, as I’m not an expert on R, and there are many other sources covering this much better anyways).

Then of course you can create a small utility function for quickly showing the methods of objects too:

m <- function(data) { methods(class=class(data)) }

…which you can then use like this (again with an ExpressionSet object as example):

> m(edata)
>  [1] annotatedDataFrameFrom anyDuplicated          as_tibble              as.data.frame
   [5] as.raster              boxplot                coerce                 combine
   [9] determinant            duplicated             edit                   ExpressionSet
  [13] exprs<-                head                   initialize             isSymmetric
  [17] Math                   Math2                  Ops                    relist
  [21] rowMedians             rowQ                   snpCall<-              snpCallProbability<-
  [25] subset                 summary                tail                   unique
  see '?methods' for accessing help and source code

This is actually quite useful I think! Here, exprs is for example one of the core methods one will need to use to get your hands on the actual data, and it is listed in the output.

In summary

So, in summary, indeed, R is very very confusing. But, there are also definitely some techniques you can put put under your belt to overcome some of its worst points of confusion.

Read more

Update: Some feedback

John MacKintosh suggested the skimr::skim() method, which I found produces pretty nice output:

  > skimr::skim(edata)
  > ── Data Summary ────────────────────────
  >                            Values
  > Name                       edata 
Number of rows             52580 
Number of columns          129   
_______________________          
Column type frequency:           
  numeric                  129   
  ________________________         
  Group variables            None  
  
  ── Variable type: numeric ────────────────────────────────────────────────────────────────
    skim_variable n_missing complete_rate  mean    sd p0 p25 p50 p75  p100 hist 
  1 NA06985               0             1 21.5  232.   0   0   0   0 36671 ▇▁▁▁▁
  2 NA06986               0             1 22.3  227.   0   0   0   0 36065 ▇▁▁▁▁
  3 NA06994               0             1 17.0  187.   0   0   0   0 30852 ▇▁▁▁▁
  <snip>