Accessing MongoDB from R with mongolite

Recently, I have moved away from text files as data storage, and started using MongoDB. While there are already two R packages (RMongo and rmongodb) interfacing with MongoDB, I was never completed satified with them - especially in comparison to the excellent PyMongo. A couple of days ago, a new package, mongolite, was released and seems very promising.

Here, I quickly want to showcase some of the functions of mongolite, using Fisher's Iris data set.

data(iris)
# remove . to avoid problems with MongoDBs naming structure
names(iris)[1:4] = gsub("[.]","",names(iris)[1:4] ) 
head(iris)

##   SepalLength SepalWidth PetalLength PetalWidth Species
## 1         5.1        3.5         1.4        0.2  setosa
## 2         4.9        3.0         1.4        0.2  setosa
## 3         4.7        3.2         1.3        0.2  setosa
## 4         4.6        3.1         1.5        0.2  setosa
## 5         5.0        3.6         1.4        0.2  setosa
## 6         5.4        3.9         1.7        0.4  setosa

First, we need to insert the data from R into a new collection in MongoDB. This is done by first establishing a connection to the collection in the database, and then calling the insert function on the connection handler.

library(mongolite)
c = mongo(collection = "iris", db = "tutorials")
c$insert(iris)
## 
Complete! Processed total of 150 rows.
## [1] TRUE

We get some nice feedback from the function, showing us that we inserted 150 rows (which is the total number of rows in the Iris data set). Let's start exploring the data with mongolite. You can easily get the total number of rows, as well as the unique values within the Species column/field:

c$count()
## [1] 150
c$distinct("Species")
## [1] "setosa"     "versicolor" "virginica"

It's also possible to utilize MongoDB's aggregate function through $aggregate:

c$aggregate('[{"$group": {
                "_id":"$Species", 
                "count": {"$sum":1}, 
                "avgPetalLength":{"$avg":"$PetalLength"}
              }}]')

## 
 Found 3 records...
 Imported 3 records. Simplifying into dataframe...

##          _id count avgPetalLength
## 1  virginica    50          5.552
## 2 versicolor    50          4.260
## 3     setosa    50          1.462

Of course, with a small data set such as the Iris data, there is no drawback in simply doing the aggregating in dplyr, but if you are dealing with a big data set, querying the MongoDB database directly might give you some performance benefits, as not all the data has to be loaded into memory.

iris %>%
  group_by(Species) %>%
  summarise(count = n(), avg = mean(PetalLength))

## Source: local data frame [3 x 3]
## 
##      Species count   avg
##       (fctr) (int) (dbl)
## 1     setosa    50 1.462
## 2 versicolor    50 4.260
## 3  virginica    50 5.552

Similary, let's suppose we don't need all columns/fields for an analysis. For example, maybe we're only interesting in the sepal width for one particular analysis. Instead of loading all the data into memory, we can use mongolite to only return chosen fields:

sw = c$find('{}', '{"SepalWidth": 1, "Species": 1, "_id": 0}')
## 
 Found 150 records...
 Imported 150 records. Simplifying into dataframe...

ggplot(sw, aes(x = Species, y = SepalWidth)) + 
         geom_boxplot()

Of course, you can also update and remove entries in your database, as well as other functions. If you'd like to know more, have a look at the package on github and CRAN.