Friday, April 29, 2016

Microsoft-R_IntroductionToRopenAndRserver

Introduction to MRO and MRS

Special thanks to Lixun Zhang https://github.com/lixzhang
April 26, 2016
Microsoft recently announced that it now offers Microsoft R Server (formerly Revolution R Enterprise) free for developers and Microsoft R Open (formerly Revolution R Open) free for everyone.
The purpose of this write-up is to help data scientists who are familiar with R better understand the commonalities and differences among R, Microsoft R Open (MRO) and Microsoft R Server (MRS).
A quick summary of the content can be found at my Revolutions blog post. The same summary can aslo be found at R bloggers.

1 Setup

The links for downloading MRO and MRS can be found here.
You can also provision a Microsoft Data Science Virutal Machine (DSVM), which comes pre-installed with Microsoft R Server Developer Edition. All tests below were completed on a Microsoft DSVM with 4 virtual processors (2.40 GHz) and 7 GB RAM.

2 Three Perspectives

We can compare R, MRO and MRS from three perspectives: functions, packages, and platforms.

2.1 Functions

Since MRO and MRS are based on R, you can continue doing everything in MRO and MRS just like in R. For example, you can continue using your favorate development environment like RStudio, installing packages from CRAN, or fitting models with functions like lm() and glm(). As an example, with the following command we fit a logistic regression model with the glm() function, using either R, or MRO, or MRS.
In [1]:
# check the data
head(mtcars)
# fit a model with glm(), this can be run on R, MRO, or MRS
# predict V engine vs straight engine with weight and displacement
logistic1 <- glm(vs ~ wt + disp, data = mtcars, family = binomial)
# summary(logistic1)
Out[1]:
mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX42161601103.92.6216.460144
Mazda RX4 Wag2161601103.92.87517.020144
Datsun 71022.84108933.852.3218.611141
Hornet 4 Drive21.462581103.083.21519.441031
Hornet Sportabout18.783601753.153.4417.020032
Valiant18.162251052.763.4620.221031
MRS has a high performance analysis (HPA) version of the glm() function called rxGlm(). This function can be used only with MRS and not R or MRO.
In [2]:
rxOptions(reportProgress = 0) # don't report progress
# fit the same model with rxGlm(), this can be run on MRS only
# predict V engine vs straight engine with weight and displacement
logistic2 <- rxGlm(vs ~ wt + disp, data = mtcars, family = binomial)
# summary(logistic2)
As another example, MRS has an HPA version of the kmeans() function called rxKmeans(). This function can be used only with MRS and not R or MRO.
Below we simulate data that can be analyzed by kmeans() and rxKmeans().
In [3]:
library("MASS")
library("ggplot2")

# show images inline
options(jupyter.plot_mimetypes = 'image/png')

# change size of plot
options(repr.plot.width = 5, repr.plot.height = 4)

# make sure the results can be replicated
set.seed(112)

# function to simulate data 
simulCluster <- function(nsamples, mean, dimension, group)
{
  Sigma <- diag(1, dimension, dimension)
  x <- mvrnorm(n = nsamples, rep(mean, dimension), Sigma)
  z <- as.data.frame(x)
  z$group = group
  z
}

# simulate data with 2 clusters
nsamples <- 1000
group_a <- simulCluster(nsamples, -1, 2, "a")
group_b <- simulCluster(nsamples, 1, 2, "b")
group_all <- rbind(group_a, group_b)

nclusters <- 2

# plot data 
ggplot(group_all, aes(x = V1, y = V2)) +
  geom_point(aes(colour = group)) +
  geom_point(data = data.frame(V1 = c(-1, 1), V2 = c(-1, 1)), size = 5) +
  xlim(-5, 5) + ylim(-5, 5) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  ggtitle("Simulated data in two overlapping groups")

# assign data 
mydata = group_all[,1:2]
Then we can use kmeans() in R/MRO/MRS or rxKmeans() in MRS for the analysis.
In [4]:
# cluster analysis with kmeans(), it works on R, MRO, or MRS
fit_kmeans <- kmeans(mydata, nclusters, iter.max = 1000, algorithm = "Lloyd")

# cluster analysis with rxKmeans(), it works on MRS only
fit_rxKmeans <- rxKmeans( ~ V1 + V2, data = mydata,
                          numClusters = nclusters, algorithm = "lloyd")
The following code uses the XDF format data in rxKmeans() and compares the cluster assignment between kmeans() and rxKmeans(). The cluster assignments are the same between these two functions.
In [5]:
# save a dataset in XDF format
dataXDF = tempfile(fileext = ".xdf")
rxImport(inData = mydata, outFile = dataXDF, overwrite = TRUE)
# rxKmeans
fit_rxKmeans_2 <- rxKmeans(~ V1 + V2, data = dataXDF, 
                          numClusters = nclusters, algorithm = "lloyd",
                          outFile = dataXDF, outColName = "cluster", 
                          overwrite = TRUE)

# convert to dataframe and keep one variable only
cluster_rxKmeans <- rxDataStep(dataXDF, varsToKeep = "cluster")

# append cluster assignment from kmeans and rxKmeans
mydata_clusters <- cbind(
  group_all,
  cluster_kmeans = factor(fit_kmeans$cluster),
  cluster_rxKmeans = factor(cluster_rxKmeans$cluster))

# compare the cluster assignments between kmeans and rxKmeans
with(mydata_clusters, table(cluster_kmeans, cluster_rxKmeans))

# get cluster centroids 
cluster_centroid_kmeans <- fit_kmeans$centers
cluster_centroid_rxKmeans <- fit_rxKmeans_2$centers

# plot clusters from kmeans
ggplot(mydata_clusters, aes(x = V1, y = V2)) +
  geom_point(aes(colour = cluster_kmeans)) +
  geom_point(data = as.data.frame(cluster_centroid_kmeans), size = 5) +
  xlim(-5, 5) + ylim(-5, 5) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  ggtitle("Clusters found by kmeans()") + 
  theme(legend.title=element_blank())

# plot clusters from rxKmeans
ggplot(mydata_clusters, aes(x = V1, y = V2)) +
  geom_point(aes(colour = cluster_rxKmeans)) +
  geom_point(data = as.data.frame(cluster_centroid_rxKmeans), size = 5) +
  xlim(-5, 5) + ylim(-5, 5) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  ggtitle("Clusters found by rxKmeans()") +
  theme(legend.title=element_blank())
blocksPerRead must be 1 if predictions or residuals are be being written to the input data source.
Out[5]:
              cluster_rxKmeans
cluster_kmeans    1    2
             1  991    0
             2    0 1009
MRS has the HPA version of many other R functions. These MRS functions usually start with "rx" and are included in the RevoScaleR package. Figure 1 gives 3 three examples. More information on available HPA functions can be found here
Figure 1 Figure 1

2.2 Packages

Data scientists who use R are familiar with the use of Packages. Figure 2 compares R, MRO and MRS by checking the packages installed by default. For example, MRO with MKL has two additional packages compared with base R: RevoUtilsMath and checkpoint. While some of the additional packages in MRS are available on CRAN, most of them (e.g., RevoScaleR) are not and can be found only after installing MRS.
Figure 2 Figure 2

2.3 Platforms

Figure 3 from a webinar's slides compares R, MRO and MRS at the platform level. For example, the functions in the ScaleR module are included in the RevoScaleR package in Figure 2.
Figure 3 Figure 3

3 Performance Comparison

MRS solves two problems associated with using R: capacity (handling the size of datasets and models) and speed. Figure 4 compares R, MRO and MRS in these two areas. In terms of capacity, using HPA in MRS increases the size of data that can be analyzed. In the speed perspective, certain matrix related base-R functions can perform better in MRO and MRS than base-R due to MKL. The HPA functions in MRS perform better than their base-R counterparts for large datasets. We'll look into these two areas in this section.
Figure 4 Figure 4

3.1 Capacity

Both R and MRO can only process data in-memory. So if the data size is larger than memory size, they cannot load the data successfully and thus cannot process them. This also applies to base R functions under the MRS environment. Even when the data size is less than memory they could fail for more complicated analyses.
The HPA functions in MRS, on the other hand, allows processing data block by block. To do this it uses a data file format (.xdf) that is extremely efficient for both reading and writing data. These functions can still process in-memory data. But they allow the data to be disk-based as well.
For example, the following example shows that for a large dataset rxKmeans() runs successfully but kmeans() does not.
We start by simulating data with 60,000,000 rows and 2 columns, which takes up 2.2 GB disk space.
In [6]:
library("MASS")

# make sure the results can be replicated
set.seed(0)

# simulate data and append
nsamples <- 3*10^7 # use this to generate a large dataset
# nsamples <- 10^3 # for testing purpose
group_a <- simulCluster(nsamples, -1, 2, "a")
group_b <- simulCluster(nsamples, 1, 2, "b")
group_all <- rbind(group_a, group_b)

nclusters <- 2

# save data
mydata = group_all[, 1:2]
dataCSV = tempfile(fileext = ".csv")
dataXDF = tempfile(fileext = ".xdf")
write.csv(group_all, dataCSV, row.names = FALSE)
rxImport(inData = dataCSV, outFile = dataXDF, overwrite = TRUE)
When the following code uses kmeans() to analyze the data, an error message - “cannot allocate vector of size 915.5 Mb … Reached total allocation of 7167Mb: see help(memory.size)” - was returned.
In [7]:
# cluster analysis with kmeans
system_time_R <- 
  system.time(
    {
      fit <- kmeans(mydata, nclusters,
                    iter.max = 1000,
                    algorithm = "Lloyd")
    })
Warning message:
In unlist(X, recursive = FALSE, use.names = FALSE): Reached total allocation of 7167Mb: see help(memory.size)Warning message:
In unlist(X, recursive = FALSE, use.names = FALSE): Reached total allocation of 7167Mb: see help(memory.size)Warning message:
In unlist(X, recursive = FALSE, use.names = FALSE): Reached total allocation of 7167Mb: see help(memory.size)Warning message:
In unlist(X, recursive = FALSE, use.names = FALSE): Reached total allocation of 7167Mb: see help(memory.size)
Error: cannot allocate vector of size 915.5 Mb
Timing stopped at: 257.39 6.12 319.03 
The following code uses the rxKmeans() function and returned results after 4 minutes.
In [8]:
# cluster analysis with rxKmeans
system_time_MRS <- 
  system.time(
    {
      clust <- rxKmeans( ~ V1 + V2, data = dataXDF,
                         numClusters = nclusters,
                         algorithm = "lloyd",
                         outFile = dataXDF,
                         outColName = "cluster",
                         overwrite = TRUE)
    })
blocksPerRead must be 1 if predictions or residuals are be being written to the input data source.

3.2 Speed

3.2.1 Base R Functions

With the RevoUtilsMath package on multi-core computers, MRO performs better than base R for many matrix related functions. For example, running the following code from MRAN's Performance Benchmarks page on R, MRO, and MRS shows that MRO and MRS performed better than R.
In [9]:
if(require("RevoUtilsMath")){
  print(paste("The number of threads is:", getMKLthreads()))
  # setMKLthreads(4) # set 4 threads
  # setMKLthreads(1) # set 1 thread
}

# Initialization
set.seed (1)
m <- 10000
n <-  5000
A <- matrix (runif (m*n),m,n)

# Matrix multiply
system.time (B <- crossprod(A))

# Cholesky Factorization
system.time (C <- chol(B))

# Singular Value Decomposition
m <- 10000
n <- 2000
A <- matrix (runif (m*n),m,n)
system.time (S <- svd (A,nu=0,nv=0))

# Principal Components Analysis
m <- 10000
n <- 2000
A <- matrix (runif (m*n),m,n)
system.time (P <- prcomp(A))

# Linear Discriminant Analysis
library('MASS')
g <- 5
k <- round (m/2)
A <- data.frame (A, fac=sample (LETTERS[1:g],m,replace=TRUE))
train <- sample(1:m, k)
system.time (L <- lda(fac ~., data=A, prior=rep(1,g)/g, subset=train))
[1] "The number of threads is: 4"
Out[9]:
   user  system elapsed 
 148.54  176.76  235.01 
Out[9]:
   user  system elapsed 
   7.98    0.56    4.40 
Out[9]:
   user  system elapsed 
  52.35    4.50   18.12 
Out[9]:
   user  system elapsed 
 114.34   10.45   37.82 
Out[9]:
   user  system elapsed 
 165.66   13.20   98.02 
Figure 5 summarizes the speed comparison on 4 platforms:
  • Base R
  • MRO without MKL
  • MRO with MKL
  • MRS
From Figure 5 we can make several observations:
  • There is little performance difference between R and MRO when MKL (the RevoUtilsMath package) is not installed.
  • MRO has significant better performance than R when MKL is installed.
  • The performance is similar between MRS and MRO when MKL is installed.
So the installation of MKL helped speed up the calculations related to matrix. MKL can be used for MRO but not for base R.
Figure 5 Figure 5
When the calculation does not involve matrix, however, MKL will not help much for standard R functions. For example, the following code can be run on the same four platforms as shown above.
In [10]:
# simulate data 
nsamples <- 10 ^ 7 # this was used on different platforms
# nsamples <- 1000 # for testing purpose
group_a <- simulCluster(nsamples, -1, 2, "a")
group_b <- simulCluster(nsamples, 1, 2, "b")
group_all <- rbind(group_a, group_b)

nclusters <- 2

mydata = group_all[, 1:2]
# K-Means Cluster Analysis
system_time_r <- system.time(fit <- kmeans(mydata, nclusters,
                                           iter.max = 1000,
                                           algorithm = "Lloyd"))
Figure 6 compares the speed across the platforms for the kmeans() function. It can be seen that the speed of running kmeans() is similar between R, MRO, and MRS.
Figure 6 Figure 6

3.2.2 Base R vs HPA Functions

By changing the sample size of the simulated data we can compare the speed of using kmeans() and rxKmeans() for different data sizes. The code in the following block compares the performance. The results show that the HPA function rxKmeans() runs faster than kmeans() when the size of datasets gets large and that their performance is similar for smaller datasets.
In [11]:
# to save timing results
myresult <- data.frame(nsamples = integer(), time_r = double(),
                       time_rre = double())

# list of sample sizes
nsamples_list <- c(5 * 10 ^ 2, 10 ^ 3, 5 * 10 ^ 3, 10 ^ 4, 5 * 10 ^ 4, 10 ^ 5,
                   5 * 10 ^ 5, 10 ^ 6, 5 * 10 ^ 6, 10 ^ 7)

for (nsamples in nsamples_list)
{
  # simulate data and append
  group_a <- simulCluster(nsamples, -1, 2, "a")
  group_b <- simulCluster(nsamples, 1, 2, "b")
  group_all <- rbind(group_a, group_b)
  mydata = group_all[, 1:2]

  nclusters <- 2

  # kmeans with R
  system_time_r <- system.time(fit <- kmeans(mydata, nclusters,
                                               iter.max = 1000,
                                               algorithm = "Lloyd"))

  # kmeans with MRS
  system_time_rre <- system.time(clust <- rxKmeans( ~ V1 + V2, data = mydata,
                                                     numClusters = nclusters,
                                                     algorithm = "lloyd"))

  # combine
  newrow <- data.frame(nsamples = nsamples,
                         time_r = as.numeric(system_time_r[3]),
                         time_rre = as.numeric(system_time_rre[3]))
  myresult <- rbind(myresult, newrow)

}

myresult$nsamples <- 2 * myresult$nsamples
mydata <- myresult
mydata$nsamples_log <- log10(mydata$nsamples)
mydata
Out[11]:
nsamplestime_rtime_rrensamples_log
110000.010.023
2200000.023.30103
31000000.034
4200000.010.074.30103
51e+050.070.065
62e+050.260.15.30103
71e+061.580.566
82e+062.111.136.30103
91e+0715.146.077
102e+0726.0815.397.30103
In [12]:
# generate plot
ggplot(data = mydata, aes(x = nsamples_log)) +
    geom_point(aes(y = time_r, colour = "kmeans")) +
    geom_line(aes(y = time_r, colour = "kmeans")) +
    geom_point(aes(y = time_rre, colour = "rxKmeans")) +
    geom_line(aes(y = time_rre, colour = "rxKmeans")) +
    scale_x_continuous(breaks = seq(2, 8, by = 1)) +
    scale_colour_manual("Function", values = c(kmeans = "red", rxKmeans = "blue")) +
    xlab("log10(number of samples)") +
    ylab("time in seconds") +
    ggtitle("If data fits in memory, \nkmeans() and rxKmeans() are equally performant")

4 Conclusion

Figure 7 summarize scenarios where R, MRO, and MRS can be used depending on the amount of data and the availability of MRS's HPA functions. Whenever R can be used, MRO can be used with the additional benefit of multi-thread computation for certain matrix related computations. MRS can be used whenever R or MRO can be used and it allows the possibility of using HPA functions that provide better performance in terms of both speed and capacity.
Figure 7 Figure 7

5 Additional Resources

This write-up should have helped you understand how MRS and MRO relate to R. Below are some additional resources you can refer to.

Special thanks to https://github.com/lixzhang