R involves a lot of programming

R developer guide for Azure

  • 7 minutes to read

Many data analysts have to process ever increasing amounts of data and are looking for ways to use the power of cloud computing for their analyzes. This article provides an overview of the different ways data analysts can leverage their existing skills with the R programming language in Azure.

Microsoft has fully embraced the R programming language as a first-class tool for data scientists. By providing many different options for R developers to run their code on Azure, the company enables data scientists to extend their data science workloads to the cloud on large projects.

Let's look at the different options and the most suitable scenarios for each.

Azure services that support the R language

This article covers the following Azure services that support the R language:

Virtual computer for data science

Data Science Virtual Machine (DSVM) is a custom VM image on the Microsoft Azure cloud platform created specifically for data science. This image includes many common data science tools:

DSVM can be deployed with either Windows or Linux as the operating system. You can use DSVM in two different ways: as an interactive workstation or as a computing platform for a custom cluster.

As a workstation

If you want to get started with R in the cloud quickly and easily, this is the best solution. The environment is familiar to anyone who has worked with R on a local workstation. Instead of using local resources, however, the R environment runs on a virtual machine in the cloud. If your data is already stored in Azure, this has the added benefit of allowing your R scripts to run closer to the data. Instead of transferring the data over the Internet, access can take place via the internal Azure network, which offers much faster access times.

DSVM can be especially useful for small teams of R developers. Instead of investing in high-performance workstations for each individual developer so that team members have to agree on which versions of the various software packages to use, each developer can set up an instance of DSVM as required.

As a computer platform

In addition to being used as a workstation, DSVM will also be used as an elastically scalable computing platform for R projects. The -R package allows you to programmatically manage the creation and deletion of DSVM instances. You can group the instances together and deploy a distributed analysis to run in the cloud. This entire process can be controlled by R code running on your local workstation.

For more information about DSVM, see Introduction to Azure Data Science Virtual Machine for Linux and Windows.

ML Services in HDInsight

Microsoft ML Services enables data scientists, statisticians and R programmers to access scalable, distributed analysis methods in HDInsight when needed. This solution provides the latest capabilities for R-based analysis of virtually any size dataset loaded into either Azure Blob storage or Data Lake Storage.

This is an enterprise grade solution that allows you to scale your R-code onto a cluster. Using functions in the Microsoft package, your R scripts for HDInsight can run computing functions in parallel on many nodes in a cluster. This allows R to process data on a much larger scale than single-threaded R can do on a workstation.

This scalability makes ML Services in HDInsight a great option for R developers with huge datasets. It provides a flexible and scalable platform for running your R scripts in the cloud.

For a walkthrough of creating an ML Services cluster, see Getting started with ML Services on Azure HDInsight.

Azure Databricks

Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud service platform. Databricks was designed with the founders of Apache Spark and is integrated with Azure to enable: One-click setup, streamlined workflows and an interactive workspace for data specialists, data engineers and business analysts to collaborate.

The collaboration in Databricks is made possible by the platform's notebook system. Users can create, share and edit notebooks with other users of the systems. These notebooks allow users to write code that runs on Spark clusters that are managed in the Databricks environment. These notebooks have full support for R and give users access to Spark through the packages and.

Because Databricks is based on Spark and has a strong focus on collaboration, the platform is often used by teams of data scientists working together on complex analyzes of large datasets. Since the notebooks in Databricks also support other languages ​​in addition to R, this solution is particularly suitable for teams in which the analysts use different languages ​​for their primary work.

In the article What is Azure Databricks? you will find more details about the platform and assistance with getting started.

Azure machine learning

Azure Machine Learning can be used for all types of machine learning - from classic machine learning to deep learning and for supervised and unsupervised learning. Regardless of whether you prefer to write Python or R code or use options with little or no code (e.g. using a designer), you can use an Azure Machine Learning workspace to create highly precise machine learning and deep learning Build, train, and track models.

Start training on your local computer, then scale up to the cloud. Train your first model in R with Azure Machine Learning today.

Azure Machine Learning Studio (classic)

Azure Machine Learning Studio (Classic) is a drag-and-drop collaboration tool that you can use to build, test, and deploy predictive analytics solutions in the cloud. It enables inexperienced data scientists to create and deploy machine learning models without writing a lot of code.

Azure Machine Learning Studio (classic) supports both R and Python.

Customers who are currently using or testing Azure Machine Learning Studio (Classic) are encouraged to try the designer in Azure Machine Learning. This offers drag and drop ML modules as well as scalability, version control and corporate security.

Azure batch

You can use Azure Batch for large R jobs. This service provides cloud-scale job scheduling and compute management so you can scale your R workload to tens, hundreds, or thousands of virtual machines. Because it is a generalized computing platform, there are a few options for running R jobs in Azure Batch.

One option for running an R script in Azure Batch is to bundle your code with RScript.exe as batch apps in the Azure portal. For a detailed walkthrough, see R Workloads on Azure Batch.

Another option is to use the Azure Distributed Data Engineering Toolkit (AZTK). This enables the provision of on-demand Spark clusters using Docker containers in Azure Batch. This provides an inexpensive way to run Spark jobs in Azure. By using SparklyR with AZTK, your R scripts can be easily and cost effectively scaled out in the cloud.

Azure SQL managed instance

Azure SQL Managed Instance is the intelligent, scalable cloud database service from Microsoft. This allows you to use the full potential of SQL Server without having to set up the infrastructure. This includes machine learning services, the Microsoft R and Python packages for powerful predictive analytics and machine learning.

Machine Learning Services provides an embedded predictive analytics and data science engine that can execute R / Python code within a SQL Server database. Instead of extracting data from the database and loading it into the R / Python environment, you load your R / Python code directly into the database and have it run alongside the data. The relational data can be used in stored procedures, as T-SQL scripts with R / Python statements, or as R / Python code with T-SQL.

Machine Learning Services has been part of the local SQL Server version since 2016, but was only recently introduced in Azure SQL Managed Instance.

Next Steps

The R logo is © 2016 The R Foundation. It is used under the terms of the Creative Commons Attribution-ShareAlike 4.0 International license.