Skip to main content

DAT213 - Analyzing Big Data with Microsoft R Server

·773 words·4 mins
Table of Contents
MPPDS - This article is part of a series.
Part 10: This Article

This course teaches exploratory data analysis skills using the Microsoft R Server implementation known as RevoScaleR. This product is in most ways functionally equivalent to the open source CRAN-R. RevoScaleR offers three significant benefits over its open source brother: the ability to run analyses in parallel across different servers, the ability to “chunk” data for evaluation and bypass the in-memory limitation of R, and the ability to read more natively from data sources like SQL Server, Hadoop, and Spark. This course explains these benefits and allows a new user to become familiar with the RevoScaleR tool.

featured.gif

Analyzing Big Data with Microsoft R Server
#

The course is divided into 4 segments:

  • Reading and Preparing Data
  • Examining and Visualizing Data
  • Clustering and Modeling
  • Deploying and Scaling

The edX course provides a useful written summary of content covered with each video provided. They also suggest online locations for additional reading. I cut/pasted a fair bit of the content into the Wiki I’m using for notes. As an example, here is some information about how to install the free version of RevoScaleR:

The RevoScaleR package is a Microsoft offering and it is not available on CRAN. To get RevoScaleR and its full functionality, we need a license for Microsoft R Server. However, we can get a free, light-weight version of Microsoft R Server by installing the Microsoft R Client. Microsoft R Client gives us an installation of R which has the RevoScaleR library built into it, and loaded by default when we start an R session. Once we install it all we need to do is point our IDE (Visual Studio or RStudio) to the Microsoft R Client installation instead of any other installation of R (if there is no other installation of R, the IDE usually automatically detects the R Client installation). Using Microsoft R Client, we can develop code in R that leverages the RevoScaleR functions.

The course makes reference to New York City taxi data, a 12 GB data set you must download to work on the labs. These are transaction records of taxi fares over a six month period. It includes date/time, start location, end location, fare paid, the method of payment, etc. Your analysis involves identifying geographic patterns in the data and predicting fares based on date/time and locations. The instructor guides you on importing the original data into a Microsoft proprietary external data frame (XDF) format that is optimized for working in RevoScaleR. Generally speaking, this course highlights how Microsoft R Server is different from and complementary to the open-source CRAN-R.

This is the 9th course in the sequence – and the last class before taking the capstone course. As much as I enjoyed the demonstration of the AzureML tools in earlier courses, going through this class makes me realize CRAN-R and Microsoft R Server are preferred tools for hardcore data analysis. The AzureML GUI makes data modeling accessible and visually intuitive for cleaning data and constructing models. But for job environments where you are using the tools all-day, the point-and-click requirements of the GUI and the limitations of code reuse make AzureML rather tedious/tiresome. In contrast, the ability to organize R coding with markdown documentation into a single “.RMD” file that can be easily tracked via GIT is a huge advantage the R environment has over AzureML.

I really enjoyed this course as the topic is tremendously interesting to me and the videos and labs gave me a good understanding of the tools and process of evaluating large datasets. The only real complaint I have concerns the presenter’s vocal pauses (use of “um”, “uh”, etc.) About 1/3 through the course I began counting how many times he used these in each sentence. In the big scheme of things, this is a petty complaint about a highly informative course – but you have been warned…

There are a total of 51 videos with playing time of about 6.5 hours. The course requirements include:

  • Knowledge Checks (aka Quizzes) with 5-8 multiple questions that are together worth 30% of grade,
  • Labs with data sets provided on the edx.org website and 3-6 multiple choice questions that are worth 30%,
  • Final exam worth 40% which is composed of approximately 30 multiple choice questions that are fairly challenging. They typically cover choice among different commands and proper syntax of R Server / RevoScaleR commands. There is no time limit on the exam.

All told, this course required about 35 hours for me to complete.

While taking the R courses in this sequence, I found two supporting references particularly helpful:

Jonathan Bartleson
Author
Jonathan Bartleson
MPPDS - This article is part of a series.
Part 10: This Article

Related

DAT209 - Programming R

·373 words·2 mins
Programming R for Data Science is taught by Anders Stockmarr (on the faculty of Technical University of Denmark.) For US audiences, his accent requires some getting used to. He places emphasis on unexpected syllables and has a unique way of pronouncing many things. I found it helpful to use headphones and to adjust the playback speed of the recordings. It is worth making the effort to understand Dr. Stockmarr because he has put together a course with a lot of substance, using a tight script and backed up by supporting exercises. Programming R Course Highlights # I genuinely enjoyed this course, it goes a lot deeper than the introductory course in R taken earlier in the MPP sequence. For the first course, I used RStudio to experiment. With this course, I wanted to use the Visual Studio version of R to work the exercises and labs. The R Tools for Visual Studio (https://www.visualstudio.com/vs/rtvs/) required some fiddling to get installed, but were stable and had nice IDE features I’ve become used to with VS. Becoming familiar with R Tools for Visual Studio at this point will prepare you for taking DAT213 “Analyzing Big Data in MS R Server” which is the logical follow-on course in the MPP Data Science sequence.

DAT203-1 - Data Science Essentials

·392 words·2 mins
Data Science Essentials (DAT203) marks the point where we have enough foundation that we can start forming a bigger picture of data science. To that goal, the course provides this definition: Data Science is the exploration and quantitative analysis of all available structured and unstructured data to develop understanding, extract knowledge, and formulate actionable results. Cynthia Rudin and Steve Elston are co-presenters in this entertaining, informative and well-organized course. Both are really effective instructors, with quite different teaching styles. Cynthia covers the more theoretical topics including a fair bit of concepts relating to statistics. She is an entertaining presenter. A lot of personality and practical examples come through her presentations. She provides specific data science example project from her university lab, one of which concerns predicting manhole fires in Manhattan.

DAT203.2 - Principles of Machine Learning

·427 words·3 mins
Principles of Machine Learning (DAT203.2) is the 7th in a series of 10 courses that form the Microsoft Professional Program in Data Science. It proves that the further you get into this 10-course sequence, the more enjoyable the classes become. Similar to Data Science Orientation, this class is co-led by Cynthia Rudin and Steve Elston. Principles of Machine Learning # The lecture is composed of 60 videos spanning 8 hours lecture time. Watching them and working the exercises reveals the true practical value of the data science tools. This course forces you to genuinely harness the Azure Machine Learning environment with Python or R scripts. All told, this course required about 30 hours to complete.