DAT102 Data Science Capstone

With both pride and relief – I get to share news of completing the Capstone course for the Microsoft Professional Program in Data Science. Over the past six months, I’ve been systematically working through the nine courses leading up to this final 10th course in the series. You can look on this index page to see observations for the other courses. This includes some general advice for those considering taking these courses.

The blog entry you are reading now is specific to the Capstone course. It explains the hoops you need to jump through, the time you will likely need to complete, and a variety of other observations about the Capstone.

Data Science DAT102 Hoops

The Capstone is composed of three separate steps. All of which center around evaluating data from the U.S. Department of Education College Scorecard data set.

Easy Opening Quiz (25%)

The first part of the course walks you through a couple of easy questions. It contains questions like “Which of the following histograms best depict the feature GraduateIncome?” There are perhaps a dozen questions that allow you to demonstrate that you restored the data successfully and that you can navigate your way through some data analysis tool to generate basic descriptive statistics and correlations. This opening quiz is a great opportunity to pick up 100% percent on what constitutes 1/4 of your final grade.

Incidentally, at no point does the course dictate what analysis tool you use. You can choose from R, Python, AzureML, Excel, etc. It’s whatever tool you’re most comfortable with.

Predictive Model (50%)

The second part of this course is the real heart of the matter. It is where you demonstrate your ability to:

Complete exploratory data analysis on a fairly wide data set (300 columns).
Create a predictive model from a 17,000 record training data file.
Apply that predictive model to a 10,000 record test data file.

You submit your predicted values to the Capstone Challenge website and the site immediately comes back and gives you an indication of how accurately you predicted values. You are allowed to submit up to three times a day, and you can follow your progress versus your peers. During the October 2017 run, there were over five hundred participants in the Capstone.

In order to get a competitive score, you’ll need to do some data transformations, identification of outliers, treating missing data and a fair bit of experimentation with predictive models. If you stick with simple linear regression models (lm in R), you’ll likely limit yourself to a 60-70% grade on this step. I believe boosted / ensemble models were used by the top scorers.

Final Report (25%)

The final report is your chance to summarize your findings up to this point. The EdX organizer provides you with a generic example report and a grading rubric to make clear what they are expecting. I would recommend close scrutiny of these resources. Be sure to start your document with an executive summary and highlight your conclusions in layman’s terminology.

You are reviewed by three other individuals who have also authored a report and gone through the Capstone. Your score is the median score of their evaluations. It is clever the way EdX.org have structured this course. It requires no real review of papers on their part. It is an entirely peer-reviewed phenomenon.

In my experience, the others were pretty generous in their grading and offered nice encouragement of my writing. It was quite a bit of fun to read other’s papers. There are some really smart people going through this course. If you took reasonably good notes while you were doing your analysis, you can probably convert those findings into a final document in five or six hours time.

Deadlines

You will be warned at the beginning of the course that the paper is due a few days before the end of the course. Arranged this way – it allows your peers to grade your report. From the time of submitting your report, you have 4 days to review three other papers. The actual grading of peer papers involves answering 8-9 multiple choice questions and optionally filling in some comments sections about your reflections on their paper.

Once you start viewing peer papers, you are no longer able to submit your predictive model results for a higher score. You can expect your final paper grade within 48-60 hours of paper submission. After which you will be issued your certificate if you pass.

Time spent

I spent a lot of time on this course – over 120 hours. But the majority of this time (80%) was an investment in learning R. If I was focused purely on reducing time to get the course done, I would have used AzureML. The AzureML environment is ideally suited for doing a quick analysis of this sort of data. But from prior courses, I realized AzureML was not an ideal tool for my long-term use. AzureML doesn’t allow using Git source code control, the whole mouse / GUI aspect of it seems tedious, and there seemed no likelihood of leveraging code reuse in AzureML. In that context, the R language is a lot more appealing to me – though it required upping my game a fair bit when trying to explore this 300 column dataset. Here is a related blog entry about the sorts of R coding done as part of this project.

Community

There is a forum available for registered class participants. You are invited to introduce yourself and can pose questions to other participants and to the monitors of the course. It was fun to follow this and see the helpful attitudes that others have. As one example, the lead scorer of the course was asked by others – “what secret methods are you using?” He was kind enough to post his full project on Github at the conclusion of the course.

I got the sense that there were a number of people who had competed in prior competitions, and were using the Capstone course to improve their data analysis skills. It’s kind of neat the way the course is set up – there’s no cost to audit this and you can participate completely in this manner. That includes having your paper peer-reviewed.

Conclusion

After consideration of the 6-months of study, the money out of pocket and the stress of test taking - I’m genuinely glad to have taken these courses. To me, it was very much worth the effort. It has given me practical skills and confidence to apply these techniques for my customers.

Due to the sheer breadth of topics covered, I realize that completing a sequence like this marks the beginning of my learning path not the end. For someone who loves to learn, that is truly a good thing!

I hope this summary is useful to others. Please visit my overview page for general suggestions for maximizing your experience in the Microsoft Professional Program for Data Science.