Unit 02 | Data Mining & CRISP-DM - DA5030 | Machine Learning & Data Mining

Unit 02 | Data Mining & CRISP-DM

| 9.0 hrs

Upon completion of this module, you will be able to:

differentiate between the different phases in the CRISP-DM framework
load data into R and perform basic data exploration and preparation
differentiate among several data imputation strategies
split a data set into training and validation subsets

Lesson 1
Lesson 2
Lesson 3
Lesson 4

The CRISP-DM Model

Slide Deck

Required Work

Read the following article to get an overview of the CRISP-DM model for data mining; it outlines the key steps a data scientist generally takes when mining data and building a predictive model
Watch the lecture (21m) providing an overview of CRISP-DM

Additional Resources

Article: What is the CRISP-DM Methodology by SV Europe
Video Tutorial: Data Mining Process and CRISP-DM (7:45m)
Video Tutorial: Basics of Data Mining (9:49m)

Data Quality and Missing Data

Required Work

Read this post on Data Science Central to get an overview of strategies for dealing with missing data. The post also summarizes various approaches to imputing missing values.
Watch the lecture on data imputation methods
Read this article on data preparation -- we will add many more advanced techniques to this later.

Additional Resources

Exploring a Data Set

Required Work

Read the following chapters from Manas A. Pathak Beginning Data Science with R. Note that you don't have to read them deeply but rather follow alongside R and explore the different techniques in exploring a data set and becoming familiar with it. This is essential to determine what kind of data mining or machine learning techniques should be (and can be) applied.
Chapter 3 Getting Data into R
Chapter 4 Data Visualization
Chapter 5 Exploratory Data Analysis

This book is available for free through the Northeastern network. If you are not on campus then you can "tunnel" into the library and access it. You do not have to purchase this book. A copy of the book has been uploaded to our LMS.

Training vs Validation Data

To avoid problems with overfitting data to a model, any modeling effort in machine learning should be done on a training data set. Determining whether the model is good and comparing it to other models should be done by applying the model to a validation data set and comparing predicted versus actual values.

K-fold cross-validation is a common, albeit more advanced, strategy for building more robust training and validation data sets.