Unit 02 | Data Mining & CRISP-DM |
| 9.0 hrs |
Upon completion of this module, you will be able to:
- differentiate between the different phases in the CRISP-DM framework
- load data into R and perform basic data exploration and preparation
- differentiate among several data imputation strategies
- split a data set into training and validation subsets
The CRISP-DM Model
|
Required WorkAdditional Resources
|
Data Quality and Missing Data
|
Required Work
Additional Resources |
Exploring a Data Set
Required Work
This book is available for free through the Northeastern network. If you are not on campus then you can "tunnel" into the library and access it. You do not have to purchase this book. A copy of the book has been uploaded to our LMS.
|
Training vs Validation Data
|
To avoid problems with overfitting data to a model, any modeling effort in machine learning should be done on a training data set. Determining whether the model is good and comparing it to other models should be done by applying the model to a validation data set and comparing predicted versus actual values.
K-fold cross-validation is a common, albeit more advanced, strategy for building more robust training and validation data sets.
|