Project (Signature Assignment)
The signature project provides you with an opportunity to complete a substantial effort where you showcase your understanding of the machine learning and data mining techniques studied in the course. It is an individual project and must be completed without external help.
- Identify a data set that you want to explore and for which you can build a minimum of three appropriate and useful machine learning or data mining models.
- Your effort must follow the CRISP-DM process and addresses business understanding, definition of the problem to be solved, data sources, data cleaning efforts, assessment of data quality, exploration of the data, transformations, imputation, case elimination, training vs validation data set division strategy, model selection, model tuning, evaluation, accuracy, etc. This resource summarizes many of the techniques and can be helpful.
- Explain in detail what you did, show your R code in an R Notebook, and explain why you chose this approach versus other possible approaches.
- Your models must be evaluated and compared using appropriate methods, e.g., MAD, MSE, RMSE, accuracy, AUC, R-Squared, etc.
- You must construct an ensemble learner from your models and also evaluate the ensemble.
- You must provide justification, interpretation, evaluation, evidence for all your work. Don't simply provide the R output - you must interpret the output and evaluate it.
Here are key things to keep in mind for the project that will lose you points. Use this rubric for self-evaluation.
- where does the data come from?
- how do you plan on assessing data quality and deal with missing values?
- what strategy are you using for data imputation and why? if the data set has no missing data, can you randomly remove data and then impute the data and compare performance of your algorithms with imputed vs full data? why do they differ? how do they differ?
- how do you assess normality, distribution, skew -- and does it matter for your algorithms?
- what kinds of exploratory data analysis and visualization do you plan on doing?
- what kind of normalization, standardization, regularization, or transformation do you plan on using and why?
- how will you select the features? will you use PCA?
- what kind of feature engineering will you use? will you add new derived features?
- what do you plan on predicting?
- which algorithms will you use and why? naive bayes, knn, decision trees, rules, log regression, multi regression, lasso, ridge, neural net, svm, clustering
- how do you compare and evaluate the performance of the algorithms? R-Squared? MAD, MSE, RMSE? AIC? AUC? why are they that way?
- how do you choose training vs validation data? why?
- can you and should you use k-fold cross-validation?
- how would you build a stacked ensemble model? is it a better model? can you use boosting or bagging? or build a stacked learner?
- how will you communicate the results of your algorithms?
Submission Details -- Read This in Full!
- Your submission must contain two files and three links (add the links in the Comments section of the Assignment submission on Blackboard): the .Rmd notebook and a knitted PDF or HTML (from the notebook). The three links are described below (presentation, rubric, data set). Name your files and all links with the pattern, DA5030.Proj.LastName.{Rmd,[pdf,html]}, where LastName is *your* last name.
- The .Rmd file must be fully commented and properly "chunked" R code and detailed explanations. Make sure that it is easy to recognize which question you answer and that your code runs from beginning to end (because that is how we will test it.) Code that doesn't execute, stops, throws errors will receive -- naturally -- receive no points. If the graders have to "debug" your code or spend any effort getting it to run, substantial points will be deducted. All of your code and explanations must be structured using the CRISP-DM model.
- In case it is not obvious, clickable means that one can click on a link and a new tab opens with the content.
- Not submitting a knitted PDF or HTML will result in reduction of 30 points.
- Not submitting the .Rmd file (or both) will result in a score of 0.
- Upload this along with your data set (or a link) to Blackboard under Assignments/Project.
- Submit a (clickable) link to a video presentation demonstrating your working code in the comments section on Blackboard under Assignments/Project, an explanation of your project, your data set, and (optional) slides summarizing your findings. Upload the video to YouTube or another video sharing site and submit a link, not the actual video. Below are some tools and strategies for creating screen narrations. Keep the presentation to a maximum of 10 minutes.
- Submit a (public and clickable) link to the completed rubric for self-evaluation (make a copy of the rubric on Google as a Google Sheet, do not request access to the rubric via Google and do not convert the file to Excel, CSV, or another format; to make a copy click on File/Make A Copy... in Google Sheet) in the comments section on Blackboard under Assignments/Project so we can see exactly what you did. Fill in the column for the points you "deserve" based on the work you've done. Tell us where in your R Notebook we can find that work - and note in the comments anything that does not work. Explain in the comments section the work that you did. If you place points into a column in the rubric that you did not actually complete, we consider this to be an academic integrity violation which results in a 0 for the project and a F for the course. So, complete the rubric properly, honestly, and defensibly. Name the Google Sheet for the rubric LastName.DA5030.Rubric.Term where term is the term in which you are taking DA5030, e.g., Su19 for Summer 2019 or F19 for Fall 2019 or S20 for Spring 2020. Here's a video tutorial that shows you how to make and submit a public and clickable link on Blackboard.
- Failure to submit a public link to your presentation and the rubric as part of the submission comments (not embedded within your file) will result in a deduction of 10 points.
- Submit a (public) link to your data set.
- Post a (clickable) link to your video only to the group discussion board in your group's thread so your peer's can see what you did. Provide peer feedback to everyone in your group. Feedback is required and if not provided will result in a grade of 0 and an F for the course.
- Note that a random number of students will be required to provide an in-person demonstration of your project via Skype, join.me, or Google Hangouts with question and answer. If you are selected, you are required to do a live demonstration - failure to do so will result in a grade of 0 and an F for the course.
- If your files are too big to upload to Blackboard, zip (not rar) them and provide a link to your work on Google Drive, OneDrive, or some other file sharing site.
- A video presentation is required; failure to submit a video presentation is 50% reduction in the final grade for the project.
- We expect to find copious comments, justifications, evidence, references, interpretations in your R Notebooks. Remember, it's not just about getting some function or package to work -- you must tell us what the results mean.
Submission Checklist
☐ R Notebook(s) and knitted PDFs or HTMLs
☐ link to video, (link to) data set, and (public) link to completed copy of self-evaluation rubric provided in comments
☐ public and clickable link to video posted to BB Discussion Board in Project Peer Review forum under group thread
☐ peer reviews posted under group thread
☐ R Notebook(s) and knitted PDFs or HTMLs
☐ link to video, (link to) data set, and (public) link to completed copy of self-evaluation rubric provided in comments
☐ public and clickable link to video posted to BB Discussion Board in Project Peer Review forum under group thread
☐ peer reviews posted under group thread
Presentation Best Practices
A small slide deck may help summarize key points. Use it in conjunction with your R Notebook - in the presentation switch between them as needed. Be sure to pause recording when switching, setting up, or when long computations are performed. Here's an outline for a 10-minute presentation with timing in minutes and seconds:
A small slide deck may help summarize key points. Use it in conjunction with your R Notebook - in the presentation switch between them as needed. Be sure to pause recording when switching, setting up, or when long computations are performed. Here's an outline for a 10-minute presentation with timing in minutes and seconds:
- (0:30) overview of the data and where you obtained it
- (0:30) your business problems, i.e., what you are trying to predict
- (1:00) how you explored the data
- (1:30) what kind of transformation you needed to do - briefly show code
- (1:00) what models you built and why
- (1:30) how the models performed -briefly show running code
- (1:30) how you evaluated, validated the models
- (1:30) how you built and ensemble model and how well your ensemble performed
- (1:00) summary and key lessons learned
Presentation Recording Tools
After you recorded the presentation upload it to a video sharing site such as YouTube. If you do not wish to make the video public, then mark it as unlisted.
After you recorded the presentation upload it to a video sharing site such as YouTube. If you do not wish to make the video public, then mark it as unlisted.
join.me
MovAVI TinyTake VoiceThread FlashBack Express QuickTime Player Win-Alt-R SnagIt Screen-cast-o-matic |
Free
Free Free Free Free Free Free Trial Free |
Browser
Windows Windows Windows Browser Mac Windows Windows & Mac (15-day Trial is free) Windows, Mac |
Sources for Data Sets and Presentations
Resources
- https://machinelearningmastery.com/framework-for-data-preparation-for-machine-learning/
- https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/
- https://machinelearningmastery.com/feature-selection-with-categorical-data/
- https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/
Grading
- The project is graded out of 100 points. It is possible to get more than 100 points and thus additional points can be earned towards the final grade.
- It is not possible to earn all possible points as many activities are orthogonal. However, the rubric is designed such that every student has the opportunity to earn 100 points regardless of the data set chosen.