Assignment 3 - Machine Learning In Action
Assignment Objectives
Experience first hand the machine learning workflow
Apply knowledge about regression and classification algorithms on real data!
Learn the basics of the pandas and sklearn python libraries
Learn how to find an interesting dataset from a popular online repository (Kaggle)
Learn how to analyze a dataset with a Jupyter Notebook and share your insights with others
Pre-Requisites
Completion of Assignment 2 - Setup Jupyter is required. Knowledge of the basic syntax of Python is expected. As is background knowledge of the algorithms you will use in this assignment. If part of this assignment seems unclear please send an email wmcnichols@ccny.cuny.edu or message me on the course Slack.
Part 0: Understand the assignment
In this assignment you will be building a respectably sized Jupyter notebook report in which you use machine learning libraries to make predictions on datasets of your choosing. The report will have 4 sections outlined below.
Before you get started you should read through the instructions carefully and understand the requirements of the assignment. I'd recommend that before you start building the report that you experiment with your data to ensure it's a good fit for the algorithms you're using
You can view a sample notebook submission here. Feel free to use this notebook as a starting template for your report and fill in the missing code sections as you go.
Notebook section
Possible points
Requirements
Abstract
5
Talk about what datasets you've selected for your notebook
Include why you choose these datasets
Talk about the classification algorithm you plan to use
Mention what you hope to find over the course of this notebook
Section 1: Regression Dataset Prep
10
Selecting datasets in google sheet by 4/22/2020
(5 points)
At a high level, discuss what columns are included in the data
Load the dataset you've selected using pandas
Show the head of the data
For the columns you will be using in your regression, describe their setup in more detail
Clean the data if needed using pandas
Section 2: Regression
15
Split your cleaned dataset using sklearn into a training and test set
Fit a linear regression to your training set
Visualize the regressor using matplotlib
Report on the error rate for your test set
Perform a k-fold cross validation on the dataset and report on the mean error rate (K >= 5)
Section 3: Classification Dataset Prep
5
At a high level, discuss what columns are included in the data
Load the dataset you've selected using pandas
Show the head of the data
For the columns you will be using in your classification, describe the range of values
For those columns map the values to a set of integers (if they aren't already)
Clean the data using pandas if needed
Section 4: Classification
15
You can use any of the classifier algorithms we talked about in class for this section, namely:
Support Vector Machines
k-Nearest Neighbor
Multi-Layer Perceptron (Neural Networks)
Important: If you use an algorithm outside the ones discussed in class you must discuss why you selected it
Split your cleaned dataset using sklearn into a training and test set
(Situational) Scale your data to prevent overfitting
Fit a classifier to your training set
(Optional) visualize the classifier on a data plot
Report on the error rate for your test set
Perform a k-fold cross validation on the dataset and report on the mean error rate (K >= 5)
Conclusions
5
Summarize the findings of your report
Repeat your methodology and key findings for each model
Highlight what you found interesting
Discuss what you would do to extend the project further
Max points
55
Part 1 : Gather your dataset(s)
Kaggle.com is a web platform whose mission is to create a thriving data science community. The company, which is subsidiary of Google, is a great community for anyone who wants to learn more about data science. The website also hosts open competitions for participants to compete for cash money by submitting analysis on released datasets.
For the purposes of this assignment Kaggle is also a repository of a large number of free datasets! You will get to choose your own datasets for this project which you will then apply a handful of machine learning algorithms on.
First, navigate to https://www.kaggle.com and create an account (or log in if you have one already).
Once you're logged in, navigate https://www.kaggle.com/datasets and explore the available datasets to find a few of interest. I'd recommend finding a handful that you find promising and save them to your favorites. See below some tips on finding a good dataset. In this assignment you will be using performing both classification and regression on labeled data. You may be able to do this all with a single dataset but most likely will need multiple.
Dataset Requirements:
You cannot use the iris or digits datasets
Your data must have a minimum of 50 entries (rows)
For regression, you will need at least 2 quantitative (numeric) columns
For classification you will need at 1 qualitative (categorical) feature and at least 2 quantitative features for prediction (or qualitative features that can be mapped)
For the feature you're trying to predict, your data must be labeled (no unsupervised learning)
Tips on finding good data:
I would strongly advise picking a dataset with high usability. The goal of this assignment is to get you comfortable with the tools used by data scientists and to have fun; Not to pull your hair out with messy data. You can sort datasets on Kaggle by usability score. I wouldn't recommend using a dataset with a usability score of less than 9.0.
That said, find something you're interested in! You will be spending a good bit of time with this data and it will be much more enjoyable if you are interested in the output.
Once you've found your dataset, claim it! To prevent duplicate submissions everyone will need to work with a unique dataset. Once you've selected the datasets that you want to use add them along with your name to the google sheet posted in Slack. https://docs.google.com/spreadsheets/d/1ERNU86Z1cbp9kI3QOK9FQqI1s0vvH5exwOT9LMhfdxM/edit#gid=0
Part 2: Build Your Notebook
Below is some guidance and tips for completing the requirements of each section.
Prepping and Analyzing your data
There are many different data loading/analysis libraries out there for python but don't reinvent the wheel. Pandas is by far the most universally used library for manipulating datasets. In includes tools for loading a datasets, slicing/combining the data, and easily transforming back and forth to numpy primitives. The following tutorials should cover all the tools you will need to complete this assignment.
The following function will also be helpful for any data mapping you need to do in the classification section.
Splitting and Cross-Validating Data
SciKit Learn is a popular and easy to use machine learning library for Python. Their documentation is very thorough as well. These pages should get you off the ground for setting up your datasets:
Performing a regression
Performing a classification
We've discussed 3 different classifiers in class, namely:
Support Vector Machines
k-Nearest Neighbor
Multi-Layer Perceptron (Neural Networks)
Important: If you use an algorithm outside the ones discussed in class you must discuss why you thought it was appropriate in your notebook.
SciKit Learn documentation for each classifier follows:
Support Vector Machine
k-Nearest Neighbor Classifier
Multi-Layer Perceptron (Neural Networks)
Part 3: Submit your notebook
Once you've completed your jupyter report you will need to upload it to both Blackboard and to the shared Google Cloud Compute Instance.
Blackboard Submission
Zip your notebook (.ipynb file) and your datasets together into a directory with the following naming convention:
familyName_GivenName_reportTitle Here's a 1 liner to zip (on unix machines):
Once compressed, upload your archive to our course blackboard under content > assignments > Assignment 3
Google Cloud Submission
Navigate to the class's shared notebook which is hosted on google cloud at http://35.245.58.149:8888/
The password is shared on Slack if you can't find it feel free to drop a message in #ai_chat.
Note this address is different from the one shared in Lecture 7
Once in the shared notebook navigate to the Assignment 3 directory and create a folder for your submission with the same naming conventions as the Blackboard submission.
Upload the notebook (.ipynb file) and all dependencies to the notebook. And then test your notebook.
It is critical you test your notebook on this remote environment as the server may not have all dependencies that your machine does. If your notebook does not run properly due to missing packages reach out to me to add them and note the missing dependencies in the "Abstract" section of your notebook.
Once tested you're all done! I hope you enjoyed doing some real machine learning π€
Last updated