Assignment 3 - Machine Learning In Action

Assignment Objectives

  • Experience first hand the machine learning workflow

  • Apply knowledge about regression and classification algorithms on real data!

  • Learn the basics of the pandas and sklearn python libraries

  • Learn how to find an interesting dataset from a popular online repository (Kaggle)

  • Learn how to analyze a dataset with a Jupyter Notebook and share your insights with others

Pre-Requisites

Completion of Assignment 2 - Setup Jupyter is required. Knowledge of the basic syntax of Python is expected. As is background knowledge of the algorithms you will use in this assignment. If part of this assignment seems unclear please send an email wmcnichols@ccny.cuny.eduenvelope or message me on the course Slack.

Part 0: Understand the assignment

In this assignment you will be building a respectably sized Jupyter notebook report in which you use machine learning libraries to make predictions on datasets of your choosing. The report will have 4 sections outlined below.

Before you get started you should read through the instructions carefully and understand the requirements of the assignment. I'd recommend that before you start building the report that you experiment with your data to ensure it's a good fit for the algorithms you're using

You can view a sample notebook submission herearrow-up-right. Feel free to use this notebook as a starting template for your report and fill in the missing code sections as you go.

file-download
13KB
Cats Notebook

Notebook section

Possible points

Requirements

Abstract

5

  • Talk about what datasets you've selected for your notebook

  • Include why you choose these datasets

  • Talk about the classification algorithm you plan to use

  • Mention what you hope to find over the course of this notebook

Section 1: Regression Dataset Prep

10

Selecting datasets in google sheet by 4/22/2020

(5 points)

  • At a high level, discuss what columns are included in the data

  • Load the dataset you've selected using pandas

  • Show the head of the data

  • For the columns you will be using in your regression, describe their setup in more detail

  • Clean the data if needed using pandas

Section 2: Regression

15

  • Split your cleaned dataset using sklearn into a training and test set

  • Fit a linear regression to your training set

  • Visualize the regressor using matplotlib

  • Report on the error rate for your test set

  • Perform a k-fold cross validation on the dataset and report on the mean error rate (K >= 5)

Section 3: Classification Dataset Prep

5

  • At a high level, discuss what columns are included in the data

  • Load the dataset you've selected using pandas

  • Show the head of the data

  • For the columns you will be using in your classification, describe the range of values

  • For those columns map the values to a set of integers (if they aren't already)

  • Clean the data using pandas if needed

Section 4: Classification

15

You can use any of the classifier algorithms we talked about in class for this section, namely:

Support Vector Machines

k-Nearest Neighbor

Multi-Layer Perceptron (Neural Networks)

Important: If you use an algorithm outside the ones discussed in class you must discuss why you selected it

  • Split your cleaned dataset using sklearn into a training and test set

  • (Situational) Scale your data to prevent overfitting

  • Fit a classifier to your training set

  • (Optional) visualize the classifier on a data plot

  • Report on the error rate for your test set

  • Perform a k-fold cross validation on the dataset and report on the mean error rate (K >= 5)

Conclusions

5

  • Summarize the findings of your report

  • Repeat your methodology and key findings for each model

  • Highlight what you found interesting

  • Discuss what you would do to extend the project further

Max points

55

Part 1 : Gather your dataset(s)

Kaggle.comarrow-up-right is a web platform whose mission is to create a thriving data science community. The company, which is subsidiary of Google, is a great community for anyone who wants to learn more about data science. The website also hosts open competitions for participants to compete for cash money by submitting analysis on released datasets.

For the purposes of this assignment Kaggle is also a repository of a large number of free datasets! You will get to choose your own datasets for this project which you will then apply a handful of machine learning algorithms on.

First, navigate to https://www.kaggle.comarrow-up-right and create an account (or log in if you have one already).

Once you're logged in, navigate https://www.kaggle.com/datasets and explore the available datasets to find a few of interest. I'd recommend finding a handful that you find promising and save them to your favorites. See below some tips on finding a good dataset. In this assignment you will be using performing both classification and regression on labeled data. You may be able to do this all with a single dataset but most likely will need multiple.

Dataset Requirements:

  1. You cannot use the iris or digits datasets

  2. Your data must have a minimum of 50 entries (rows)

  3. For regression, you will need at least 2 quantitative (numeric) columns

  4. For classification you will need at 1 qualitative (categorical) feature and at least 2 quantitative features for prediction (or qualitative features that can be mapped)

  5. For the feature you're trying to predict, your data must be labeled (no unsupervised learning)

Tips on finding good data:

circle-info

I would strongly advise picking a dataset with high usability. The goal of this assignment is to get you comfortable with the tools used by data scientists and to have fun; Not to pull your hair out with messy data. You can sort datasets on Kaggle by usability score. I wouldn't recommend using a dataset with a usability score of less than 9.0.

That said, find something you're interested in! You will be spending a good bit of time with this data and it will be much more enjoyable if you are interested in the output.

Once you've found your dataset, claim it! To prevent duplicate submissions everyone will need to work with a unique dataset. Once you've selected the datasets that you want to use add them along with your name to the google sheet posted in Slack. https://docs.google.com/spreadsheets/d/1ERNU86Z1cbp9kI3QOK9FQqI1s0vvH5exwOT9LMhfdxM/edit#gid=0arrow-up-right

Part 2: Build Your Notebook

Below is some guidance and tips for completing the requirements of each section.

Prepping and Analyzing your data

There are many different data loading/analysis libraries out there for python but don't reinvent the wheel. Pandasarrow-up-right is by far the most universally used library for manipulating datasets. In includes tools for loading a datasets, slicing/combining the data, and easily transforming back and forth to numpyarrow-up-right primitives. The following tutorials should cover all the tools you will need to complete this assignment.

The following function will also be helpful for any data mapping you need to do in the classification section.

Splitting and Cross-Validating Data

SciKit Learnarrow-up-right is a popular and easy to use machine learning library for Python. Their documentation is very thorough as well. These pages should get you off the ground for setting up your datasets:

Performing a regression

Performing a classification

We've discussed 3 different classifiers in class, namely:

  1. Support Vector Machines

  2. k-Nearest Neighbor

  3. Multi-Layer Perceptron (Neural Networks)

    Important: If you use an algorithm outside the ones discussed in class you must discuss why you thought it was appropriate in your notebook.

SciKit Learn documentation for each classifier follows:

Support Vector Machine

k-Nearest Neighbor Classifier

Multi-Layer Perceptron (Neural Networks)

Part 3: Submit your notebook

Once you've completed your jupyter report you will need to upload it to both Blackboard and to the shared Google Cloud Compute Instance.

Blackboard Submission

Zip your notebook (.ipynb file) and your datasets together into a directory with the following naming convention:

familyName_GivenName_reportTitle Here's a 1 liner to zip (on unix machines):

Once compressed, upload your archive to our course blackboard under content > assignments > Assignment 3

Google Cloud Submission

Navigate to the class's shared notebook which is hosted on google cloud at http://35.245.58.149:8888/arrow-up-right

The password is shared on Slack if you can't find it feel free to drop a message in #ai_chat.

circle-info

Note this address is different from the one shared in Lecture 7

Once in the shared notebook navigate to the Assignment 3 directoryarrow-up-right and create a folder for your submission with the same naming conventions as the Blackboard submission.

Upload the notebook (.ipynb file) and all dependencies to the notebook. And then test your notebook.

circle-info

It is critical you test your notebook on this remote environment as the server may not have all dependencies that your machine does. If your notebook does not run properly due to missing packages reach out to me to add them and note the missing dependencies in the "Abstract" section of your notebook.

Once tested you're all done! I hope you enjoyed doing some real machine learning πŸ€–

Last updated