import pandas as pd
import numpy as np
Each February thousands of factory owners and manufacturers around the world compete for contracts to produce consumer goods for large European retailers.
In January, the regulatory body DIOGENES assigns a d_score
to each manufacturer that registers for bidding. A manufacturer's d_score
is intended to be an indicator of general productivity/efficiency, and is assigned by analysts based on:
Possible d_score
values are A, B, C, D, or F.
In other words, DIOGENES releases a table of manufacturer's DPI scores each March, June, September, and December. These DPI values are used by analysts (combined with what they observe during inspection) to derive the final d_score
. This gets published in January just before bidding begins. Since d_score
has a large effect on who wins the most lucrative contracts, a reliable method for predicting it would be quite valuable.
We have been given a data set containing historical DPI values and resulting d_scores
for the years 1995 through 2015. It is now July of 2016, and we want to predict the upcoming 2017 d_scores
using a python program. Given that it's July, we can only use the March and June DPI values for prediction.
You are provided with three files, training_set.csv
, test_set.csv
, and sample_submission.csv
. The file training_set.csv
contains historical DPI values and d_scores
for 1995 - 2015. Here are the first few lines of this file:
historical = pd.read_csv('../data/final/training_set.csv', index_col=0)
historical.head()
There are 30271 data points. This data can be used to train a model for predicting d_score
.
The file test_set.csv
contains the 2016 DPI values for March and June. This is the data you will use predict the 2017 d_score
s. The first few lines of test_set.csv
look like:
test_set = pd.read_csv('../data/final/test_set.csv', index_col=0)
test_set.head()
There are 3363 lines.
You will have to generate a prediction for each entry in the test data set. Your prediction should be a probability distribution over the possible d_score
s, store. For instance, a prediction of [0.1, 0.5, 0.2, 0.1, 0.1]
corresponds to a distribution over ['A', 'B', 'C', 'D', 'F']
that assigns probabities 'A' -> 0.1
, B -> 0.5
, C -> 0.2
, etc.
Your predictions for all data points in the test set should be exported in a single .csv
file. Predictions must be stored in the same order as they occur in the test_set.csv
file. I.e. the prediction on row N
of your submission file should correspond to the DPI values on row N
in the test_set.csv
. The sample_submission.csv
file contains an example of a properly formatted submission.
There are three components to the project:
d_scores
in python,d_score
values. Before constructing your model, you should explore the data. Some good questions to ask are: what ranges do the values lie in? Is there missing data? How are the input variables related to the target variable? Often it is useful to do some preprocessing or simplification of the data in this step. pandas
is a useful python library for handling data, while matplotlib
is standard for basic 2D and 3D plotting. These libraries may come in handy, but you are not required to use them.
In this step you write a python program that generates d_score
predictions based on DPI values. Training data for your model is made available in training_set.csv
. See section 2 above for further details. Useful libraries may include scikit-learn
, xgboost
, tflearn
, and of course numpy
/scipy
.
Once you've devised, implemented, and trained a model, you should generate predictions for the data in test_set.csv
. Before the assignment is due, you will be able to make 5 preliminary submissions to see how your predictions score relative to the actual 2017 d_score
s. See section 2 above, and the sample submission file for further details or email Gideon Providence at g@math.utoronto.ca. The preliminary submissions are a useful way of gauging your model accuracy.
Your final mark will be based on: