In [43]:
import pandas as pd
import numpy as np

Background

Each February thousands of factory owners and manufacturers around the world compete for contracts to produce consumer goods for large European retailers.

In January, the regulatory body DIOGENES assigns a d_score to each manufacturer that registers for bidding. A manufacturer's d_score is intended to be an indicator of general productivity/efficiency, and is assigned by analysts based on:

  • the results of a physical inspection of the manufacturing plant in early January,
  • a quarterly productivity index (the DPI) published by DIOGENES in the year before bidding.

Possible d_score values are A, B, C, D, or F.

In other words, DIOGENES releases a table of manufacturer's DPI scores each March, June, September, and December. These DPI values are used by analysts (combined with what they observe during inspection) to derive the final d_score. This gets published in January just before bidding begins. Since d_score has a large effect on who wins the most lucrative contracts, a reliable method for predicting it would be quite valuable.

Data Description

We have been given a data set containing historical DPI values and resulting d_scores for the years 1995 through 2015. It is now July of 2016, and we want to predict the upcoming 2017 d_scores using a python program. Given that it's July, we can only use the March and June DPI values for prediction.

Training data

You are provided with three files, training_set.csv, test_set.csv, and sample_submission.csv. The file training_set.csv contains historical DPI values and d_scores for 1995 - 2015. Here are the first few lines of this file:

In [45]:
historical = pd.read_csv('../data/final/training_set.csv', index_col=0)
historical.head()
Out[45]:
YEAR DPI_MAR DPI_JUN DPI_SEP DPI_DEC D_SCORE
0 2002 0.76 0.39 0.35 0.50 D
1 2002 0.56 0.49 0.47 0.46 F
2 2002 0.92 0.76 0.95 0.84 A
3 2002 0.64 0.37 0.33 0.00 F
4 2002 0.40 0.00 0.00 0.00 F

There are 30271 data points. This data can be used to train a model for predicting d_score.

Test data

The file test_set.csv contains the 2016 DPI values for March and June. This is the data you will use predict the 2017 d_scores. The first few lines of test_set.csv look like:

In [48]:
test_set = pd.read_csv('../data/final/test_set.csv', index_col=0)
test_set.head()
Out[48]:
YEAR DPI_MAR DPI_JUN
0 2016 0.73 0.73
1 2016 0.68 0.70
2 2016 0.79 0.70
3 2016 0.69 0.85
4 2016 0.88 0.77

There are 3363 lines.

Prediction data

You will have to generate a prediction for each entry in the test data set. Your prediction should be a probability distribution over the possible d_scores, store. For instance, a prediction of [0.1, 0.5, 0.2, 0.1, 0.1] corresponds to a distribution over ['A', 'B', 'C', 'D', 'F'] that assigns probabities 'A' -> 0.1, B -> 0.5, C -> 0.2, etc.

Your predictions for all data points in the test set should be exported in a single .csv file. Predictions must be stored in the same order as they occur in the test_set.csv file. I.e. the prediction on row N of your submission file should correspond to the DPI values on row N in the test_set.csv. The sample_submission.csv file contains an example of a properly formatted submission.

Assignment details

There are three components to the project:

  • exploratory data analysis (EDA),
  • implementing a mathematical/statistical predictive model for d_scores in python,
  • generating and submitting predicted 2017 d_score values.

EDA

Before constructing your model, you should explore the data. Some good questions to ask are: what ranges do the values lie in? Is there missing data? How are the input variables related to the target variable? Often it is useful to do some preprocessing or simplification of the data in this step. pandas is a useful python library for handling data, while matplotlib is standard for basic 2D and 3D plotting. These libraries may come in handy, but you are not required to use them.

Modelling

In this step you write a python program that generates d_score predictions based on DPI values. Training data for your model is made available in training_set.csv. See section 2 above for further details. Useful libraries may include scikit-learn, xgboost, tflearn, and of course numpy/scipy.

Generating and submitting predictions

Once you've devised, implemented, and trained a model, you should generate predictions for the data in test_set.csv. Before the assignment is due, you will be able to make 5 preliminary submissions to see how your predictions score relative to the actual 2017 d_scores. See section 2 above, and the sample submission file for further details or email Gideon Providence at g@math.utoronto.ca. The preliminary submissions are a useful way of gauging your model accuracy.

Assessment

Your final mark will be based on:

  • Your presentation explaining (a) the results of your exploratory data analysis, (b) a breakdown of how your model works, (c) the reasoning behind its construction
  • The accuracy of your predictions (scores will be computed using modified cross-entropy)
  • A review of your python code