import pandas as pd import numpy as np
Each February thousands of factory owners and manufacturers around the world compete for contracts to produce consumer goods for large European retailers.
In January, the regulatory body DIOGENES assigns a
d_score to each manufacturer that registers for bidding. A manufacturer's
d_score is intended to be an indicator of general productivity/efficiency, and is assigned by analysts based on:
d_score values are A, B, C, D, or F.
In other words, DIOGENES releases a table of manufacturer's DPI scores each March, June, September, and December. These DPI values are used by analysts (combined with what they observe during inspection) to derive the final
d_score. This gets published in January just before bidding begins. Since
d_score has a large effect on who wins the most lucrative contracts, a reliable method for predicting it would be quite valuable.
We have been given a data set containing historical DPI values and resulting
d_scores for the years 1995 through 2015. It is now July of 2016, and we want to predict the upcoming 2017
d_scores using a python program. Given that it's July, we can only use the March and June DPI values for prediction.
You are provided with three files,
sample_submission.csv. The file
training_set.csv contains historical DPI values and
d_scores for 1995 - 2015. Here are the first few lines of this file:
historical = pd.read_csv('../data/final/training_set.csv', index_col=0) historical.head()
There are 30271 data points. This data can be used to train a model for predicting
test_set.csv contains the 2016 DPI values for March and June. This is the data you will use predict the 2017
d_scores. The first few lines of
test_set.csv look like:
test_set = pd.read_csv('../data/final/test_set.csv', index_col=0) test_set.head()
There are 3363 lines.
You will have to generate a prediction for each entry in the test data set. Your prediction should be a probability distribution over the possible
d_scores, store. For instance, a prediction of
[0.1, 0.5, 0.2, 0.1, 0.1] corresponds to a distribution over
['A', 'B', 'C', 'D', 'F'] that assigns probabities
'A' -> 0.1,
B -> 0.5,
C -> 0.2, etc.
Your predictions for all data points in the test set should be exported in a single
.csv file. Predictions must be stored in the same order as they occur in the
test_set.csv file. I.e. the prediction on row
N of your submission file should correspond to the DPI values on row
N in the
sample_submission.csv file contains an example of a properly formatted submission.
There are three components to the project:
Before constructing your model, you should explore the data. Some good questions to ask are: what ranges do the values lie in? Is there missing data? How are the input variables related to the target variable? Often it is useful to do some preprocessing or simplification of the data in this step.
pandas is a useful python library for handling data, while
matplotlib is standard for basic 2D and 3D plotting. These libraries may come in handy, but you are not required to use them.
In this step you write a python program that generates
d_score predictions based on DPI values. Training data for your model is made available in
training_set.csv. See section 2 above for further details. Useful libraries may include
tflearn, and of course
Once you've devised, implemented, and trained a model, you should generate predictions for the data in
test_set.csv. Before the assignment is due, you will be able to make 5 preliminary submissions to see how your predictions score relative to the actual 2017
d_scores. See section 2 above, and the sample submission file for further details or email Gideon Providence at email@example.com. The preliminary submissions are a useful way of gauging your model accuracy.
Your final mark will be based on: