In [43]:

```
import pandas as pd
import numpy as np
```

Each February thousands of factory owners and manufacturers around the world compete for contracts to produce consumer goods for large European retailers.

In January, the regulatory body DIOGENES assigns a `d_score`

to each manufacturer that registers for bidding. A manufacturer's `d_score`

is intended to be an indicator of general productivity/efficiency, and is assigned by analysts based on:

- the results of a physical inspection of the manufacturing plant in early January,
- a quarterly productivity index (the DPI) published by DIOGENES in the year before bidding.

Possible `d_score`

values are A, B, C, D, or F.

In other words, DIOGENES releases a table of manufacturer's DPI scores each March, June, September, and December. These DPI values are used by analysts (combined with what they observe during inspection) to derive the final `d_score`

. This gets published in January just before bidding begins. Since `d_score`

has a large effect on who wins the most lucrative contracts, a reliable method for predicting it would be quite valuable.

We have been given a data set containing historical DPI values and resulting `d_scores`

for the years 1995 through 2015. It is now July of 2016, and we want to predict the upcoming 2017 `d_scores`

using a python program. Given that it's July, we can only use the March and June DPI values for prediction.

You are provided with three files, `training_set.csv`

, `test_set.csv`

, and `sample_submission.csv`

. The file `training_set.csv`

contains historical DPI values and `d_scores`

for 1995 - 2015. Here are the first few lines of this file:

In [45]:

```
historical = pd.read_csv('../data/final/training_set.csv', index_col=0)
historical.head()
```

Out[45]:

There are 30271 data points. This data can be used to train a model for predicting `d_score`

.

The file `test_set.csv`

contains the 2016 DPI values for March and June. This is the data you will use predict the 2017 `d_score`

s. The first few lines of `test_set.csv`

look like:

In [48]:

```
test_set = pd.read_csv('../data/final/test_set.csv', index_col=0)
test_set.head()
```

Out[48]:

There are 3363 lines.

You will have to generate a prediction for each entry in the test data set. Your prediction should be a probability distribution over the possible `d_score`

s, store. For instance, a prediction of `[0.1, 0.5, 0.2, 0.1, 0.1]`

corresponds to a distribution over `['A', 'B', 'C', 'D', 'F']`

that assigns probabities `'A' -> 0.1`

, `B -> 0.5`

, `C -> 0.2`

, etc.

Your predictions for all data points in the test set should be exported in a single `.csv`

file. Predictions must be stored in the same order as they occur in the `test_set.csv`

file. I.e. the prediction on row `N`

of your submission file should correspond to the DPI values on row `N`

in the `test_set.csv`

. The `sample_submission.csv`

file contains an example of a properly formatted submission.

There are three components to the project:

- exploratory data analysis (EDA),
- implementing a mathematical/statistical predictive model for
`d_scores`

in python, - generating and submitting predicted 2017
`d_score`

values.

Before constructing your model, you should explore the data. Some good questions to ask are: what ranges do the values lie in? Is there missing data? How are the input variables related to the target variable? Often it is useful to do some preprocessing or simplification of the data in this step. `pandas`

is a useful python library for handling data, while `matplotlib`

is standard for basic 2D and 3D plotting. These libraries may come in handy, but you are not required to use them.

In this step you write a python program that generates `d_score`

predictions based on DPI values. Training data for your model is made available in `training_set.csv`

. See section 2 above for further details. Useful libraries may include `scikit-learn`

, `xgboost`

, `tflearn`

, and of course `numpy`

/`scipy`

.

Once you've devised, implemented, and trained a model, you should generate predictions for the data in `test_set.csv`

. Before the assignment is due, you will be able to make 5 preliminary submissions to see how your predictions score relative to the actual 2017 `d_score`

s. See section 2 above, and the sample submission file for further details or email Gideon Providence at g@math.utoronto.ca. The preliminary submissions are a useful way of gauging your model accuracy.

Your final mark will be based on:

- Your presentation explaining (a) the results of your exploratory data analysis, (b) a breakdown of how your model works, (c) the reasoning behind its construction
- The accuracy of your predictions (scores will be computed using modified cross-entropy)
- A review of your python code