Homework 7

Homework Submission Workflow

When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.

Important: Failure to do any of the following will result in lost points:

  • Submit one PDF file and one notebook per group

  • Enter all the group members in your PDF Gradescope submission, not just in your documents. Look at this Piazza Post if you don't know how to do this

  • Match each answer (not question!) with the appropriate page in Gradescope

  • Avoid large blank spaces in your PDF file

Part 1: Random Guesses

It is often said that a binary classifier that ignores its input and makes a random guess at the output has a 50 percent accuracy. The term ''accuracy'' denotes the probability that the classifier's output is correct, that is, 1 minus the error rate. The statement about random guesses above, however, implies two assumptions: (i) the labels are equi-probable and (ii) the classifier assumes that they are equi-probable. Let us say that if both assumptions hold, we are in the symmetric case.

Then, if the label set is $Y = \{1, 2\}$, the argument goes that the true label $y$ is either $k=1$ or $k=2$ with probability $\mathbb{P}[y = k] = 1/2$, and that the classifier makes a choice to output label $\widehat{y}$ equal to either $\widehat{k} = $1 or $\widehat{k} = 2$ with probability $\mathbb{P}[\widehat{y} = \widehat{k}\ |\ y = k]$ for all combinations of $k$ and $\widehat{k}$. Since the classifier ignores the input $\mathbf{x}$, the two events $y = k$ and $\widehat{y} = \widehat{k}$ are independent, so that

$$ \mathbb{P}[\widehat{y} = \widehat{k}\ |\ y = k] \;=\; \mathbb{P}[\widehat{y} = \widehat{k}] \;=\; 1/2 $$

and the probability that the classifier is correct is then

$$ \begin{eqnarray*} \mathbb{P}[y = \widehat{y}] &\;=\;& \mathbb{P}[\widehat{y} = 1\ \text{and}\ y = 1] \;+\; \mathbb{P}[\widehat{y} = 2\ \text{and}\ y = 2] \\ &\;=\;& \mathbb{P}[\widehat{y} = 1\ |\ y = 1]\ \mathbb{P}[y = 1] \;+\; \mathbb{P}[\widehat{y} = 2\ |\ y = 2]\ \mathbb{P}[y = 2] \\ &\;=\;& \mathbb{P}[\widehat{y} = 1]\ \mathbb{P}[y = 1] \;+\; \mathbb{P}[\widehat{y} = 2]\ \mathbb{P}[y = 2] \\ &\;=\;& \frac{1}{2}\, \frac{1}{2} \;+\; \frac{1}{2}\, \frac{1}{2} \\ &\;=\;& \frac{1}{2} \;. \end{eqnarray*} $$

Problem 1.1

What is the accuracy for a random-guess $K$-class classifier in the symmetric case? No need to prove your answer, if you are sure of it. Also give a numerical value as an approximate percentage with two decimal digits after the period when $K=7$.

Problem 1.2

Consider now the general, zero-knowledge case in which assumption (ii) still holds, but assumption (i) does not: The classifier assumes that the labels are equi-probable, while in truth they have probabilities

$$ \mathbb{P}[y = k] = p_k $$

where the $p_k$ are nonnegative real numbers that add up to $1$.

This situation is ''general'' in that it does not restrict the label probabilities in any way. It is ''zero-knowledge'' in that the (possibly incorrect) assumption that the labels are equi-probable is the most reasonable assumption for the classifier to make in the absence of knowledge.

When you are asked for numerical values in the following questions of this part, use $K = 7$ and

$$ \begin{eqnarray*} (p_1,\ldots, p_7) &=& \frac{1}{10} (1, 2, 4, 1, 0, 1, 1) \\ (q_1,\ldots, q_7) &=& \frac{1}{10} (2, 1, 0, 0, 3, 0, 4) \end{eqnarray*} $$

(the $q_k$ values will come up later). Report accuracies as approximate percentages (not fractions) with two decimal digits after the period. It is not necessary to prove your answers if you are sure about them.

Write a formula for the accuracy of a random-guess $K$-class classifier in the general, zero-knowledge case. Also report a numerical value.

Problem 1.3

Consider now the fully general case: The probabilities of the true labels are

$$ \mathbb{P}[y = k] = p_k $$

while the classifier outputs label $\widehat{k}$ with probability

$$ \mathbb{P}[\widehat{y} = \widehat{k}] = q_{\widehat{k}} \;. $$

Write a formula for the accuracy of a random-guess $K$-class classifier in the fully general case. Also report a numerical value.

Problem 1.4

Finally, let us consider the general, perfect-knowledge case in which the probabilities of the true labels are

$$ \mathbb{P}[y = k] = p_k $$

and the classifier outputs label $\widehat{k}$ with probability

$$ \mathbb{P}[\widehat{y} = \widehat{k}] = p_{\widehat{k}} \;. $$

Write a formula for the accuracy of a random-guess $K$-class classifier in the general, perfect-knowledge case. Also report a numerical value.

Problem 1.5

Is the classifier better off by having perfect knowledge rather than assuming equiprobable labels in the numerical examples given above? The formula for accuracy in the perfect-knowledge case is related to one of the measures of impurity we discussed in the context of decision trees. Which measure, and how?

Part 2: Multiclass Classifiers

The goal of the problems in this part is for you to gain experience using a machine learning package. The only way to do so effectively is to first understand what you are doing, that is, to study the theory. Make sure that you understand both multi-class linear predictors and decision trees before you attempt these problems.

You may use any function in sklearn or numpy for this part. Make sure you learn how to navigate the user manuals for these packages.

The code below loads the Cover Type data set, which is described here, and splits the data into a training set T and a test set S. No separate validation set is given. Spend some time understanding the structure and nature of the data set.

Some of the features (entries of data vectors $\mathbf{x}$) are categorical. However, they are all binary, and it is therefore OK to think of them as real-valued.

The first run of the code may take quite a while, because the fetch_covtype command downloads a database from a remote site. Subsequent runs are much faster, because the data is cached on your disk.

To make our grading easier, do not change anything in this code.

In [1]:
import numpy as np
from sklearn.datasets import fetch_covtype

# So we all look at the data the same way
seed = 3
np.random.seed(seed)

data = fetch_covtype(random_state=seed, shuffle=True)

# Random training/test data partition
from sklearn.model_selection import train_test_split
testFraction = 0.2
T, S = {}, {}
T['x'], S['x'], T['y'], S['y'] = train_test_split(data.data, data.target,
 test_size=testFraction, random_state=seed)

Problem 2.1

Use the function LogisticRegression from sklearn.linear_model to train a logistic-regression classifier on T. Use the following keyword parameters:

C=1e5, solver='lbfgs', multi_class='multinomial', random_state=seed

If you want to know what these mean (a good idea), read the manual.

Show your code and report the training accuracy and the test accuracy as a percentage (not a fraction) with two decimal digits after the period.

Programming Notes

You will be asked to report training and test accuracy again below. Because of this, the best use of your time (and the best way to minimize the possibility of mistakes) is to write a function that takes a classifier and two data sets, and possibly some informative text strings, and prints out a meaningful sentence that describes the result of evaluating the classifier on the sets. You will reuse that function later. You are free to write that function any (correct) way you want. Include that function in the code you show.

Training takes some time.

Problem 2.2

Comment on the results you obtained in the previous problem:

  • Is performance good?
  • What may be a plausible reason for this level of performance?
  • Is there overfitting, and how can you tell?
  • Is performance better than random guessing, with or without knowledge of the label distribution?

Assume that the label distribution estimated from T constitutes ''perfect knowledge.'' Include accuracies as percentages (with two decimal digits after the period) in your arguments. Don't forget to answer any of the questions, and answer them in bullet form, in the order they are asked.

Problem 2.3

Now try a DecisionTreeClassifier from sklearn.tree on the same problem. Show your code and report training and test accuracy. Use default parameters.

Problem 2.4

Discuss the decision-tree results in the same way you did for the linear classifier. Don't forget to answer any of the questions, and answer them in bullet form, in the order they were asked.

Part 3: Cross-Validation

The outcome of the experiment in this part may be disappointing, depending on your expectations. The goal is to understand how to do cross-validation and where it can help, and to realize how costly it can be.

Problem 3.1

Use GridSearchCV from sklearn.model_selection to perform $10$-fold cross-validation in order to determine good hyper-parameters for the decision tree you developed in the previous part, and with the same data. Find the best out of all possible combinations of the following parameters:

'criterion': ['gini', 'entropy']
'max_depth': range(30, 50, 5)
'min_samples_leaf': [1, 10, 100, 1000]

Report the values of the best parameters found and the training and test accuracy as before.

Programming Note

Be patient, this will take quite a while.

Problem 3.2

Discuss your results:

  • Is there significant improvement, or any improvement at all?
  • Give a plausible explanation for the result