How does the classification using the 0-1 loss matrix method work?

In this machine learning lecture the professor says:

Suppose $mathbf{X}inBbb R^p$ and $gin G$ where $G$ is a discrete
space. We have a joint probability distribution $Pr(mathbf{X},g)$.

Our training data has some points like:

$(mathbf{x_1},g_1)$, $(mathbf{x_2},g_2)$, $(mathbf{x_3},g_3)$ ...
$(mathbf{x_n},g_n)$

We now define a function $f(mathbf{X}):Bbb R^p to G$.

The loss $L$ is defined as a $Ktimes K$ matrix where $K$ is the
cardinality of $G$. It has only zeroes along the main diagonal.

$L(k,l)$ is basically the cost of classifying $k$ as $l$.

An example of $0-1$ loss function:

begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix}

$text{EPE}(hat{f}) = text{E} [L(G,hat{f})]$ (where $text{EPE=
Expected Prediction Error}$)

$=E_mathbf{X} E_{G/mathbf{X}} {L[G,hat{f}]|mathbf{X}}$

$hat{f}(mathbf{x})=text{argmin}_gsum_{k=1}^{k}L(k,g)text{Pr}(k|mathbf{X}=mathbf{x})=text{argmax}_gtext{Pr}(g|mathbf{X=x})$

$hat{f}(mathbf{x})$ is the Bayesian Optimal Classifier.

I couldn't really follow what the professor was trying to say in some of the steps.

My questions are:

Suppose our loss matrix is indeed: begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix} What it the use of this matrix? What does classifying $k$ as $l$ even mean? Then how do we read off the loss for (say) a certain input $mathbf{x_i}$ from the matrix?

I couldn't understand what $hat{f}$ and $text{EPE}(hat{f}mathbf{(x)})$ stand for. Could someone please explain it with a simple example?

asked Jan 27 '18 at 6:05

user400242

add a comment |

In this machine learning lecture the professor says:

Suppose $mathbf{X}inBbb R^p$ and $gin G$ where $G$ is a discrete
space. We have a joint probability distribution $Pr(mathbf{X},g)$.

Our training data has some points like:

$(mathbf{x_1},g_1)$, $(mathbf{x_2},g_2)$, $(mathbf{x_3},g_3)$ ...
$(mathbf{x_n},g_n)$

We now define a function $f(mathbf{X}):Bbb R^p to G$.

The loss $L$ is defined as a $Ktimes K$ matrix where $K$ is the
cardinality of $G$. It has only zeroes along the main diagonal.

$L(k,l)$ is basically the cost of classifying $k$ as $l$.

An example of $0-1$ loss function:

begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix}

$text{EPE}(hat{f}) = text{E} [L(G,hat{f})]$ (where $text{EPE=
Expected Prediction Error}$)

$=E_mathbf{X} E_{G/mathbf{X}} {L[G,hat{f}]|mathbf{X}}$

$hat{f}(mathbf{x})=text{argmin}_gsum_{k=1}^{k}L(k,g)text{Pr}(k|mathbf{X}=mathbf{x})=text{argmax}_gtext{Pr}(g|mathbf{X=x})$

$hat{f}(mathbf{x})$ is the Bayesian Optimal Classifier.

I couldn't really follow what the professor was trying to say in some of the steps.

My questions are:

Suppose our loss matrix is indeed: begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix} What it the use of this matrix? What does classifying $k$ as $l$ even mean? Then how do we read off the loss for (say) a certain input $mathbf{x_i}$ from the matrix?

I couldn't understand what $hat{f}$ and $text{EPE}(hat{f}mathbf{(x)})$ stand for. Could someone please explain it with a simple example?

asked Jan 27 '18 at 6:05

user400242

add a comment |

In this machine learning lecture the professor says:

Suppose $mathbf{X}inBbb R^p$ and $gin G$ where $G$ is a discrete
space. We have a joint probability distribution $Pr(mathbf{X},g)$.

Our training data has some points like:

$(mathbf{x_1},g_1)$, $(mathbf{x_2},g_2)$, $(mathbf{x_3},g_3)$ ...
$(mathbf{x_n},g_n)$

We now define a function $f(mathbf{X}):Bbb R^p to G$.

The loss $L$ is defined as a $Ktimes K$ matrix where $K$ is the
cardinality of $G$. It has only zeroes along the main diagonal.

$L(k,l)$ is basically the cost of classifying $k$ as $l$.

An example of $0-1$ loss function:

begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix}

$text{EPE}(hat{f}) = text{E} [L(G,hat{f})]$ (where $text{EPE=
Expected Prediction Error}$)

$=E_mathbf{X} E_{G/mathbf{X}} {L[G,hat{f}]|mathbf{X}}$

$hat{f}(mathbf{x})=text{argmin}_gsum_{k=1}^{k}L(k,g)text{Pr}(k|mathbf{X}=mathbf{x})=text{argmax}_gtext{Pr}(g|mathbf{X=x})$

$hat{f}(mathbf{x})$ is the Bayesian Optimal Classifier.

I couldn't really follow what the professor was trying to say in some of the steps.

My questions are:

Suppose our loss matrix is indeed: begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix} What it the use of this matrix? What does classifying $k$ as $l$ even mean? Then how do we read off the loss for (say) a certain input $mathbf{x_i}$ from the matrix?

I couldn't understand what $hat{f}$ and $text{EPE}(hat{f}mathbf{(x)})$ stand for. Could someone please explain it with a simple example?

asked Jan 27 '18 at 6:05

user400242

In this machine learning lecture the professor says:

Suppose $mathbf{X}inBbb R^p$ and $gin G$ where $G$ is a discrete
space. We have a joint probability distribution $Pr(mathbf{X},g)$.

Our training data has some points like:

$(mathbf{x_1},g_1)$, $(mathbf{x_2},g_2)$, $(mathbf{x_3},g_3)$ ...
$(mathbf{x_n},g_n)$

We now define a function $f(mathbf{X}):Bbb R^p to G$.

The loss $L$ is defined as a $Ktimes K$ matrix where $K$ is the
cardinality of $G$. It has only zeroes along the main diagonal.

$L(k,l)$ is basically the cost of classifying $k$ as $l$.

An example of $0-1$ loss function:

begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix}

$text{EPE}(hat{f}) = text{E} [L(G,hat{f})]$ (where $text{EPE=
Expected Prediction Error}$)

$=E_mathbf{X} E_{G/mathbf{X}} {L[G,hat{f}]|mathbf{X}}$

$hat{f}(mathbf{x})=text{argmin}_gsum_{k=1}^{k}L(k,g)text{Pr}(k|mathbf{X}=mathbf{x})=text{argmax}_gtext{Pr}(g|mathbf{X=x})$

$hat{f}(mathbf{x})$ is the Bayesian Optimal Classifier.

I couldn't really follow what the professor was trying to say in some of the steps.

My questions are:

Suppose our loss matrix is indeed: begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix} What it the use of this matrix? What does classifying $k$ as $l$ even mean? Then how do we read off the loss for (say) a certain input $mathbf{x_i}$ from the matrix?

I couldn't understand what $hat{f}$ and $text{EPE}(hat{f}mathbf{(x)})$ stand for. Could someone please explain it with a simple example?

probability matrices statistics machine-learning

asked Jan 27 '18 at 6:05

user400242

asked Jan 27 '18 at 6:05

user400242

asked Jan 27 '18 at 6:05

user400242

asked Jan 27 '18 at 6:05

user400242

asked Jan 27 '18 at 6:05

user400242

add a comment |

2 Answers
2

active

oldest

votes

Consider a random variable $X$ and a radom variable $g $. $X $ is uniformly distributed over $[0,1] $ if $g=1$ and is uniformly distributed over $[0.5,1.5] $ if $g=2$. $g$ takes the value $1$ with probabity $0.1$. By this the joint distribution of $(X,g) $ is given.

Asssume that we have to guess $g $ after having observed $X $. If we miss then we have to pay $1$ rupy. If we don't then the penalty is $0$.

It is clear that if $X $ is below $0.5$ then our guess for $g $ is $1$. If $X $ is above $1$ then our guess for $g $ is $2$. But what do we do if $X $ falls between $0.5$ and $1$?

Now, define $hat f $ the following way: let it be $1$ if $X <0.7$ and $2$ otherwise.

Could we do that any better?

Can you identify the current loss matrix?

Can you compute the expected loss belonging to the method given above?

Can you create a similar problem so that the loss matrix is the one given in the OP?

edited Jan 27 '18 at 13:34

answered Jan 27 '18 at 12:52

zoli

17k41945

$begingroup$
Not convinced by your answer - could you help out some more ? I am also struggling with this question
$endgroup$
– Xavier Bourret Sicotte
Jun 21 '18 at 14:07

add a comment |

-2

I think the loss matrix, for example in the ESL book page 20, the zero-one loss matrix, can be treated as a square look-up table. The number of rows and the number of columns are both the number of all possible classes. Assume it is $K times K$.

In your example, $K = 3$. So there are three possible levels, let's say they are lvl-1, lvl-2, and lvl-3. If you have an observation of which the level is lvl-2 (truth), but your estimator is lvl-1. Then we got a penalty since this is a misclassification, and (lvl-2, lvl-1) corrsponds to the value that sits at the 2$_{nd}$ row and the 1$_{st}$ column, which is 1 in your table. You asked what's $k$ was classified as $l$. Here $k$ is the truth, which is lvl-2, and $l$ is an error, which is lvl-1 in my example.

In the ESL book, the EPE (Expected Prediction Error) is $$EPE = E[L(G,hat{G}(X))]$$
, where $G$ and $hat{G}(X)$ within $L(G, hat{G}(X))$ can be treated as the row and column indices. Therefore $L(G, hat{G}(X))$ can be considered as a random variable. The sources of its randomness are G and X.

Let's assume that we observed a $X$ and corresponding observed class $G$, then an estimator of $G$, that is the $hat{G}(X)$, can be calculated by certain technique. Then we go to the 'look-up' table, check the row index by known G, and check the column index by estimator $hat{G}(X)$. If $G = hat{G}(X)$ then we have 0 'penalty', 1 otherwise. This is determined because we have already observed both $X$ and $G$, no randomness thereafter.

If we only observed a $X$, not $G$, you are still able to find a $hat{G}(X)$, which is the column index, since you can just apply the technique on any $X$ as long as it is observed. But we don't know the 'penalty' because you need both row and column indices. Therefore we calculate the expectation of this random 'penalty'. The expectation is conditioning on the observed $X$. It is the weighted average of all possible 'penalties' (or ones, which means miscatches) that this 'penalty' can be, each 'penalty' being weighted according to the conditional probability of its occurence. For example, here is the weight for $k^{th}$ class: $Pr(g_{k}|X)$, of which $k in {1, 2, ..., K}$. Now we have:

$$tag{*} E[L(G, hat{G}(X))|X]=sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]$$

The above "$| X$" means $X$ is known. But it is also random, which means it is unknown until we observe one. And we are trying to find the EPE, not a conditional expectation like (*). Therefore we need to find the expecation of the above conditional expectation, which is:
$$EPE = E_{X}left(sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]right)$$

So we should pick the estimator $hat{G}(X)$ which minimizes the above EPE.

This is my understanding. I think there must be answer more mathematically solid.

edited Dec 27 '18 at 17:33

answered Dec 27 '18 at 17:18

cs_snake

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2623072%2fhow-does-the-classification-using-the-0-1-loss-matrix-method-work%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Asssume that we have to guess $g $ after having observed $X $. If we miss then we have to pay $1$ rupy. If we don't then the penalty is $0$.

It is clear that if $X $ is below $0.5$ then our guess for $g $ is $1$. If $X $ is above $1$ then our guess for $g $ is $2$. But what do we do if $X $ falls between $0.5$ and $1$?

Now, define $hat f $ the following way: let it be $1$ if $X <0.7$ and $2$ otherwise.

Could we do that any better?

Can you identify the current loss matrix?

Can you compute the expected loss belonging to the method given above?

Can you create a similar problem so that the loss matrix is the one given in the OP?

edited Jan 27 '18 at 13:34

answered Jan 27 '18 at 12:52

zoli

17k41945

$begingroup$
Not convinced by your answer - could you help out some more ? I am also struggling with this question
$endgroup$
– Xavier Bourret Sicotte
Jun 21 '18 at 14:07

add a comment |

Asssume that we have to guess $g $ after having observed $X $. If we miss then we have to pay $1$ rupy. If we don't then the penalty is $0$.

It is clear that if $X $ is below $0.5$ then our guess for $g $ is $1$. If $X $ is above $1$ then our guess for $g $ is $2$. But what do we do if $X $ falls between $0.5$ and $1$?

Now, define $hat f $ the following way: let it be $1$ if $X <0.7$ and $2$ otherwise.

Could we do that any better?

Can you identify the current loss matrix?

Can you compute the expected loss belonging to the method given above?

Can you create a similar problem so that the loss matrix is the one given in the OP?

edited Jan 27 '18 at 13:34

answered Jan 27 '18 at 12:52

zoli

17k41945

$begingroup$
Not convinced by your answer - could you help out some more ? I am also struggling with this question
$endgroup$
– Xavier Bourret Sicotte
Jun 21 '18 at 14:07

add a comment |

Asssume that we have to guess $g $ after having observed $X $. If we miss then we have to pay $1$ rupy. If we don't then the penalty is $0$.

It is clear that if $X $ is below $0.5$ then our guess for $g $ is $1$. If $X $ is above $1$ then our guess for $g $ is $2$. But what do we do if $X $ falls between $0.5$ and $1$?

Now, define $hat f $ the following way: let it be $1$ if $X <0.7$ and $2$ otherwise.

Could we do that any better?

Can you identify the current loss matrix?

Can you compute the expected loss belonging to the method given above?

Can you create a similar problem so that the loss matrix is the one given in the OP?

edited Jan 27 '18 at 13:34

answered Jan 27 '18 at 12:52

zoli

17k41945

Asssume that we have to guess $g $ after having observed $X $. If we miss then we have to pay $1$ rupy. If we don't then the penalty is $0$.

It is clear that if $X $ is below $0.5$ then our guess for $g $ is $1$. If $X $ is above $1$ then our guess for $g $ is $2$. But what do we do if $X $ falls between $0.5$ and $1$?

Now, define $hat f $ the following way: let it be $1$ if $X <0.7$ and $2$ otherwise.

Could we do that any better?

Can you identify the current loss matrix?

Can you compute the expected loss belonging to the method given above?

Can you create a similar problem so that the loss matrix is the one given in the OP?

edited Jan 27 '18 at 13:34

answered Jan 27 '18 at 12:52

zoli

17k41945

edited Jan 27 '18 at 13:34

answered Jan 27 '18 at 12:52

zoli

17k41945

answered Jan 27 '18 at 12:52

zoli

17k41945

answered Jan 27 '18 at 12:52

zoli

17k41945

$begingroup$
Not convinced by your answer - could you help out some more ? I am also struggling with this question
$endgroup$
– Xavier Bourret Sicotte
Jun 21 '18 at 14:07

add a comment |

$begingroup$
Not convinced by your answer - could you help out some more ? I am also struggling with this question
$endgroup$
– Xavier Bourret Sicotte
Jun 21 '18 at 14:07

Not convinced by your answer - could you help out some more ? I am also struggling with this question

– Xavier Bourret Sicotte
Jun 21 '18 at 14:07

add a comment |

-2

$$tag{*} E[L(G, hat{G}(X))|X]=sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]$$

So we should pick the estimator $hat{G}(X)$ which minimizes the above EPE.

This is my understanding. I think there must be answer more mathematically solid.

edited Dec 27 '18 at 17:33

answered Dec 27 '18 at 17:18

cs_snake

add a comment |

-2

$$tag{*} E[L(G, hat{G}(X))|X]=sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]$$

So we should pick the estimator $hat{G}(X)$ which minimizes the above EPE.

This is my understanding. I think there must be answer more mathematically solid.

edited Dec 27 '18 at 17:33

answered Dec 27 '18 at 17:18

cs_snake

add a comment |

-2

$$tag{*} E[L(G, hat{G}(X))|X]=sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]$$

So we should pick the estimator $hat{G}(X)$ which minimizes the above EPE.

This is my understanding. I think there must be answer more mathematically solid.

edited Dec 27 '18 at 17:33

answered Dec 27 '18 at 17:18

cs_snake

$$tag{*} E[L(G, hat{G}(X))|X]=sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]$$

So we should pick the estimator $hat{G}(X)$ which minimizes the above EPE.

This is my understanding. I think there must be answer more mathematically solid.

edited Dec 27 '18 at 17:33

answered Dec 27 '18 at 17:18

cs_snake

edited Dec 27 '18 at 17:33

answered Dec 27 '18 at 17:18

cs_snake

answered Dec 27 '18 at 17:18

cs_snake

answered Dec 27 '18 at 17:18

cs_snake

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Mathematics Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Jtdylktuy