How does the classification using the 0-1 loss matrix method work?
$begingroup$
In this machine learning lecture the professor says:
Suppose $mathbf{X}inBbb R^p$ and $gin G$ where $G$ is a discrete
space. We have a joint probability distribution $Pr(mathbf{X},g)$.
Our training data has some points like:
$(mathbf{x_1},g_1)$, $(mathbf{x_2},g_2)$, $(mathbf{x_3},g_3)$ ...
$(mathbf{x_n},g_n)$
We now define a function $f(mathbf{X}):Bbb R^p to G$.
The loss $L$ is defined as a $Ktimes K$ matrix where $K$ is the
cardinality of $G$. It has only zeroes along the main diagonal.
$L(k,l)$ is basically the cost of classifying $k$ as $l$.
An example of $0-1$ loss function:
begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix}
$text{EPE}(hat{f}) = text{E} [L(G,hat{f})]$ (where $text{EPE=
Expected Prediction Error}$)
$=E_mathbf{X} E_{G/mathbf{X}} {L[G,hat{f}]|mathbf{X}}$
$hat{f}(mathbf{x})=text{argmin}_gsum_{k=1}^{k}L(k,g)text{Pr}(k|mathbf{X}=mathbf{x})=text{argmax}_gtext{Pr}(g|mathbf{X=x})$
$hat{f}(mathbf{x})$ is the Bayesian Optimal Classifier.
I couldn't really follow what the professor was trying to say in some of the steps.
My questions are:
Suppose our loss matrix is indeed: begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix} What it the use of this matrix? What does classifying $k$ as $l$ even mean? Then how do we read off the loss for (say) a certain input $mathbf{x_i}$ from the matrix?
I couldn't understand what $hat{f}$ and $text{EPE}(hat{f}mathbf{(x)})$ stand for. Could someone please explain it with a simple example?
probability matrices statistics machine-learning
$endgroup$
add a comment |
$begingroup$
In this machine learning lecture the professor says:
Suppose $mathbf{X}inBbb R^p$ and $gin G$ where $G$ is a discrete
space. We have a joint probability distribution $Pr(mathbf{X},g)$.
Our training data has some points like:
$(mathbf{x_1},g_1)$, $(mathbf{x_2},g_2)$, $(mathbf{x_3},g_3)$ ...
$(mathbf{x_n},g_n)$
We now define a function $f(mathbf{X}):Bbb R^p to G$.
The loss $L$ is defined as a $Ktimes K$ matrix where $K$ is the
cardinality of $G$. It has only zeroes along the main diagonal.
$L(k,l)$ is basically the cost of classifying $k$ as $l$.
An example of $0-1$ loss function:
begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix}
$text{EPE}(hat{f}) = text{E} [L(G,hat{f})]$ (where $text{EPE=
Expected Prediction Error}$)
$=E_mathbf{X} E_{G/mathbf{X}} {L[G,hat{f}]|mathbf{X}}$
$hat{f}(mathbf{x})=text{argmin}_gsum_{k=1}^{k}L(k,g)text{Pr}(k|mathbf{X}=mathbf{x})=text{argmax}_gtext{Pr}(g|mathbf{X=x})$
$hat{f}(mathbf{x})$ is the Bayesian Optimal Classifier.
I couldn't really follow what the professor was trying to say in some of the steps.
My questions are:
Suppose our loss matrix is indeed: begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix} What it the use of this matrix? What does classifying $k$ as $l$ even mean? Then how do we read off the loss for (say) a certain input $mathbf{x_i}$ from the matrix?
I couldn't understand what $hat{f}$ and $text{EPE}(hat{f}mathbf{(x)})$ stand for. Could someone please explain it with a simple example?
probability matrices statistics machine-learning
$endgroup$
add a comment |
$begingroup$
In this machine learning lecture the professor says:
Suppose $mathbf{X}inBbb R^p$ and $gin G$ where $G$ is a discrete
space. We have a joint probability distribution $Pr(mathbf{X},g)$.
Our training data has some points like:
$(mathbf{x_1},g_1)$, $(mathbf{x_2},g_2)$, $(mathbf{x_3},g_3)$ ...
$(mathbf{x_n},g_n)$
We now define a function $f(mathbf{X}):Bbb R^p to G$.
The loss $L$ is defined as a $Ktimes K$ matrix where $K$ is the
cardinality of $G$. It has only zeroes along the main diagonal.
$L(k,l)$ is basically the cost of classifying $k$ as $l$.
An example of $0-1$ loss function:
begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix}
$text{EPE}(hat{f}) = text{E} [L(G,hat{f})]$ (where $text{EPE=
Expected Prediction Error}$)
$=E_mathbf{X} E_{G/mathbf{X}} {L[G,hat{f}]|mathbf{X}}$
$hat{f}(mathbf{x})=text{argmin}_gsum_{k=1}^{k}L(k,g)text{Pr}(k|mathbf{X}=mathbf{x})=text{argmax}_gtext{Pr}(g|mathbf{X=x})$
$hat{f}(mathbf{x})$ is the Bayesian Optimal Classifier.
I couldn't really follow what the professor was trying to say in some of the steps.
My questions are:
Suppose our loss matrix is indeed: begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix} What it the use of this matrix? What does classifying $k$ as $l$ even mean? Then how do we read off the loss for (say) a certain input $mathbf{x_i}$ from the matrix?
I couldn't understand what $hat{f}$ and $text{EPE}(hat{f}mathbf{(x)})$ stand for. Could someone please explain it with a simple example?
probability matrices statistics machine-learning
$endgroup$
In this machine learning lecture the professor says:
Suppose $mathbf{X}inBbb R^p$ and $gin G$ where $G$ is a discrete
space. We have a joint probability distribution $Pr(mathbf{X},g)$.
Our training data has some points like:
$(mathbf{x_1},g_1)$, $(mathbf{x_2},g_2)$, $(mathbf{x_3},g_3)$ ...
$(mathbf{x_n},g_n)$
We now define a function $f(mathbf{X}):Bbb R^p to G$.
The loss $L$ is defined as a $Ktimes K$ matrix where $K$ is the
cardinality of $G$. It has only zeroes along the main diagonal.
$L(k,l)$ is basically the cost of classifying $k$ as $l$.
An example of $0-1$ loss function:
begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix}
$text{EPE}(hat{f}) = text{E} [L(G,hat{f})]$ (where $text{EPE=
Expected Prediction Error}$)
$=E_mathbf{X} E_{G/mathbf{X}} {L[G,hat{f}]|mathbf{X}}$
$hat{f}(mathbf{x})=text{argmin}_gsum_{k=1}^{k}L(k,g)text{Pr}(k|mathbf{X}=mathbf{x})=text{argmax}_gtext{Pr}(g|mathbf{X=x})$
$hat{f}(mathbf{x})$ is the Bayesian Optimal Classifier.
I couldn't really follow what the professor was trying to say in some of the steps.
My questions are:
Suppose our loss matrix is indeed: begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix} What it the use of this matrix? What does classifying $k$ as $l$ even mean? Then how do we read off the loss for (say) a certain input $mathbf{x_i}$ from the matrix?
I couldn't understand what $hat{f}$ and $text{EPE}(hat{f}mathbf{(x)})$ stand for. Could someone please explain it with a simple example?
probability matrices statistics machine-learning
probability matrices statistics machine-learning
asked Jan 27 '18 at 6:05
user400242
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
Consider a random variable $X$ and a radom variable $g $. $X $ is uniformly distributed over $[0,1] $ if $g=1$ and is uniformly distributed over $[0.5,1.5] $ if $g=2$. $g$ takes the value $1$ with probabity $0.1$. By this the joint distribution of $(X,g) $ is given.
Asssume that we have to guess $g $ after having observed $X $. If we miss then we have to pay $1$ rupy. If we don't then the penalty is $0$.
It is clear that if $X $ is below $0.5$ then our guess for $g $ is $1$. If $X $ is above $1$ then our guess for $g $ is $2$. But what do we do if $X $ falls between $0.5$ and $1$?
Now, define $hat f $ the following way: let it be $1$ if $X <0.7$ and $2$ otherwise.
Could we do that any better?
Can you identify the current loss matrix?
Can you compute the expected loss belonging to the method given above?
Can you create a similar problem so that the loss matrix is the one given in the OP?
$endgroup$
$begingroup$
Not convinced by your answer - could you help out some more ? I am also struggling with this question
$endgroup$
– Xavier Bourret Sicotte
Jun 21 '18 at 14:07
add a comment |
$begingroup$
I think the loss matrix, for example in the ESL book page 20, the zero-one loss matrix, can be treated as a square look-up table. The number of rows and the number of columns are both the number of all possible classes. Assume it is $K times K$.
In your example, $K = 3$. So there are three possible levels, let's say they are lvl-1, lvl-2, and lvl-3. If you have an observation of which the level is lvl-2 (truth), but your estimator is lvl-1. Then we got a penalty since this is a misclassification, and (lvl-2, lvl-1) corrsponds to the value that sits at the 2$_{nd}$ row and the 1$_{st}$ column, which is 1 in your table. You asked what's $k$ was classified as $l$. Here $k$ is the truth, which is lvl-2, and $l$ is an error, which is lvl-1 in my example.
In the ESL book, the EPE (Expected Prediction Error) is $$EPE = E[L(G,hat{G}(X))]$$
, where $G$ and $hat{G}(X)$ within $L(G, hat{G}(X))$ can be treated as the row and column indices. Therefore $L(G, hat{G}(X))$ can be considered as a random variable. The sources of its randomness are G and X.
Let's assume that we observed a $X$ and corresponding observed class $G$, then an estimator of $G$, that is the $hat{G}(X)$, can be calculated by certain technique. Then we go to the 'look-up' table, check the row index by known G, and check the column index by estimator $hat{G}(X)$. If $G = hat{G}(X)$ then we have 0 'penalty', 1 otherwise. This is determined because we have already observed both $X$ and $G$, no randomness thereafter.
If we only observed a $X$, not $G$, you are still able to find a $hat{G}(X)$, which is the column index, since you can just apply the technique on any $X$ as long as it is observed. But we don't know the 'penalty' because you need both row and column indices. Therefore we calculate the expectation of this random 'penalty'. The expectation is conditioning on the observed $X$. It is the weighted average of all possible 'penalties' (or ones, which means miscatches) that this 'penalty' can be, each 'penalty' being weighted according to the conditional probability of its occurence. For example, here is the weight for $k^{th}$ class: $Pr(g_{k}|X)$, of which $k in {1, 2, ..., K}$. Now we have:
$$tag{*} E[L(G, hat{G}(X))|X]=sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]$$
The above "$| X$" means $X$ is known. But it is also random, which means it is unknown until we observe one. And we are trying to find the EPE, not a conditional expectation like (*). Therefore we need to find the expecation of the above conditional expectation, which is:
$$EPE = E_{X}left(sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]right)$$
So we should pick the estimator $hat{G}(X)$ which minimizes the above EPE.
This is my understanding. I think there must be answer more mathematically solid.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2623072%2fhow-does-the-classification-using-the-0-1-loss-matrix-method-work%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Consider a random variable $X$ and a radom variable $g $. $X $ is uniformly distributed over $[0,1] $ if $g=1$ and is uniformly distributed over $[0.5,1.5] $ if $g=2$. $g$ takes the value $1$ with probabity $0.1$. By this the joint distribution of $(X,g) $ is given.
Asssume that we have to guess $g $ after having observed $X $. If we miss then we have to pay $1$ rupy. If we don't then the penalty is $0$.
It is clear that if $X $ is below $0.5$ then our guess for $g $ is $1$. If $X $ is above $1$ then our guess for $g $ is $2$. But what do we do if $X $ falls between $0.5$ and $1$?
Now, define $hat f $ the following way: let it be $1$ if $X <0.7$ and $2$ otherwise.
Could we do that any better?
Can you identify the current loss matrix?
Can you compute the expected loss belonging to the method given above?
Can you create a similar problem so that the loss matrix is the one given in the OP?
$endgroup$
$begingroup$
Not convinced by your answer - could you help out some more ? I am also struggling with this question
$endgroup$
– Xavier Bourret Sicotte
Jun 21 '18 at 14:07
add a comment |
$begingroup$
Consider a random variable $X$ and a radom variable $g $. $X $ is uniformly distributed over $[0,1] $ if $g=1$ and is uniformly distributed over $[0.5,1.5] $ if $g=2$. $g$ takes the value $1$ with probabity $0.1$. By this the joint distribution of $(X,g) $ is given.
Asssume that we have to guess $g $ after having observed $X $. If we miss then we have to pay $1$ rupy. If we don't then the penalty is $0$.
It is clear that if $X $ is below $0.5$ then our guess for $g $ is $1$. If $X $ is above $1$ then our guess for $g $ is $2$. But what do we do if $X $ falls between $0.5$ and $1$?
Now, define $hat f $ the following way: let it be $1$ if $X <0.7$ and $2$ otherwise.
Could we do that any better?
Can you identify the current loss matrix?
Can you compute the expected loss belonging to the method given above?
Can you create a similar problem so that the loss matrix is the one given in the OP?
$endgroup$
$begingroup$
Not convinced by your answer - could you help out some more ? I am also struggling with this question
$endgroup$
– Xavier Bourret Sicotte
Jun 21 '18 at 14:07
add a comment |
$begingroup$
Consider a random variable $X$ and a radom variable $g $. $X $ is uniformly distributed over $[0,1] $ if $g=1$ and is uniformly distributed over $[0.5,1.5] $ if $g=2$. $g$ takes the value $1$ with probabity $0.1$. By this the joint distribution of $(X,g) $ is given.
Asssume that we have to guess $g $ after having observed $X $. If we miss then we have to pay $1$ rupy. If we don't then the penalty is $0$.
It is clear that if $X $ is below $0.5$ then our guess for $g $ is $1$. If $X $ is above $1$ then our guess for $g $ is $2$. But what do we do if $X $ falls between $0.5$ and $1$?
Now, define $hat f $ the following way: let it be $1$ if $X <0.7$ and $2$ otherwise.
Could we do that any better?
Can you identify the current loss matrix?
Can you compute the expected loss belonging to the method given above?
Can you create a similar problem so that the loss matrix is the one given in the OP?
$endgroup$
Consider a random variable $X$ and a radom variable $g $. $X $ is uniformly distributed over $[0,1] $ if $g=1$ and is uniformly distributed over $[0.5,1.5] $ if $g=2$. $g$ takes the value $1$ with probabity $0.1$. By this the joint distribution of $(X,g) $ is given.
Asssume that we have to guess $g $ after having observed $X $. If we miss then we have to pay $1$ rupy. If we don't then the penalty is $0$.
It is clear that if $X $ is below $0.5$ then our guess for $g $ is $1$. If $X $ is above $1$ then our guess for $g $ is $2$. But what do we do if $X $ falls between $0.5$ and $1$?
Now, define $hat f $ the following way: let it be $1$ if $X <0.7$ and $2$ otherwise.
Could we do that any better?
Can you identify the current loss matrix?
Can you compute the expected loss belonging to the method given above?
Can you create a similar problem so that the loss matrix is the one given in the OP?
edited Jan 27 '18 at 13:34
answered Jan 27 '18 at 12:52
zolizoli
17k41945
17k41945
$begingroup$
Not convinced by your answer - could you help out some more ? I am also struggling with this question
$endgroup$
– Xavier Bourret Sicotte
Jun 21 '18 at 14:07
add a comment |
$begingroup$
Not convinced by your answer - could you help out some more ? I am also struggling with this question
$endgroup$
– Xavier Bourret Sicotte
Jun 21 '18 at 14:07
$begingroup$
Not convinced by your answer - could you help out some more ? I am also struggling with this question
$endgroup$
– Xavier Bourret Sicotte
Jun 21 '18 at 14:07
$begingroup$
Not convinced by your answer - could you help out some more ? I am also struggling with this question
$endgroup$
– Xavier Bourret Sicotte
Jun 21 '18 at 14:07
add a comment |
$begingroup$
I think the loss matrix, for example in the ESL book page 20, the zero-one loss matrix, can be treated as a square look-up table. The number of rows and the number of columns are both the number of all possible classes. Assume it is $K times K$.
In your example, $K = 3$. So there are three possible levels, let's say they are lvl-1, lvl-2, and lvl-3. If you have an observation of which the level is lvl-2 (truth), but your estimator is lvl-1. Then we got a penalty since this is a misclassification, and (lvl-2, lvl-1) corrsponds to the value that sits at the 2$_{nd}$ row and the 1$_{st}$ column, which is 1 in your table. You asked what's $k$ was classified as $l$. Here $k$ is the truth, which is lvl-2, and $l$ is an error, which is lvl-1 in my example.
In the ESL book, the EPE (Expected Prediction Error) is $$EPE = E[L(G,hat{G}(X))]$$
, where $G$ and $hat{G}(X)$ within $L(G, hat{G}(X))$ can be treated as the row and column indices. Therefore $L(G, hat{G}(X))$ can be considered as a random variable. The sources of its randomness are G and X.
Let's assume that we observed a $X$ and corresponding observed class $G$, then an estimator of $G$, that is the $hat{G}(X)$, can be calculated by certain technique. Then we go to the 'look-up' table, check the row index by known G, and check the column index by estimator $hat{G}(X)$. If $G = hat{G}(X)$ then we have 0 'penalty', 1 otherwise. This is determined because we have already observed both $X$ and $G$, no randomness thereafter.
If we only observed a $X$, not $G$, you are still able to find a $hat{G}(X)$, which is the column index, since you can just apply the technique on any $X$ as long as it is observed. But we don't know the 'penalty' because you need both row and column indices. Therefore we calculate the expectation of this random 'penalty'. The expectation is conditioning on the observed $X$. It is the weighted average of all possible 'penalties' (or ones, which means miscatches) that this 'penalty' can be, each 'penalty' being weighted according to the conditional probability of its occurence. For example, here is the weight for $k^{th}$ class: $Pr(g_{k}|X)$, of which $k in {1, 2, ..., K}$. Now we have:
$$tag{*} E[L(G, hat{G}(X))|X]=sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]$$
The above "$| X$" means $X$ is known. But it is also random, which means it is unknown until we observe one. And we are trying to find the EPE, not a conditional expectation like (*). Therefore we need to find the expecation of the above conditional expectation, which is:
$$EPE = E_{X}left(sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]right)$$
So we should pick the estimator $hat{G}(X)$ which minimizes the above EPE.
This is my understanding. I think there must be answer more mathematically solid.
$endgroup$
add a comment |
$begingroup$
I think the loss matrix, for example in the ESL book page 20, the zero-one loss matrix, can be treated as a square look-up table. The number of rows and the number of columns are both the number of all possible classes. Assume it is $K times K$.
In your example, $K = 3$. So there are three possible levels, let's say they are lvl-1, lvl-2, and lvl-3. If you have an observation of which the level is lvl-2 (truth), but your estimator is lvl-1. Then we got a penalty since this is a misclassification, and (lvl-2, lvl-1) corrsponds to the value that sits at the 2$_{nd}$ row and the 1$_{st}$ column, which is 1 in your table. You asked what's $k$ was classified as $l$. Here $k$ is the truth, which is lvl-2, and $l$ is an error, which is lvl-1 in my example.
In the ESL book, the EPE (Expected Prediction Error) is $$EPE = E[L(G,hat{G}(X))]$$
, where $G$ and $hat{G}(X)$ within $L(G, hat{G}(X))$ can be treated as the row and column indices. Therefore $L(G, hat{G}(X))$ can be considered as a random variable. The sources of its randomness are G and X.
Let's assume that we observed a $X$ and corresponding observed class $G$, then an estimator of $G$, that is the $hat{G}(X)$, can be calculated by certain technique. Then we go to the 'look-up' table, check the row index by known G, and check the column index by estimator $hat{G}(X)$. If $G = hat{G}(X)$ then we have 0 'penalty', 1 otherwise. This is determined because we have already observed both $X$ and $G$, no randomness thereafter.
If we only observed a $X$, not $G$, you are still able to find a $hat{G}(X)$, which is the column index, since you can just apply the technique on any $X$ as long as it is observed. But we don't know the 'penalty' because you need both row and column indices. Therefore we calculate the expectation of this random 'penalty'. The expectation is conditioning on the observed $X$. It is the weighted average of all possible 'penalties' (or ones, which means miscatches) that this 'penalty' can be, each 'penalty' being weighted according to the conditional probability of its occurence. For example, here is the weight for $k^{th}$ class: $Pr(g_{k}|X)$, of which $k in {1, 2, ..., K}$. Now we have:
$$tag{*} E[L(G, hat{G}(X))|X]=sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]$$
The above "$| X$" means $X$ is known. But it is also random, which means it is unknown until we observe one. And we are trying to find the EPE, not a conditional expectation like (*). Therefore we need to find the expecation of the above conditional expectation, which is:
$$EPE = E_{X}left(sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]right)$$
So we should pick the estimator $hat{G}(X)$ which minimizes the above EPE.
This is my understanding. I think there must be answer more mathematically solid.
$endgroup$
add a comment |
$begingroup$
I think the loss matrix, for example in the ESL book page 20, the zero-one loss matrix, can be treated as a square look-up table. The number of rows and the number of columns are both the number of all possible classes. Assume it is $K times K$.
In your example, $K = 3$. So there are three possible levels, let's say they are lvl-1, lvl-2, and lvl-3. If you have an observation of which the level is lvl-2 (truth), but your estimator is lvl-1. Then we got a penalty since this is a misclassification, and (lvl-2, lvl-1) corrsponds to the value that sits at the 2$_{nd}$ row and the 1$_{st}$ column, which is 1 in your table. You asked what's $k$ was classified as $l$. Here $k$ is the truth, which is lvl-2, and $l$ is an error, which is lvl-1 in my example.
In the ESL book, the EPE (Expected Prediction Error) is $$EPE = E[L(G,hat{G}(X))]$$
, where $G$ and $hat{G}(X)$ within $L(G, hat{G}(X))$ can be treated as the row and column indices. Therefore $L(G, hat{G}(X))$ can be considered as a random variable. The sources of its randomness are G and X.
Let's assume that we observed a $X$ and corresponding observed class $G$, then an estimator of $G$, that is the $hat{G}(X)$, can be calculated by certain technique. Then we go to the 'look-up' table, check the row index by known G, and check the column index by estimator $hat{G}(X)$. If $G = hat{G}(X)$ then we have 0 'penalty', 1 otherwise. This is determined because we have already observed both $X$ and $G$, no randomness thereafter.
If we only observed a $X$, not $G$, you are still able to find a $hat{G}(X)$, which is the column index, since you can just apply the technique on any $X$ as long as it is observed. But we don't know the 'penalty' because you need both row and column indices. Therefore we calculate the expectation of this random 'penalty'. The expectation is conditioning on the observed $X$. It is the weighted average of all possible 'penalties' (or ones, which means miscatches) that this 'penalty' can be, each 'penalty' being weighted according to the conditional probability of its occurence. For example, here is the weight for $k^{th}$ class: $Pr(g_{k}|X)$, of which $k in {1, 2, ..., K}$. Now we have:
$$tag{*} E[L(G, hat{G}(X))|X]=sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]$$
The above "$| X$" means $X$ is known. But it is also random, which means it is unknown until we observe one. And we are trying to find the EPE, not a conditional expectation like (*). Therefore we need to find the expecation of the above conditional expectation, which is:
$$EPE = E_{X}left(sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]right)$$
So we should pick the estimator $hat{G}(X)$ which minimizes the above EPE.
This is my understanding. I think there must be answer more mathematically solid.
$endgroup$
I think the loss matrix, for example in the ESL book page 20, the zero-one loss matrix, can be treated as a square look-up table. The number of rows and the number of columns are both the number of all possible classes. Assume it is $K times K$.
In your example, $K = 3$. So there are three possible levels, let's say they are lvl-1, lvl-2, and lvl-3. If you have an observation of which the level is lvl-2 (truth), but your estimator is lvl-1. Then we got a penalty since this is a misclassification, and (lvl-2, lvl-1) corrsponds to the value that sits at the 2$_{nd}$ row and the 1$_{st}$ column, which is 1 in your table. You asked what's $k$ was classified as $l$. Here $k$ is the truth, which is lvl-2, and $l$ is an error, which is lvl-1 in my example.
In the ESL book, the EPE (Expected Prediction Error) is $$EPE = E[L(G,hat{G}(X))]$$
, where $G$ and $hat{G}(X)$ within $L(G, hat{G}(X))$ can be treated as the row and column indices. Therefore $L(G, hat{G}(X))$ can be considered as a random variable. The sources of its randomness are G and X.
Let's assume that we observed a $X$ and corresponding observed class $G$, then an estimator of $G$, that is the $hat{G}(X)$, can be calculated by certain technique. Then we go to the 'look-up' table, check the row index by known G, and check the column index by estimator $hat{G}(X)$. If $G = hat{G}(X)$ then we have 0 'penalty', 1 otherwise. This is determined because we have already observed both $X$ and $G$, no randomness thereafter.
If we only observed a $X$, not $G$, you are still able to find a $hat{G}(X)$, which is the column index, since you can just apply the technique on any $X$ as long as it is observed. But we don't know the 'penalty' because you need both row and column indices. Therefore we calculate the expectation of this random 'penalty'. The expectation is conditioning on the observed $X$. It is the weighted average of all possible 'penalties' (or ones, which means miscatches) that this 'penalty' can be, each 'penalty' being weighted according to the conditional probability of its occurence. For example, here is the weight for $k^{th}$ class: $Pr(g_{k}|X)$, of which $k in {1, 2, ..., K}$. Now we have:
$$tag{*} E[L(G, hat{G}(X))|X]=sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]$$
The above "$| X$" means $X$ is known. But it is also random, which means it is unknown until we observe one. And we are trying to find the EPE, not a conditional expectation like (*). Therefore we need to find the expecation of the above conditional expectation, which is:
$$EPE = E_{X}left(sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]right)$$
So we should pick the estimator $hat{G}(X)$ which minimizes the above EPE.
This is my understanding. I think there must be answer more mathematically solid.
edited Dec 27 '18 at 17:33
answered Dec 27 '18 at 17:18
cs_snakecs_snake
12
12
add a comment |
add a comment |
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2623072%2fhow-does-the-classification-using-the-0-1-loss-matrix-method-work%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown