How does the classification using the 0-1 loss matrix method work?












4












$begingroup$



In this machine learning lecture the professor says:



Suppose $mathbf{X}inBbb R^p$ and $gin G$ where $G$ is a discrete
space. We have a joint probability distribution $Pr(mathbf{X},g)$.



Our training data has some points like:



$(mathbf{x_1},g_1)$, $(mathbf{x_2},g_2)$, $(mathbf{x_3},g_3)$ ...
$(mathbf{x_n},g_n)$



We now define a function $f(mathbf{X}):Bbb R^p to G$.



The loss $L$ is defined as a $Ktimes K$ matrix where $K$ is the
cardinality of $G$. It has only zeroes along the main diagonal.



$L(k,l)$ is basically the cost of classifying $k$ as $l$.



An example of $0-1$ loss function:



begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix}



$text{EPE}(hat{f}) = text{E} [L(G,hat{f})]$ (where $text{EPE=
Expected Prediction Error}$)



$=E_mathbf{X} E_{G/mathbf{X}} {L[G,hat{f}]|mathbf{X}}$



$hat{f}(mathbf{x})=text{argmin}_gsum_{k=1}^{k}L(k,g)text{Pr}(k|mathbf{X}=mathbf{x})=text{argmax}_gtext{Pr}(g|mathbf{X=x})$



$hat{f}(mathbf{x})$ is the Bayesian Optimal Classifier.




I couldn't really follow what the professor was trying to say in some of the steps.



My questions are:




  • Suppose our loss matrix is indeed: begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix} What it the use of this matrix? What does classifying $k$ as $l$ even mean? Then how do we read off the loss for (say) a certain input $mathbf{x_i}$ from the matrix?


  • I couldn't understand what $hat{f}$ and $text{EPE}(hat{f}mathbf{(x)})$ stand for. Could someone please explain it with a simple example?











share|cite|improve this question









$endgroup$

















    4












    $begingroup$



    In this machine learning lecture the professor says:



    Suppose $mathbf{X}inBbb R^p$ and $gin G$ where $G$ is a discrete
    space. We have a joint probability distribution $Pr(mathbf{X},g)$.



    Our training data has some points like:



    $(mathbf{x_1},g_1)$, $(mathbf{x_2},g_2)$, $(mathbf{x_3},g_3)$ ...
    $(mathbf{x_n},g_n)$



    We now define a function $f(mathbf{X}):Bbb R^p to G$.



    The loss $L$ is defined as a $Ktimes K$ matrix where $K$ is the
    cardinality of $G$. It has only zeroes along the main diagonal.



    $L(k,l)$ is basically the cost of classifying $k$ as $l$.



    An example of $0-1$ loss function:



    begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix}



    $text{EPE}(hat{f}) = text{E} [L(G,hat{f})]$ (where $text{EPE=
    Expected Prediction Error}$)



    $=E_mathbf{X} E_{G/mathbf{X}} {L[G,hat{f}]|mathbf{X}}$



    $hat{f}(mathbf{x})=text{argmin}_gsum_{k=1}^{k}L(k,g)text{Pr}(k|mathbf{X}=mathbf{x})=text{argmax}_gtext{Pr}(g|mathbf{X=x})$



    $hat{f}(mathbf{x})$ is the Bayesian Optimal Classifier.




    I couldn't really follow what the professor was trying to say in some of the steps.



    My questions are:




    • Suppose our loss matrix is indeed: begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix} What it the use of this matrix? What does classifying $k$ as $l$ even mean? Then how do we read off the loss for (say) a certain input $mathbf{x_i}$ from the matrix?


    • I couldn't understand what $hat{f}$ and $text{EPE}(hat{f}mathbf{(x)})$ stand for. Could someone please explain it with a simple example?











    share|cite|improve this question









    $endgroup$















      4












      4








      4





      $begingroup$



      In this machine learning lecture the professor says:



      Suppose $mathbf{X}inBbb R^p$ and $gin G$ where $G$ is a discrete
      space. We have a joint probability distribution $Pr(mathbf{X},g)$.



      Our training data has some points like:



      $(mathbf{x_1},g_1)$, $(mathbf{x_2},g_2)$, $(mathbf{x_3},g_3)$ ...
      $(mathbf{x_n},g_n)$



      We now define a function $f(mathbf{X}):Bbb R^p to G$.



      The loss $L$ is defined as a $Ktimes K$ matrix where $K$ is the
      cardinality of $G$. It has only zeroes along the main diagonal.



      $L(k,l)$ is basically the cost of classifying $k$ as $l$.



      An example of $0-1$ loss function:



      begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix}



      $text{EPE}(hat{f}) = text{E} [L(G,hat{f})]$ (where $text{EPE=
      Expected Prediction Error}$)



      $=E_mathbf{X} E_{G/mathbf{X}} {L[G,hat{f}]|mathbf{X}}$



      $hat{f}(mathbf{x})=text{argmin}_gsum_{k=1}^{k}L(k,g)text{Pr}(k|mathbf{X}=mathbf{x})=text{argmax}_gtext{Pr}(g|mathbf{X=x})$



      $hat{f}(mathbf{x})$ is the Bayesian Optimal Classifier.




      I couldn't really follow what the professor was trying to say in some of the steps.



      My questions are:




      • Suppose our loss matrix is indeed: begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix} What it the use of this matrix? What does classifying $k$ as $l$ even mean? Then how do we read off the loss for (say) a certain input $mathbf{x_i}$ from the matrix?


      • I couldn't understand what $hat{f}$ and $text{EPE}(hat{f}mathbf{(x)})$ stand for. Could someone please explain it with a simple example?











      share|cite|improve this question









      $endgroup$





      In this machine learning lecture the professor says:



      Suppose $mathbf{X}inBbb R^p$ and $gin G$ where $G$ is a discrete
      space. We have a joint probability distribution $Pr(mathbf{X},g)$.



      Our training data has some points like:



      $(mathbf{x_1},g_1)$, $(mathbf{x_2},g_2)$, $(mathbf{x_3},g_3)$ ...
      $(mathbf{x_n},g_n)$



      We now define a function $f(mathbf{X}):Bbb R^p to G$.



      The loss $L$ is defined as a $Ktimes K$ matrix where $K$ is the
      cardinality of $G$. It has only zeroes along the main diagonal.



      $L(k,l)$ is basically the cost of classifying $k$ as $l$.



      An example of $0-1$ loss function:



      begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix}



      $text{EPE}(hat{f}) = text{E} [L(G,hat{f})]$ (where $text{EPE=
      Expected Prediction Error}$)



      $=E_mathbf{X} E_{G/mathbf{X}} {L[G,hat{f}]|mathbf{X}}$



      $hat{f}(mathbf{x})=text{argmin}_gsum_{k=1}^{k}L(k,g)text{Pr}(k|mathbf{X}=mathbf{x})=text{argmax}_gtext{Pr}(g|mathbf{X=x})$



      $hat{f}(mathbf{x})$ is the Bayesian Optimal Classifier.




      I couldn't really follow what the professor was trying to say in some of the steps.



      My questions are:




      • Suppose our loss matrix is indeed: begin{bmatrix} 0 & 1 & 1 \ 1 & 0 & 1 \ 1 & 1 & 0 end{bmatrix} What it the use of this matrix? What does classifying $k$ as $l$ even mean? Then how do we read off the loss for (say) a certain input $mathbf{x_i}$ from the matrix?


      • I couldn't understand what $hat{f}$ and $text{EPE}(hat{f}mathbf{(x)})$ stand for. Could someone please explain it with a simple example?








      probability matrices statistics machine-learning






      share|cite|improve this question













      share|cite|improve this question











      share|cite|improve this question




      share|cite|improve this question










      asked Jan 27 '18 at 6:05







      user400242





























          2 Answers
          2






          active

          oldest

          votes


















          0












          $begingroup$

          Consider a random variable $X$ and a radom variable $g $. $X $ is uniformly distributed over $[0,1] $ if $g=1$ and is uniformly distributed over $[0.5,1.5] $ if $g=2$. $g$ takes the value $1$ with probabity $0.1$. By this the joint distribution of $(X,g) $ is given.



          Asssume that we have to guess $g $ after having observed $X $. If we miss then we have to pay $1$ rupy. If we don't then the penalty is $0$.



          It is clear that if $X $ is below $0.5$ then our guess for $g $ is $1$. If $X $ is above $1$ then our guess for $g $ is $2$. But what do we do if $X $ falls between $0.5$ and $1$?



          Now, define $hat f $ the following way: let it be $1$ if $X <0.7$ and $2$ otherwise.



          Could we do that any better?





          Can you identify the current loss matrix?



          Can you compute the expected loss belonging to the method given above?



          Can you create a similar problem so that the loss matrix is the one given in the OP?






          share|cite|improve this answer











          $endgroup$













          • $begingroup$
            Not convinced by your answer - could you help out some more ? I am also struggling with this question
            $endgroup$
            – Xavier Bourret Sicotte
            Jun 21 '18 at 14:07



















          -2












          $begingroup$

          I think the loss matrix, for example in the ESL book page 20, the zero-one loss matrix, can be treated as a square look-up table. The number of rows and the number of columns are both the number of all possible classes. Assume it is $K times K$.



          In your example, $K = 3$. So there are three possible levels, let's say they are lvl-1, lvl-2, and lvl-3. If you have an observation of which the level is lvl-2 (truth), but your estimator is lvl-1. Then we got a penalty since this is a misclassification, and (lvl-2, lvl-1) corrsponds to the value that sits at the 2$_{nd}$ row and the 1$_{st}$ column, which is 1 in your table. You asked what's $k$ was classified as $l$. Here $k$ is the truth, which is lvl-2, and $l$ is an error, which is lvl-1 in my example.



          In the ESL book, the EPE (Expected Prediction Error) is $$EPE = E[L(G,hat{G}(X))]$$
          , where $G$ and $hat{G}(X)$ within $L(G, hat{G}(X))$ can be treated as the row and column indices. Therefore $L(G, hat{G}(X))$ can be considered as a random variable. The sources of its randomness are G and X.



          Let's assume that we observed a $X$ and corresponding observed class $G$, then an estimator of $G$, that is the $hat{G}(X)$, can be calculated by certain technique. Then we go to the 'look-up' table, check the row index by known G, and check the column index by estimator $hat{G}(X)$. If $G = hat{G}(X)$ then we have 0 'penalty', 1 otherwise. This is determined because we have already observed both $X$ and $G$, no randomness thereafter.



          If we only observed a $X$, not $G$, you are still able to find a $hat{G}(X)$, which is the column index, since you can just apply the technique on any $X$ as long as it is observed. But we don't know the 'penalty' because you need both row and column indices. Therefore we calculate the expectation of this random 'penalty'. The expectation is conditioning on the observed $X$. It is the weighted average of all possible 'penalties' (or ones, which means miscatches) that this 'penalty' can be, each 'penalty' being weighted according to the conditional probability of its occurence. For example, here is the weight for $k^{th}$ class: $Pr(g_{k}|X)$, of which $k in {1, 2, ..., K}$. Now we have:



          $$tag{*} E[L(G, hat{G}(X))|X]=sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]$$



          The above "$| X$" means $X$ is known. But it is also random, which means it is unknown until we observe one. And we are trying to find the EPE, not a conditional expectation like (*). Therefore we need to find the expecation of the above conditional expectation, which is:
          $$EPE = E_{X}left(sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]right)$$



          So we should pick the estimator $hat{G}(X)$ which minimizes the above EPE.



          This is my understanding. I think there must be answer more mathematically solid.






          share|cite|improve this answer











          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "69"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            noCode: true, onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2623072%2fhow-does-the-classification-using-the-0-1-loss-matrix-method-work%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown
























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            Consider a random variable $X$ and a radom variable $g $. $X $ is uniformly distributed over $[0,1] $ if $g=1$ and is uniformly distributed over $[0.5,1.5] $ if $g=2$. $g$ takes the value $1$ with probabity $0.1$. By this the joint distribution of $(X,g) $ is given.



            Asssume that we have to guess $g $ after having observed $X $. If we miss then we have to pay $1$ rupy. If we don't then the penalty is $0$.



            It is clear that if $X $ is below $0.5$ then our guess for $g $ is $1$. If $X $ is above $1$ then our guess for $g $ is $2$. But what do we do if $X $ falls between $0.5$ and $1$?



            Now, define $hat f $ the following way: let it be $1$ if $X <0.7$ and $2$ otherwise.



            Could we do that any better?





            Can you identify the current loss matrix?



            Can you compute the expected loss belonging to the method given above?



            Can you create a similar problem so that the loss matrix is the one given in the OP?






            share|cite|improve this answer











            $endgroup$













            • $begingroup$
              Not convinced by your answer - could you help out some more ? I am also struggling with this question
              $endgroup$
              – Xavier Bourret Sicotte
              Jun 21 '18 at 14:07
















            0












            $begingroup$

            Consider a random variable $X$ and a radom variable $g $. $X $ is uniformly distributed over $[0,1] $ if $g=1$ and is uniformly distributed over $[0.5,1.5] $ if $g=2$. $g$ takes the value $1$ with probabity $0.1$. By this the joint distribution of $(X,g) $ is given.



            Asssume that we have to guess $g $ after having observed $X $. If we miss then we have to pay $1$ rupy. If we don't then the penalty is $0$.



            It is clear that if $X $ is below $0.5$ then our guess for $g $ is $1$. If $X $ is above $1$ then our guess for $g $ is $2$. But what do we do if $X $ falls between $0.5$ and $1$?



            Now, define $hat f $ the following way: let it be $1$ if $X <0.7$ and $2$ otherwise.



            Could we do that any better?





            Can you identify the current loss matrix?



            Can you compute the expected loss belonging to the method given above?



            Can you create a similar problem so that the loss matrix is the one given in the OP?






            share|cite|improve this answer











            $endgroup$













            • $begingroup$
              Not convinced by your answer - could you help out some more ? I am also struggling with this question
              $endgroup$
              – Xavier Bourret Sicotte
              Jun 21 '18 at 14:07














            0












            0








            0





            $begingroup$

            Consider a random variable $X$ and a radom variable $g $. $X $ is uniformly distributed over $[0,1] $ if $g=1$ and is uniformly distributed over $[0.5,1.5] $ if $g=2$. $g$ takes the value $1$ with probabity $0.1$. By this the joint distribution of $(X,g) $ is given.



            Asssume that we have to guess $g $ after having observed $X $. If we miss then we have to pay $1$ rupy. If we don't then the penalty is $0$.



            It is clear that if $X $ is below $0.5$ then our guess for $g $ is $1$. If $X $ is above $1$ then our guess for $g $ is $2$. But what do we do if $X $ falls between $0.5$ and $1$?



            Now, define $hat f $ the following way: let it be $1$ if $X <0.7$ and $2$ otherwise.



            Could we do that any better?





            Can you identify the current loss matrix?



            Can you compute the expected loss belonging to the method given above?



            Can you create a similar problem so that the loss matrix is the one given in the OP?






            share|cite|improve this answer











            $endgroup$



            Consider a random variable $X$ and a radom variable $g $. $X $ is uniformly distributed over $[0,1] $ if $g=1$ and is uniformly distributed over $[0.5,1.5] $ if $g=2$. $g$ takes the value $1$ with probabity $0.1$. By this the joint distribution of $(X,g) $ is given.



            Asssume that we have to guess $g $ after having observed $X $. If we miss then we have to pay $1$ rupy. If we don't then the penalty is $0$.



            It is clear that if $X $ is below $0.5$ then our guess for $g $ is $1$. If $X $ is above $1$ then our guess for $g $ is $2$. But what do we do if $X $ falls between $0.5$ and $1$?



            Now, define $hat f $ the following way: let it be $1$ if $X <0.7$ and $2$ otherwise.



            Could we do that any better?





            Can you identify the current loss matrix?



            Can you compute the expected loss belonging to the method given above?



            Can you create a similar problem so that the loss matrix is the one given in the OP?







            share|cite|improve this answer














            share|cite|improve this answer



            share|cite|improve this answer








            edited Jan 27 '18 at 13:34

























            answered Jan 27 '18 at 12:52









            zolizoli

            17k41945




            17k41945












            • $begingroup$
              Not convinced by your answer - could you help out some more ? I am also struggling with this question
              $endgroup$
              – Xavier Bourret Sicotte
              Jun 21 '18 at 14:07


















            • $begingroup$
              Not convinced by your answer - could you help out some more ? I am also struggling with this question
              $endgroup$
              – Xavier Bourret Sicotte
              Jun 21 '18 at 14:07
















            $begingroup$
            Not convinced by your answer - could you help out some more ? I am also struggling with this question
            $endgroup$
            – Xavier Bourret Sicotte
            Jun 21 '18 at 14:07




            $begingroup$
            Not convinced by your answer - could you help out some more ? I am also struggling with this question
            $endgroup$
            – Xavier Bourret Sicotte
            Jun 21 '18 at 14:07











            -2












            $begingroup$

            I think the loss matrix, for example in the ESL book page 20, the zero-one loss matrix, can be treated as a square look-up table. The number of rows and the number of columns are both the number of all possible classes. Assume it is $K times K$.



            In your example, $K = 3$. So there are three possible levels, let's say they are lvl-1, lvl-2, and lvl-3. If you have an observation of which the level is lvl-2 (truth), but your estimator is lvl-1. Then we got a penalty since this is a misclassification, and (lvl-2, lvl-1) corrsponds to the value that sits at the 2$_{nd}$ row and the 1$_{st}$ column, which is 1 in your table. You asked what's $k$ was classified as $l$. Here $k$ is the truth, which is lvl-2, and $l$ is an error, which is lvl-1 in my example.



            In the ESL book, the EPE (Expected Prediction Error) is $$EPE = E[L(G,hat{G}(X))]$$
            , where $G$ and $hat{G}(X)$ within $L(G, hat{G}(X))$ can be treated as the row and column indices. Therefore $L(G, hat{G}(X))$ can be considered as a random variable. The sources of its randomness are G and X.



            Let's assume that we observed a $X$ and corresponding observed class $G$, then an estimator of $G$, that is the $hat{G}(X)$, can be calculated by certain technique. Then we go to the 'look-up' table, check the row index by known G, and check the column index by estimator $hat{G}(X)$. If $G = hat{G}(X)$ then we have 0 'penalty', 1 otherwise. This is determined because we have already observed both $X$ and $G$, no randomness thereafter.



            If we only observed a $X$, not $G$, you are still able to find a $hat{G}(X)$, which is the column index, since you can just apply the technique on any $X$ as long as it is observed. But we don't know the 'penalty' because you need both row and column indices. Therefore we calculate the expectation of this random 'penalty'. The expectation is conditioning on the observed $X$. It is the weighted average of all possible 'penalties' (or ones, which means miscatches) that this 'penalty' can be, each 'penalty' being weighted according to the conditional probability of its occurence. For example, here is the weight for $k^{th}$ class: $Pr(g_{k}|X)$, of which $k in {1, 2, ..., K}$. Now we have:



            $$tag{*} E[L(G, hat{G}(X))|X]=sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]$$



            The above "$| X$" means $X$ is known. But it is also random, which means it is unknown until we observe one. And we are trying to find the EPE, not a conditional expectation like (*). Therefore we need to find the expecation of the above conditional expectation, which is:
            $$EPE = E_{X}left(sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]right)$$



            So we should pick the estimator $hat{G}(X)$ which minimizes the above EPE.



            This is my understanding. I think there must be answer more mathematically solid.






            share|cite|improve this answer











            $endgroup$


















              -2












              $begingroup$

              I think the loss matrix, for example in the ESL book page 20, the zero-one loss matrix, can be treated as a square look-up table. The number of rows and the number of columns are both the number of all possible classes. Assume it is $K times K$.



              In your example, $K = 3$. So there are three possible levels, let's say they are lvl-1, lvl-2, and lvl-3. If you have an observation of which the level is lvl-2 (truth), but your estimator is lvl-1. Then we got a penalty since this is a misclassification, and (lvl-2, lvl-1) corrsponds to the value that sits at the 2$_{nd}$ row and the 1$_{st}$ column, which is 1 in your table. You asked what's $k$ was classified as $l$. Here $k$ is the truth, which is lvl-2, and $l$ is an error, which is lvl-1 in my example.



              In the ESL book, the EPE (Expected Prediction Error) is $$EPE = E[L(G,hat{G}(X))]$$
              , where $G$ and $hat{G}(X)$ within $L(G, hat{G}(X))$ can be treated as the row and column indices. Therefore $L(G, hat{G}(X))$ can be considered as a random variable. The sources of its randomness are G and X.



              Let's assume that we observed a $X$ and corresponding observed class $G$, then an estimator of $G$, that is the $hat{G}(X)$, can be calculated by certain technique. Then we go to the 'look-up' table, check the row index by known G, and check the column index by estimator $hat{G}(X)$. If $G = hat{G}(X)$ then we have 0 'penalty', 1 otherwise. This is determined because we have already observed both $X$ and $G$, no randomness thereafter.



              If we only observed a $X$, not $G$, you are still able to find a $hat{G}(X)$, which is the column index, since you can just apply the technique on any $X$ as long as it is observed. But we don't know the 'penalty' because you need both row and column indices. Therefore we calculate the expectation of this random 'penalty'. The expectation is conditioning on the observed $X$. It is the weighted average of all possible 'penalties' (or ones, which means miscatches) that this 'penalty' can be, each 'penalty' being weighted according to the conditional probability of its occurence. For example, here is the weight for $k^{th}$ class: $Pr(g_{k}|X)$, of which $k in {1, 2, ..., K}$. Now we have:



              $$tag{*} E[L(G, hat{G}(X))|X]=sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]$$



              The above "$| X$" means $X$ is known. But it is also random, which means it is unknown until we observe one. And we are trying to find the EPE, not a conditional expectation like (*). Therefore we need to find the expecation of the above conditional expectation, which is:
              $$EPE = E_{X}left(sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]right)$$



              So we should pick the estimator $hat{G}(X)$ which minimizes the above EPE.



              This is my understanding. I think there must be answer more mathematically solid.






              share|cite|improve this answer











              $endgroup$
















                -2












                -2








                -2





                $begingroup$

                I think the loss matrix, for example in the ESL book page 20, the zero-one loss matrix, can be treated as a square look-up table. The number of rows and the number of columns are both the number of all possible classes. Assume it is $K times K$.



                In your example, $K = 3$. So there are three possible levels, let's say they are lvl-1, lvl-2, and lvl-3. If you have an observation of which the level is lvl-2 (truth), but your estimator is lvl-1. Then we got a penalty since this is a misclassification, and (lvl-2, lvl-1) corrsponds to the value that sits at the 2$_{nd}$ row and the 1$_{st}$ column, which is 1 in your table. You asked what's $k$ was classified as $l$. Here $k$ is the truth, which is lvl-2, and $l$ is an error, which is lvl-1 in my example.



                In the ESL book, the EPE (Expected Prediction Error) is $$EPE = E[L(G,hat{G}(X))]$$
                , where $G$ and $hat{G}(X)$ within $L(G, hat{G}(X))$ can be treated as the row and column indices. Therefore $L(G, hat{G}(X))$ can be considered as a random variable. The sources of its randomness are G and X.



                Let's assume that we observed a $X$ and corresponding observed class $G$, then an estimator of $G$, that is the $hat{G}(X)$, can be calculated by certain technique. Then we go to the 'look-up' table, check the row index by known G, and check the column index by estimator $hat{G}(X)$. If $G = hat{G}(X)$ then we have 0 'penalty', 1 otherwise. This is determined because we have already observed both $X$ and $G$, no randomness thereafter.



                If we only observed a $X$, not $G$, you are still able to find a $hat{G}(X)$, which is the column index, since you can just apply the technique on any $X$ as long as it is observed. But we don't know the 'penalty' because you need both row and column indices. Therefore we calculate the expectation of this random 'penalty'. The expectation is conditioning on the observed $X$. It is the weighted average of all possible 'penalties' (or ones, which means miscatches) that this 'penalty' can be, each 'penalty' being weighted according to the conditional probability of its occurence. For example, here is the weight for $k^{th}$ class: $Pr(g_{k}|X)$, of which $k in {1, 2, ..., K}$. Now we have:



                $$tag{*} E[L(G, hat{G}(X))|X]=sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]$$



                The above "$| X$" means $X$ is known. But it is also random, which means it is unknown until we observe one. And we are trying to find the EPE, not a conditional expectation like (*). Therefore we need to find the expecation of the above conditional expectation, which is:
                $$EPE = E_{X}left(sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]right)$$



                So we should pick the estimator $hat{G}(X)$ which minimizes the above EPE.



                This is my understanding. I think there must be answer more mathematically solid.






                share|cite|improve this answer











                $endgroup$



                I think the loss matrix, for example in the ESL book page 20, the zero-one loss matrix, can be treated as a square look-up table. The number of rows and the number of columns are both the number of all possible classes. Assume it is $K times K$.



                In your example, $K = 3$. So there are three possible levels, let's say they are lvl-1, lvl-2, and lvl-3. If you have an observation of which the level is lvl-2 (truth), but your estimator is lvl-1. Then we got a penalty since this is a misclassification, and (lvl-2, lvl-1) corrsponds to the value that sits at the 2$_{nd}$ row and the 1$_{st}$ column, which is 1 in your table. You asked what's $k$ was classified as $l$. Here $k$ is the truth, which is lvl-2, and $l$ is an error, which is lvl-1 in my example.



                In the ESL book, the EPE (Expected Prediction Error) is $$EPE = E[L(G,hat{G}(X))]$$
                , where $G$ and $hat{G}(X)$ within $L(G, hat{G}(X))$ can be treated as the row and column indices. Therefore $L(G, hat{G}(X))$ can be considered as a random variable. The sources of its randomness are G and X.



                Let's assume that we observed a $X$ and corresponding observed class $G$, then an estimator of $G$, that is the $hat{G}(X)$, can be calculated by certain technique. Then we go to the 'look-up' table, check the row index by known G, and check the column index by estimator $hat{G}(X)$. If $G = hat{G}(X)$ then we have 0 'penalty', 1 otherwise. This is determined because we have already observed both $X$ and $G$, no randomness thereafter.



                If we only observed a $X$, not $G$, you are still able to find a $hat{G}(X)$, which is the column index, since you can just apply the technique on any $X$ as long as it is observed. But we don't know the 'penalty' because you need both row and column indices. Therefore we calculate the expectation of this random 'penalty'. The expectation is conditioning on the observed $X$. It is the weighted average of all possible 'penalties' (or ones, which means miscatches) that this 'penalty' can be, each 'penalty' being weighted according to the conditional probability of its occurence. For example, here is the weight for $k^{th}$ class: $Pr(g_{k}|X)$, of which $k in {1, 2, ..., K}$. Now we have:



                $$tag{*} E[L(G, hat{G}(X))|X]=sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]$$



                The above "$| X$" means $X$ is known. But it is also random, which means it is unknown until we observe one. And we are trying to find the EPE, not a conditional expectation like (*). Therefore we need to find the expecation of the above conditional expectation, which is:
                $$EPE = E_{X}left(sumlimits_{k=1}^K [L(g_{k}, hat{G}(X))Pr(g_{k}|X)]right)$$



                So we should pick the estimator $hat{G}(X)$ which minimizes the above EPE.



                This is my understanding. I think there must be answer more mathematically solid.







                share|cite|improve this answer














                share|cite|improve this answer



                share|cite|improve this answer








                edited Dec 27 '18 at 17:33

























                answered Dec 27 '18 at 17:18









                cs_snakecs_snake

                12




                12






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Mathematics Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2623072%2fhow-does-the-classification-using-the-0-1-loss-matrix-method-work%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Aardman Animations

                    Are they similar matrix

                    “minimization” problem in Euclidean space related to orthonormal basis