Gradient checking in neural network with dot product












0












$begingroup$


I was taking the 2nd course of deeplearning.ai specialization on coursera. I was watching a video on gradient checking for neural networks. After we compute the gradient vector and the approximated gradient vector as shown here, why is the strange formula
$$difference = frac {| grad - gradapprox |_2}{| grad |_2 + | gradapprox |_2 } tag{3}$$
being used to calculate the similarity i.e. of two vectors.
Why not use a cosine similarity?










share|cite|improve this question











$endgroup$

















    0












    $begingroup$


    I was taking the 2nd course of deeplearning.ai specialization on coursera. I was watching a video on gradient checking for neural networks. After we compute the gradient vector and the approximated gradient vector as shown here, why is the strange formula
    $$difference = frac {| grad - gradapprox |_2}{| grad |_2 + | gradapprox |_2 } tag{3}$$
    being used to calculate the similarity i.e. of two vectors.
    Why not use a cosine similarity?










    share|cite|improve this question











    $endgroup$















      0












      0








      0





      $begingroup$


      I was taking the 2nd course of deeplearning.ai specialization on coursera. I was watching a video on gradient checking for neural networks. After we compute the gradient vector and the approximated gradient vector as shown here, why is the strange formula
      $$difference = frac {| grad - gradapprox |_2}{| grad |_2 + | gradapprox |_2 } tag{3}$$
      being used to calculate the similarity i.e. of two vectors.
      Why not use a cosine similarity?










      share|cite|improve this question











      $endgroup$




      I was taking the 2nd course of deeplearning.ai specialization on coursera. I was watching a video on gradient checking for neural networks. After we compute the gradient vector and the approximated gradient vector as shown here, why is the strange formula
      $$difference = frac {| grad - gradapprox |_2}{| grad |_2 + | gradapprox |_2 } tag{3}$$
      being used to calculate the similarity i.e. of two vectors.
      Why not use a cosine similarity?







      vectors neural-networks






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited Dec 4 '18 at 16:36







      KAY_YAK

















      asked Dec 4 '18 at 15:39









      KAY_YAKKAY_YAK

      155




      155






















          2 Answers
          2






          active

          oldest

          votes


















          1












          $begingroup$

          Probably to avoid division-by-zero errors. As we approach a point where the gradient is zero, the last step may have a vector's length round to $0$. That's not a problem here as long as only one does. You can of course write the formula in terms of the usual cosine similarity (I'll leave that as an exercise). It's also natural to subtract one vector from the other elsewhere in gradient descent, so you can recycle a cached value.






          share|cite|improve this answer









          $endgroup$













          • $begingroup$
            Well I can just place some checks on the lengths. Also, this form is not so intuitive.
            $endgroup$
            – KAY_YAK
            Dec 4 '18 at 16:39










          • $begingroup$
            Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
            $endgroup$
            – J.G.
            Dec 4 '18 at 17:18





















          1












          $begingroup$

          The idea is you want to know when the update in small, so that you can stop the iterations. The problem is: what does it mean to be small.



          One option is to calculate the distance $|{bf a} - {bf b} |_2$ and then compare it against $0$, or a very small number $epsilon$. If it is close to zero then stop. But here is the problem: imagine you multiple the cost function by a factor $k$ (arbitrary, e.g. the size of the problem, or 1/2, ...) then each vector is now scaled by the same factor



          $$
          | k {bf a} - k {bf b} |_2 = k |{bf a} - {bf b} |_2
          $$



          For the example, imagine $k = 10^3$, so now what is the value you should compare against to stop? If you don't change $epsilon$ the algorithm is now going to stop even if its not converged.



          To avoid this problem, divide by the length of the vectors



          $$
          frac{| k {bf a} - k {bf b} |_2}{k|{bf a}|_2 + k|{bf b}|_2} = frac{k|{bf a} - {bf b} |_2}{2k} = frac{|{bf a} - {bf b} |_2}{2}
          $$



          which clearly does not depend on the scale of the problem. And $epsilon$ now is meaningful






          share|cite|improve this answer









          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "69"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            noCode: true, onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3025720%2fgradient-checking-in-neural-network-with-dot-product%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1












            $begingroup$

            Probably to avoid division-by-zero errors. As we approach a point where the gradient is zero, the last step may have a vector's length round to $0$. That's not a problem here as long as only one does. You can of course write the formula in terms of the usual cosine similarity (I'll leave that as an exercise). It's also natural to subtract one vector from the other elsewhere in gradient descent, so you can recycle a cached value.






            share|cite|improve this answer









            $endgroup$













            • $begingroup$
              Well I can just place some checks on the lengths. Also, this form is not so intuitive.
              $endgroup$
              – KAY_YAK
              Dec 4 '18 at 16:39










            • $begingroup$
              Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
              $endgroup$
              – J.G.
              Dec 4 '18 at 17:18


















            1












            $begingroup$

            Probably to avoid division-by-zero errors. As we approach a point where the gradient is zero, the last step may have a vector's length round to $0$. That's not a problem here as long as only one does. You can of course write the formula in terms of the usual cosine similarity (I'll leave that as an exercise). It's also natural to subtract one vector from the other elsewhere in gradient descent, so you can recycle a cached value.






            share|cite|improve this answer









            $endgroup$













            • $begingroup$
              Well I can just place some checks on the lengths. Also, this form is not so intuitive.
              $endgroup$
              – KAY_YAK
              Dec 4 '18 at 16:39










            • $begingroup$
              Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
              $endgroup$
              – J.G.
              Dec 4 '18 at 17:18
















            1












            1








            1





            $begingroup$

            Probably to avoid division-by-zero errors. As we approach a point where the gradient is zero, the last step may have a vector's length round to $0$. That's not a problem here as long as only one does. You can of course write the formula in terms of the usual cosine similarity (I'll leave that as an exercise). It's also natural to subtract one vector from the other elsewhere in gradient descent, so you can recycle a cached value.






            share|cite|improve this answer









            $endgroup$



            Probably to avoid division-by-zero errors. As we approach a point where the gradient is zero, the last step may have a vector's length round to $0$. That's not a problem here as long as only one does. You can of course write the formula in terms of the usual cosine similarity (I'll leave that as an exercise). It's also natural to subtract one vector from the other elsewhere in gradient descent, so you can recycle a cached value.







            share|cite|improve this answer












            share|cite|improve this answer



            share|cite|improve this answer










            answered Dec 4 '18 at 16:06









            J.G.J.G.

            24.5k22539




            24.5k22539












            • $begingroup$
              Well I can just place some checks on the lengths. Also, this form is not so intuitive.
              $endgroup$
              – KAY_YAK
              Dec 4 '18 at 16:39










            • $begingroup$
              Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
              $endgroup$
              – J.G.
              Dec 4 '18 at 17:18




















            • $begingroup$
              Well I can just place some checks on the lengths. Also, this form is not so intuitive.
              $endgroup$
              – KAY_YAK
              Dec 4 '18 at 16:39










            • $begingroup$
              Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
              $endgroup$
              – J.G.
              Dec 4 '18 at 17:18


















            $begingroup$
            Well I can just place some checks on the lengths. Also, this form is not so intuitive.
            $endgroup$
            – KAY_YAK
            Dec 4 '18 at 16:39




            $begingroup$
            Well I can just place some checks on the lengths. Also, this form is not so intuitive.
            $endgroup$
            – KAY_YAK
            Dec 4 '18 at 16:39












            $begingroup$
            Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
            $endgroup$
            – J.G.
            Dec 4 '18 at 17:18






            $begingroup$
            Firstly, that makes for messier code. Secondly, caverac's answer points out we want to check for the vectors being close together, not their scaled-to-length-$1$ counterparts being close together. Thirdly, although learning about the dot produce may make cosine similarity intuitive in certain contexts, the formula you've asked about, which I'd argue is more intuitive to someone with limited geometric knowledge, is also more intuitively motivatable in the context of gradient descent.
            $endgroup$
            – J.G.
            Dec 4 '18 at 17:18













            1












            $begingroup$

            The idea is you want to know when the update in small, so that you can stop the iterations. The problem is: what does it mean to be small.



            One option is to calculate the distance $|{bf a} - {bf b} |_2$ and then compare it against $0$, or a very small number $epsilon$. If it is close to zero then stop. But here is the problem: imagine you multiple the cost function by a factor $k$ (arbitrary, e.g. the size of the problem, or 1/2, ...) then each vector is now scaled by the same factor



            $$
            | k {bf a} - k {bf b} |_2 = k |{bf a} - {bf b} |_2
            $$



            For the example, imagine $k = 10^3$, so now what is the value you should compare against to stop? If you don't change $epsilon$ the algorithm is now going to stop even if its not converged.



            To avoid this problem, divide by the length of the vectors



            $$
            frac{| k {bf a} - k {bf b} |_2}{k|{bf a}|_2 + k|{bf b}|_2} = frac{k|{bf a} - {bf b} |_2}{2k} = frac{|{bf a} - {bf b} |_2}{2}
            $$



            which clearly does not depend on the scale of the problem. And $epsilon$ now is meaningful






            share|cite|improve this answer









            $endgroup$


















              1












              $begingroup$

              The idea is you want to know when the update in small, so that you can stop the iterations. The problem is: what does it mean to be small.



              One option is to calculate the distance $|{bf a} - {bf b} |_2$ and then compare it against $0$, or a very small number $epsilon$. If it is close to zero then stop. But here is the problem: imagine you multiple the cost function by a factor $k$ (arbitrary, e.g. the size of the problem, or 1/2, ...) then each vector is now scaled by the same factor



              $$
              | k {bf a} - k {bf b} |_2 = k |{bf a} - {bf b} |_2
              $$



              For the example, imagine $k = 10^3$, so now what is the value you should compare against to stop? If you don't change $epsilon$ the algorithm is now going to stop even if its not converged.



              To avoid this problem, divide by the length of the vectors



              $$
              frac{| k {bf a} - k {bf b} |_2}{k|{bf a}|_2 + k|{bf b}|_2} = frac{k|{bf a} - {bf b} |_2}{2k} = frac{|{bf a} - {bf b} |_2}{2}
              $$



              which clearly does not depend on the scale of the problem. And $epsilon$ now is meaningful






              share|cite|improve this answer









              $endgroup$
















                1












                1








                1





                $begingroup$

                The idea is you want to know when the update in small, so that you can stop the iterations. The problem is: what does it mean to be small.



                One option is to calculate the distance $|{bf a} - {bf b} |_2$ and then compare it against $0$, or a very small number $epsilon$. If it is close to zero then stop. But here is the problem: imagine you multiple the cost function by a factor $k$ (arbitrary, e.g. the size of the problem, or 1/2, ...) then each vector is now scaled by the same factor



                $$
                | k {bf a} - k {bf b} |_2 = k |{bf a} - {bf b} |_2
                $$



                For the example, imagine $k = 10^3$, so now what is the value you should compare against to stop? If you don't change $epsilon$ the algorithm is now going to stop even if its not converged.



                To avoid this problem, divide by the length of the vectors



                $$
                frac{| k {bf a} - k {bf b} |_2}{k|{bf a}|_2 + k|{bf b}|_2} = frac{k|{bf a} - {bf b} |_2}{2k} = frac{|{bf a} - {bf b} |_2}{2}
                $$



                which clearly does not depend on the scale of the problem. And $epsilon$ now is meaningful






                share|cite|improve this answer









                $endgroup$



                The idea is you want to know when the update in small, so that you can stop the iterations. The problem is: what does it mean to be small.



                One option is to calculate the distance $|{bf a} - {bf b} |_2$ and then compare it against $0$, or a very small number $epsilon$. If it is close to zero then stop. But here is the problem: imagine you multiple the cost function by a factor $k$ (arbitrary, e.g. the size of the problem, or 1/2, ...) then each vector is now scaled by the same factor



                $$
                | k {bf a} - k {bf b} |_2 = k |{bf a} - {bf b} |_2
                $$



                For the example, imagine $k = 10^3$, so now what is the value you should compare against to stop? If you don't change $epsilon$ the algorithm is now going to stop even if its not converged.



                To avoid this problem, divide by the length of the vectors



                $$
                frac{| k {bf a} - k {bf b} |_2}{k|{bf a}|_2 + k|{bf b}|_2} = frac{k|{bf a} - {bf b} |_2}{2k} = frac{|{bf a} - {bf b} |_2}{2}
                $$



                which clearly does not depend on the scale of the problem. And $epsilon$ now is meaningful







                share|cite|improve this answer












                share|cite|improve this answer



                share|cite|improve this answer










                answered Dec 4 '18 at 16:40









                caveraccaverac

                14.4k31130




                14.4k31130






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Mathematics Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3025720%2fgradient-checking-in-neural-network-with-dot-product%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Probability when a professor distributes a quiz and homework assignment to a class of n students.

                    Aardman Animations

                    Are they similar matrix