Is Gradient Descent central to every optimizer?












12












$begingroup$


I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.










share|improve this question











$endgroup$








  • 1




    $begingroup$
    I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
    $endgroup$
    – frank
    Mar 12 at 12:58
















12












$begingroup$


I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.










share|improve this question











$endgroup$








  • 1




    $begingroup$
    I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
    $endgroup$
    – frank
    Mar 12 at 12:58














12












12








12


6



$begingroup$


I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.










share|improve this question











$endgroup$




I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.







machine-learning neural-network deep-learning optimization gradient-descent






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 12 at 14:03









doppelgreener

1053




1053










asked Mar 12 at 10:04









InAFlashInAFlash

3671416




3671416








  • 1




    $begingroup$
    I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
    $endgroup$
    – frank
    Mar 12 at 12:58














  • 1




    $begingroup$
    I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
    $endgroup$
    – frank
    Mar 12 at 12:58








1




1




$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
Mar 12 at 12:58




$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
Mar 12 at 12:58










4 Answers
4






active

oldest

votes


















26












$begingroup$

No. Gradient descent is used in optimization algorithms that use the gradient as the basis of its step movement. Adam, Adagrad, and RMSProp all use some form of gradient descent, however they do not make up every optimizer. Evolutionary algorithms such as Particle Swarm Optimization and Genetic Algorithms are inspired by natural phenomena do not use gradients. Other algorithms, such as Bayesian Optimization, draw inspiration from statistics.



Check out this visualization of Bayesian Optimization in action: Bayesian Optimization in action



There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.



Non-derivative based optimization algorithms can be especially useful in irregular non-convex cost functions, non-differentiable cost functions, or cost functions that have a different left or right derivative.



To understand why one may choose a non-derivative based optimization algorithm. Take a look at the Rastrigin benchmark function. Gradient based optimization is not well suited for optimizing functions with so many local minima.



Rastrigin benchmark function






share|improve this answer









$endgroup$













  • $begingroup$
    thank you very much. Loved your answer
    $endgroup$
    – InAFlash
    Mar 12 at 17:32



















8












$begingroup$

According to the title:

No. Only specific types of optimizers are based on Gradient Descent. A straightforward counterexample is when optimization is over a discrete space where gradient is undefined.



According to the body:

Yes. Adam, Adagrad, RMSProp and other similar optimizers (Nesterov, Nadam, etc.) are all trying to propose an adaptive step size (learning rate) for gradient descent to improve the convergence speed without sacrificing the performance (i.e. leading to worse local minimum/maximum).



It is worth noting that there are also Newton methods, and similarly quasi-Newton methods, that work with second-order derivative of loss function (gradient descent works with the first-order derivative). These methods have lost the speed-scalability trade-off to gradient descent due to the large number of model parameters in practical problems.



Some extra notes




  1. The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.


  2. The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).


  3. Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.







share|improve this answer











$endgroup$





















    3












    $begingroup$

    The answer to the question may be no. The reason is simply due to numerous optimisation algorithms that are available, but choosing one highly depends on the context and the time you have for optimising. For instance, Genetic algorithm is a well-known optimisation approach which does not have any gradient descent inside it. There are also other approaches like backtracking in some contexts. They all can be used which do not leverage gradient descent step by step.



    On the other hand, for tasks like regression, you can find close-form for solving the problem to find extremums, but the point is that depending on the feature space and the number of inputs you may choose the close-form equation or the gradient descent to decrease the number of calculations.



    While there are so many optimisation algorithms, in neural networks gradient descent based approaches are used more due to multiple reasons. First of all, they are very fast. In deep learning, you have to provide so many data that they cannot be loaded to the memory simultaneously. Consequently, you have to apply batch gradient methods for optimisation. It's a bit statistic stuff but you can consider that each sample you bring to your network can have a roughly similar distribution to the real data and can be representative enough to find a gradient which can be close to the real gradient of the cost function which should be constructed using all data in hand.



    Second, The complexity of finding extremums using matrices and their inverse is $O(n^3)$ for a simple regression task which the parameters can be found using $w = (X^tX)^{-1}(X^ty)$. It turns out that simple gradient-based methods can have better performance. It should also be mentioned that in the former case, you have to bring the data simultaneously to the memory which is not possible for occasions where you deal with big data tasks.



    Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.






    share|improve this answer











    $endgroup$





















      3












      $begingroup$

      Well, you picked optimizers which are used in neural networks, those optimizers do use gradient based algorithms. Most of the times gradient based algorithms are used in neural networks. Why is that? Well, would you prefer trying to find the minimum knowing the slope of a curve or without knowing it?
      When you can't compute the gradient then you'll fall back to Derivative-free optimization.
      That being said, there are cases when even though you have info about the gradient, it's better to use a gradient-free method. This is usually the case with functions which have a lot of local minima. Population based algorithms such as evolutionary strategies and genetic algorithms have the upper hand here.
      And there's also the branch of combinatorial optimization where a whole different set of tools is used.






      share|improve this answer









      $endgroup$














        Your Answer








        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "557"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47142%2fis-gradient-descent-central-to-every-optimizer%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        4 Answers
        4






        active

        oldest

        votes








        4 Answers
        4






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        26












        $begingroup$

        No. Gradient descent is used in optimization algorithms that use the gradient as the basis of its step movement. Adam, Adagrad, and RMSProp all use some form of gradient descent, however they do not make up every optimizer. Evolutionary algorithms such as Particle Swarm Optimization and Genetic Algorithms are inspired by natural phenomena do not use gradients. Other algorithms, such as Bayesian Optimization, draw inspiration from statistics.



        Check out this visualization of Bayesian Optimization in action: Bayesian Optimization in action



        There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.



        Non-derivative based optimization algorithms can be especially useful in irregular non-convex cost functions, non-differentiable cost functions, or cost functions that have a different left or right derivative.



        To understand why one may choose a non-derivative based optimization algorithm. Take a look at the Rastrigin benchmark function. Gradient based optimization is not well suited for optimizing functions with so many local minima.



        Rastrigin benchmark function






        share|improve this answer









        $endgroup$













        • $begingroup$
          thank you very much. Loved your answer
          $endgroup$
          – InAFlash
          Mar 12 at 17:32
















        26












        $begingroup$

        No. Gradient descent is used in optimization algorithms that use the gradient as the basis of its step movement. Adam, Adagrad, and RMSProp all use some form of gradient descent, however they do not make up every optimizer. Evolutionary algorithms such as Particle Swarm Optimization and Genetic Algorithms are inspired by natural phenomena do not use gradients. Other algorithms, such as Bayesian Optimization, draw inspiration from statistics.



        Check out this visualization of Bayesian Optimization in action: Bayesian Optimization in action



        There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.



        Non-derivative based optimization algorithms can be especially useful in irregular non-convex cost functions, non-differentiable cost functions, or cost functions that have a different left or right derivative.



        To understand why one may choose a non-derivative based optimization algorithm. Take a look at the Rastrigin benchmark function. Gradient based optimization is not well suited for optimizing functions with so many local minima.



        Rastrigin benchmark function






        share|improve this answer









        $endgroup$













        • $begingroup$
          thank you very much. Loved your answer
          $endgroup$
          – InAFlash
          Mar 12 at 17:32














        26












        26








        26





        $begingroup$

        No. Gradient descent is used in optimization algorithms that use the gradient as the basis of its step movement. Adam, Adagrad, and RMSProp all use some form of gradient descent, however they do not make up every optimizer. Evolutionary algorithms such as Particle Swarm Optimization and Genetic Algorithms are inspired by natural phenomena do not use gradients. Other algorithms, such as Bayesian Optimization, draw inspiration from statistics.



        Check out this visualization of Bayesian Optimization in action: Bayesian Optimization in action



        There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.



        Non-derivative based optimization algorithms can be especially useful in irregular non-convex cost functions, non-differentiable cost functions, or cost functions that have a different left or right derivative.



        To understand why one may choose a non-derivative based optimization algorithm. Take a look at the Rastrigin benchmark function. Gradient based optimization is not well suited for optimizing functions with so many local minima.



        Rastrigin benchmark function






        share|improve this answer









        $endgroup$



        No. Gradient descent is used in optimization algorithms that use the gradient as the basis of its step movement. Adam, Adagrad, and RMSProp all use some form of gradient descent, however they do not make up every optimizer. Evolutionary algorithms such as Particle Swarm Optimization and Genetic Algorithms are inspired by natural phenomena do not use gradients. Other algorithms, such as Bayesian Optimization, draw inspiration from statistics.



        Check out this visualization of Bayesian Optimization in action: Bayesian Optimization in action



        There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.



        Non-derivative based optimization algorithms can be especially useful in irregular non-convex cost functions, non-differentiable cost functions, or cost functions that have a different left or right derivative.



        To understand why one may choose a non-derivative based optimization algorithm. Take a look at the Rastrigin benchmark function. Gradient based optimization is not well suited for optimizing functions with so many local minima.



        Rastrigin benchmark function







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Mar 12 at 13:38









        jeb02jeb02

        37612




        37612












        • $begingroup$
          thank you very much. Loved your answer
          $endgroup$
          – InAFlash
          Mar 12 at 17:32


















        • $begingroup$
          thank you very much. Loved your answer
          $endgroup$
          – InAFlash
          Mar 12 at 17:32
















        $begingroup$
        thank you very much. Loved your answer
        $endgroup$
        – InAFlash
        Mar 12 at 17:32




        $begingroup$
        thank you very much. Loved your answer
        $endgroup$
        – InAFlash
        Mar 12 at 17:32











        8












        $begingroup$

        According to the title:

        No. Only specific types of optimizers are based on Gradient Descent. A straightforward counterexample is when optimization is over a discrete space where gradient is undefined.



        According to the body:

        Yes. Adam, Adagrad, RMSProp and other similar optimizers (Nesterov, Nadam, etc.) are all trying to propose an adaptive step size (learning rate) for gradient descent to improve the convergence speed without sacrificing the performance (i.e. leading to worse local minimum/maximum).



        It is worth noting that there are also Newton methods, and similarly quasi-Newton methods, that work with second-order derivative of loss function (gradient descent works with the first-order derivative). These methods have lost the speed-scalability trade-off to gradient descent due to the large number of model parameters in practical problems.



        Some extra notes




        1. The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.


        2. The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).


        3. Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.







        share|improve this answer











        $endgroup$


















          8












          $begingroup$

          According to the title:

          No. Only specific types of optimizers are based on Gradient Descent. A straightforward counterexample is when optimization is over a discrete space where gradient is undefined.



          According to the body:

          Yes. Adam, Adagrad, RMSProp and other similar optimizers (Nesterov, Nadam, etc.) are all trying to propose an adaptive step size (learning rate) for gradient descent to improve the convergence speed without sacrificing the performance (i.e. leading to worse local minimum/maximum).



          It is worth noting that there are also Newton methods, and similarly quasi-Newton methods, that work with second-order derivative of loss function (gradient descent works with the first-order derivative). These methods have lost the speed-scalability trade-off to gradient descent due to the large number of model parameters in practical problems.



          Some extra notes




          1. The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.


          2. The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).


          3. Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.







          share|improve this answer











          $endgroup$
















            8












            8








            8





            $begingroup$

            According to the title:

            No. Only specific types of optimizers are based on Gradient Descent. A straightforward counterexample is when optimization is over a discrete space where gradient is undefined.



            According to the body:

            Yes. Adam, Adagrad, RMSProp and other similar optimizers (Nesterov, Nadam, etc.) are all trying to propose an adaptive step size (learning rate) for gradient descent to improve the convergence speed without sacrificing the performance (i.e. leading to worse local minimum/maximum).



            It is worth noting that there are also Newton methods, and similarly quasi-Newton methods, that work with second-order derivative of loss function (gradient descent works with the first-order derivative). These methods have lost the speed-scalability trade-off to gradient descent due to the large number of model parameters in practical problems.



            Some extra notes




            1. The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.


            2. The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).


            3. Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.







            share|improve this answer











            $endgroup$



            According to the title:

            No. Only specific types of optimizers are based on Gradient Descent. A straightforward counterexample is when optimization is over a discrete space where gradient is undefined.



            According to the body:

            Yes. Adam, Adagrad, RMSProp and other similar optimizers (Nesterov, Nadam, etc.) are all trying to propose an adaptive step size (learning rate) for gradient descent to improve the convergence speed without sacrificing the performance (i.e. leading to worse local minimum/maximum).



            It is worth noting that there are also Newton methods, and similarly quasi-Newton methods, that work with second-order derivative of loss function (gradient descent works with the first-order derivative). These methods have lost the speed-scalability trade-off to gradient descent due to the large number of model parameters in practical problems.



            Some extra notes




            1. The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.


            2. The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).


            3. Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.








            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Mar 12 at 17:10

























            answered Mar 12 at 10:25









            EsmailianEsmailian

            3,321420




            3,321420























                3












                $begingroup$

                The answer to the question may be no. The reason is simply due to numerous optimisation algorithms that are available, but choosing one highly depends on the context and the time you have for optimising. For instance, Genetic algorithm is a well-known optimisation approach which does not have any gradient descent inside it. There are also other approaches like backtracking in some contexts. They all can be used which do not leverage gradient descent step by step.



                On the other hand, for tasks like regression, you can find close-form for solving the problem to find extremums, but the point is that depending on the feature space and the number of inputs you may choose the close-form equation or the gradient descent to decrease the number of calculations.



                While there are so many optimisation algorithms, in neural networks gradient descent based approaches are used more due to multiple reasons. First of all, they are very fast. In deep learning, you have to provide so many data that they cannot be loaded to the memory simultaneously. Consequently, you have to apply batch gradient methods for optimisation. It's a bit statistic stuff but you can consider that each sample you bring to your network can have a roughly similar distribution to the real data and can be representative enough to find a gradient which can be close to the real gradient of the cost function which should be constructed using all data in hand.



                Second, The complexity of finding extremums using matrices and their inverse is $O(n^3)$ for a simple regression task which the parameters can be found using $w = (X^tX)^{-1}(X^ty)$. It turns out that simple gradient-based methods can have better performance. It should also be mentioned that in the former case, you have to bring the data simultaneously to the memory which is not possible for occasions where you deal with big data tasks.



                Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.






                share|improve this answer











                $endgroup$


















                  3












                  $begingroup$

                  The answer to the question may be no. The reason is simply due to numerous optimisation algorithms that are available, but choosing one highly depends on the context and the time you have for optimising. For instance, Genetic algorithm is a well-known optimisation approach which does not have any gradient descent inside it. There are also other approaches like backtracking in some contexts. They all can be used which do not leverage gradient descent step by step.



                  On the other hand, for tasks like regression, you can find close-form for solving the problem to find extremums, but the point is that depending on the feature space and the number of inputs you may choose the close-form equation or the gradient descent to decrease the number of calculations.



                  While there are so many optimisation algorithms, in neural networks gradient descent based approaches are used more due to multiple reasons. First of all, they are very fast. In deep learning, you have to provide so many data that they cannot be loaded to the memory simultaneously. Consequently, you have to apply batch gradient methods for optimisation. It's a bit statistic stuff but you can consider that each sample you bring to your network can have a roughly similar distribution to the real data and can be representative enough to find a gradient which can be close to the real gradient of the cost function which should be constructed using all data in hand.



                  Second, The complexity of finding extremums using matrices and their inverse is $O(n^3)$ for a simple regression task which the parameters can be found using $w = (X^tX)^{-1}(X^ty)$. It turns out that simple gradient-based methods can have better performance. It should also be mentioned that in the former case, you have to bring the data simultaneously to the memory which is not possible for occasions where you deal with big data tasks.



                  Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.






                  share|improve this answer











                  $endgroup$
















                    3












                    3








                    3





                    $begingroup$

                    The answer to the question may be no. The reason is simply due to numerous optimisation algorithms that are available, but choosing one highly depends on the context and the time you have for optimising. For instance, Genetic algorithm is a well-known optimisation approach which does not have any gradient descent inside it. There are also other approaches like backtracking in some contexts. They all can be used which do not leverage gradient descent step by step.



                    On the other hand, for tasks like regression, you can find close-form for solving the problem to find extremums, but the point is that depending on the feature space and the number of inputs you may choose the close-form equation or the gradient descent to decrease the number of calculations.



                    While there are so many optimisation algorithms, in neural networks gradient descent based approaches are used more due to multiple reasons. First of all, they are very fast. In deep learning, you have to provide so many data that they cannot be loaded to the memory simultaneously. Consequently, you have to apply batch gradient methods for optimisation. It's a bit statistic stuff but you can consider that each sample you bring to your network can have a roughly similar distribution to the real data and can be representative enough to find a gradient which can be close to the real gradient of the cost function which should be constructed using all data in hand.



                    Second, The complexity of finding extremums using matrices and their inverse is $O(n^3)$ for a simple regression task which the parameters can be found using $w = (X^tX)^{-1}(X^ty)$. It turns out that simple gradient-based methods can have better performance. It should also be mentioned that in the former case, you have to bring the data simultaneously to the memory which is not possible for occasions where you deal with big data tasks.



                    Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.






                    share|improve this answer











                    $endgroup$



                    The answer to the question may be no. The reason is simply due to numerous optimisation algorithms that are available, but choosing one highly depends on the context and the time you have for optimising. For instance, Genetic algorithm is a well-known optimisation approach which does not have any gradient descent inside it. There are also other approaches like backtracking in some contexts. They all can be used which do not leverage gradient descent step by step.



                    On the other hand, for tasks like regression, you can find close-form for solving the problem to find extremums, but the point is that depending on the feature space and the number of inputs you may choose the close-form equation or the gradient descent to decrease the number of calculations.



                    While there are so many optimisation algorithms, in neural networks gradient descent based approaches are used more due to multiple reasons. First of all, they are very fast. In deep learning, you have to provide so many data that they cannot be loaded to the memory simultaneously. Consequently, you have to apply batch gradient methods for optimisation. It's a bit statistic stuff but you can consider that each sample you bring to your network can have a roughly similar distribution to the real data and can be representative enough to find a gradient which can be close to the real gradient of the cost function which should be constructed using all data in hand.



                    Second, The complexity of finding extremums using matrices and their inverse is $O(n^3)$ for a simple regression task which the parameters can be found using $w = (X^tX)^{-1}(X^ty)$. It turns out that simple gradient-based methods can have better performance. It should also be mentioned that in the former case, you have to bring the data simultaneously to the memory which is not possible for occasions where you deal with big data tasks.



                    Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Mar 12 at 12:13

























                    answered Mar 12 at 11:51









                    VaalizaadehVaalizaadeh

                    7,59562263




                    7,59562263























                        3












                        $begingroup$

                        Well, you picked optimizers which are used in neural networks, those optimizers do use gradient based algorithms. Most of the times gradient based algorithms are used in neural networks. Why is that? Well, would you prefer trying to find the minimum knowing the slope of a curve or without knowing it?
                        When you can't compute the gradient then you'll fall back to Derivative-free optimization.
                        That being said, there are cases when even though you have info about the gradient, it's better to use a gradient-free method. This is usually the case with functions which have a lot of local minima. Population based algorithms such as evolutionary strategies and genetic algorithms have the upper hand here.
                        And there's also the branch of combinatorial optimization where a whole different set of tools is used.






                        share|improve this answer









                        $endgroup$


















                          3












                          $begingroup$

                          Well, you picked optimizers which are used in neural networks, those optimizers do use gradient based algorithms. Most of the times gradient based algorithms are used in neural networks. Why is that? Well, would you prefer trying to find the minimum knowing the slope of a curve or without knowing it?
                          When you can't compute the gradient then you'll fall back to Derivative-free optimization.
                          That being said, there are cases when even though you have info about the gradient, it's better to use a gradient-free method. This is usually the case with functions which have a lot of local minima. Population based algorithms such as evolutionary strategies and genetic algorithms have the upper hand here.
                          And there's also the branch of combinatorial optimization where a whole different set of tools is used.






                          share|improve this answer









                          $endgroup$
















                            3












                            3








                            3





                            $begingroup$

                            Well, you picked optimizers which are used in neural networks, those optimizers do use gradient based algorithms. Most of the times gradient based algorithms are used in neural networks. Why is that? Well, would you prefer trying to find the minimum knowing the slope of a curve or without knowing it?
                            When you can't compute the gradient then you'll fall back to Derivative-free optimization.
                            That being said, there are cases when even though you have info about the gradient, it's better to use a gradient-free method. This is usually the case with functions which have a lot of local minima. Population based algorithms such as evolutionary strategies and genetic algorithms have the upper hand here.
                            And there's also the branch of combinatorial optimization where a whole different set of tools is used.






                            share|improve this answer









                            $endgroup$



                            Well, you picked optimizers which are used in neural networks, those optimizers do use gradient based algorithms. Most of the times gradient based algorithms are used in neural networks. Why is that? Well, would you prefer trying to find the minimum knowing the slope of a curve or without knowing it?
                            When you can't compute the gradient then you'll fall back to Derivative-free optimization.
                            That being said, there are cases when even though you have info about the gradient, it's better to use a gradient-free method. This is usually the case with functions which have a lot of local minima. Population based algorithms such as evolutionary strategies and genetic algorithms have the upper hand here.
                            And there's also the branch of combinatorial optimization where a whole different set of tools is used.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Mar 12 at 17:01









                            ChristianChristian

                            311




                            311






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Data Science Stack Exchange!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                Use MathJax to format equations. MathJax reference.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47142%2fis-gradient-descent-central-to-every-optimizer%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Probability when a professor distributes a quiz and homework assignment to a class of n students.

                                Aardman Animations

                                Are they similar matrix