Is Gradient Descent central to every optimizer?

I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.

edited Mar 12 at 14:03

doppelgreener

1053

asked Mar 12 at 10:04

InAFlash

3671416

1

$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
Mar 12 at 12:58

add a comment |

I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.

edited Mar 12 at 14:03

doppelgreener

1053

asked Mar 12 at 10:04

InAFlash

3671416

1

$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
Mar 12 at 12:58

add a comment |

I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.

edited Mar 12 at 14:03

doppelgreener

1053

asked Mar 12 at 10:04

InAFlash

3671416

I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.

machine-learning neural-network deep-learning optimization gradient-descent

edited Mar 12 at 14:03

doppelgreener

1053

asked Mar 12 at 10:04

InAFlash

3671416

edited Mar 12 at 14:03

doppelgreener

1053

asked Mar 12 at 10:04

InAFlash

3671416

edited Mar 12 at 14:03

doppelgreener

1053

edited Mar 12 at 14:03

doppelgreener

1053

edited Mar 12 at 14:03

doppelgreener

1053

asked Mar 12 at 10:04

InAFlash

3671416

asked Mar 12 at 10:04

InAFlash

3671416

asked Mar 12 at 10:04

InAFlash

3671416

1

$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
Mar 12 at 12:58

add a comment |

1

$begingroup$
I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .
$endgroup$
– frank
Mar 12 at 12:58

I'm surprised no one has mentioned "coordinate descent" or "coordinate ascent." en.wikipedia.org/wiki/Coordinate_descent .

– frank
Mar 12 at 12:58

add a comment |

4 Answers
4

active

oldest

votes

No. Gradient descent is used in optimization algorithms that use the gradient as the basis of its step movement. Adam, Adagrad, and RMSProp all use some form of gradient descent, however they do not make up every optimizer. Evolutionary algorithms such as Particle Swarm Optimization and Genetic Algorithms are inspired by natural phenomena do not use gradients. Other algorithms, such as Bayesian Optimization, draw inspiration from statistics.

Check out this visualization of Bayesian Optimization in action:

There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.

Non-derivative based optimization algorithms can be especially useful in irregular non-convex cost functions, non-differentiable cost functions, or cost functions that have a different left or right derivative.

To understand why one may choose a non-derivative based optimization algorithm. Take a look at the Rastrigin benchmark function. Gradient based optimization is not well suited for optimizing functions with so many local minima.

Rastrigin benchmark function

answered Mar 12 at 13:38

jeb02

37612

$begingroup$
thank you very much. Loved your answer
$endgroup$
– InAFlash
Mar 12 at 17:32

add a comment |

According to the title:

No. Only specific types of optimizers are based on Gradient Descent. A straightforward counterexample is when optimization is over a discrete space where gradient is undefined.

According to the body:

Yes. Adam, Adagrad, RMSProp and other similar optimizers (Nesterov, Nadam, etc.) are all trying to propose an adaptive step size (learning rate) for gradient descent to improve the convergence speed without sacrificing the performance (i.e. leading to worse local minimum/maximum).

It is worth noting that there are also Newton methods, and similarly quasi-Newton methods, that work with second-order derivative of loss function (gradient descent works with the first-order derivative). These methods have lost the speed-scalability trade-off to gradient descent due to the large number of model parameters in practical problems.

Some extra notes

The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.

The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).

Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.

edited Mar 12 at 17:10

answered Mar 12 at 10:25

Esmailian

3,321420

add a comment |

The answer to the question may be no. The reason is simply due to numerous optimisation algorithms that are available, but choosing one highly depends on the context and the time you have for optimising. For instance, Genetic algorithm is a well-known optimisation approach which does not have any gradient descent inside it. There are also other approaches like backtracking in some contexts. They all can be used which do not leverage gradient descent step by step.

On the other hand, for tasks like regression, you can find close-form for solving the problem to find extremums, but the point is that depending on the feature space and the number of inputs you may choose the close-form equation or the gradient descent to decrease the number of calculations.

While there are so many optimisation algorithms, in neural networks gradient descent based approaches are used more due to multiple reasons. First of all, they are very fast. In deep learning, you have to provide so many data that they cannot be loaded to the memory simultaneously. Consequently, you have to apply batch gradient methods for optimisation. It's a bit statistic stuff but you can consider that each sample you bring to your network can have a roughly similar distribution to the real data and can be representative enough to find a gradient which can be close to the real gradient of the cost function which should be constructed using all data in hand.

Second, The complexity of finding extremums using matrices and their inverse is $O(n^3)$ for a simple regression task which the parameters can be found using $w = (X^tX)^{-1}(X^ty)$. It turns out that simple gradient-based methods can have better performance. It should also be mentioned that in the former case, you have to bring the data simultaneously to the memory which is not possible for occasions where you deal with big data tasks.

Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.

edited Mar 12 at 12:13

answered Mar 12 at 11:51

Vaalizaadeh

7,59562263

add a comment |

Well, you picked optimizers which are used in neural networks, those optimizers do use gradient based algorithms. Most of the times gradient based algorithms are used in neural networks. Why is that? Well, would you prefer trying to find the minimum knowing the slope of a curve or without knowing it?
When you can't compute the gradient then you'll fall back to Derivative-free optimization.
That being said, there are cases when even though you have info about the gradient, it's better to use a gradient-free method. This is usually the case with functions which have a lot of local minima. Population based algorithms such as evolutionary strategies and genetic algorithms have the upper hand here.
And there's also the branch of combinatorial optimization where a whole different set of tools is used.

answered Mar 12 at 17:01

Christian

311

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47142%2fis-gradient-descent-central-to-every-optimizer%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

Check out this visualization of Bayesian Optimization in action:

There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.

Rastrigin benchmark function

answered Mar 12 at 13:38

jeb02

37612

$begingroup$
thank you very much. Loved your answer
$endgroup$
– InAFlash
Mar 12 at 17:32

add a comment |

Check out this visualization of Bayesian Optimization in action:

There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.

Rastrigin benchmark function

answered Mar 12 at 13:38

jeb02

37612

$begingroup$
thank you very much. Loved your answer
$endgroup$
– InAFlash
Mar 12 at 17:32

add a comment |

Check out this visualization of Bayesian Optimization in action:

There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.

Rastrigin benchmark function

answered Mar 12 at 13:38

jeb02

37612

Check out this visualization of Bayesian Optimization in action:

There are also a few algorithms that combine concepts from evolutionary and gradient-based optimization.

Rastrigin benchmark function

answered Mar 12 at 13:38

jeb02

37612

answered Mar 12 at 13:38

jeb02

37612

answered Mar 12 at 13:38

jeb02

37612

answered Mar 12 at 13:38

jeb02

37612

$begingroup$
thank you very much. Loved your answer
$endgroup$
– InAFlash
Mar 12 at 17:32

add a comment |

$begingroup$
thank you very much. Loved your answer
$endgroup$
– InAFlash
Mar 12 at 17:32

thank you very much. Loved your answer

– InAFlash
Mar 12 at 17:32

add a comment |

Some extra notes

The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.

The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).

Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.

edited Mar 12 at 17:10

answered Mar 12 at 10:25

Esmailian

3,321420

add a comment |

Some extra notes

The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.

The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).

Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.

edited Mar 12 at 17:10

answered Mar 12 at 10:25

Esmailian

3,321420

add a comment |

Some extra notes

The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.

The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).

Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.

edited Mar 12 at 17:10

answered Mar 12 at 10:25

Esmailian

3,321420

Some extra notes

The shape of loss function depends on both the model parameters and data, so choosing the best method is always task dependent and needs trial and error.

The stochastic part of gradient descent is achieved by using a batch of data rather than the complete data. This technique is in parallel to all the mentioned methods, meaning all of them can be stochastic (using a batch of data) or deterministic (using the whole data).

Projected gradient descent is used when some regions of parameters are infeasible (invalid, not allowed), so we bring back (project) a parameter to a feasible region when it goes into an infeasible one. For example, suppose we only allow $left | w right |_2 leq 1$, when parameter goes to $(0, 1.1)$, we bring it back to $(0, 1)$, or $(0.43, 0.9)$, or some other feasible points depending on the trajectory and the specific method. This technique is also in parallel to the mentioned methods, we could have projected stochastic Adam.

edited Mar 12 at 17:10

answered Mar 12 at 10:25

Esmailian

3,321420

edited Mar 12 at 17:10

answered Mar 12 at 10:25

Esmailian

3,321420

answered Mar 12 at 10:25

Esmailian

3,321420

answered Mar 12 at 10:25

Esmailian

3,321420

add a comment |

Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.

edited Mar 12 at 12:13

answered Mar 12 at 11:51

Vaalizaadeh

7,59562263

add a comment |

Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.

edited Mar 12 at 12:13

answered Mar 12 at 11:51

Vaalizaadeh

7,59562263

add a comment |

Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.

edited Mar 12 at 12:13

answered Mar 12 at 11:51

Vaalizaadeh

7,59562263

Third, there are optimisation problems that do not necessarily have a close-form solution. Logistic regression is one of them.

edited Mar 12 at 12:13

answered Mar 12 at 11:51

Vaalizaadeh

7,59562263

edited Mar 12 at 12:13

answered Mar 12 at 11:51

Vaalizaadeh

7,59562263

answered Mar 12 at 11:51

Vaalizaadeh

7,59562263

answered Mar 12 at 11:51

Vaalizaadeh

7,59562263

add a comment |

answered Mar 12 at 17:01

Christian

311

add a comment |

answered Mar 12 at 17:01

Christian

311

add a comment |

answered Mar 12 at 17:01

Christian

311

answered Mar 12 at 17:01

Christian

311

answered Mar 12 at 17:01

Christian

311

answered Mar 12 at 17:01

Christian

311

answered Mar 12 at 17:01

Christian

311

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

1H ey2Diaqt

搜尋此網誌

Jtdylktuy