Gradient estimates using a candidate
up vote
5
down vote
favorite
Suppose one has a differentiable function $f:Bbb R^ntoBbb R$ that one can evaluate but for which one has no expression for the derivative. There exist several procedure for gaining an estimate for the gradient, for example one could evaluate
$$nabla f(x)approxsum_i frac{f(x+epsilon,e_i)-f(x-epsilon, e_i)}{2epsilon} e_i,tag{1}$$
or
$$nabla f(x) approx leftlanglesum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_irightrangle_{p(Delta)}.tag{2}$$
where the brackets denote the expectation value in which $Delta$ is drawn from some symmetric probability distribution $p(Delta)$ centered around $0$ (this is known as SPSA).
These methods are generic in the sense that for $epsilonto0$ or $mathrm{Var}(p(Delta))to0$ they will always converge to the gradient of $f$. However they may be slow, for example $(1)$ requires very many evaluations in high dimensions.
Sometimes one already has a candidate function $overline{nabla f}$ for $nabla f$. This candidate will in general be wrong but can for example lie in the correct quadrant. My question:
Is there a method for estimating the gradient of $f$ that uses the candidate $overline{nabla f}$ as a suggestion?
The expected benefit of using $overline{nabla f}$ would be a speed-up of the procedure. I'm sure such a thing exists, but I can't find any literature because I do not know what to search for.
I'm not sure if math stack exchange is the right place to ask such a question, so I would also appreciate it if you could point me in a direction in which I am more likely to get an answer.
optimization numerical-methods numerical-optimization gradient-descent
add a comment |
up vote
5
down vote
favorite
Suppose one has a differentiable function $f:Bbb R^ntoBbb R$ that one can evaluate but for which one has no expression for the derivative. There exist several procedure for gaining an estimate for the gradient, for example one could evaluate
$$nabla f(x)approxsum_i frac{f(x+epsilon,e_i)-f(x-epsilon, e_i)}{2epsilon} e_i,tag{1}$$
or
$$nabla f(x) approx leftlanglesum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_irightrangle_{p(Delta)}.tag{2}$$
where the brackets denote the expectation value in which $Delta$ is drawn from some symmetric probability distribution $p(Delta)$ centered around $0$ (this is known as SPSA).
These methods are generic in the sense that for $epsilonto0$ or $mathrm{Var}(p(Delta))to0$ they will always converge to the gradient of $f$. However they may be slow, for example $(1)$ requires very many evaluations in high dimensions.
Sometimes one already has a candidate function $overline{nabla f}$ for $nabla f$. This candidate will in general be wrong but can for example lie in the correct quadrant. My question:
Is there a method for estimating the gradient of $f$ that uses the candidate $overline{nabla f}$ as a suggestion?
The expected benefit of using $overline{nabla f}$ would be a speed-up of the procedure. I'm sure such a thing exists, but I can't find any literature because I do not know what to search for.
I'm not sure if math stack exchange is the right place to ask such a question, so I would also appreciate it if you could point me in a direction in which I am more likely to get an answer.
optimization numerical-methods numerical-optimization gradient-descent
You mean like using the SVD to determine the most important directions and then using Broyden updates to improve the Jacobian in these directions?
– LutzL
May 17 '17 at 12:00
I'm thinking concretely about situations where one has a model which gives you such gradient candidates, for example via an approximation of $f$ by some function $bar f$ which has a tractable differential. I don't know for sure what the Broyden method is, but it seems from what I just read that it is an update method for gradient descent that uses something like inertial terms to get better gradient estimates.
– s.harp
May 17 '17 at 12:47
What about some kind of importance sampling, with bias based on $overline{nabla f}$?
– user3658307
May 17 '17 at 21:37
add a comment |
up vote
5
down vote
favorite
up vote
5
down vote
favorite
Suppose one has a differentiable function $f:Bbb R^ntoBbb R$ that one can evaluate but for which one has no expression for the derivative. There exist several procedure for gaining an estimate for the gradient, for example one could evaluate
$$nabla f(x)approxsum_i frac{f(x+epsilon,e_i)-f(x-epsilon, e_i)}{2epsilon} e_i,tag{1}$$
or
$$nabla f(x) approx leftlanglesum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_irightrangle_{p(Delta)}.tag{2}$$
where the brackets denote the expectation value in which $Delta$ is drawn from some symmetric probability distribution $p(Delta)$ centered around $0$ (this is known as SPSA).
These methods are generic in the sense that for $epsilonto0$ or $mathrm{Var}(p(Delta))to0$ they will always converge to the gradient of $f$. However they may be slow, for example $(1)$ requires very many evaluations in high dimensions.
Sometimes one already has a candidate function $overline{nabla f}$ for $nabla f$. This candidate will in general be wrong but can for example lie in the correct quadrant. My question:
Is there a method for estimating the gradient of $f$ that uses the candidate $overline{nabla f}$ as a suggestion?
The expected benefit of using $overline{nabla f}$ would be a speed-up of the procedure. I'm sure such a thing exists, but I can't find any literature because I do not know what to search for.
I'm not sure if math stack exchange is the right place to ask such a question, so I would also appreciate it if you could point me in a direction in which I am more likely to get an answer.
optimization numerical-methods numerical-optimization gradient-descent
Suppose one has a differentiable function $f:Bbb R^ntoBbb R$ that one can evaluate but for which one has no expression for the derivative. There exist several procedure for gaining an estimate for the gradient, for example one could evaluate
$$nabla f(x)approxsum_i frac{f(x+epsilon,e_i)-f(x-epsilon, e_i)}{2epsilon} e_i,tag{1}$$
or
$$nabla f(x) approx leftlanglesum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_irightrangle_{p(Delta)}.tag{2}$$
where the brackets denote the expectation value in which $Delta$ is drawn from some symmetric probability distribution $p(Delta)$ centered around $0$ (this is known as SPSA).
These methods are generic in the sense that for $epsilonto0$ or $mathrm{Var}(p(Delta))to0$ they will always converge to the gradient of $f$. However they may be slow, for example $(1)$ requires very many evaluations in high dimensions.
Sometimes one already has a candidate function $overline{nabla f}$ for $nabla f$. This candidate will in general be wrong but can for example lie in the correct quadrant. My question:
Is there a method for estimating the gradient of $f$ that uses the candidate $overline{nabla f}$ as a suggestion?
The expected benefit of using $overline{nabla f}$ would be a speed-up of the procedure. I'm sure such a thing exists, but I can't find any literature because I do not know what to search for.
I'm not sure if math stack exchange is the right place to ask such a question, so I would also appreciate it if you could point me in a direction in which I am more likely to get an answer.
optimization numerical-methods numerical-optimization gradient-descent
optimization numerical-methods numerical-optimization gradient-descent
edited Nov 21 at 9:28
asked May 17 '17 at 11:46
s.harp
8,37612049
8,37612049
You mean like using the SVD to determine the most important directions and then using Broyden updates to improve the Jacobian in these directions?
– LutzL
May 17 '17 at 12:00
I'm thinking concretely about situations where one has a model which gives you such gradient candidates, for example via an approximation of $f$ by some function $bar f$ which has a tractable differential. I don't know for sure what the Broyden method is, but it seems from what I just read that it is an update method for gradient descent that uses something like inertial terms to get better gradient estimates.
– s.harp
May 17 '17 at 12:47
What about some kind of importance sampling, with bias based on $overline{nabla f}$?
– user3658307
May 17 '17 at 21:37
add a comment |
You mean like using the SVD to determine the most important directions and then using Broyden updates to improve the Jacobian in these directions?
– LutzL
May 17 '17 at 12:00
I'm thinking concretely about situations where one has a model which gives you such gradient candidates, for example via an approximation of $f$ by some function $bar f$ which has a tractable differential. I don't know for sure what the Broyden method is, but it seems from what I just read that it is an update method for gradient descent that uses something like inertial terms to get better gradient estimates.
– s.harp
May 17 '17 at 12:47
What about some kind of importance sampling, with bias based on $overline{nabla f}$?
– user3658307
May 17 '17 at 21:37
You mean like using the SVD to determine the most important directions and then using Broyden updates to improve the Jacobian in these directions?
– LutzL
May 17 '17 at 12:00
You mean like using the SVD to determine the most important directions and then using Broyden updates to improve the Jacobian in these directions?
– LutzL
May 17 '17 at 12:00
I'm thinking concretely about situations where one has a model which gives you such gradient candidates, for example via an approximation of $f$ by some function $bar f$ which has a tractable differential. I don't know for sure what the Broyden method is, but it seems from what I just read that it is an update method for gradient descent that uses something like inertial terms to get better gradient estimates.
– s.harp
May 17 '17 at 12:47
I'm thinking concretely about situations where one has a model which gives you such gradient candidates, for example via an approximation of $f$ by some function $bar f$ which has a tractable differential. I don't know for sure what the Broyden method is, but it seems from what I just read that it is an update method for gradient descent that uses something like inertial terms to get better gradient estimates.
– s.harp
May 17 '17 at 12:47
What about some kind of importance sampling, with bias based on $overline{nabla f}$?
– user3658307
May 17 '17 at 21:37
What about some kind of importance sampling, with bias based on $overline{nabla f}$?
– user3658307
May 17 '17 at 21:37
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
For the case where $overline{nabla f} = nabla overline{f}$ for some $overline{f}$, you can use control variates.
In particular, let $Delta$ be drawn from a symmetric probability distribution $p(Delta)$, and let
$$X = sum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_i,$$
$$overline{X} = sum_i frac{overline{f}(x+Delta)-overline{f}(x-Delta)}{2Delta_i}e_i.$$
If $overline{nabla f}$ is a good candidate, $X$ and $overline{X}$ will be correlated, and we can exploit that.
Knowing that $mathbb{E}[overline{X}] = overline{nabla f}$, for any constant $c$ we can write
$$nabla f(x) = mathbb{E}[X] =
mathbb{E}[X + c(overline{X} - overline{nabla f})]$$
This effectively adds a correction to each of your Monte Carlo terms to make it more accurate for the candidate. If the candidate is correlated to the actual gradient, the real accuracy will also improve. This means you would need less terms to ge the same level of accuracy.
The optimal choice for the constant is $c = -frac{text{Cov}(X, overline{X})}{text{Var}(overline{X})}$. This is typically not known in advance, but can itself be estimated using random sampling (you can either sample separately, or you can start with some hand-picked initial value and improve it as you keep sampling).
If we don't have $overline{f}$, we can try to use a local approximation to it using $overline{nabla f}$ around $x$. If $Delta$ is unlikely to be large, using the first two terms of a Taylor series should be quite accurate.
I like where you are going, but I don't fully understand some things. When we are at $$nabla f(x)=Bbb E[X + c(overline{X}-overline{nabla f})]$$ what is the thing we are computing? If I understand you correctly you are saying that $X+c(overline{X}-overline{nabla f})$ will be a better estimator for $nabla f$ than $X$? In the case $overline{nabla f}=nabla f$ (from which optimal $c=-1$ follows) it is clear to me that this is a better estimator, so I am extrapolating from that.
– s.harp
Nov 23 at 10:36
As a further remark, the situation of the question no longer is relevant to me, but back then I did not have any approximation $overline f$ for $f$ where the gradients were easy to compute. The situation was quite interesting actually, I had a machine with parameters $theta$, outputs $y$ and intermediate parameters $z$ together with a model $y(theta), z(theta)$ for the machine and an optimisation goal $F(y)$. One uses the model to get expressions for the gradient involving $y, z, theta$ into which one then plugs in the actually realised values of the machine, retrieving a candidate.
– s.harp
Nov 23 at 10:44
The idea was that using actually observed behaviour together with "inspiration" from the model in tuning the parameters is more robust than using a pure model term, and far faster than using a generic optimisation method on the machine itself.
– s.harp
Nov 23 at 10:50
@s.harp What we're computing here $X + c(overline{X} - overline{nabla f})$ is a bit abstract, but you can think of it that way: $X$ (a function of $Delta$) is an unbiased estimator for $nabla f$, but for a particular realization of $Delta$, $X$ will have an error. Similarly, $overline{X}$ is an estimator of $nabla overline{f}$. But for $overline{X}$ we also know the true value being estimated, so we can compute what correction is needed. We assume that their errors are correlated, so we try to apply the same correction to $X$ (times a coefficient).
– Todor Markov
Nov 23 at 12:00
@s.harp The above is the idea. The specifics are then computed analytically: we simply pick c to minimize the variance of $X+c(X−overline{nabla f})$. The lower that variance is, the fewer samples we need to get a desired accuracy. Note that $c=0$ will simply give $X$, so the optimal $c$ can't possibly give anything worse than $X$. How much better it will be depends of the correlation between $X$ and $overline{X}$.
– Todor Markov
Nov 23 at 12:02
|
show 1 more comment
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
For the case where $overline{nabla f} = nabla overline{f}$ for some $overline{f}$, you can use control variates.
In particular, let $Delta$ be drawn from a symmetric probability distribution $p(Delta)$, and let
$$X = sum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_i,$$
$$overline{X} = sum_i frac{overline{f}(x+Delta)-overline{f}(x-Delta)}{2Delta_i}e_i.$$
If $overline{nabla f}$ is a good candidate, $X$ and $overline{X}$ will be correlated, and we can exploit that.
Knowing that $mathbb{E}[overline{X}] = overline{nabla f}$, for any constant $c$ we can write
$$nabla f(x) = mathbb{E}[X] =
mathbb{E}[X + c(overline{X} - overline{nabla f})]$$
This effectively adds a correction to each of your Monte Carlo terms to make it more accurate for the candidate. If the candidate is correlated to the actual gradient, the real accuracy will also improve. This means you would need less terms to ge the same level of accuracy.
The optimal choice for the constant is $c = -frac{text{Cov}(X, overline{X})}{text{Var}(overline{X})}$. This is typically not known in advance, but can itself be estimated using random sampling (you can either sample separately, or you can start with some hand-picked initial value and improve it as you keep sampling).
If we don't have $overline{f}$, we can try to use a local approximation to it using $overline{nabla f}$ around $x$. If $Delta$ is unlikely to be large, using the first two terms of a Taylor series should be quite accurate.
I like where you are going, but I don't fully understand some things. When we are at $$nabla f(x)=Bbb E[X + c(overline{X}-overline{nabla f})]$$ what is the thing we are computing? If I understand you correctly you are saying that $X+c(overline{X}-overline{nabla f})$ will be a better estimator for $nabla f$ than $X$? In the case $overline{nabla f}=nabla f$ (from which optimal $c=-1$ follows) it is clear to me that this is a better estimator, so I am extrapolating from that.
– s.harp
Nov 23 at 10:36
As a further remark, the situation of the question no longer is relevant to me, but back then I did not have any approximation $overline f$ for $f$ where the gradients were easy to compute. The situation was quite interesting actually, I had a machine with parameters $theta$, outputs $y$ and intermediate parameters $z$ together with a model $y(theta), z(theta)$ for the machine and an optimisation goal $F(y)$. One uses the model to get expressions for the gradient involving $y, z, theta$ into which one then plugs in the actually realised values of the machine, retrieving a candidate.
– s.harp
Nov 23 at 10:44
The idea was that using actually observed behaviour together with "inspiration" from the model in tuning the parameters is more robust than using a pure model term, and far faster than using a generic optimisation method on the machine itself.
– s.harp
Nov 23 at 10:50
@s.harp What we're computing here $X + c(overline{X} - overline{nabla f})$ is a bit abstract, but you can think of it that way: $X$ (a function of $Delta$) is an unbiased estimator for $nabla f$, but for a particular realization of $Delta$, $X$ will have an error. Similarly, $overline{X}$ is an estimator of $nabla overline{f}$. But for $overline{X}$ we also know the true value being estimated, so we can compute what correction is needed. We assume that their errors are correlated, so we try to apply the same correction to $X$ (times a coefficient).
– Todor Markov
Nov 23 at 12:00
@s.harp The above is the idea. The specifics are then computed analytically: we simply pick c to minimize the variance of $X+c(X−overline{nabla f})$. The lower that variance is, the fewer samples we need to get a desired accuracy. Note that $c=0$ will simply give $X$, so the optimal $c$ can't possibly give anything worse than $X$. How much better it will be depends of the correlation between $X$ and $overline{X}$.
– Todor Markov
Nov 23 at 12:02
|
show 1 more comment
up vote
1
down vote
For the case where $overline{nabla f} = nabla overline{f}$ for some $overline{f}$, you can use control variates.
In particular, let $Delta$ be drawn from a symmetric probability distribution $p(Delta)$, and let
$$X = sum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_i,$$
$$overline{X} = sum_i frac{overline{f}(x+Delta)-overline{f}(x-Delta)}{2Delta_i}e_i.$$
If $overline{nabla f}$ is a good candidate, $X$ and $overline{X}$ will be correlated, and we can exploit that.
Knowing that $mathbb{E}[overline{X}] = overline{nabla f}$, for any constant $c$ we can write
$$nabla f(x) = mathbb{E}[X] =
mathbb{E}[X + c(overline{X} - overline{nabla f})]$$
This effectively adds a correction to each of your Monte Carlo terms to make it more accurate for the candidate. If the candidate is correlated to the actual gradient, the real accuracy will also improve. This means you would need less terms to ge the same level of accuracy.
The optimal choice for the constant is $c = -frac{text{Cov}(X, overline{X})}{text{Var}(overline{X})}$. This is typically not known in advance, but can itself be estimated using random sampling (you can either sample separately, or you can start with some hand-picked initial value and improve it as you keep sampling).
If we don't have $overline{f}$, we can try to use a local approximation to it using $overline{nabla f}$ around $x$. If $Delta$ is unlikely to be large, using the first two terms of a Taylor series should be quite accurate.
I like where you are going, but I don't fully understand some things. When we are at $$nabla f(x)=Bbb E[X + c(overline{X}-overline{nabla f})]$$ what is the thing we are computing? If I understand you correctly you are saying that $X+c(overline{X}-overline{nabla f})$ will be a better estimator for $nabla f$ than $X$? In the case $overline{nabla f}=nabla f$ (from which optimal $c=-1$ follows) it is clear to me that this is a better estimator, so I am extrapolating from that.
– s.harp
Nov 23 at 10:36
As a further remark, the situation of the question no longer is relevant to me, but back then I did not have any approximation $overline f$ for $f$ where the gradients were easy to compute. The situation was quite interesting actually, I had a machine with parameters $theta$, outputs $y$ and intermediate parameters $z$ together with a model $y(theta), z(theta)$ for the machine and an optimisation goal $F(y)$. One uses the model to get expressions for the gradient involving $y, z, theta$ into which one then plugs in the actually realised values of the machine, retrieving a candidate.
– s.harp
Nov 23 at 10:44
The idea was that using actually observed behaviour together with "inspiration" from the model in tuning the parameters is more robust than using a pure model term, and far faster than using a generic optimisation method on the machine itself.
– s.harp
Nov 23 at 10:50
@s.harp What we're computing here $X + c(overline{X} - overline{nabla f})$ is a bit abstract, but you can think of it that way: $X$ (a function of $Delta$) is an unbiased estimator for $nabla f$, but for a particular realization of $Delta$, $X$ will have an error. Similarly, $overline{X}$ is an estimator of $nabla overline{f}$. But for $overline{X}$ we also know the true value being estimated, so we can compute what correction is needed. We assume that their errors are correlated, so we try to apply the same correction to $X$ (times a coefficient).
– Todor Markov
Nov 23 at 12:00
@s.harp The above is the idea. The specifics are then computed analytically: we simply pick c to minimize the variance of $X+c(X−overline{nabla f})$. The lower that variance is, the fewer samples we need to get a desired accuracy. Note that $c=0$ will simply give $X$, so the optimal $c$ can't possibly give anything worse than $X$. How much better it will be depends of the correlation between $X$ and $overline{X}$.
– Todor Markov
Nov 23 at 12:02
|
show 1 more comment
up vote
1
down vote
up vote
1
down vote
For the case where $overline{nabla f} = nabla overline{f}$ for some $overline{f}$, you can use control variates.
In particular, let $Delta$ be drawn from a symmetric probability distribution $p(Delta)$, and let
$$X = sum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_i,$$
$$overline{X} = sum_i frac{overline{f}(x+Delta)-overline{f}(x-Delta)}{2Delta_i}e_i.$$
If $overline{nabla f}$ is a good candidate, $X$ and $overline{X}$ will be correlated, and we can exploit that.
Knowing that $mathbb{E}[overline{X}] = overline{nabla f}$, for any constant $c$ we can write
$$nabla f(x) = mathbb{E}[X] =
mathbb{E}[X + c(overline{X} - overline{nabla f})]$$
This effectively adds a correction to each of your Monte Carlo terms to make it more accurate for the candidate. If the candidate is correlated to the actual gradient, the real accuracy will also improve. This means you would need less terms to ge the same level of accuracy.
The optimal choice for the constant is $c = -frac{text{Cov}(X, overline{X})}{text{Var}(overline{X})}$. This is typically not known in advance, but can itself be estimated using random sampling (you can either sample separately, or you can start with some hand-picked initial value and improve it as you keep sampling).
If we don't have $overline{f}$, we can try to use a local approximation to it using $overline{nabla f}$ around $x$. If $Delta$ is unlikely to be large, using the first two terms of a Taylor series should be quite accurate.
For the case where $overline{nabla f} = nabla overline{f}$ for some $overline{f}$, you can use control variates.
In particular, let $Delta$ be drawn from a symmetric probability distribution $p(Delta)$, and let
$$X = sum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_i,$$
$$overline{X} = sum_i frac{overline{f}(x+Delta)-overline{f}(x-Delta)}{2Delta_i}e_i.$$
If $overline{nabla f}$ is a good candidate, $X$ and $overline{X}$ will be correlated, and we can exploit that.
Knowing that $mathbb{E}[overline{X}] = overline{nabla f}$, for any constant $c$ we can write
$$nabla f(x) = mathbb{E}[X] =
mathbb{E}[X + c(overline{X} - overline{nabla f})]$$
This effectively adds a correction to each of your Monte Carlo terms to make it more accurate for the candidate. If the candidate is correlated to the actual gradient, the real accuracy will also improve. This means you would need less terms to ge the same level of accuracy.
The optimal choice for the constant is $c = -frac{text{Cov}(X, overline{X})}{text{Var}(overline{X})}$. This is typically not known in advance, but can itself be estimated using random sampling (you can either sample separately, or you can start with some hand-picked initial value and improve it as you keep sampling).
If we don't have $overline{f}$, we can try to use a local approximation to it using $overline{nabla f}$ around $x$. If $Delta$ is unlikely to be large, using the first two terms of a Taylor series should be quite accurate.
edited Nov 28 at 8:22
answered Nov 22 at 9:59
Todor Markov
6126
6126
I like where you are going, but I don't fully understand some things. When we are at $$nabla f(x)=Bbb E[X + c(overline{X}-overline{nabla f})]$$ what is the thing we are computing? If I understand you correctly you are saying that $X+c(overline{X}-overline{nabla f})$ will be a better estimator for $nabla f$ than $X$? In the case $overline{nabla f}=nabla f$ (from which optimal $c=-1$ follows) it is clear to me that this is a better estimator, so I am extrapolating from that.
– s.harp
Nov 23 at 10:36
As a further remark, the situation of the question no longer is relevant to me, but back then I did not have any approximation $overline f$ for $f$ where the gradients were easy to compute. The situation was quite interesting actually, I had a machine with parameters $theta$, outputs $y$ and intermediate parameters $z$ together with a model $y(theta), z(theta)$ for the machine and an optimisation goal $F(y)$. One uses the model to get expressions for the gradient involving $y, z, theta$ into which one then plugs in the actually realised values of the machine, retrieving a candidate.
– s.harp
Nov 23 at 10:44
The idea was that using actually observed behaviour together with "inspiration" from the model in tuning the parameters is more robust than using a pure model term, and far faster than using a generic optimisation method on the machine itself.
– s.harp
Nov 23 at 10:50
@s.harp What we're computing here $X + c(overline{X} - overline{nabla f})$ is a bit abstract, but you can think of it that way: $X$ (a function of $Delta$) is an unbiased estimator for $nabla f$, but for a particular realization of $Delta$, $X$ will have an error. Similarly, $overline{X}$ is an estimator of $nabla overline{f}$. But for $overline{X}$ we also know the true value being estimated, so we can compute what correction is needed. We assume that their errors are correlated, so we try to apply the same correction to $X$ (times a coefficient).
– Todor Markov
Nov 23 at 12:00
@s.harp The above is the idea. The specifics are then computed analytically: we simply pick c to minimize the variance of $X+c(X−overline{nabla f})$. The lower that variance is, the fewer samples we need to get a desired accuracy. Note that $c=0$ will simply give $X$, so the optimal $c$ can't possibly give anything worse than $X$. How much better it will be depends of the correlation between $X$ and $overline{X}$.
– Todor Markov
Nov 23 at 12:02
|
show 1 more comment
I like where you are going, but I don't fully understand some things. When we are at $$nabla f(x)=Bbb E[X + c(overline{X}-overline{nabla f})]$$ what is the thing we are computing? If I understand you correctly you are saying that $X+c(overline{X}-overline{nabla f})$ will be a better estimator for $nabla f$ than $X$? In the case $overline{nabla f}=nabla f$ (from which optimal $c=-1$ follows) it is clear to me that this is a better estimator, so I am extrapolating from that.
– s.harp
Nov 23 at 10:36
As a further remark, the situation of the question no longer is relevant to me, but back then I did not have any approximation $overline f$ for $f$ where the gradients were easy to compute. The situation was quite interesting actually, I had a machine with parameters $theta$, outputs $y$ and intermediate parameters $z$ together with a model $y(theta), z(theta)$ for the machine and an optimisation goal $F(y)$. One uses the model to get expressions for the gradient involving $y, z, theta$ into which one then plugs in the actually realised values of the machine, retrieving a candidate.
– s.harp
Nov 23 at 10:44
The idea was that using actually observed behaviour together with "inspiration" from the model in tuning the parameters is more robust than using a pure model term, and far faster than using a generic optimisation method on the machine itself.
– s.harp
Nov 23 at 10:50
@s.harp What we're computing here $X + c(overline{X} - overline{nabla f})$ is a bit abstract, but you can think of it that way: $X$ (a function of $Delta$) is an unbiased estimator for $nabla f$, but for a particular realization of $Delta$, $X$ will have an error. Similarly, $overline{X}$ is an estimator of $nabla overline{f}$. But for $overline{X}$ we also know the true value being estimated, so we can compute what correction is needed. We assume that their errors are correlated, so we try to apply the same correction to $X$ (times a coefficient).
– Todor Markov
Nov 23 at 12:00
@s.harp The above is the idea. The specifics are then computed analytically: we simply pick c to minimize the variance of $X+c(X−overline{nabla f})$. The lower that variance is, the fewer samples we need to get a desired accuracy. Note that $c=0$ will simply give $X$, so the optimal $c$ can't possibly give anything worse than $X$. How much better it will be depends of the correlation between $X$ and $overline{X}$.
– Todor Markov
Nov 23 at 12:02
I like where you are going, but I don't fully understand some things. When we are at $$nabla f(x)=Bbb E[X + c(overline{X}-overline{nabla f})]$$ what is the thing we are computing? If I understand you correctly you are saying that $X+c(overline{X}-overline{nabla f})$ will be a better estimator for $nabla f$ than $X$? In the case $overline{nabla f}=nabla f$ (from which optimal $c=-1$ follows) it is clear to me that this is a better estimator, so I am extrapolating from that.
– s.harp
Nov 23 at 10:36
I like where you are going, but I don't fully understand some things. When we are at $$nabla f(x)=Bbb E[X + c(overline{X}-overline{nabla f})]$$ what is the thing we are computing? If I understand you correctly you are saying that $X+c(overline{X}-overline{nabla f})$ will be a better estimator for $nabla f$ than $X$? In the case $overline{nabla f}=nabla f$ (from which optimal $c=-1$ follows) it is clear to me that this is a better estimator, so I am extrapolating from that.
– s.harp
Nov 23 at 10:36
As a further remark, the situation of the question no longer is relevant to me, but back then I did not have any approximation $overline f$ for $f$ where the gradients were easy to compute. The situation was quite interesting actually, I had a machine with parameters $theta$, outputs $y$ and intermediate parameters $z$ together with a model $y(theta), z(theta)$ for the machine and an optimisation goal $F(y)$. One uses the model to get expressions for the gradient involving $y, z, theta$ into which one then plugs in the actually realised values of the machine, retrieving a candidate.
– s.harp
Nov 23 at 10:44
As a further remark, the situation of the question no longer is relevant to me, but back then I did not have any approximation $overline f$ for $f$ where the gradients were easy to compute. The situation was quite interesting actually, I had a machine with parameters $theta$, outputs $y$ and intermediate parameters $z$ together with a model $y(theta), z(theta)$ for the machine and an optimisation goal $F(y)$. One uses the model to get expressions for the gradient involving $y, z, theta$ into which one then plugs in the actually realised values of the machine, retrieving a candidate.
– s.harp
Nov 23 at 10:44
The idea was that using actually observed behaviour together with "inspiration" from the model in tuning the parameters is more robust than using a pure model term, and far faster than using a generic optimisation method on the machine itself.
– s.harp
Nov 23 at 10:50
The idea was that using actually observed behaviour together with "inspiration" from the model in tuning the parameters is more robust than using a pure model term, and far faster than using a generic optimisation method on the machine itself.
– s.harp
Nov 23 at 10:50
@s.harp What we're computing here $X + c(overline{X} - overline{nabla f})$ is a bit abstract, but you can think of it that way: $X$ (a function of $Delta$) is an unbiased estimator for $nabla f$, but for a particular realization of $Delta$, $X$ will have an error. Similarly, $overline{X}$ is an estimator of $nabla overline{f}$. But for $overline{X}$ we also know the true value being estimated, so we can compute what correction is needed. We assume that their errors are correlated, so we try to apply the same correction to $X$ (times a coefficient).
– Todor Markov
Nov 23 at 12:00
@s.harp What we're computing here $X + c(overline{X} - overline{nabla f})$ is a bit abstract, but you can think of it that way: $X$ (a function of $Delta$) is an unbiased estimator for $nabla f$, but for a particular realization of $Delta$, $X$ will have an error. Similarly, $overline{X}$ is an estimator of $nabla overline{f}$. But for $overline{X}$ we also know the true value being estimated, so we can compute what correction is needed. We assume that their errors are correlated, so we try to apply the same correction to $X$ (times a coefficient).
– Todor Markov
Nov 23 at 12:00
@s.harp The above is the idea. The specifics are then computed analytically: we simply pick c to minimize the variance of $X+c(X−overline{nabla f})$. The lower that variance is, the fewer samples we need to get a desired accuracy. Note that $c=0$ will simply give $X$, so the optimal $c$ can't possibly give anything worse than $X$. How much better it will be depends of the correlation between $X$ and $overline{X}$.
– Todor Markov
Nov 23 at 12:02
@s.harp The above is the idea. The specifics are then computed analytically: we simply pick c to minimize the variance of $X+c(X−overline{nabla f})$. The lower that variance is, the fewer samples we need to get a desired accuracy. Note that $c=0$ will simply give $X$, so the optimal $c$ can't possibly give anything worse than $X$. How much better it will be depends of the correlation between $X$ and $overline{X}$.
– Todor Markov
Nov 23 at 12:02
|
show 1 more comment
Thanks for contributing an answer to Mathematics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2284835%2fgradient-estimates-using-a-candidate%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You mean like using the SVD to determine the most important directions and then using Broyden updates to improve the Jacobian in these directions?
– LutzL
May 17 '17 at 12:00
I'm thinking concretely about situations where one has a model which gives you such gradient candidates, for example via an approximation of $f$ by some function $bar f$ which has a tractable differential. I don't know for sure what the Broyden method is, but it seems from what I just read that it is an update method for gradient descent that uses something like inertial terms to get better gradient estimates.
– s.harp
May 17 '17 at 12:47
What about some kind of importance sampling, with bias based on $overline{nabla f}$?
– user3658307
May 17 '17 at 21:37