Gradient estimates using a candidate











up vote
5
down vote

favorite












Suppose one has a differentiable function $f:Bbb R^ntoBbb R$ that one can evaluate but for which one has no expression for the derivative. There exist several procedure for gaining an estimate for the gradient, for example one could evaluate
$$nabla f(x)approxsum_i frac{f(x+epsilon,e_i)-f(x-epsilon, e_i)}{2epsilon} e_i,tag{1}$$
or
$$nabla f(x) approx leftlanglesum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_irightrangle_{p(Delta)}.tag{2}$$
where the brackets denote the expectation value in which $Delta$ is drawn from some symmetric probability distribution $p(Delta)$ centered around $0$ (this is known as SPSA).



These methods are generic in the sense that for $epsilonto0$ or $mathrm{Var}(p(Delta))to0$ they will always converge to the gradient of $f$. However they may be slow, for example $(1)$ requires very many evaluations in high dimensions.



Sometimes one already has a candidate function $overline{nabla f}$ for $nabla f$. This candidate will in general be wrong but can for example lie in the correct quadrant. My question:




Is there a method for estimating the gradient of $f$ that uses the candidate $overline{nabla f}$ as a suggestion?




The expected benefit of using $overline{nabla f}$ would be a speed-up of the procedure. I'm sure such a thing exists, but I can't find any literature because I do not know what to search for.



I'm not sure if math stack exchange is the right place to ask such a question, so I would also appreciate it if you could point me in a direction in which I am more likely to get an answer.










share|cite|improve this question
























  • You mean like using the SVD to determine the most important directions and then using Broyden updates to improve the Jacobian in these directions?
    – LutzL
    May 17 '17 at 12:00










  • I'm thinking concretely about situations where one has a model which gives you such gradient candidates, for example via an approximation of $f$ by some function $bar f$ which has a tractable differential. I don't know for sure what the Broyden method is, but it seems from what I just read that it is an update method for gradient descent that uses something like inertial terms to get better gradient estimates.
    – s.harp
    May 17 '17 at 12:47










  • What about some kind of importance sampling, with bias based on $overline{nabla f}$?
    – user3658307
    May 17 '17 at 21:37

















up vote
5
down vote

favorite












Suppose one has a differentiable function $f:Bbb R^ntoBbb R$ that one can evaluate but for which one has no expression for the derivative. There exist several procedure for gaining an estimate for the gradient, for example one could evaluate
$$nabla f(x)approxsum_i frac{f(x+epsilon,e_i)-f(x-epsilon, e_i)}{2epsilon} e_i,tag{1}$$
or
$$nabla f(x) approx leftlanglesum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_irightrangle_{p(Delta)}.tag{2}$$
where the brackets denote the expectation value in which $Delta$ is drawn from some symmetric probability distribution $p(Delta)$ centered around $0$ (this is known as SPSA).



These methods are generic in the sense that for $epsilonto0$ or $mathrm{Var}(p(Delta))to0$ they will always converge to the gradient of $f$. However they may be slow, for example $(1)$ requires very many evaluations in high dimensions.



Sometimes one already has a candidate function $overline{nabla f}$ for $nabla f$. This candidate will in general be wrong but can for example lie in the correct quadrant. My question:




Is there a method for estimating the gradient of $f$ that uses the candidate $overline{nabla f}$ as a suggestion?




The expected benefit of using $overline{nabla f}$ would be a speed-up of the procedure. I'm sure such a thing exists, but I can't find any literature because I do not know what to search for.



I'm not sure if math stack exchange is the right place to ask such a question, so I would also appreciate it if you could point me in a direction in which I am more likely to get an answer.










share|cite|improve this question
























  • You mean like using the SVD to determine the most important directions and then using Broyden updates to improve the Jacobian in these directions?
    – LutzL
    May 17 '17 at 12:00










  • I'm thinking concretely about situations where one has a model which gives you such gradient candidates, for example via an approximation of $f$ by some function $bar f$ which has a tractable differential. I don't know for sure what the Broyden method is, but it seems from what I just read that it is an update method for gradient descent that uses something like inertial terms to get better gradient estimates.
    – s.harp
    May 17 '17 at 12:47










  • What about some kind of importance sampling, with bias based on $overline{nabla f}$?
    – user3658307
    May 17 '17 at 21:37















up vote
5
down vote

favorite









up vote
5
down vote

favorite











Suppose one has a differentiable function $f:Bbb R^ntoBbb R$ that one can evaluate but for which one has no expression for the derivative. There exist several procedure for gaining an estimate for the gradient, for example one could evaluate
$$nabla f(x)approxsum_i frac{f(x+epsilon,e_i)-f(x-epsilon, e_i)}{2epsilon} e_i,tag{1}$$
or
$$nabla f(x) approx leftlanglesum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_irightrangle_{p(Delta)}.tag{2}$$
where the brackets denote the expectation value in which $Delta$ is drawn from some symmetric probability distribution $p(Delta)$ centered around $0$ (this is known as SPSA).



These methods are generic in the sense that for $epsilonto0$ or $mathrm{Var}(p(Delta))to0$ they will always converge to the gradient of $f$. However they may be slow, for example $(1)$ requires very many evaluations in high dimensions.



Sometimes one already has a candidate function $overline{nabla f}$ for $nabla f$. This candidate will in general be wrong but can for example lie in the correct quadrant. My question:




Is there a method for estimating the gradient of $f$ that uses the candidate $overline{nabla f}$ as a suggestion?




The expected benefit of using $overline{nabla f}$ would be a speed-up of the procedure. I'm sure such a thing exists, but I can't find any literature because I do not know what to search for.



I'm not sure if math stack exchange is the right place to ask such a question, so I would also appreciate it if you could point me in a direction in which I am more likely to get an answer.










share|cite|improve this question















Suppose one has a differentiable function $f:Bbb R^ntoBbb R$ that one can evaluate but for which one has no expression for the derivative. There exist several procedure for gaining an estimate for the gradient, for example one could evaluate
$$nabla f(x)approxsum_i frac{f(x+epsilon,e_i)-f(x-epsilon, e_i)}{2epsilon} e_i,tag{1}$$
or
$$nabla f(x) approx leftlanglesum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_irightrangle_{p(Delta)}.tag{2}$$
where the brackets denote the expectation value in which $Delta$ is drawn from some symmetric probability distribution $p(Delta)$ centered around $0$ (this is known as SPSA).



These methods are generic in the sense that for $epsilonto0$ or $mathrm{Var}(p(Delta))to0$ they will always converge to the gradient of $f$. However they may be slow, for example $(1)$ requires very many evaluations in high dimensions.



Sometimes one already has a candidate function $overline{nabla f}$ for $nabla f$. This candidate will in general be wrong but can for example lie in the correct quadrant. My question:




Is there a method for estimating the gradient of $f$ that uses the candidate $overline{nabla f}$ as a suggestion?




The expected benefit of using $overline{nabla f}$ would be a speed-up of the procedure. I'm sure such a thing exists, but I can't find any literature because I do not know what to search for.



I'm not sure if math stack exchange is the right place to ask such a question, so I would also appreciate it if you could point me in a direction in which I am more likely to get an answer.







optimization numerical-methods numerical-optimization gradient-descent






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Nov 21 at 9:28

























asked May 17 '17 at 11:46









s.harp

8,37612049




8,37612049












  • You mean like using the SVD to determine the most important directions and then using Broyden updates to improve the Jacobian in these directions?
    – LutzL
    May 17 '17 at 12:00










  • I'm thinking concretely about situations where one has a model which gives you such gradient candidates, for example via an approximation of $f$ by some function $bar f$ which has a tractable differential. I don't know for sure what the Broyden method is, but it seems from what I just read that it is an update method for gradient descent that uses something like inertial terms to get better gradient estimates.
    – s.harp
    May 17 '17 at 12:47










  • What about some kind of importance sampling, with bias based on $overline{nabla f}$?
    – user3658307
    May 17 '17 at 21:37




















  • You mean like using the SVD to determine the most important directions and then using Broyden updates to improve the Jacobian in these directions?
    – LutzL
    May 17 '17 at 12:00










  • I'm thinking concretely about situations where one has a model which gives you such gradient candidates, for example via an approximation of $f$ by some function $bar f$ which has a tractable differential. I don't know for sure what the Broyden method is, but it seems from what I just read that it is an update method for gradient descent that uses something like inertial terms to get better gradient estimates.
    – s.harp
    May 17 '17 at 12:47










  • What about some kind of importance sampling, with bias based on $overline{nabla f}$?
    – user3658307
    May 17 '17 at 21:37


















You mean like using the SVD to determine the most important directions and then using Broyden updates to improve the Jacobian in these directions?
– LutzL
May 17 '17 at 12:00




You mean like using the SVD to determine the most important directions and then using Broyden updates to improve the Jacobian in these directions?
– LutzL
May 17 '17 at 12:00












I'm thinking concretely about situations where one has a model which gives you such gradient candidates, for example via an approximation of $f$ by some function $bar f$ which has a tractable differential. I don't know for sure what the Broyden method is, but it seems from what I just read that it is an update method for gradient descent that uses something like inertial terms to get better gradient estimates.
– s.harp
May 17 '17 at 12:47




I'm thinking concretely about situations where one has a model which gives you such gradient candidates, for example via an approximation of $f$ by some function $bar f$ which has a tractable differential. I don't know for sure what the Broyden method is, but it seems from what I just read that it is an update method for gradient descent that uses something like inertial terms to get better gradient estimates.
– s.harp
May 17 '17 at 12:47












What about some kind of importance sampling, with bias based on $overline{nabla f}$?
– user3658307
May 17 '17 at 21:37






What about some kind of importance sampling, with bias based on $overline{nabla f}$?
– user3658307
May 17 '17 at 21:37












1 Answer
1






active

oldest

votes

















up vote
1
down vote













For the case where $overline{nabla f} = nabla overline{f}$ for some $overline{f}$, you can use control variates.



In particular, let $Delta$ be drawn from a symmetric probability distribution $p(Delta)$, and let
$$X = sum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_i,$$
$$overline{X} = sum_i frac{overline{f}(x+Delta)-overline{f}(x-Delta)}{2Delta_i}e_i.$$



If $overline{nabla f}$ is a good candidate, $X$ and $overline{X}$ will be correlated, and we can exploit that.



Knowing that $mathbb{E}[overline{X}] = overline{nabla f}$, for any constant $c$ we can write



$$nabla f(x) = mathbb{E}[X] =
mathbb{E}[X + c(overline{X} - overline{nabla f})]$$



This effectively adds a correction to each of your Monte Carlo terms to make it more accurate for the candidate. If the candidate is correlated to the actual gradient, the real accuracy will also improve. This means you would need less terms to ge the same level of accuracy.



The optimal choice for the constant is $c = -frac{text{Cov}(X, overline{X})}{text{Var}(overline{X})}$. This is typically not known in advance, but can itself be estimated using random sampling (you can either sample separately, or you can start with some hand-picked initial value and improve it as you keep sampling).



If we don't have $overline{f}$, we can try to use a local approximation to it using $overline{nabla f}$ around $x$. If $Delta$ is unlikely to be large, using the first two terms of a Taylor series should be quite accurate.






share|cite|improve this answer























  • I like where you are going, but I don't fully understand some things. When we are at $$nabla f(x)=Bbb E[X + c(overline{X}-overline{nabla f})]$$ what is the thing we are computing? If I understand you correctly you are saying that $X+c(overline{X}-overline{nabla f})$ will be a better estimator for $nabla f$ than $X$? In the case $overline{nabla f}=nabla f$ (from which optimal $c=-1$ follows) it is clear to me that this is a better estimator, so I am extrapolating from that.
    – s.harp
    Nov 23 at 10:36












  • As a further remark, the situation of the question no longer is relevant to me, but back then I did not have any approximation $overline f$ for $f$ where the gradients were easy to compute. The situation was quite interesting actually, I had a machine with parameters $theta$, outputs $y$ and intermediate parameters $z$ together with a model $y(theta), z(theta)$ for the machine and an optimisation goal $F(y)$. One uses the model to get expressions for the gradient involving $y, z, theta$ into which one then plugs in the actually realised values of the machine, retrieving a candidate.
    – s.harp
    Nov 23 at 10:44










  • The idea was that using actually observed behaviour together with "inspiration" from the model in tuning the parameters is more robust than using a pure model term, and far faster than using a generic optimisation method on the machine itself.
    – s.harp
    Nov 23 at 10:50










  • @s.harp What we're computing here $X + c(overline{X} - overline{nabla f})$ is a bit abstract, but you can think of it that way: $X$ (a function of $Delta$) is an unbiased estimator for $nabla f$, but for a particular realization of $Delta$, $X$ will have an error. Similarly, $overline{X}$ is an estimator of $nabla overline{f}$. But for $overline{X}$ we also know the true value being estimated, so we can compute what correction is needed. We assume that their errors are correlated, so we try to apply the same correction to $X$ (times a coefficient).
    – Todor Markov
    Nov 23 at 12:00










  • @s.harp The above is the idea. The specifics are then computed analytically: we simply pick c to minimize the variance of $X+c(X−overline{nabla f})$. The lower that variance is, the fewer samples we need to get a desired accuracy. Note that $c=0$ will simply give $X$, so the optimal $c$ can't possibly give anything worse than $X$. How much better it will be depends of the correlation between $X$ and $overline{X}$.
    – Todor Markov
    Nov 23 at 12:02











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2284835%2fgradient-estimates-using-a-candidate%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote













For the case where $overline{nabla f} = nabla overline{f}$ for some $overline{f}$, you can use control variates.



In particular, let $Delta$ be drawn from a symmetric probability distribution $p(Delta)$, and let
$$X = sum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_i,$$
$$overline{X} = sum_i frac{overline{f}(x+Delta)-overline{f}(x-Delta)}{2Delta_i}e_i.$$



If $overline{nabla f}$ is a good candidate, $X$ and $overline{X}$ will be correlated, and we can exploit that.



Knowing that $mathbb{E}[overline{X}] = overline{nabla f}$, for any constant $c$ we can write



$$nabla f(x) = mathbb{E}[X] =
mathbb{E}[X + c(overline{X} - overline{nabla f})]$$



This effectively adds a correction to each of your Monte Carlo terms to make it more accurate for the candidate. If the candidate is correlated to the actual gradient, the real accuracy will also improve. This means you would need less terms to ge the same level of accuracy.



The optimal choice for the constant is $c = -frac{text{Cov}(X, overline{X})}{text{Var}(overline{X})}$. This is typically not known in advance, but can itself be estimated using random sampling (you can either sample separately, or you can start with some hand-picked initial value and improve it as you keep sampling).



If we don't have $overline{f}$, we can try to use a local approximation to it using $overline{nabla f}$ around $x$. If $Delta$ is unlikely to be large, using the first two terms of a Taylor series should be quite accurate.






share|cite|improve this answer























  • I like where you are going, but I don't fully understand some things. When we are at $$nabla f(x)=Bbb E[X + c(overline{X}-overline{nabla f})]$$ what is the thing we are computing? If I understand you correctly you are saying that $X+c(overline{X}-overline{nabla f})$ will be a better estimator for $nabla f$ than $X$? In the case $overline{nabla f}=nabla f$ (from which optimal $c=-1$ follows) it is clear to me that this is a better estimator, so I am extrapolating from that.
    – s.harp
    Nov 23 at 10:36












  • As a further remark, the situation of the question no longer is relevant to me, but back then I did not have any approximation $overline f$ for $f$ where the gradients were easy to compute. The situation was quite interesting actually, I had a machine with parameters $theta$, outputs $y$ and intermediate parameters $z$ together with a model $y(theta), z(theta)$ for the machine and an optimisation goal $F(y)$. One uses the model to get expressions for the gradient involving $y, z, theta$ into which one then plugs in the actually realised values of the machine, retrieving a candidate.
    – s.harp
    Nov 23 at 10:44










  • The idea was that using actually observed behaviour together with "inspiration" from the model in tuning the parameters is more robust than using a pure model term, and far faster than using a generic optimisation method on the machine itself.
    – s.harp
    Nov 23 at 10:50










  • @s.harp What we're computing here $X + c(overline{X} - overline{nabla f})$ is a bit abstract, but you can think of it that way: $X$ (a function of $Delta$) is an unbiased estimator for $nabla f$, but for a particular realization of $Delta$, $X$ will have an error. Similarly, $overline{X}$ is an estimator of $nabla overline{f}$. But for $overline{X}$ we also know the true value being estimated, so we can compute what correction is needed. We assume that their errors are correlated, so we try to apply the same correction to $X$ (times a coefficient).
    – Todor Markov
    Nov 23 at 12:00










  • @s.harp The above is the idea. The specifics are then computed analytically: we simply pick c to minimize the variance of $X+c(X−overline{nabla f})$. The lower that variance is, the fewer samples we need to get a desired accuracy. Note that $c=0$ will simply give $X$, so the optimal $c$ can't possibly give anything worse than $X$. How much better it will be depends of the correlation between $X$ and $overline{X}$.
    – Todor Markov
    Nov 23 at 12:02















up vote
1
down vote













For the case where $overline{nabla f} = nabla overline{f}$ for some $overline{f}$, you can use control variates.



In particular, let $Delta$ be drawn from a symmetric probability distribution $p(Delta)$, and let
$$X = sum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_i,$$
$$overline{X} = sum_i frac{overline{f}(x+Delta)-overline{f}(x-Delta)}{2Delta_i}e_i.$$



If $overline{nabla f}$ is a good candidate, $X$ and $overline{X}$ will be correlated, and we can exploit that.



Knowing that $mathbb{E}[overline{X}] = overline{nabla f}$, for any constant $c$ we can write



$$nabla f(x) = mathbb{E}[X] =
mathbb{E}[X + c(overline{X} - overline{nabla f})]$$



This effectively adds a correction to each of your Monte Carlo terms to make it more accurate for the candidate. If the candidate is correlated to the actual gradient, the real accuracy will also improve. This means you would need less terms to ge the same level of accuracy.



The optimal choice for the constant is $c = -frac{text{Cov}(X, overline{X})}{text{Var}(overline{X})}$. This is typically not known in advance, but can itself be estimated using random sampling (you can either sample separately, or you can start with some hand-picked initial value and improve it as you keep sampling).



If we don't have $overline{f}$, we can try to use a local approximation to it using $overline{nabla f}$ around $x$. If $Delta$ is unlikely to be large, using the first two terms of a Taylor series should be quite accurate.






share|cite|improve this answer























  • I like where you are going, but I don't fully understand some things. When we are at $$nabla f(x)=Bbb E[X + c(overline{X}-overline{nabla f})]$$ what is the thing we are computing? If I understand you correctly you are saying that $X+c(overline{X}-overline{nabla f})$ will be a better estimator for $nabla f$ than $X$? In the case $overline{nabla f}=nabla f$ (from which optimal $c=-1$ follows) it is clear to me that this is a better estimator, so I am extrapolating from that.
    – s.harp
    Nov 23 at 10:36












  • As a further remark, the situation of the question no longer is relevant to me, but back then I did not have any approximation $overline f$ for $f$ where the gradients were easy to compute. The situation was quite interesting actually, I had a machine with parameters $theta$, outputs $y$ and intermediate parameters $z$ together with a model $y(theta), z(theta)$ for the machine and an optimisation goal $F(y)$. One uses the model to get expressions for the gradient involving $y, z, theta$ into which one then plugs in the actually realised values of the machine, retrieving a candidate.
    – s.harp
    Nov 23 at 10:44










  • The idea was that using actually observed behaviour together with "inspiration" from the model in tuning the parameters is more robust than using a pure model term, and far faster than using a generic optimisation method on the machine itself.
    – s.harp
    Nov 23 at 10:50










  • @s.harp What we're computing here $X + c(overline{X} - overline{nabla f})$ is a bit abstract, but you can think of it that way: $X$ (a function of $Delta$) is an unbiased estimator for $nabla f$, but for a particular realization of $Delta$, $X$ will have an error. Similarly, $overline{X}$ is an estimator of $nabla overline{f}$. But for $overline{X}$ we also know the true value being estimated, so we can compute what correction is needed. We assume that their errors are correlated, so we try to apply the same correction to $X$ (times a coefficient).
    – Todor Markov
    Nov 23 at 12:00










  • @s.harp The above is the idea. The specifics are then computed analytically: we simply pick c to minimize the variance of $X+c(X−overline{nabla f})$. The lower that variance is, the fewer samples we need to get a desired accuracy. Note that $c=0$ will simply give $X$, so the optimal $c$ can't possibly give anything worse than $X$. How much better it will be depends of the correlation between $X$ and $overline{X}$.
    – Todor Markov
    Nov 23 at 12:02













up vote
1
down vote










up vote
1
down vote









For the case where $overline{nabla f} = nabla overline{f}$ for some $overline{f}$, you can use control variates.



In particular, let $Delta$ be drawn from a symmetric probability distribution $p(Delta)$, and let
$$X = sum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_i,$$
$$overline{X} = sum_i frac{overline{f}(x+Delta)-overline{f}(x-Delta)}{2Delta_i}e_i.$$



If $overline{nabla f}$ is a good candidate, $X$ and $overline{X}$ will be correlated, and we can exploit that.



Knowing that $mathbb{E}[overline{X}] = overline{nabla f}$, for any constant $c$ we can write



$$nabla f(x) = mathbb{E}[X] =
mathbb{E}[X + c(overline{X} - overline{nabla f})]$$



This effectively adds a correction to each of your Monte Carlo terms to make it more accurate for the candidate. If the candidate is correlated to the actual gradient, the real accuracy will also improve. This means you would need less terms to ge the same level of accuracy.



The optimal choice for the constant is $c = -frac{text{Cov}(X, overline{X})}{text{Var}(overline{X})}$. This is typically not known in advance, but can itself be estimated using random sampling (you can either sample separately, or you can start with some hand-picked initial value and improve it as you keep sampling).



If we don't have $overline{f}$, we can try to use a local approximation to it using $overline{nabla f}$ around $x$. If $Delta$ is unlikely to be large, using the first two terms of a Taylor series should be quite accurate.






share|cite|improve this answer














For the case where $overline{nabla f} = nabla overline{f}$ for some $overline{f}$, you can use control variates.



In particular, let $Delta$ be drawn from a symmetric probability distribution $p(Delta)$, and let
$$X = sum_i frac{f(x+Delta)-f(x-Delta)}{2Delta_i}e_i,$$
$$overline{X} = sum_i frac{overline{f}(x+Delta)-overline{f}(x-Delta)}{2Delta_i}e_i.$$



If $overline{nabla f}$ is a good candidate, $X$ and $overline{X}$ will be correlated, and we can exploit that.



Knowing that $mathbb{E}[overline{X}] = overline{nabla f}$, for any constant $c$ we can write



$$nabla f(x) = mathbb{E}[X] =
mathbb{E}[X + c(overline{X} - overline{nabla f})]$$



This effectively adds a correction to each of your Monte Carlo terms to make it more accurate for the candidate. If the candidate is correlated to the actual gradient, the real accuracy will also improve. This means you would need less terms to ge the same level of accuracy.



The optimal choice for the constant is $c = -frac{text{Cov}(X, overline{X})}{text{Var}(overline{X})}$. This is typically not known in advance, but can itself be estimated using random sampling (you can either sample separately, or you can start with some hand-picked initial value and improve it as you keep sampling).



If we don't have $overline{f}$, we can try to use a local approximation to it using $overline{nabla f}$ around $x$. If $Delta$ is unlikely to be large, using the first two terms of a Taylor series should be quite accurate.







share|cite|improve this answer














share|cite|improve this answer



share|cite|improve this answer








edited Nov 28 at 8:22

























answered Nov 22 at 9:59









Todor Markov

6126




6126












  • I like where you are going, but I don't fully understand some things. When we are at $$nabla f(x)=Bbb E[X + c(overline{X}-overline{nabla f})]$$ what is the thing we are computing? If I understand you correctly you are saying that $X+c(overline{X}-overline{nabla f})$ will be a better estimator for $nabla f$ than $X$? In the case $overline{nabla f}=nabla f$ (from which optimal $c=-1$ follows) it is clear to me that this is a better estimator, so I am extrapolating from that.
    – s.harp
    Nov 23 at 10:36












  • As a further remark, the situation of the question no longer is relevant to me, but back then I did not have any approximation $overline f$ for $f$ where the gradients were easy to compute. The situation was quite interesting actually, I had a machine with parameters $theta$, outputs $y$ and intermediate parameters $z$ together with a model $y(theta), z(theta)$ for the machine and an optimisation goal $F(y)$. One uses the model to get expressions for the gradient involving $y, z, theta$ into which one then plugs in the actually realised values of the machine, retrieving a candidate.
    – s.harp
    Nov 23 at 10:44










  • The idea was that using actually observed behaviour together with "inspiration" from the model in tuning the parameters is more robust than using a pure model term, and far faster than using a generic optimisation method on the machine itself.
    – s.harp
    Nov 23 at 10:50










  • @s.harp What we're computing here $X + c(overline{X} - overline{nabla f})$ is a bit abstract, but you can think of it that way: $X$ (a function of $Delta$) is an unbiased estimator for $nabla f$, but for a particular realization of $Delta$, $X$ will have an error. Similarly, $overline{X}$ is an estimator of $nabla overline{f}$. But for $overline{X}$ we also know the true value being estimated, so we can compute what correction is needed. We assume that their errors are correlated, so we try to apply the same correction to $X$ (times a coefficient).
    – Todor Markov
    Nov 23 at 12:00










  • @s.harp The above is the idea. The specifics are then computed analytically: we simply pick c to minimize the variance of $X+c(X−overline{nabla f})$. The lower that variance is, the fewer samples we need to get a desired accuracy. Note that $c=0$ will simply give $X$, so the optimal $c$ can't possibly give anything worse than $X$. How much better it will be depends of the correlation between $X$ and $overline{X}$.
    – Todor Markov
    Nov 23 at 12:02


















  • I like where you are going, but I don't fully understand some things. When we are at $$nabla f(x)=Bbb E[X + c(overline{X}-overline{nabla f})]$$ what is the thing we are computing? If I understand you correctly you are saying that $X+c(overline{X}-overline{nabla f})$ will be a better estimator for $nabla f$ than $X$? In the case $overline{nabla f}=nabla f$ (from which optimal $c=-1$ follows) it is clear to me that this is a better estimator, so I am extrapolating from that.
    – s.harp
    Nov 23 at 10:36












  • As a further remark, the situation of the question no longer is relevant to me, but back then I did not have any approximation $overline f$ for $f$ where the gradients were easy to compute. The situation was quite interesting actually, I had a machine with parameters $theta$, outputs $y$ and intermediate parameters $z$ together with a model $y(theta), z(theta)$ for the machine and an optimisation goal $F(y)$. One uses the model to get expressions for the gradient involving $y, z, theta$ into which one then plugs in the actually realised values of the machine, retrieving a candidate.
    – s.harp
    Nov 23 at 10:44










  • The idea was that using actually observed behaviour together with "inspiration" from the model in tuning the parameters is more robust than using a pure model term, and far faster than using a generic optimisation method on the machine itself.
    – s.harp
    Nov 23 at 10:50










  • @s.harp What we're computing here $X + c(overline{X} - overline{nabla f})$ is a bit abstract, but you can think of it that way: $X$ (a function of $Delta$) is an unbiased estimator for $nabla f$, but for a particular realization of $Delta$, $X$ will have an error. Similarly, $overline{X}$ is an estimator of $nabla overline{f}$. But for $overline{X}$ we also know the true value being estimated, so we can compute what correction is needed. We assume that their errors are correlated, so we try to apply the same correction to $X$ (times a coefficient).
    – Todor Markov
    Nov 23 at 12:00










  • @s.harp The above is the idea. The specifics are then computed analytically: we simply pick c to minimize the variance of $X+c(X−overline{nabla f})$. The lower that variance is, the fewer samples we need to get a desired accuracy. Note that $c=0$ will simply give $X$, so the optimal $c$ can't possibly give anything worse than $X$. How much better it will be depends of the correlation between $X$ and $overline{X}$.
    – Todor Markov
    Nov 23 at 12:02
















I like where you are going, but I don't fully understand some things. When we are at $$nabla f(x)=Bbb E[X + c(overline{X}-overline{nabla f})]$$ what is the thing we are computing? If I understand you correctly you are saying that $X+c(overline{X}-overline{nabla f})$ will be a better estimator for $nabla f$ than $X$? In the case $overline{nabla f}=nabla f$ (from which optimal $c=-1$ follows) it is clear to me that this is a better estimator, so I am extrapolating from that.
– s.harp
Nov 23 at 10:36






I like where you are going, but I don't fully understand some things. When we are at $$nabla f(x)=Bbb E[X + c(overline{X}-overline{nabla f})]$$ what is the thing we are computing? If I understand you correctly you are saying that $X+c(overline{X}-overline{nabla f})$ will be a better estimator for $nabla f$ than $X$? In the case $overline{nabla f}=nabla f$ (from which optimal $c=-1$ follows) it is clear to me that this is a better estimator, so I am extrapolating from that.
– s.harp
Nov 23 at 10:36














As a further remark, the situation of the question no longer is relevant to me, but back then I did not have any approximation $overline f$ for $f$ where the gradients were easy to compute. The situation was quite interesting actually, I had a machine with parameters $theta$, outputs $y$ and intermediate parameters $z$ together with a model $y(theta), z(theta)$ for the machine and an optimisation goal $F(y)$. One uses the model to get expressions for the gradient involving $y, z, theta$ into which one then plugs in the actually realised values of the machine, retrieving a candidate.
– s.harp
Nov 23 at 10:44




As a further remark, the situation of the question no longer is relevant to me, but back then I did not have any approximation $overline f$ for $f$ where the gradients were easy to compute. The situation was quite interesting actually, I had a machine with parameters $theta$, outputs $y$ and intermediate parameters $z$ together with a model $y(theta), z(theta)$ for the machine and an optimisation goal $F(y)$. One uses the model to get expressions for the gradient involving $y, z, theta$ into which one then plugs in the actually realised values of the machine, retrieving a candidate.
– s.harp
Nov 23 at 10:44












The idea was that using actually observed behaviour together with "inspiration" from the model in tuning the parameters is more robust than using a pure model term, and far faster than using a generic optimisation method on the machine itself.
– s.harp
Nov 23 at 10:50




The idea was that using actually observed behaviour together with "inspiration" from the model in tuning the parameters is more robust than using a pure model term, and far faster than using a generic optimisation method on the machine itself.
– s.harp
Nov 23 at 10:50












@s.harp What we're computing here $X + c(overline{X} - overline{nabla f})$ is a bit abstract, but you can think of it that way: $X$ (a function of $Delta$) is an unbiased estimator for $nabla f$, but for a particular realization of $Delta$, $X$ will have an error. Similarly, $overline{X}$ is an estimator of $nabla overline{f}$. But for $overline{X}$ we also know the true value being estimated, so we can compute what correction is needed. We assume that their errors are correlated, so we try to apply the same correction to $X$ (times a coefficient).
– Todor Markov
Nov 23 at 12:00




@s.harp What we're computing here $X + c(overline{X} - overline{nabla f})$ is a bit abstract, but you can think of it that way: $X$ (a function of $Delta$) is an unbiased estimator for $nabla f$, but for a particular realization of $Delta$, $X$ will have an error. Similarly, $overline{X}$ is an estimator of $nabla overline{f}$. But for $overline{X}$ we also know the true value being estimated, so we can compute what correction is needed. We assume that their errors are correlated, so we try to apply the same correction to $X$ (times a coefficient).
– Todor Markov
Nov 23 at 12:00












@s.harp The above is the idea. The specifics are then computed analytically: we simply pick c to minimize the variance of $X+c(X−overline{nabla f})$. The lower that variance is, the fewer samples we need to get a desired accuracy. Note that $c=0$ will simply give $X$, so the optimal $c$ can't possibly give anything worse than $X$. How much better it will be depends of the correlation between $X$ and $overline{X}$.
– Todor Markov
Nov 23 at 12:02




@s.harp The above is the idea. The specifics are then computed analytically: we simply pick c to minimize the variance of $X+c(X−overline{nabla f})$. The lower that variance is, the fewer samples we need to get a desired accuracy. Note that $c=0$ will simply give $X$, so the optimal $c$ can't possibly give anything worse than $X$. How much better it will be depends of the correlation between $X$ and $overline{X}$.
– Todor Markov
Nov 23 at 12:02


















draft saved

draft discarded




















































Thanks for contributing an answer to Mathematics Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2284835%2fgradient-estimates-using-a-candidate%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Probability when a professor distributes a quiz and homework assignment to a class of n students.

Aardman Animations

Are they similar matrix