Relation between Shannon Entropy and Total Variation distance












9












$begingroup$


Let $p_1(cdot), p_2(cdot)$ be two discrete distributions on $mathbb{Z}.$ Total variation distance is defined as $d_{TV}(p_1,p_2)= frac{1}{2} displaystyle sum_{k in mathbb{Z}}|p_1(k)-p_2(k)|$ and Shannon entropy is defined the usual way, i.e
$$
H(p_1)=sum_k p_1(k) log(frac{1}{p_1(k)})
$$
Binary entropy function $h(cdot)$ is defined by $h(x)=x log(1/x)+(1-x)log(1/1-x), forall x in (0,1)$



I am trying to prove that $H(frac{p_1+p_2}{2})-frac{1}{2}H(p_1)-frac{1}{2}H(p_2) leq h (d_{TV}(p_1,p_2)/2)$. Can anyone guide me in this direction ?










share|cite|improve this question











$endgroup$












  • $begingroup$
    Out of curiosity, where did that question arise?
    $endgroup$
    – Clement C.
    Oct 15 '15 at 23:47










  • $begingroup$
    I would write of a function $h$ rather than of a function $h(cdot)$, reserving the parentheses to express a value of the function at some argument, as in $text{“}h(x)=text{some expression depending on }xtext{''}$. However, you feel strongly that you need the parentheses, the proper notation is $h(cdot)$ rather than $h(.)$. I edited accordingly. ${}qquad{}$
    $endgroup$
    – Michael Hardy
    Oct 15 '15 at 23:53












  • $begingroup$
    @ClementC. : The exact problem statement is as follows : $X ~ Bern(0.5), mathbb{P}(Y=k|X=0)=p_1(k), mathbb{P}(Y=k|X=1)=p_2(k)$. I am trying to prove $I(X;Y) leq h(d_{TV}(p_1,p_2))$
    $endgroup$
    – pikachuchameleon
    Oct 16 '15 at 2:37












  • $begingroup$
    @AshokVardhan I am deleting my previous comments, since they are no longer relevant to the question after the correction/edit you made. On a side note, I wonder if looking as the other expression of TV, namely $sup_S (p_1(S) - p_2(S))$, would help as a first step.
    $endgroup$
    – Clement C.
    Oct 17 '15 at 14:22








  • 1




    $begingroup$
    Without some assumptions on the entropies of $p_1,p_2$, it seems that what you are trying to prove may lead into trouble, because the right hand side of the inequality, namely, $h(d_{TV}(p_1,p_2)/2)$, is always finite, ( clearly $d_{TV}(p_1,p_2)leq 1$), but the left hand side can be infinite. For example, take $p_1$ with $H(p_1)=infty$. Now choose a second distribution $p_2$ for which $H(p_2)$ is finite. You get: $$infty-frac{1}{2}infty-frac{1}{2}H(p_2)leq C$$ for some positive number $C>0$. This does not make much sense.
    $endgroup$
    – uniquesolution
    Oct 23 '15 at 21:44


















9












$begingroup$


Let $p_1(cdot), p_2(cdot)$ be two discrete distributions on $mathbb{Z}.$ Total variation distance is defined as $d_{TV}(p_1,p_2)= frac{1}{2} displaystyle sum_{k in mathbb{Z}}|p_1(k)-p_2(k)|$ and Shannon entropy is defined the usual way, i.e
$$
H(p_1)=sum_k p_1(k) log(frac{1}{p_1(k)})
$$
Binary entropy function $h(cdot)$ is defined by $h(x)=x log(1/x)+(1-x)log(1/1-x), forall x in (0,1)$



I am trying to prove that $H(frac{p_1+p_2}{2})-frac{1}{2}H(p_1)-frac{1}{2}H(p_2) leq h (d_{TV}(p_1,p_2)/2)$. Can anyone guide me in this direction ?










share|cite|improve this question











$endgroup$












  • $begingroup$
    Out of curiosity, where did that question arise?
    $endgroup$
    – Clement C.
    Oct 15 '15 at 23:47










  • $begingroup$
    I would write of a function $h$ rather than of a function $h(cdot)$, reserving the parentheses to express a value of the function at some argument, as in $text{“}h(x)=text{some expression depending on }xtext{''}$. However, you feel strongly that you need the parentheses, the proper notation is $h(cdot)$ rather than $h(.)$. I edited accordingly. ${}qquad{}$
    $endgroup$
    – Michael Hardy
    Oct 15 '15 at 23:53












  • $begingroup$
    @ClementC. : The exact problem statement is as follows : $X ~ Bern(0.5), mathbb{P}(Y=k|X=0)=p_1(k), mathbb{P}(Y=k|X=1)=p_2(k)$. I am trying to prove $I(X;Y) leq h(d_{TV}(p_1,p_2))$
    $endgroup$
    – pikachuchameleon
    Oct 16 '15 at 2:37












  • $begingroup$
    @AshokVardhan I am deleting my previous comments, since they are no longer relevant to the question after the correction/edit you made. On a side note, I wonder if looking as the other expression of TV, namely $sup_S (p_1(S) - p_2(S))$, would help as a first step.
    $endgroup$
    – Clement C.
    Oct 17 '15 at 14:22








  • 1




    $begingroup$
    Without some assumptions on the entropies of $p_1,p_2$, it seems that what you are trying to prove may lead into trouble, because the right hand side of the inequality, namely, $h(d_{TV}(p_1,p_2)/2)$, is always finite, ( clearly $d_{TV}(p_1,p_2)leq 1$), but the left hand side can be infinite. For example, take $p_1$ with $H(p_1)=infty$. Now choose a second distribution $p_2$ for which $H(p_2)$ is finite. You get: $$infty-frac{1}{2}infty-frac{1}{2}H(p_2)leq C$$ for some positive number $C>0$. This does not make much sense.
    $endgroup$
    – uniquesolution
    Oct 23 '15 at 21:44
















9












9








9


9



$begingroup$


Let $p_1(cdot), p_2(cdot)$ be two discrete distributions on $mathbb{Z}.$ Total variation distance is defined as $d_{TV}(p_1,p_2)= frac{1}{2} displaystyle sum_{k in mathbb{Z}}|p_1(k)-p_2(k)|$ and Shannon entropy is defined the usual way, i.e
$$
H(p_1)=sum_k p_1(k) log(frac{1}{p_1(k)})
$$
Binary entropy function $h(cdot)$ is defined by $h(x)=x log(1/x)+(1-x)log(1/1-x), forall x in (0,1)$



I am trying to prove that $H(frac{p_1+p_2}{2})-frac{1}{2}H(p_1)-frac{1}{2}H(p_2) leq h (d_{TV}(p_1,p_2)/2)$. Can anyone guide me in this direction ?










share|cite|improve this question











$endgroup$




Let $p_1(cdot), p_2(cdot)$ be two discrete distributions on $mathbb{Z}.$ Total variation distance is defined as $d_{TV}(p_1,p_2)= frac{1}{2} displaystyle sum_{k in mathbb{Z}}|p_1(k)-p_2(k)|$ and Shannon entropy is defined the usual way, i.e
$$
H(p_1)=sum_k p_1(k) log(frac{1}{p_1(k)})
$$
Binary entropy function $h(cdot)$ is defined by $h(x)=x log(1/x)+(1-x)log(1/1-x), forall x in (0,1)$



I am trying to prove that $H(frac{p_1+p_2}{2})-frac{1}{2}H(p_1)-frac{1}{2}H(p_2) leq h (d_{TV}(p_1,p_2)/2)$. Can anyone guide me in this direction ?







probability-theory measure-theory information-theory






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Oct 17 '15 at 2:41







pikachuchameleon

















asked Oct 15 '15 at 23:27









pikachuchameleonpikachuchameleon

1,139621




1,139621












  • $begingroup$
    Out of curiosity, where did that question arise?
    $endgroup$
    – Clement C.
    Oct 15 '15 at 23:47










  • $begingroup$
    I would write of a function $h$ rather than of a function $h(cdot)$, reserving the parentheses to express a value of the function at some argument, as in $text{“}h(x)=text{some expression depending on }xtext{''}$. However, you feel strongly that you need the parentheses, the proper notation is $h(cdot)$ rather than $h(.)$. I edited accordingly. ${}qquad{}$
    $endgroup$
    – Michael Hardy
    Oct 15 '15 at 23:53












  • $begingroup$
    @ClementC. : The exact problem statement is as follows : $X ~ Bern(0.5), mathbb{P}(Y=k|X=0)=p_1(k), mathbb{P}(Y=k|X=1)=p_2(k)$. I am trying to prove $I(X;Y) leq h(d_{TV}(p_1,p_2))$
    $endgroup$
    – pikachuchameleon
    Oct 16 '15 at 2:37












  • $begingroup$
    @AshokVardhan I am deleting my previous comments, since they are no longer relevant to the question after the correction/edit you made. On a side note, I wonder if looking as the other expression of TV, namely $sup_S (p_1(S) - p_2(S))$, would help as a first step.
    $endgroup$
    – Clement C.
    Oct 17 '15 at 14:22








  • 1




    $begingroup$
    Without some assumptions on the entropies of $p_1,p_2$, it seems that what you are trying to prove may lead into trouble, because the right hand side of the inequality, namely, $h(d_{TV}(p_1,p_2)/2)$, is always finite, ( clearly $d_{TV}(p_1,p_2)leq 1$), but the left hand side can be infinite. For example, take $p_1$ with $H(p_1)=infty$. Now choose a second distribution $p_2$ for which $H(p_2)$ is finite. You get: $$infty-frac{1}{2}infty-frac{1}{2}H(p_2)leq C$$ for some positive number $C>0$. This does not make much sense.
    $endgroup$
    – uniquesolution
    Oct 23 '15 at 21:44




















  • $begingroup$
    Out of curiosity, where did that question arise?
    $endgroup$
    – Clement C.
    Oct 15 '15 at 23:47










  • $begingroup$
    I would write of a function $h$ rather than of a function $h(cdot)$, reserving the parentheses to express a value of the function at some argument, as in $text{“}h(x)=text{some expression depending on }xtext{''}$. However, you feel strongly that you need the parentheses, the proper notation is $h(cdot)$ rather than $h(.)$. I edited accordingly. ${}qquad{}$
    $endgroup$
    – Michael Hardy
    Oct 15 '15 at 23:53












  • $begingroup$
    @ClementC. : The exact problem statement is as follows : $X ~ Bern(0.5), mathbb{P}(Y=k|X=0)=p_1(k), mathbb{P}(Y=k|X=1)=p_2(k)$. I am trying to prove $I(X;Y) leq h(d_{TV}(p_1,p_2))$
    $endgroup$
    – pikachuchameleon
    Oct 16 '15 at 2:37












  • $begingroup$
    @AshokVardhan I am deleting my previous comments, since they are no longer relevant to the question after the correction/edit you made. On a side note, I wonder if looking as the other expression of TV, namely $sup_S (p_1(S) - p_2(S))$, would help as a first step.
    $endgroup$
    – Clement C.
    Oct 17 '15 at 14:22








  • 1




    $begingroup$
    Without some assumptions on the entropies of $p_1,p_2$, it seems that what you are trying to prove may lead into trouble, because the right hand side of the inequality, namely, $h(d_{TV}(p_1,p_2)/2)$, is always finite, ( clearly $d_{TV}(p_1,p_2)leq 1$), but the left hand side can be infinite. For example, take $p_1$ with $H(p_1)=infty$. Now choose a second distribution $p_2$ for which $H(p_2)$ is finite. You get: $$infty-frac{1}{2}infty-frac{1}{2}H(p_2)leq C$$ for some positive number $C>0$. This does not make much sense.
    $endgroup$
    – uniquesolution
    Oct 23 '15 at 21:44


















$begingroup$
Out of curiosity, where did that question arise?
$endgroup$
– Clement C.
Oct 15 '15 at 23:47




$begingroup$
Out of curiosity, where did that question arise?
$endgroup$
– Clement C.
Oct 15 '15 at 23:47












$begingroup$
I would write of a function $h$ rather than of a function $h(cdot)$, reserving the parentheses to express a value of the function at some argument, as in $text{“}h(x)=text{some expression depending on }xtext{''}$. However, you feel strongly that you need the parentheses, the proper notation is $h(cdot)$ rather than $h(.)$. I edited accordingly. ${}qquad{}$
$endgroup$
– Michael Hardy
Oct 15 '15 at 23:53






$begingroup$
I would write of a function $h$ rather than of a function $h(cdot)$, reserving the parentheses to express a value of the function at some argument, as in $text{“}h(x)=text{some expression depending on }xtext{''}$. However, you feel strongly that you need the parentheses, the proper notation is $h(cdot)$ rather than $h(.)$. I edited accordingly. ${}qquad{}$
$endgroup$
– Michael Hardy
Oct 15 '15 at 23:53














$begingroup$
@ClementC. : The exact problem statement is as follows : $X ~ Bern(0.5), mathbb{P}(Y=k|X=0)=p_1(k), mathbb{P}(Y=k|X=1)=p_2(k)$. I am trying to prove $I(X;Y) leq h(d_{TV}(p_1,p_2))$
$endgroup$
– pikachuchameleon
Oct 16 '15 at 2:37






$begingroup$
@ClementC. : The exact problem statement is as follows : $X ~ Bern(0.5), mathbb{P}(Y=k|X=0)=p_1(k), mathbb{P}(Y=k|X=1)=p_2(k)$. I am trying to prove $I(X;Y) leq h(d_{TV}(p_1,p_2))$
$endgroup$
– pikachuchameleon
Oct 16 '15 at 2:37














$begingroup$
@AshokVardhan I am deleting my previous comments, since they are no longer relevant to the question after the correction/edit you made. On a side note, I wonder if looking as the other expression of TV, namely $sup_S (p_1(S) - p_2(S))$, would help as a first step.
$endgroup$
– Clement C.
Oct 17 '15 at 14:22






$begingroup$
@AshokVardhan I am deleting my previous comments, since they are no longer relevant to the question after the correction/edit you made. On a side note, I wonder if looking as the other expression of TV, namely $sup_S (p_1(S) - p_2(S))$, would help as a first step.
$endgroup$
– Clement C.
Oct 17 '15 at 14:22






1




1




$begingroup$
Without some assumptions on the entropies of $p_1,p_2$, it seems that what you are trying to prove may lead into trouble, because the right hand side of the inequality, namely, $h(d_{TV}(p_1,p_2)/2)$, is always finite, ( clearly $d_{TV}(p_1,p_2)leq 1$), but the left hand side can be infinite. For example, take $p_1$ with $H(p_1)=infty$. Now choose a second distribution $p_2$ for which $H(p_2)$ is finite. You get: $$infty-frac{1}{2}infty-frac{1}{2}H(p_2)leq C$$ for some positive number $C>0$. This does not make much sense.
$endgroup$
– uniquesolution
Oct 23 '15 at 21:44






$begingroup$
Without some assumptions on the entropies of $p_1,p_2$, it seems that what you are trying to prove may lead into trouble, because the right hand side of the inequality, namely, $h(d_{TV}(p_1,p_2)/2)$, is always finite, ( clearly $d_{TV}(p_1,p_2)leq 1$), but the left hand side can be infinite. For example, take $p_1$ with $H(p_1)=infty$. Now choose a second distribution $p_2$ for which $H(p_2)$ is finite. You get: $$infty-frac{1}{2}infty-frac{1}{2}H(p_2)leq C$$ for some positive number $C>0$. This does not make much sense.
$endgroup$
– uniquesolution
Oct 23 '15 at 21:44












1 Answer
1






active

oldest

votes


















0












$begingroup$

Below the second part is rather inelegant, and I think this can possibly be improved. Suggestions are welcome.





Note that the LHS is the Jensen-Shannon divergence ($mathrm{JSD}$) between $P_1$ and $P_2$, and that $mathrm{JSD}$ is a $f$-divergence. For $f$-divergences generated by $f, g$ the joint ranges of $D_f,D_g$ are defined as
begin{align}
mathbb{R}^2 supset mathcal{R} :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on some measurable space} } \
mathcal{R}_k :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on } ([1:k], 2^{[1:k]} ) }end{align}



A remarkable theorem of Harremoees & Vajda (see also these notes by Wu) states that for any pair of $f$-divergences, $$mathcal{R} = mathrm{co}(mathcal{R}_2),$$ where $mathrm{co}$ is the convex hull operator.



Now, we want to show the relation $mathrm{JSD} le h(d_{TV}).$ Since both $mathrm{JSD}$ and $d_{TV}$ are $f$-divergences, and since the set $mathcal{S} := { y - h(x) le 0}$ is convex in $mathbb{R}^2$, it suffices to show this inequality for distributions on $2$-symbols, since by the convexity we have $mathcal{R}_2 subset mathcal{S} implies mathrm{co}(mathcal{R}_2) subset mathcal{S},$ as the convex hull of a set is the intersection of all convex sets containing it. The remainder of this answer will thus concentrate on showing $mathcal{R}_2 subset mathcal{S}$.





Let $p := pi + delta, q:= pi - delta,$ where $delta in [0,1/2]$ and $pi in [delta, 1- delta].$ We will show that $$ mathrm{JSD}(mathrm{Bern}(p)|mathrm{Bern}(q) ) le hleft(frac{1}{2}d_{TV}(mathrm{Bern}(p)|mathrm{Bern}(q) )right) = h(delta), tag{1}$$ which suffices to show the relation on $2$-letter distributions. Note that above $pge q$ always, but this doesn't matter since both $mathrm{JSD}$ and $d_{TV}$ are symmetric in their arguments.



For conciseness I'll set represent the $mathrm{JSD}$ above by $J$. All '$log$'s in the following are natural, and we will make use of the simple identities for $p in (0,1)$ $$ frac{mathrm{d}}{mathrm{d}p} h(p) = log frac{1-p}{p} \ frac{mathrm{d}^2}{mathrm{d}p^2} h(p) = -frac{1}{p} - frac{1}{1-p}. $$



By the expansion in the question, $$J(pi, delta) = h( pi) - frac{1}{2} h(pi + delta) - frac{1}{2} h(pi - delta).$$



It is trivial to see that the relation $(1)$ holds if $delta = 0$. Let us thus assume that $delta > 0.$ For $pi in (delta, 1-delta),$ we have



begin{align} frac{partial}{partial pi} J &= log frac{1-pi}{pi} - frac{1}{2} left( log frac{1 - pi - delta}{pi + delta} + log frac{1 - pi +delta}{pi - delta}right) end{align} and begin{align} frac{partial^2}{partial pi^2} J &= frac{1}{2} left( frac{1}{pi + delta} + frac{1}{pi - delta} + frac{1}{1 - pi - delta} + frac{1}{1 - pi + delta} right) - frac{1}{pi} - frac{1}{1-pi} \
&= frac{pi}{pi^2 - delta^2} - frac{1}{pi} + frac{1 - pi}{( 1-pi)^2 - delta^2} - frac{1}{1-pi} \
&= frac{delta^2}{pi(pi^2 - delta^2)} + frac{delta^2}{(1-pi)( (1-pi)^2 - delta^2)} > 0,
end{align}



where the final inequality uses $delta > 0,$ and that $ pi in (delta, 1-delta).$



As a consequence, for every fixed $delta >0,$ $J$ is strictly convex on $(delta, 1-delta).$ Since the maxima of a convex function on an interval must lie on the end points, we have $$ J(pi ,delta) le max( J(delta, delta), J(1- delta, delta) ).$$



But $$J(delta, delta) = h(delta) - frac{1}{2} (h(2delta) + h(0) ) = h(delta) - frac{1}{2} h(2delta),$$ and similarly $$J(1-delta, delta) = h(delta) - frac{1}{2} h(1-2delta) = h(delta) - frac{1}{2} h(2delta),$$ by the symmetry of $h$. We immediately get that for every $delta in [0,1/2], pi in [delta, 1-delta],$ $$J(pi, delta) le h(delta) - frac{1}{2} h(2delta) le h(delta),$$ finishing the argument.





Note that the last line indicates something stronger for $2$-symbol distributions: $J(pi, delta) le h(delta) - h(2delta)/2$. Unfortunately the RHS is a convex function of $delta$, so this doesn't directly extend to all alphabets. It'd be interesting if a bound that has such an advantage can be shown in general.






share|cite|improve this answer











$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "69"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1482341%2frelation-between-shannon-entropy-and-total-variation-distance%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    Below the second part is rather inelegant, and I think this can possibly be improved. Suggestions are welcome.





    Note that the LHS is the Jensen-Shannon divergence ($mathrm{JSD}$) between $P_1$ and $P_2$, and that $mathrm{JSD}$ is a $f$-divergence. For $f$-divergences generated by $f, g$ the joint ranges of $D_f,D_g$ are defined as
    begin{align}
    mathbb{R}^2 supset mathcal{R} :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on some measurable space} } \
    mathcal{R}_k :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on } ([1:k], 2^{[1:k]} ) }end{align}



    A remarkable theorem of Harremoees & Vajda (see also these notes by Wu) states that for any pair of $f$-divergences, $$mathcal{R} = mathrm{co}(mathcal{R}_2),$$ where $mathrm{co}$ is the convex hull operator.



    Now, we want to show the relation $mathrm{JSD} le h(d_{TV}).$ Since both $mathrm{JSD}$ and $d_{TV}$ are $f$-divergences, and since the set $mathcal{S} := { y - h(x) le 0}$ is convex in $mathbb{R}^2$, it suffices to show this inequality for distributions on $2$-symbols, since by the convexity we have $mathcal{R}_2 subset mathcal{S} implies mathrm{co}(mathcal{R}_2) subset mathcal{S},$ as the convex hull of a set is the intersection of all convex sets containing it. The remainder of this answer will thus concentrate on showing $mathcal{R}_2 subset mathcal{S}$.





    Let $p := pi + delta, q:= pi - delta,$ where $delta in [0,1/2]$ and $pi in [delta, 1- delta].$ We will show that $$ mathrm{JSD}(mathrm{Bern}(p)|mathrm{Bern}(q) ) le hleft(frac{1}{2}d_{TV}(mathrm{Bern}(p)|mathrm{Bern}(q) )right) = h(delta), tag{1}$$ which suffices to show the relation on $2$-letter distributions. Note that above $pge q$ always, but this doesn't matter since both $mathrm{JSD}$ and $d_{TV}$ are symmetric in their arguments.



    For conciseness I'll set represent the $mathrm{JSD}$ above by $J$. All '$log$'s in the following are natural, and we will make use of the simple identities for $p in (0,1)$ $$ frac{mathrm{d}}{mathrm{d}p} h(p) = log frac{1-p}{p} \ frac{mathrm{d}^2}{mathrm{d}p^2} h(p) = -frac{1}{p} - frac{1}{1-p}. $$



    By the expansion in the question, $$J(pi, delta) = h( pi) - frac{1}{2} h(pi + delta) - frac{1}{2} h(pi - delta).$$



    It is trivial to see that the relation $(1)$ holds if $delta = 0$. Let us thus assume that $delta > 0.$ For $pi in (delta, 1-delta),$ we have



    begin{align} frac{partial}{partial pi} J &= log frac{1-pi}{pi} - frac{1}{2} left( log frac{1 - pi - delta}{pi + delta} + log frac{1 - pi +delta}{pi - delta}right) end{align} and begin{align} frac{partial^2}{partial pi^2} J &= frac{1}{2} left( frac{1}{pi + delta} + frac{1}{pi - delta} + frac{1}{1 - pi - delta} + frac{1}{1 - pi + delta} right) - frac{1}{pi} - frac{1}{1-pi} \
    &= frac{pi}{pi^2 - delta^2} - frac{1}{pi} + frac{1 - pi}{( 1-pi)^2 - delta^2} - frac{1}{1-pi} \
    &= frac{delta^2}{pi(pi^2 - delta^2)} + frac{delta^2}{(1-pi)( (1-pi)^2 - delta^2)} > 0,
    end{align}



    where the final inequality uses $delta > 0,$ and that $ pi in (delta, 1-delta).$



    As a consequence, for every fixed $delta >0,$ $J$ is strictly convex on $(delta, 1-delta).$ Since the maxima of a convex function on an interval must lie on the end points, we have $$ J(pi ,delta) le max( J(delta, delta), J(1- delta, delta) ).$$



    But $$J(delta, delta) = h(delta) - frac{1}{2} (h(2delta) + h(0) ) = h(delta) - frac{1}{2} h(2delta),$$ and similarly $$J(1-delta, delta) = h(delta) - frac{1}{2} h(1-2delta) = h(delta) - frac{1}{2} h(2delta),$$ by the symmetry of $h$. We immediately get that for every $delta in [0,1/2], pi in [delta, 1-delta],$ $$J(pi, delta) le h(delta) - frac{1}{2} h(2delta) le h(delta),$$ finishing the argument.





    Note that the last line indicates something stronger for $2$-symbol distributions: $J(pi, delta) le h(delta) - h(2delta)/2$. Unfortunately the RHS is a convex function of $delta$, so this doesn't directly extend to all alphabets. It'd be interesting if a bound that has such an advantage can be shown in general.






    share|cite|improve this answer











    $endgroup$


















      0












      $begingroup$

      Below the second part is rather inelegant, and I think this can possibly be improved. Suggestions are welcome.





      Note that the LHS is the Jensen-Shannon divergence ($mathrm{JSD}$) between $P_1$ and $P_2$, and that $mathrm{JSD}$ is a $f$-divergence. For $f$-divergences generated by $f, g$ the joint ranges of $D_f,D_g$ are defined as
      begin{align}
      mathbb{R}^2 supset mathcal{R} :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on some measurable space} } \
      mathcal{R}_k :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on } ([1:k], 2^{[1:k]} ) }end{align}



      A remarkable theorem of Harremoees & Vajda (see also these notes by Wu) states that for any pair of $f$-divergences, $$mathcal{R} = mathrm{co}(mathcal{R}_2),$$ where $mathrm{co}$ is the convex hull operator.



      Now, we want to show the relation $mathrm{JSD} le h(d_{TV}).$ Since both $mathrm{JSD}$ and $d_{TV}$ are $f$-divergences, and since the set $mathcal{S} := { y - h(x) le 0}$ is convex in $mathbb{R}^2$, it suffices to show this inequality for distributions on $2$-symbols, since by the convexity we have $mathcal{R}_2 subset mathcal{S} implies mathrm{co}(mathcal{R}_2) subset mathcal{S},$ as the convex hull of a set is the intersection of all convex sets containing it. The remainder of this answer will thus concentrate on showing $mathcal{R}_2 subset mathcal{S}$.





      Let $p := pi + delta, q:= pi - delta,$ where $delta in [0,1/2]$ and $pi in [delta, 1- delta].$ We will show that $$ mathrm{JSD}(mathrm{Bern}(p)|mathrm{Bern}(q) ) le hleft(frac{1}{2}d_{TV}(mathrm{Bern}(p)|mathrm{Bern}(q) )right) = h(delta), tag{1}$$ which suffices to show the relation on $2$-letter distributions. Note that above $pge q$ always, but this doesn't matter since both $mathrm{JSD}$ and $d_{TV}$ are symmetric in their arguments.



      For conciseness I'll set represent the $mathrm{JSD}$ above by $J$. All '$log$'s in the following are natural, and we will make use of the simple identities for $p in (0,1)$ $$ frac{mathrm{d}}{mathrm{d}p} h(p) = log frac{1-p}{p} \ frac{mathrm{d}^2}{mathrm{d}p^2} h(p) = -frac{1}{p} - frac{1}{1-p}. $$



      By the expansion in the question, $$J(pi, delta) = h( pi) - frac{1}{2} h(pi + delta) - frac{1}{2} h(pi - delta).$$



      It is trivial to see that the relation $(1)$ holds if $delta = 0$. Let us thus assume that $delta > 0.$ For $pi in (delta, 1-delta),$ we have



      begin{align} frac{partial}{partial pi} J &= log frac{1-pi}{pi} - frac{1}{2} left( log frac{1 - pi - delta}{pi + delta} + log frac{1 - pi +delta}{pi - delta}right) end{align} and begin{align} frac{partial^2}{partial pi^2} J &= frac{1}{2} left( frac{1}{pi + delta} + frac{1}{pi - delta} + frac{1}{1 - pi - delta} + frac{1}{1 - pi + delta} right) - frac{1}{pi} - frac{1}{1-pi} \
      &= frac{pi}{pi^2 - delta^2} - frac{1}{pi} + frac{1 - pi}{( 1-pi)^2 - delta^2} - frac{1}{1-pi} \
      &= frac{delta^2}{pi(pi^2 - delta^2)} + frac{delta^2}{(1-pi)( (1-pi)^2 - delta^2)} > 0,
      end{align}



      where the final inequality uses $delta > 0,$ and that $ pi in (delta, 1-delta).$



      As a consequence, for every fixed $delta >0,$ $J$ is strictly convex on $(delta, 1-delta).$ Since the maxima of a convex function on an interval must lie on the end points, we have $$ J(pi ,delta) le max( J(delta, delta), J(1- delta, delta) ).$$



      But $$J(delta, delta) = h(delta) - frac{1}{2} (h(2delta) + h(0) ) = h(delta) - frac{1}{2} h(2delta),$$ and similarly $$J(1-delta, delta) = h(delta) - frac{1}{2} h(1-2delta) = h(delta) - frac{1}{2} h(2delta),$$ by the symmetry of $h$. We immediately get that for every $delta in [0,1/2], pi in [delta, 1-delta],$ $$J(pi, delta) le h(delta) - frac{1}{2} h(2delta) le h(delta),$$ finishing the argument.





      Note that the last line indicates something stronger for $2$-symbol distributions: $J(pi, delta) le h(delta) - h(2delta)/2$. Unfortunately the RHS is a convex function of $delta$, so this doesn't directly extend to all alphabets. It'd be interesting if a bound that has such an advantage can be shown in general.






      share|cite|improve this answer











      $endgroup$
















        0












        0








        0





        $begingroup$

        Below the second part is rather inelegant, and I think this can possibly be improved. Suggestions are welcome.





        Note that the LHS is the Jensen-Shannon divergence ($mathrm{JSD}$) between $P_1$ and $P_2$, and that $mathrm{JSD}$ is a $f$-divergence. For $f$-divergences generated by $f, g$ the joint ranges of $D_f,D_g$ are defined as
        begin{align}
        mathbb{R}^2 supset mathcal{R} :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on some measurable space} } \
        mathcal{R}_k :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on } ([1:k], 2^{[1:k]} ) }end{align}



        A remarkable theorem of Harremoees & Vajda (see also these notes by Wu) states that for any pair of $f$-divergences, $$mathcal{R} = mathrm{co}(mathcal{R}_2),$$ where $mathrm{co}$ is the convex hull operator.



        Now, we want to show the relation $mathrm{JSD} le h(d_{TV}).$ Since both $mathrm{JSD}$ and $d_{TV}$ are $f$-divergences, and since the set $mathcal{S} := { y - h(x) le 0}$ is convex in $mathbb{R}^2$, it suffices to show this inequality for distributions on $2$-symbols, since by the convexity we have $mathcal{R}_2 subset mathcal{S} implies mathrm{co}(mathcal{R}_2) subset mathcal{S},$ as the convex hull of a set is the intersection of all convex sets containing it. The remainder of this answer will thus concentrate on showing $mathcal{R}_2 subset mathcal{S}$.





        Let $p := pi + delta, q:= pi - delta,$ where $delta in [0,1/2]$ and $pi in [delta, 1- delta].$ We will show that $$ mathrm{JSD}(mathrm{Bern}(p)|mathrm{Bern}(q) ) le hleft(frac{1}{2}d_{TV}(mathrm{Bern}(p)|mathrm{Bern}(q) )right) = h(delta), tag{1}$$ which suffices to show the relation on $2$-letter distributions. Note that above $pge q$ always, but this doesn't matter since both $mathrm{JSD}$ and $d_{TV}$ are symmetric in their arguments.



        For conciseness I'll set represent the $mathrm{JSD}$ above by $J$. All '$log$'s in the following are natural, and we will make use of the simple identities for $p in (0,1)$ $$ frac{mathrm{d}}{mathrm{d}p} h(p) = log frac{1-p}{p} \ frac{mathrm{d}^2}{mathrm{d}p^2} h(p) = -frac{1}{p} - frac{1}{1-p}. $$



        By the expansion in the question, $$J(pi, delta) = h( pi) - frac{1}{2} h(pi + delta) - frac{1}{2} h(pi - delta).$$



        It is trivial to see that the relation $(1)$ holds if $delta = 0$. Let us thus assume that $delta > 0.$ For $pi in (delta, 1-delta),$ we have



        begin{align} frac{partial}{partial pi} J &= log frac{1-pi}{pi} - frac{1}{2} left( log frac{1 - pi - delta}{pi + delta} + log frac{1 - pi +delta}{pi - delta}right) end{align} and begin{align} frac{partial^2}{partial pi^2} J &= frac{1}{2} left( frac{1}{pi + delta} + frac{1}{pi - delta} + frac{1}{1 - pi - delta} + frac{1}{1 - pi + delta} right) - frac{1}{pi} - frac{1}{1-pi} \
        &= frac{pi}{pi^2 - delta^2} - frac{1}{pi} + frac{1 - pi}{( 1-pi)^2 - delta^2} - frac{1}{1-pi} \
        &= frac{delta^2}{pi(pi^2 - delta^2)} + frac{delta^2}{(1-pi)( (1-pi)^2 - delta^2)} > 0,
        end{align}



        where the final inequality uses $delta > 0,$ and that $ pi in (delta, 1-delta).$



        As a consequence, for every fixed $delta >0,$ $J$ is strictly convex on $(delta, 1-delta).$ Since the maxima of a convex function on an interval must lie on the end points, we have $$ J(pi ,delta) le max( J(delta, delta), J(1- delta, delta) ).$$



        But $$J(delta, delta) = h(delta) - frac{1}{2} (h(2delta) + h(0) ) = h(delta) - frac{1}{2} h(2delta),$$ and similarly $$J(1-delta, delta) = h(delta) - frac{1}{2} h(1-2delta) = h(delta) - frac{1}{2} h(2delta),$$ by the symmetry of $h$. We immediately get that for every $delta in [0,1/2], pi in [delta, 1-delta],$ $$J(pi, delta) le h(delta) - frac{1}{2} h(2delta) le h(delta),$$ finishing the argument.





        Note that the last line indicates something stronger for $2$-symbol distributions: $J(pi, delta) le h(delta) - h(2delta)/2$. Unfortunately the RHS is a convex function of $delta$, so this doesn't directly extend to all alphabets. It'd be interesting if a bound that has such an advantage can be shown in general.






        share|cite|improve this answer











        $endgroup$



        Below the second part is rather inelegant, and I think this can possibly be improved. Suggestions are welcome.





        Note that the LHS is the Jensen-Shannon divergence ($mathrm{JSD}$) between $P_1$ and $P_2$, and that $mathrm{JSD}$ is a $f$-divergence. For $f$-divergences generated by $f, g$ the joint ranges of $D_f,D_g$ are defined as
        begin{align}
        mathbb{R}^2 supset mathcal{R} :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on some measurable space} } \
        mathcal{R}_k :=& { (D_f(P|Q), D_g(P|Q)): P, Q textrm{ are distributions on } ([1:k], 2^{[1:k]} ) }end{align}



        A remarkable theorem of Harremoees & Vajda (see also these notes by Wu) states that for any pair of $f$-divergences, $$mathcal{R} = mathrm{co}(mathcal{R}_2),$$ where $mathrm{co}$ is the convex hull operator.



        Now, we want to show the relation $mathrm{JSD} le h(d_{TV}).$ Since both $mathrm{JSD}$ and $d_{TV}$ are $f$-divergences, and since the set $mathcal{S} := { y - h(x) le 0}$ is convex in $mathbb{R}^2$, it suffices to show this inequality for distributions on $2$-symbols, since by the convexity we have $mathcal{R}_2 subset mathcal{S} implies mathrm{co}(mathcal{R}_2) subset mathcal{S},$ as the convex hull of a set is the intersection of all convex sets containing it. The remainder of this answer will thus concentrate on showing $mathcal{R}_2 subset mathcal{S}$.





        Let $p := pi + delta, q:= pi - delta,$ where $delta in [0,1/2]$ and $pi in [delta, 1- delta].$ We will show that $$ mathrm{JSD}(mathrm{Bern}(p)|mathrm{Bern}(q) ) le hleft(frac{1}{2}d_{TV}(mathrm{Bern}(p)|mathrm{Bern}(q) )right) = h(delta), tag{1}$$ which suffices to show the relation on $2$-letter distributions. Note that above $pge q$ always, but this doesn't matter since both $mathrm{JSD}$ and $d_{TV}$ are symmetric in their arguments.



        For conciseness I'll set represent the $mathrm{JSD}$ above by $J$. All '$log$'s in the following are natural, and we will make use of the simple identities for $p in (0,1)$ $$ frac{mathrm{d}}{mathrm{d}p} h(p) = log frac{1-p}{p} \ frac{mathrm{d}^2}{mathrm{d}p^2} h(p) = -frac{1}{p} - frac{1}{1-p}. $$



        By the expansion in the question, $$J(pi, delta) = h( pi) - frac{1}{2} h(pi + delta) - frac{1}{2} h(pi - delta).$$



        It is trivial to see that the relation $(1)$ holds if $delta = 0$. Let us thus assume that $delta > 0.$ For $pi in (delta, 1-delta),$ we have



        begin{align} frac{partial}{partial pi} J &= log frac{1-pi}{pi} - frac{1}{2} left( log frac{1 - pi - delta}{pi + delta} + log frac{1 - pi +delta}{pi - delta}right) end{align} and begin{align} frac{partial^2}{partial pi^2} J &= frac{1}{2} left( frac{1}{pi + delta} + frac{1}{pi - delta} + frac{1}{1 - pi - delta} + frac{1}{1 - pi + delta} right) - frac{1}{pi} - frac{1}{1-pi} \
        &= frac{pi}{pi^2 - delta^2} - frac{1}{pi} + frac{1 - pi}{( 1-pi)^2 - delta^2} - frac{1}{1-pi} \
        &= frac{delta^2}{pi(pi^2 - delta^2)} + frac{delta^2}{(1-pi)( (1-pi)^2 - delta^2)} > 0,
        end{align}



        where the final inequality uses $delta > 0,$ and that $ pi in (delta, 1-delta).$



        As a consequence, for every fixed $delta >0,$ $J$ is strictly convex on $(delta, 1-delta).$ Since the maxima of a convex function on an interval must lie on the end points, we have $$ J(pi ,delta) le max( J(delta, delta), J(1- delta, delta) ).$$



        But $$J(delta, delta) = h(delta) - frac{1}{2} (h(2delta) + h(0) ) = h(delta) - frac{1}{2} h(2delta),$$ and similarly $$J(1-delta, delta) = h(delta) - frac{1}{2} h(1-2delta) = h(delta) - frac{1}{2} h(2delta),$$ by the symmetry of $h$. We immediately get that for every $delta in [0,1/2], pi in [delta, 1-delta],$ $$J(pi, delta) le h(delta) - frac{1}{2} h(2delta) le h(delta),$$ finishing the argument.





        Note that the last line indicates something stronger for $2$-symbol distributions: $J(pi, delta) le h(delta) - h(2delta)/2$. Unfortunately the RHS is a convex function of $delta$, so this doesn't directly extend to all alphabets. It'd be interesting if a bound that has such an advantage can be shown in general.







        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        edited Dec 10 '18 at 21:50

























        answered Dec 10 '18 at 20:45









        stochasticboy321stochasticboy321

        2,577617




        2,577617






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Mathematics Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f1482341%2frelation-between-shannon-entropy-and-total-variation-distance%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Probability when a professor distributes a quiz and homework assignment to a class of n students.

            Aardman Animations

            Are they similar matrix