Is decision threshold a hyperparameter in logistic regression?












12












$begingroup$


Predicted classes from (binary) logistic regression are determined by using a threshold on the class membership probabilities generated by the model. As I understand it, typically 0.5 is used by default.



But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter? If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV method (as you would do for the regularisation parameter C).










share|cite|improve this question









$endgroup$








  • 1




    $begingroup$
    "As I understand it, typically 0.5 is used by default." Depends on the meaning of the word "typical". In practice, no one should be doing this.
    $endgroup$
    – Matthew Drury
    Jan 31 at 17:27






  • 3




    $begingroup$
    Very much related: Classification probability threshold
    $endgroup$
    – Stephan Kolassa
    Jan 31 at 18:17










  • $begingroup$
    Strictly you don't mean logistic regression, you mean using one logistic regressor with a threshold for binary classification (you could also train one regressor for each of the two classes, with a little seeded randomness or weighting to avoid them being linearly dependent).
    $endgroup$
    – smci
    Jan 31 at 19:53
















12












$begingroup$


Predicted classes from (binary) logistic regression are determined by using a threshold on the class membership probabilities generated by the model. As I understand it, typically 0.5 is used by default.



But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter? If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV method (as you would do for the regularisation parameter C).










share|cite|improve this question









$endgroup$








  • 1




    $begingroup$
    "As I understand it, typically 0.5 is used by default." Depends on the meaning of the word "typical". In practice, no one should be doing this.
    $endgroup$
    – Matthew Drury
    Jan 31 at 17:27






  • 3




    $begingroup$
    Very much related: Classification probability threshold
    $endgroup$
    – Stephan Kolassa
    Jan 31 at 18:17










  • $begingroup$
    Strictly you don't mean logistic regression, you mean using one logistic regressor with a threshold for binary classification (you could also train one regressor for each of the two classes, with a little seeded randomness or weighting to avoid them being linearly dependent).
    $endgroup$
    – smci
    Jan 31 at 19:53














12












12








12


1



$begingroup$


Predicted classes from (binary) logistic regression are determined by using a threshold on the class membership probabilities generated by the model. As I understand it, typically 0.5 is used by default.



But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter? If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV method (as you would do for the regularisation parameter C).










share|cite|improve this question









$endgroup$




Predicted classes from (binary) logistic regression are determined by using a threshold on the class membership probabilities generated by the model. As I understand it, typically 0.5 is used by default.



But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter? If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV method (as you would do for the regularisation parameter C).







machine-learning logistic scikit-learn hyperparameter






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked Jan 31 at 17:17









NickNick

886




886








  • 1




    $begingroup$
    "As I understand it, typically 0.5 is used by default." Depends on the meaning of the word "typical". In practice, no one should be doing this.
    $endgroup$
    – Matthew Drury
    Jan 31 at 17:27






  • 3




    $begingroup$
    Very much related: Classification probability threshold
    $endgroup$
    – Stephan Kolassa
    Jan 31 at 18:17










  • $begingroup$
    Strictly you don't mean logistic regression, you mean using one logistic regressor with a threshold for binary classification (you could also train one regressor for each of the two classes, with a little seeded randomness or weighting to avoid them being linearly dependent).
    $endgroup$
    – smci
    Jan 31 at 19:53














  • 1




    $begingroup$
    "As I understand it, typically 0.5 is used by default." Depends on the meaning of the word "typical". In practice, no one should be doing this.
    $endgroup$
    – Matthew Drury
    Jan 31 at 17:27






  • 3




    $begingroup$
    Very much related: Classification probability threshold
    $endgroup$
    – Stephan Kolassa
    Jan 31 at 18:17










  • $begingroup$
    Strictly you don't mean logistic regression, you mean using one logistic regressor with a threshold for binary classification (you could also train one regressor for each of the two classes, with a little seeded randomness or weighting to avoid them being linearly dependent).
    $endgroup$
    – smci
    Jan 31 at 19:53








1




1




$begingroup$
"As I understand it, typically 0.5 is used by default." Depends on the meaning of the word "typical". In practice, no one should be doing this.
$endgroup$
– Matthew Drury
Jan 31 at 17:27




$begingroup$
"As I understand it, typically 0.5 is used by default." Depends on the meaning of the word "typical". In practice, no one should be doing this.
$endgroup$
– Matthew Drury
Jan 31 at 17:27




3




3




$begingroup$
Very much related: Classification probability threshold
$endgroup$
– Stephan Kolassa
Jan 31 at 18:17




$begingroup$
Very much related: Classification probability threshold
$endgroup$
– Stephan Kolassa
Jan 31 at 18:17












$begingroup$
Strictly you don't mean logistic regression, you mean using one logistic regressor with a threshold for binary classification (you could also train one regressor for each of the two classes, with a little seeded randomness or weighting to avoid them being linearly dependent).
$endgroup$
– smci
Jan 31 at 19:53




$begingroup$
Strictly you don't mean logistic regression, you mean using one logistic regressor with a threshold for binary classification (you could also train one regressor for each of the two classes, with a little seeded randomness or weighting to avoid them being linearly dependent).
$endgroup$
– smci
Jan 31 at 19:53










2 Answers
2






active

oldest

votes


















11












$begingroup$

The decision threshold creates a trade-off between the number of positives that you predict and the number of negatives that you predict -- because, tautologically, increasing the decision threshold will decrease the number of positives that you predict and increase the number of negatives that you predict.



The decision threshold is not a hyper-parameter in the sense of model tuning because it doesn't change the flexibility of the model.



The way you're thinking about the word "tune" in the context of the decision threshold is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN. However, the model remains the same, because this doesn't change the coefficients. (The same is true for models which do not have coefficients, such as random forests: changing the threshold doesn't change anything about the trees.) So in a narrow sense, you're correct that finding the best trade-off among errors is "tuning," but you're wrong in thinking that changing the threshold is linked to other model hyper-parameters in a way that is optimized by GridSearchCV.



Stated another way, changing the decision threshold reflects a choice on your part about how many False Positives and False Negatives that you want to have. Consider the hypothetical that you set the decision threshold to a completely implausible value like -1. All probabilities are non-negative, so with this threshold you will predict "positive" for every observation. From a certain perspective, this is great, because your false negative rate is 0.0. However, your false positive rate is also at the extreme of 1.0, so in that sense your choice of threshold at -1 is terrible.



The ideal, of course, is to have a TPR of 1.0 and a FPR of 0.0 and a FNR of 0.0. But this is usually impossible in real-world applications, so the question then becomes "how much FPR am I willing to accept for how much TPR?" And this is the motivation of roc curves.






share|cite|improve this answer











$endgroup$













  • $begingroup$
    Thanks for the answer @Sycorax. You have almost convinced me. But can't we formalise the idea of "how much FPR am I willing to accept for how much TPR"? e.g. using a cost matrix. If we have a cost matrix then would it not be desirable to find the optimal threshold via tuning, as you would tune a hyperparameter? Or is there a better way to find the optimal threshold?
    $endgroup$
    – Nick
    Feb 1 at 8:32






  • 1




    $begingroup$
    The way you're using the word "tune" here is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN (but the model remains the same -- same coefficients, etc.). You're right, that you want to find the best trade-off among errors, but you're wrong that such tuning takes place inside GridSearchCV.
    $endgroup$
    – Sycorax
    Feb 1 at 13:49










  • $begingroup$
    @Sycorax Isn't the threshold and the intercept (bias term) doing basically the same thing? I.e. you can keep the threshold fixed at 0.5 but change the intercept accordingly; this will "change the model" (as per your last comment) but will have the identical effect in terms of binary predictions. Is this correct? If so, I am not sure the strict distinction between "changing the model" and "changing the decision rule" is so meaningful in this case.
    $endgroup$
    – amoeba
    Feb 1 at 16:16










  • $begingroup$
    @amoeba This is a though-provoking remark. I'll have to consider it. I suppose your suggestion amounts to "keep the threshold at 0.5 and treat the intercept as a hyperparameter, which you tune." There's nothing mathematically to stop you from doing this, except the observation that the model no longer maximizes its likelihood. But achieving the MLE may not be a priority in some specific context.
    $endgroup$
    – Sycorax
    Feb 1 at 16:26





















9












$begingroup$


But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter?




Yup, it does, sorta. It's a hyperparameter of you decision rule, but not the underlying regression.




If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV method (as you would do for the regularisation parameter C).




This is a design error in sklearn. The best practice for most classification scenarios is to fit the underlying model (which predicts probabilities) using some measure of the quality of these probabilities (like the log-loss in a logistic regression). Afterwards, a decision threshold on these probabilities should be tuned to optimize some business objective of your classification rule. The library should make it easy to optimize the decision threshold based on some measure of quality, but I don't believe it does that well.



I think this is one of the places sklearn got it wrong. The library includes a method, predict, on all classification models that thresholds at 0.5. This method is useless, and I strongly advocate for not ever invoking it. It's unfortunate that sklearn is not encouraging a better workflow.






share|cite|improve this answer











$endgroup$













  • $begingroup$
    I also share your skepticism of the predict method's default choice of 0.5 as a cutoff, but GridSearchCV accepts scorer objects which can tune models with respect to out-of-sample cross-entropy loss. Am I missing your point?
    $endgroup$
    – Sycorax
    Jan 31 at 17:32












  • $begingroup$
    Right, agreed that is best practice, but it doesn't encourage users to tune decision thresholds.
    $endgroup$
    – Matthew Drury
    Jan 31 at 17:32










  • $begingroup$
    Gotcha. I understand what you mean!
    $endgroup$
    – Sycorax
    Jan 31 at 17:33






  • 1




    $begingroup$
    @Sycorax tried to edit to clarify!
    $endgroup$
    – Matthew Drury
    Jan 31 at 17:35











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f390186%2fis-decision-threshold-a-hyperparameter-in-logistic-regression%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









11












$begingroup$

The decision threshold creates a trade-off between the number of positives that you predict and the number of negatives that you predict -- because, tautologically, increasing the decision threshold will decrease the number of positives that you predict and increase the number of negatives that you predict.



The decision threshold is not a hyper-parameter in the sense of model tuning because it doesn't change the flexibility of the model.



The way you're thinking about the word "tune" in the context of the decision threshold is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN. However, the model remains the same, because this doesn't change the coefficients. (The same is true for models which do not have coefficients, such as random forests: changing the threshold doesn't change anything about the trees.) So in a narrow sense, you're correct that finding the best trade-off among errors is "tuning," but you're wrong in thinking that changing the threshold is linked to other model hyper-parameters in a way that is optimized by GridSearchCV.



Stated another way, changing the decision threshold reflects a choice on your part about how many False Positives and False Negatives that you want to have. Consider the hypothetical that you set the decision threshold to a completely implausible value like -1. All probabilities are non-negative, so with this threshold you will predict "positive" for every observation. From a certain perspective, this is great, because your false negative rate is 0.0. However, your false positive rate is also at the extreme of 1.0, so in that sense your choice of threshold at -1 is terrible.



The ideal, of course, is to have a TPR of 1.0 and a FPR of 0.0 and a FNR of 0.0. But this is usually impossible in real-world applications, so the question then becomes "how much FPR am I willing to accept for how much TPR?" And this is the motivation of roc curves.






share|cite|improve this answer











$endgroup$













  • $begingroup$
    Thanks for the answer @Sycorax. You have almost convinced me. But can't we formalise the idea of "how much FPR am I willing to accept for how much TPR"? e.g. using a cost matrix. If we have a cost matrix then would it not be desirable to find the optimal threshold via tuning, as you would tune a hyperparameter? Or is there a better way to find the optimal threshold?
    $endgroup$
    – Nick
    Feb 1 at 8:32






  • 1




    $begingroup$
    The way you're using the word "tune" here is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN (but the model remains the same -- same coefficients, etc.). You're right, that you want to find the best trade-off among errors, but you're wrong that such tuning takes place inside GridSearchCV.
    $endgroup$
    – Sycorax
    Feb 1 at 13:49










  • $begingroup$
    @Sycorax Isn't the threshold and the intercept (bias term) doing basically the same thing? I.e. you can keep the threshold fixed at 0.5 but change the intercept accordingly; this will "change the model" (as per your last comment) but will have the identical effect in terms of binary predictions. Is this correct? If so, I am not sure the strict distinction between "changing the model" and "changing the decision rule" is so meaningful in this case.
    $endgroup$
    – amoeba
    Feb 1 at 16:16










  • $begingroup$
    @amoeba This is a though-provoking remark. I'll have to consider it. I suppose your suggestion amounts to "keep the threshold at 0.5 and treat the intercept as a hyperparameter, which you tune." There's nothing mathematically to stop you from doing this, except the observation that the model no longer maximizes its likelihood. But achieving the MLE may not be a priority in some specific context.
    $endgroup$
    – Sycorax
    Feb 1 at 16:26


















11












$begingroup$

The decision threshold creates a trade-off between the number of positives that you predict and the number of negatives that you predict -- because, tautologically, increasing the decision threshold will decrease the number of positives that you predict and increase the number of negatives that you predict.



The decision threshold is not a hyper-parameter in the sense of model tuning because it doesn't change the flexibility of the model.



The way you're thinking about the word "tune" in the context of the decision threshold is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN. However, the model remains the same, because this doesn't change the coefficients. (The same is true for models which do not have coefficients, such as random forests: changing the threshold doesn't change anything about the trees.) So in a narrow sense, you're correct that finding the best trade-off among errors is "tuning," but you're wrong in thinking that changing the threshold is linked to other model hyper-parameters in a way that is optimized by GridSearchCV.



Stated another way, changing the decision threshold reflects a choice on your part about how many False Positives and False Negatives that you want to have. Consider the hypothetical that you set the decision threshold to a completely implausible value like -1. All probabilities are non-negative, so with this threshold you will predict "positive" for every observation. From a certain perspective, this is great, because your false negative rate is 0.0. However, your false positive rate is also at the extreme of 1.0, so in that sense your choice of threshold at -1 is terrible.



The ideal, of course, is to have a TPR of 1.0 and a FPR of 0.0 and a FNR of 0.0. But this is usually impossible in real-world applications, so the question then becomes "how much FPR am I willing to accept for how much TPR?" And this is the motivation of roc curves.






share|cite|improve this answer











$endgroup$













  • $begingroup$
    Thanks for the answer @Sycorax. You have almost convinced me. But can't we formalise the idea of "how much FPR am I willing to accept for how much TPR"? e.g. using a cost matrix. If we have a cost matrix then would it not be desirable to find the optimal threshold via tuning, as you would tune a hyperparameter? Or is there a better way to find the optimal threshold?
    $endgroup$
    – Nick
    Feb 1 at 8:32






  • 1




    $begingroup$
    The way you're using the word "tune" here is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN (but the model remains the same -- same coefficients, etc.). You're right, that you want to find the best trade-off among errors, but you're wrong that such tuning takes place inside GridSearchCV.
    $endgroup$
    – Sycorax
    Feb 1 at 13:49










  • $begingroup$
    @Sycorax Isn't the threshold and the intercept (bias term) doing basically the same thing? I.e. you can keep the threshold fixed at 0.5 but change the intercept accordingly; this will "change the model" (as per your last comment) but will have the identical effect in terms of binary predictions. Is this correct? If so, I am not sure the strict distinction between "changing the model" and "changing the decision rule" is so meaningful in this case.
    $endgroup$
    – amoeba
    Feb 1 at 16:16










  • $begingroup$
    @amoeba This is a though-provoking remark. I'll have to consider it. I suppose your suggestion amounts to "keep the threshold at 0.5 and treat the intercept as a hyperparameter, which you tune." There's nothing mathematically to stop you from doing this, except the observation that the model no longer maximizes its likelihood. But achieving the MLE may not be a priority in some specific context.
    $endgroup$
    – Sycorax
    Feb 1 at 16:26
















11












11








11





$begingroup$

The decision threshold creates a trade-off between the number of positives that you predict and the number of negatives that you predict -- because, tautologically, increasing the decision threshold will decrease the number of positives that you predict and increase the number of negatives that you predict.



The decision threshold is not a hyper-parameter in the sense of model tuning because it doesn't change the flexibility of the model.



The way you're thinking about the word "tune" in the context of the decision threshold is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN. However, the model remains the same, because this doesn't change the coefficients. (The same is true for models which do not have coefficients, such as random forests: changing the threshold doesn't change anything about the trees.) So in a narrow sense, you're correct that finding the best trade-off among errors is "tuning," but you're wrong in thinking that changing the threshold is linked to other model hyper-parameters in a way that is optimized by GridSearchCV.



Stated another way, changing the decision threshold reflects a choice on your part about how many False Positives and False Negatives that you want to have. Consider the hypothetical that you set the decision threshold to a completely implausible value like -1. All probabilities are non-negative, so with this threshold you will predict "positive" for every observation. From a certain perspective, this is great, because your false negative rate is 0.0. However, your false positive rate is also at the extreme of 1.0, so in that sense your choice of threshold at -1 is terrible.



The ideal, of course, is to have a TPR of 1.0 and a FPR of 0.0 and a FNR of 0.0. But this is usually impossible in real-world applications, so the question then becomes "how much FPR am I willing to accept for how much TPR?" And this is the motivation of roc curves.






share|cite|improve this answer











$endgroup$



The decision threshold creates a trade-off between the number of positives that you predict and the number of negatives that you predict -- because, tautologically, increasing the decision threshold will decrease the number of positives that you predict and increase the number of negatives that you predict.



The decision threshold is not a hyper-parameter in the sense of model tuning because it doesn't change the flexibility of the model.



The way you're thinking about the word "tune" in the context of the decision threshold is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN. However, the model remains the same, because this doesn't change the coefficients. (The same is true for models which do not have coefficients, such as random forests: changing the threshold doesn't change anything about the trees.) So in a narrow sense, you're correct that finding the best trade-off among errors is "tuning," but you're wrong in thinking that changing the threshold is linked to other model hyper-parameters in a way that is optimized by GridSearchCV.



Stated another way, changing the decision threshold reflects a choice on your part about how many False Positives and False Negatives that you want to have. Consider the hypothetical that you set the decision threshold to a completely implausible value like -1. All probabilities are non-negative, so with this threshold you will predict "positive" for every observation. From a certain perspective, this is great, because your false negative rate is 0.0. However, your false positive rate is also at the extreme of 1.0, so in that sense your choice of threshold at -1 is terrible.



The ideal, of course, is to have a TPR of 1.0 and a FPR of 0.0 and a FNR of 0.0. But this is usually impossible in real-world applications, so the question then becomes "how much FPR am I willing to accept for how much TPR?" And this is the motivation of roc curves.







share|cite|improve this answer














share|cite|improve this answer



share|cite|improve this answer








edited Feb 1 at 15:34

























answered Jan 31 at 17:27









SycoraxSycorax

40.7k12104204




40.7k12104204












  • $begingroup$
    Thanks for the answer @Sycorax. You have almost convinced me. But can't we formalise the idea of "how much FPR am I willing to accept for how much TPR"? e.g. using a cost matrix. If we have a cost matrix then would it not be desirable to find the optimal threshold via tuning, as you would tune a hyperparameter? Or is there a better way to find the optimal threshold?
    $endgroup$
    – Nick
    Feb 1 at 8:32






  • 1




    $begingroup$
    The way you're using the word "tune" here is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN (but the model remains the same -- same coefficients, etc.). You're right, that you want to find the best trade-off among errors, but you're wrong that such tuning takes place inside GridSearchCV.
    $endgroup$
    – Sycorax
    Feb 1 at 13:49










  • $begingroup$
    @Sycorax Isn't the threshold and the intercept (bias term) doing basically the same thing? I.e. you can keep the threshold fixed at 0.5 but change the intercept accordingly; this will "change the model" (as per your last comment) but will have the identical effect in terms of binary predictions. Is this correct? If so, I am not sure the strict distinction between "changing the model" and "changing the decision rule" is so meaningful in this case.
    $endgroup$
    – amoeba
    Feb 1 at 16:16










  • $begingroup$
    @amoeba This is a though-provoking remark. I'll have to consider it. I suppose your suggestion amounts to "keep the threshold at 0.5 and treat the intercept as a hyperparameter, which you tune." There's nothing mathematically to stop you from doing this, except the observation that the model no longer maximizes its likelihood. But achieving the MLE may not be a priority in some specific context.
    $endgroup$
    – Sycorax
    Feb 1 at 16:26




















  • $begingroup$
    Thanks for the answer @Sycorax. You have almost convinced me. But can't we formalise the idea of "how much FPR am I willing to accept for how much TPR"? e.g. using a cost matrix. If we have a cost matrix then would it not be desirable to find the optimal threshold via tuning, as you would tune a hyperparameter? Or is there a better way to find the optimal threshold?
    $endgroup$
    – Nick
    Feb 1 at 8:32






  • 1




    $begingroup$
    The way you're using the word "tune" here is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN (but the model remains the same -- same coefficients, etc.). You're right, that you want to find the best trade-off among errors, but you're wrong that such tuning takes place inside GridSearchCV.
    $endgroup$
    – Sycorax
    Feb 1 at 13:49










  • $begingroup$
    @Sycorax Isn't the threshold and the intercept (bias term) doing basically the same thing? I.e. you can keep the threshold fixed at 0.5 but change the intercept accordingly; this will "change the model" (as per your last comment) but will have the identical effect in terms of binary predictions. Is this correct? If so, I am not sure the strict distinction between "changing the model" and "changing the decision rule" is so meaningful in this case.
    $endgroup$
    – amoeba
    Feb 1 at 16:16










  • $begingroup$
    @amoeba This is a though-provoking remark. I'll have to consider it. I suppose your suggestion amounts to "keep the threshold at 0.5 and treat the intercept as a hyperparameter, which you tune." There's nothing mathematically to stop you from doing this, except the observation that the model no longer maximizes its likelihood. But achieving the MLE may not be a priority in some specific context.
    $endgroup$
    – Sycorax
    Feb 1 at 16:26


















$begingroup$
Thanks for the answer @Sycorax. You have almost convinced me. But can't we formalise the idea of "how much FPR am I willing to accept for how much TPR"? e.g. using a cost matrix. If we have a cost matrix then would it not be desirable to find the optimal threshold via tuning, as you would tune a hyperparameter? Or is there a better way to find the optimal threshold?
$endgroup$
– Nick
Feb 1 at 8:32




$begingroup$
Thanks for the answer @Sycorax. You have almost convinced me. But can't we formalise the idea of "how much FPR am I willing to accept for how much TPR"? e.g. using a cost matrix. If we have a cost matrix then would it not be desirable to find the optimal threshold via tuning, as you would tune a hyperparameter? Or is there a better way to find the optimal threshold?
$endgroup$
– Nick
Feb 1 at 8:32




1




1




$begingroup$
The way you're using the word "tune" here is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN (but the model remains the same -- same coefficients, etc.). You're right, that you want to find the best trade-off among errors, but you're wrong that such tuning takes place inside GridSearchCV.
$endgroup$
– Sycorax
Feb 1 at 13:49




$begingroup$
The way you're using the word "tune" here is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN (but the model remains the same -- same coefficients, etc.). You're right, that you want to find the best trade-off among errors, but you're wrong that such tuning takes place inside GridSearchCV.
$endgroup$
– Sycorax
Feb 1 at 13:49












$begingroup$
@Sycorax Isn't the threshold and the intercept (bias term) doing basically the same thing? I.e. you can keep the threshold fixed at 0.5 but change the intercept accordingly; this will "change the model" (as per your last comment) but will have the identical effect in terms of binary predictions. Is this correct? If so, I am not sure the strict distinction between "changing the model" and "changing the decision rule" is so meaningful in this case.
$endgroup$
– amoeba
Feb 1 at 16:16




$begingroup$
@Sycorax Isn't the threshold and the intercept (bias term) doing basically the same thing? I.e. you can keep the threshold fixed at 0.5 but change the intercept accordingly; this will "change the model" (as per your last comment) but will have the identical effect in terms of binary predictions. Is this correct? If so, I am not sure the strict distinction between "changing the model" and "changing the decision rule" is so meaningful in this case.
$endgroup$
– amoeba
Feb 1 at 16:16












$begingroup$
@amoeba This is a though-provoking remark. I'll have to consider it. I suppose your suggestion amounts to "keep the threshold at 0.5 and treat the intercept as a hyperparameter, which you tune." There's nothing mathematically to stop you from doing this, except the observation that the model no longer maximizes its likelihood. But achieving the MLE may not be a priority in some specific context.
$endgroup$
– Sycorax
Feb 1 at 16:26






$begingroup$
@amoeba This is a though-provoking remark. I'll have to consider it. I suppose your suggestion amounts to "keep the threshold at 0.5 and treat the intercept as a hyperparameter, which you tune." There's nothing mathematically to stop you from doing this, except the observation that the model no longer maximizes its likelihood. But achieving the MLE may not be a priority in some specific context.
$endgroup$
– Sycorax
Feb 1 at 16:26















9












$begingroup$


But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter?




Yup, it does, sorta. It's a hyperparameter of you decision rule, but not the underlying regression.




If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV method (as you would do for the regularisation parameter C).




This is a design error in sklearn. The best practice for most classification scenarios is to fit the underlying model (which predicts probabilities) using some measure of the quality of these probabilities (like the log-loss in a logistic regression). Afterwards, a decision threshold on these probabilities should be tuned to optimize some business objective of your classification rule. The library should make it easy to optimize the decision threshold based on some measure of quality, but I don't believe it does that well.



I think this is one of the places sklearn got it wrong. The library includes a method, predict, on all classification models that thresholds at 0.5. This method is useless, and I strongly advocate for not ever invoking it. It's unfortunate that sklearn is not encouraging a better workflow.






share|cite|improve this answer











$endgroup$













  • $begingroup$
    I also share your skepticism of the predict method's default choice of 0.5 as a cutoff, but GridSearchCV accepts scorer objects which can tune models with respect to out-of-sample cross-entropy loss. Am I missing your point?
    $endgroup$
    – Sycorax
    Jan 31 at 17:32












  • $begingroup$
    Right, agreed that is best practice, but it doesn't encourage users to tune decision thresholds.
    $endgroup$
    – Matthew Drury
    Jan 31 at 17:32










  • $begingroup$
    Gotcha. I understand what you mean!
    $endgroup$
    – Sycorax
    Jan 31 at 17:33






  • 1




    $begingroup$
    @Sycorax tried to edit to clarify!
    $endgroup$
    – Matthew Drury
    Jan 31 at 17:35
















9












$begingroup$


But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter?




Yup, it does, sorta. It's a hyperparameter of you decision rule, but not the underlying regression.




If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV method (as you would do for the regularisation parameter C).




This is a design error in sklearn. The best practice for most classification scenarios is to fit the underlying model (which predicts probabilities) using some measure of the quality of these probabilities (like the log-loss in a logistic regression). Afterwards, a decision threshold on these probabilities should be tuned to optimize some business objective of your classification rule. The library should make it easy to optimize the decision threshold based on some measure of quality, but I don't believe it does that well.



I think this is one of the places sklearn got it wrong. The library includes a method, predict, on all classification models that thresholds at 0.5. This method is useless, and I strongly advocate for not ever invoking it. It's unfortunate that sklearn is not encouraging a better workflow.






share|cite|improve this answer











$endgroup$













  • $begingroup$
    I also share your skepticism of the predict method's default choice of 0.5 as a cutoff, but GridSearchCV accepts scorer objects which can tune models with respect to out-of-sample cross-entropy loss. Am I missing your point?
    $endgroup$
    – Sycorax
    Jan 31 at 17:32












  • $begingroup$
    Right, agreed that is best practice, but it doesn't encourage users to tune decision thresholds.
    $endgroup$
    – Matthew Drury
    Jan 31 at 17:32










  • $begingroup$
    Gotcha. I understand what you mean!
    $endgroup$
    – Sycorax
    Jan 31 at 17:33






  • 1




    $begingroup$
    @Sycorax tried to edit to clarify!
    $endgroup$
    – Matthew Drury
    Jan 31 at 17:35














9












9








9





$begingroup$


But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter?




Yup, it does, sorta. It's a hyperparameter of you decision rule, but not the underlying regression.




If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV method (as you would do for the regularisation parameter C).




This is a design error in sklearn. The best practice for most classification scenarios is to fit the underlying model (which predicts probabilities) using some measure of the quality of these probabilities (like the log-loss in a logistic regression). Afterwards, a decision threshold on these probabilities should be tuned to optimize some business objective of your classification rule. The library should make it easy to optimize the decision threshold based on some measure of quality, but I don't believe it does that well.



I think this is one of the places sklearn got it wrong. The library includes a method, predict, on all classification models that thresholds at 0.5. This method is useless, and I strongly advocate for not ever invoking it. It's unfortunate that sklearn is not encouraging a better workflow.






share|cite|improve this answer











$endgroup$




But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter?




Yup, it does, sorta. It's a hyperparameter of you decision rule, but not the underlying regression.




If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV method (as you would do for the regularisation parameter C).




This is a design error in sklearn. The best practice for most classification scenarios is to fit the underlying model (which predicts probabilities) using some measure of the quality of these probabilities (like the log-loss in a logistic regression). Afterwards, a decision threshold on these probabilities should be tuned to optimize some business objective of your classification rule. The library should make it easy to optimize the decision threshold based on some measure of quality, but I don't believe it does that well.



I think this is one of the places sklearn got it wrong. The library includes a method, predict, on all classification models that thresholds at 0.5. This method is useless, and I strongly advocate for not ever invoking it. It's unfortunate that sklearn is not encouraging a better workflow.







share|cite|improve this answer














share|cite|improve this answer



share|cite|improve this answer








edited Feb 1 at 16:52

























answered Jan 31 at 17:28









Matthew DruryMatthew Drury

25.7k262103




25.7k262103












  • $begingroup$
    I also share your skepticism of the predict method's default choice of 0.5 as a cutoff, but GridSearchCV accepts scorer objects which can tune models with respect to out-of-sample cross-entropy loss. Am I missing your point?
    $endgroup$
    – Sycorax
    Jan 31 at 17:32












  • $begingroup$
    Right, agreed that is best practice, but it doesn't encourage users to tune decision thresholds.
    $endgroup$
    – Matthew Drury
    Jan 31 at 17:32










  • $begingroup$
    Gotcha. I understand what you mean!
    $endgroup$
    – Sycorax
    Jan 31 at 17:33






  • 1




    $begingroup$
    @Sycorax tried to edit to clarify!
    $endgroup$
    – Matthew Drury
    Jan 31 at 17:35


















  • $begingroup$
    I also share your skepticism of the predict method's default choice of 0.5 as a cutoff, but GridSearchCV accepts scorer objects which can tune models with respect to out-of-sample cross-entropy loss. Am I missing your point?
    $endgroup$
    – Sycorax
    Jan 31 at 17:32












  • $begingroup$
    Right, agreed that is best practice, but it doesn't encourage users to tune decision thresholds.
    $endgroup$
    – Matthew Drury
    Jan 31 at 17:32










  • $begingroup$
    Gotcha. I understand what you mean!
    $endgroup$
    – Sycorax
    Jan 31 at 17:33






  • 1




    $begingroup$
    @Sycorax tried to edit to clarify!
    $endgroup$
    – Matthew Drury
    Jan 31 at 17:35
















$begingroup$
I also share your skepticism of the predict method's default choice of 0.5 as a cutoff, but GridSearchCV accepts scorer objects which can tune models with respect to out-of-sample cross-entropy loss. Am I missing your point?
$endgroup$
– Sycorax
Jan 31 at 17:32






$begingroup$
I also share your skepticism of the predict method's default choice of 0.5 as a cutoff, but GridSearchCV accepts scorer objects which can tune models with respect to out-of-sample cross-entropy loss. Am I missing your point?
$endgroup$
– Sycorax
Jan 31 at 17:32














$begingroup$
Right, agreed that is best practice, but it doesn't encourage users to tune decision thresholds.
$endgroup$
– Matthew Drury
Jan 31 at 17:32




$begingroup$
Right, agreed that is best practice, but it doesn't encourage users to tune decision thresholds.
$endgroup$
– Matthew Drury
Jan 31 at 17:32












$begingroup$
Gotcha. I understand what you mean!
$endgroup$
– Sycorax
Jan 31 at 17:33




$begingroup$
Gotcha. I understand what you mean!
$endgroup$
– Sycorax
Jan 31 at 17:33




1




1




$begingroup$
@Sycorax tried to edit to clarify!
$endgroup$
– Matthew Drury
Jan 31 at 17:35




$begingroup$
@Sycorax tried to edit to clarify!
$endgroup$
– Matthew Drury
Jan 31 at 17:35


















draft saved

draft discarded




















































Thanks for contributing an answer to Cross Validated!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f390186%2fis-decision-threshold-a-hyperparameter-in-logistic-regression%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Probability when a professor distributes a quiz and homework assignment to a class of n students.

Aardman Animations

Are they similar matrix