Is decision threshold a hyperparameter in logistic regression?
$begingroup$
Predicted classes from (binary) logistic regression are determined by using a threshold on the class membership probabilities generated by the model. As I understand it, typically 0.5 is used by default.
But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter? If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV
method (as you would do for the regularisation parameter C
).
machine-learning logistic scikit-learn hyperparameter
$endgroup$
add a comment |
$begingroup$
Predicted classes from (binary) logistic regression are determined by using a threshold on the class membership probabilities generated by the model. As I understand it, typically 0.5 is used by default.
But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter? If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV
method (as you would do for the regularisation parameter C
).
machine-learning logistic scikit-learn hyperparameter
$endgroup$
1
$begingroup$
"As I understand it, typically 0.5 is used by default." Depends on the meaning of the word "typical". In practice, no one should be doing this.
$endgroup$
– Matthew Drury
Jan 31 at 17:27
3
$begingroup$
Very much related: Classification probability threshold
$endgroup$
– Stephan Kolassa
Jan 31 at 18:17
$begingroup$
Strictly you don't mean logistic regression, you mean using one logistic regressor with a threshold for binary classification (you could also train one regressor for each of the two classes, with a little seeded randomness or weighting to avoid them being linearly dependent).
$endgroup$
– smci
Jan 31 at 19:53
add a comment |
$begingroup$
Predicted classes from (binary) logistic regression are determined by using a threshold on the class membership probabilities generated by the model. As I understand it, typically 0.5 is used by default.
But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter? If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV
method (as you would do for the regularisation parameter C
).
machine-learning logistic scikit-learn hyperparameter
$endgroup$
Predicted classes from (binary) logistic regression are determined by using a threshold on the class membership probabilities generated by the model. As I understand it, typically 0.5 is used by default.
But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter? If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV
method (as you would do for the regularisation parameter C
).
machine-learning logistic scikit-learn hyperparameter
machine-learning logistic scikit-learn hyperparameter
asked Jan 31 at 17:17
NickNick
886
886
1
$begingroup$
"As I understand it, typically 0.5 is used by default." Depends on the meaning of the word "typical". In practice, no one should be doing this.
$endgroup$
– Matthew Drury
Jan 31 at 17:27
3
$begingroup$
Very much related: Classification probability threshold
$endgroup$
– Stephan Kolassa
Jan 31 at 18:17
$begingroup$
Strictly you don't mean logistic regression, you mean using one logistic regressor with a threshold for binary classification (you could also train one regressor for each of the two classes, with a little seeded randomness or weighting to avoid them being linearly dependent).
$endgroup$
– smci
Jan 31 at 19:53
add a comment |
1
$begingroup$
"As I understand it, typically 0.5 is used by default." Depends on the meaning of the word "typical". In practice, no one should be doing this.
$endgroup$
– Matthew Drury
Jan 31 at 17:27
3
$begingroup$
Very much related: Classification probability threshold
$endgroup$
– Stephan Kolassa
Jan 31 at 18:17
$begingroup$
Strictly you don't mean logistic regression, you mean using one logistic regressor with a threshold for binary classification (you could also train one regressor for each of the two classes, with a little seeded randomness or weighting to avoid them being linearly dependent).
$endgroup$
– smci
Jan 31 at 19:53
1
1
$begingroup$
"As I understand it, typically 0.5 is used by default." Depends on the meaning of the word "typical". In practice, no one should be doing this.
$endgroup$
– Matthew Drury
Jan 31 at 17:27
$begingroup$
"As I understand it, typically 0.5 is used by default." Depends on the meaning of the word "typical". In practice, no one should be doing this.
$endgroup$
– Matthew Drury
Jan 31 at 17:27
3
3
$begingroup$
Very much related: Classification probability threshold
$endgroup$
– Stephan Kolassa
Jan 31 at 18:17
$begingroup$
Very much related: Classification probability threshold
$endgroup$
– Stephan Kolassa
Jan 31 at 18:17
$begingroup$
Strictly you don't mean logistic regression, you mean using one logistic regressor with a threshold for binary classification (you could also train one regressor for each of the two classes, with a little seeded randomness or weighting to avoid them being linearly dependent).
$endgroup$
– smci
Jan 31 at 19:53
$begingroup$
Strictly you don't mean logistic regression, you mean using one logistic regressor with a threshold for binary classification (you could also train one regressor for each of the two classes, with a little seeded randomness or weighting to avoid them being linearly dependent).
$endgroup$
– smci
Jan 31 at 19:53
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
The decision threshold creates a trade-off between the number of positives that you predict and the number of negatives that you predict -- because, tautologically, increasing the decision threshold will decrease the number of positives that you predict and increase the number of negatives that you predict.
The decision threshold is not a hyper-parameter in the sense of model tuning because it doesn't change the flexibility of the model.
The way you're thinking about the word "tune" in the context of the decision threshold is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN. However, the model remains the same, because this doesn't change the coefficients. (The same is true for models which do not have coefficients, such as random forests: changing the threshold doesn't change anything about the trees.) So in a narrow sense, you're correct that finding the best trade-off among errors is "tuning," but you're wrong in thinking that changing the threshold is linked to other model hyper-parameters in a way that is optimized by GridSearchCV
.
Stated another way, changing the decision threshold reflects a choice on your part about how many False Positives and False Negatives that you want to have. Consider the hypothetical that you set the decision threshold to a completely implausible value like -1. All probabilities are non-negative, so with this threshold you will predict "positive" for every observation. From a certain perspective, this is great, because your false negative rate is 0.0. However, your false positive rate is also at the extreme of 1.0, so in that sense your choice of threshold at -1 is terrible.
The ideal, of course, is to have a TPR of 1.0 and a FPR of 0.0 and a FNR of 0.0. But this is usually impossible in real-world applications, so the question then becomes "how much FPR am I willing to accept for how much TPR?" And this is the motivation of roc curves.
$endgroup$
$begingroup$
Thanks for the answer @Sycorax. You have almost convinced me. But can't we formalise the idea of "how much FPR am I willing to accept for how much TPR"? e.g. using a cost matrix. If we have a cost matrix then would it not be desirable to find the optimal threshold via tuning, as you would tune a hyperparameter? Or is there a better way to find the optimal threshold?
$endgroup$
– Nick
Feb 1 at 8:32
1
$begingroup$
The way you're using the word "tune" here is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN (but the model remains the same -- same coefficients, etc.). You're right, that you want to find the best trade-off among errors, but you're wrong that such tuning takes place insideGridSearchCV
.
$endgroup$
– Sycorax
Feb 1 at 13:49
$begingroup$
@Sycorax Isn't the threshold and the intercept (bias term) doing basically the same thing? I.e. you can keep the threshold fixed at 0.5 but change the intercept accordingly; this will "change the model" (as per your last comment) but will have the identical effect in terms of binary predictions. Is this correct? If so, I am not sure the strict distinction between "changing the model" and "changing the decision rule" is so meaningful in this case.
$endgroup$
– amoeba
Feb 1 at 16:16
$begingroup$
@amoeba This is a though-provoking remark. I'll have to consider it. I suppose your suggestion amounts to "keep the threshold at 0.5 and treat the intercept as a hyperparameter, which you tune." There's nothing mathematically to stop you from doing this, except the observation that the model no longer maximizes its likelihood. But achieving the MLE may not be a priority in some specific context.
$endgroup$
– Sycorax
Feb 1 at 16:26
add a comment |
$begingroup$
But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter?
Yup, it does, sorta. It's a hyperparameter of you decision rule, but not the underlying regression.
If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV method (as you would do for the regularisation parameter C).
This is a design error in sklearn. The best practice for most classification scenarios is to fit the underlying model (which predicts probabilities) using some measure of the quality of these probabilities (like the log-loss in a logistic regression). Afterwards, a decision threshold on these probabilities should be tuned to optimize some business objective of your classification rule. The library should make it easy to optimize the decision threshold based on some measure of quality, but I don't believe it does that well.
I think this is one of the places sklearn got it wrong. The library includes a method, predict
, on all classification models that thresholds at 0.5
. This method is useless, and I strongly advocate for not ever invoking it. It's unfortunate that sklearn is not encouraging a better workflow.
$endgroup$
$begingroup$
I also share your skepticism of thepredict
method's default choice of 0.5 as a cutoff, butGridSearchCV
acceptsscorer
objects which can tune models with respect to out-of-sample cross-entropy loss. Am I missing your point?
$endgroup$
– Sycorax
Jan 31 at 17:32
$begingroup$
Right, agreed that is best practice, but it doesn't encourage users to tune decision thresholds.
$endgroup$
– Matthew Drury
Jan 31 at 17:32
$begingroup$
Gotcha. I understand what you mean!
$endgroup$
– Sycorax
Jan 31 at 17:33
1
$begingroup$
@Sycorax tried to edit to clarify!
$endgroup$
– Matthew Drury
Jan 31 at 17:35
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f390186%2fis-decision-threshold-a-hyperparameter-in-logistic-regression%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
The decision threshold creates a trade-off between the number of positives that you predict and the number of negatives that you predict -- because, tautologically, increasing the decision threshold will decrease the number of positives that you predict and increase the number of negatives that you predict.
The decision threshold is not a hyper-parameter in the sense of model tuning because it doesn't change the flexibility of the model.
The way you're thinking about the word "tune" in the context of the decision threshold is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN. However, the model remains the same, because this doesn't change the coefficients. (The same is true for models which do not have coefficients, such as random forests: changing the threshold doesn't change anything about the trees.) So in a narrow sense, you're correct that finding the best trade-off among errors is "tuning," but you're wrong in thinking that changing the threshold is linked to other model hyper-parameters in a way that is optimized by GridSearchCV
.
Stated another way, changing the decision threshold reflects a choice on your part about how many False Positives and False Negatives that you want to have. Consider the hypothetical that you set the decision threshold to a completely implausible value like -1. All probabilities are non-negative, so with this threshold you will predict "positive" for every observation. From a certain perspective, this is great, because your false negative rate is 0.0. However, your false positive rate is also at the extreme of 1.0, so in that sense your choice of threshold at -1 is terrible.
The ideal, of course, is to have a TPR of 1.0 and a FPR of 0.0 and a FNR of 0.0. But this is usually impossible in real-world applications, so the question then becomes "how much FPR am I willing to accept for how much TPR?" And this is the motivation of roc curves.
$endgroup$
$begingroup$
Thanks for the answer @Sycorax. You have almost convinced me. But can't we formalise the idea of "how much FPR am I willing to accept for how much TPR"? e.g. using a cost matrix. If we have a cost matrix then would it not be desirable to find the optimal threshold via tuning, as you would tune a hyperparameter? Or is there a better way to find the optimal threshold?
$endgroup$
– Nick
Feb 1 at 8:32
1
$begingroup$
The way you're using the word "tune" here is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN (but the model remains the same -- same coefficients, etc.). You're right, that you want to find the best trade-off among errors, but you're wrong that such tuning takes place insideGridSearchCV
.
$endgroup$
– Sycorax
Feb 1 at 13:49
$begingroup$
@Sycorax Isn't the threshold and the intercept (bias term) doing basically the same thing? I.e. you can keep the threshold fixed at 0.5 but change the intercept accordingly; this will "change the model" (as per your last comment) but will have the identical effect in terms of binary predictions. Is this correct? If so, I am not sure the strict distinction between "changing the model" and "changing the decision rule" is so meaningful in this case.
$endgroup$
– amoeba
Feb 1 at 16:16
$begingroup$
@amoeba This is a though-provoking remark. I'll have to consider it. I suppose your suggestion amounts to "keep the threshold at 0.5 and treat the intercept as a hyperparameter, which you tune." There's nothing mathematically to stop you from doing this, except the observation that the model no longer maximizes its likelihood. But achieving the MLE may not be a priority in some specific context.
$endgroup$
– Sycorax
Feb 1 at 16:26
add a comment |
$begingroup$
The decision threshold creates a trade-off between the number of positives that you predict and the number of negatives that you predict -- because, tautologically, increasing the decision threshold will decrease the number of positives that you predict and increase the number of negatives that you predict.
The decision threshold is not a hyper-parameter in the sense of model tuning because it doesn't change the flexibility of the model.
The way you're thinking about the word "tune" in the context of the decision threshold is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN. However, the model remains the same, because this doesn't change the coefficients. (The same is true for models which do not have coefficients, such as random forests: changing the threshold doesn't change anything about the trees.) So in a narrow sense, you're correct that finding the best trade-off among errors is "tuning," but you're wrong in thinking that changing the threshold is linked to other model hyper-parameters in a way that is optimized by GridSearchCV
.
Stated another way, changing the decision threshold reflects a choice on your part about how many False Positives and False Negatives that you want to have. Consider the hypothetical that you set the decision threshold to a completely implausible value like -1. All probabilities are non-negative, so with this threshold you will predict "positive" for every observation. From a certain perspective, this is great, because your false negative rate is 0.0. However, your false positive rate is also at the extreme of 1.0, so in that sense your choice of threshold at -1 is terrible.
The ideal, of course, is to have a TPR of 1.0 and a FPR of 0.0 and a FNR of 0.0. But this is usually impossible in real-world applications, so the question then becomes "how much FPR am I willing to accept for how much TPR?" And this is the motivation of roc curves.
$endgroup$
$begingroup$
Thanks for the answer @Sycorax. You have almost convinced me. But can't we formalise the idea of "how much FPR am I willing to accept for how much TPR"? e.g. using a cost matrix. If we have a cost matrix then would it not be desirable to find the optimal threshold via tuning, as you would tune a hyperparameter? Or is there a better way to find the optimal threshold?
$endgroup$
– Nick
Feb 1 at 8:32
1
$begingroup$
The way you're using the word "tune" here is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN (but the model remains the same -- same coefficients, etc.). You're right, that you want to find the best trade-off among errors, but you're wrong that such tuning takes place insideGridSearchCV
.
$endgroup$
– Sycorax
Feb 1 at 13:49
$begingroup$
@Sycorax Isn't the threshold and the intercept (bias term) doing basically the same thing? I.e. you can keep the threshold fixed at 0.5 but change the intercept accordingly; this will "change the model" (as per your last comment) but will have the identical effect in terms of binary predictions. Is this correct? If so, I am not sure the strict distinction between "changing the model" and "changing the decision rule" is so meaningful in this case.
$endgroup$
– amoeba
Feb 1 at 16:16
$begingroup$
@amoeba This is a though-provoking remark. I'll have to consider it. I suppose your suggestion amounts to "keep the threshold at 0.5 and treat the intercept as a hyperparameter, which you tune." There's nothing mathematically to stop you from doing this, except the observation that the model no longer maximizes its likelihood. But achieving the MLE may not be a priority in some specific context.
$endgroup$
– Sycorax
Feb 1 at 16:26
add a comment |
$begingroup$
The decision threshold creates a trade-off between the number of positives that you predict and the number of negatives that you predict -- because, tautologically, increasing the decision threshold will decrease the number of positives that you predict and increase the number of negatives that you predict.
The decision threshold is not a hyper-parameter in the sense of model tuning because it doesn't change the flexibility of the model.
The way you're thinking about the word "tune" in the context of the decision threshold is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN. However, the model remains the same, because this doesn't change the coefficients. (The same is true for models which do not have coefficients, such as random forests: changing the threshold doesn't change anything about the trees.) So in a narrow sense, you're correct that finding the best trade-off among errors is "tuning," but you're wrong in thinking that changing the threshold is linked to other model hyper-parameters in a way that is optimized by GridSearchCV
.
Stated another way, changing the decision threshold reflects a choice on your part about how many False Positives and False Negatives that you want to have. Consider the hypothetical that you set the decision threshold to a completely implausible value like -1. All probabilities are non-negative, so with this threshold you will predict "positive" for every observation. From a certain perspective, this is great, because your false negative rate is 0.0. However, your false positive rate is also at the extreme of 1.0, so in that sense your choice of threshold at -1 is terrible.
The ideal, of course, is to have a TPR of 1.0 and a FPR of 0.0 and a FNR of 0.0. But this is usually impossible in real-world applications, so the question then becomes "how much FPR am I willing to accept for how much TPR?" And this is the motivation of roc curves.
$endgroup$
The decision threshold creates a trade-off between the number of positives that you predict and the number of negatives that you predict -- because, tautologically, increasing the decision threshold will decrease the number of positives that you predict and increase the number of negatives that you predict.
The decision threshold is not a hyper-parameter in the sense of model tuning because it doesn't change the flexibility of the model.
The way you're thinking about the word "tune" in the context of the decision threshold is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN. However, the model remains the same, because this doesn't change the coefficients. (The same is true for models which do not have coefficients, such as random forests: changing the threshold doesn't change anything about the trees.) So in a narrow sense, you're correct that finding the best trade-off among errors is "tuning," but you're wrong in thinking that changing the threshold is linked to other model hyper-parameters in a way that is optimized by GridSearchCV
.
Stated another way, changing the decision threshold reflects a choice on your part about how many False Positives and False Negatives that you want to have. Consider the hypothetical that you set the decision threshold to a completely implausible value like -1. All probabilities are non-negative, so with this threshold you will predict "positive" for every observation. From a certain perspective, this is great, because your false negative rate is 0.0. However, your false positive rate is also at the extreme of 1.0, so in that sense your choice of threshold at -1 is terrible.
The ideal, of course, is to have a TPR of 1.0 and a FPR of 0.0 and a FNR of 0.0. But this is usually impossible in real-world applications, so the question then becomes "how much FPR am I willing to accept for how much TPR?" And this is the motivation of roc curves.
edited Feb 1 at 15:34
answered Jan 31 at 17:27
SycoraxSycorax
40.7k12104204
40.7k12104204
$begingroup$
Thanks for the answer @Sycorax. You have almost convinced me. But can't we formalise the idea of "how much FPR am I willing to accept for how much TPR"? e.g. using a cost matrix. If we have a cost matrix then would it not be desirable to find the optimal threshold via tuning, as you would tune a hyperparameter? Or is there a better way to find the optimal threshold?
$endgroup$
– Nick
Feb 1 at 8:32
1
$begingroup$
The way you're using the word "tune" here is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN (but the model remains the same -- same coefficients, etc.). You're right, that you want to find the best trade-off among errors, but you're wrong that such tuning takes place insideGridSearchCV
.
$endgroup$
– Sycorax
Feb 1 at 13:49
$begingroup$
@Sycorax Isn't the threshold and the intercept (bias term) doing basically the same thing? I.e. you can keep the threshold fixed at 0.5 but change the intercept accordingly; this will "change the model" (as per your last comment) but will have the identical effect in terms of binary predictions. Is this correct? If so, I am not sure the strict distinction between "changing the model" and "changing the decision rule" is so meaningful in this case.
$endgroup$
– amoeba
Feb 1 at 16:16
$begingroup$
@amoeba This is a though-provoking remark. I'll have to consider it. I suppose your suggestion amounts to "keep the threshold at 0.5 and treat the intercept as a hyperparameter, which you tune." There's nothing mathematically to stop you from doing this, except the observation that the model no longer maximizes its likelihood. But achieving the MLE may not be a priority in some specific context.
$endgroup$
– Sycorax
Feb 1 at 16:26
add a comment |
$begingroup$
Thanks for the answer @Sycorax. You have almost convinced me. But can't we formalise the idea of "how much FPR am I willing to accept for how much TPR"? e.g. using a cost matrix. If we have a cost matrix then would it not be desirable to find the optimal threshold via tuning, as you would tune a hyperparameter? Or is there a better way to find the optimal threshold?
$endgroup$
– Nick
Feb 1 at 8:32
1
$begingroup$
The way you're using the word "tune" here is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN (but the model remains the same -- same coefficients, etc.). You're right, that you want to find the best trade-off among errors, but you're wrong that such tuning takes place insideGridSearchCV
.
$endgroup$
– Sycorax
Feb 1 at 13:49
$begingroup$
@Sycorax Isn't the threshold and the intercept (bias term) doing basically the same thing? I.e. you can keep the threshold fixed at 0.5 but change the intercept accordingly; this will "change the model" (as per your last comment) but will have the identical effect in terms of binary predictions. Is this correct? If so, I am not sure the strict distinction between "changing the model" and "changing the decision rule" is so meaningful in this case.
$endgroup$
– amoeba
Feb 1 at 16:16
$begingroup$
@amoeba This is a though-provoking remark. I'll have to consider it. I suppose your suggestion amounts to "keep the threshold at 0.5 and treat the intercept as a hyperparameter, which you tune." There's nothing mathematically to stop you from doing this, except the observation that the model no longer maximizes its likelihood. But achieving the MLE may not be a priority in some specific context.
$endgroup$
– Sycorax
Feb 1 at 16:26
$begingroup$
Thanks for the answer @Sycorax. You have almost convinced me. But can't we formalise the idea of "how much FPR am I willing to accept for how much TPR"? e.g. using a cost matrix. If we have a cost matrix then would it not be desirable to find the optimal threshold via tuning, as you would tune a hyperparameter? Or is there a better way to find the optimal threshold?
$endgroup$
– Nick
Feb 1 at 8:32
$begingroup$
Thanks for the answer @Sycorax. You have almost convinced me. But can't we formalise the idea of "how much FPR am I willing to accept for how much TPR"? e.g. using a cost matrix. If we have a cost matrix then would it not be desirable to find the optimal threshold via tuning, as you would tune a hyperparameter? Or is there a better way to find the optimal threshold?
$endgroup$
– Nick
Feb 1 at 8:32
1
1
$begingroup$
The way you're using the word "tune" here is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN (but the model remains the same -- same coefficients, etc.). You're right, that you want to find the best trade-off among errors, but you're wrong that such tuning takes place inside
GridSearchCV
.$endgroup$
– Sycorax
Feb 1 at 13:49
$begingroup$
The way you're using the word "tune" here is different from how hyper-parameters are tuned. Changing $C$ and other model hyper-parameters changes the model (e.g., the logistic regression coefficients will be different), while adjusting the threshold can only do two things: trade off TP for FN, and FP for TN (but the model remains the same -- same coefficients, etc.). You're right, that you want to find the best trade-off among errors, but you're wrong that such tuning takes place inside
GridSearchCV
.$endgroup$
– Sycorax
Feb 1 at 13:49
$begingroup$
@Sycorax Isn't the threshold and the intercept (bias term) doing basically the same thing? I.e. you can keep the threshold fixed at 0.5 but change the intercept accordingly; this will "change the model" (as per your last comment) but will have the identical effect in terms of binary predictions. Is this correct? If so, I am not sure the strict distinction between "changing the model" and "changing the decision rule" is so meaningful in this case.
$endgroup$
– amoeba
Feb 1 at 16:16
$begingroup$
@Sycorax Isn't the threshold and the intercept (bias term) doing basically the same thing? I.e. you can keep the threshold fixed at 0.5 but change the intercept accordingly; this will "change the model" (as per your last comment) but will have the identical effect in terms of binary predictions. Is this correct? If so, I am not sure the strict distinction between "changing the model" and "changing the decision rule" is so meaningful in this case.
$endgroup$
– amoeba
Feb 1 at 16:16
$begingroup$
@amoeba This is a though-provoking remark. I'll have to consider it. I suppose your suggestion amounts to "keep the threshold at 0.5 and treat the intercept as a hyperparameter, which you tune." There's nothing mathematically to stop you from doing this, except the observation that the model no longer maximizes its likelihood. But achieving the MLE may not be a priority in some specific context.
$endgroup$
– Sycorax
Feb 1 at 16:26
$begingroup$
@amoeba This is a though-provoking remark. I'll have to consider it. I suppose your suggestion amounts to "keep the threshold at 0.5 and treat the intercept as a hyperparameter, which you tune." There's nothing mathematically to stop you from doing this, except the observation that the model no longer maximizes its likelihood. But achieving the MLE may not be a priority in some specific context.
$endgroup$
– Sycorax
Feb 1 at 16:26
add a comment |
$begingroup$
But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter?
Yup, it does, sorta. It's a hyperparameter of you decision rule, but not the underlying regression.
If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV method (as you would do for the regularisation parameter C).
This is a design error in sklearn. The best practice for most classification scenarios is to fit the underlying model (which predicts probabilities) using some measure of the quality of these probabilities (like the log-loss in a logistic regression). Afterwards, a decision threshold on these probabilities should be tuned to optimize some business objective of your classification rule. The library should make it easy to optimize the decision threshold based on some measure of quality, but I don't believe it does that well.
I think this is one of the places sklearn got it wrong. The library includes a method, predict
, on all classification models that thresholds at 0.5
. This method is useless, and I strongly advocate for not ever invoking it. It's unfortunate that sklearn is not encouraging a better workflow.
$endgroup$
$begingroup$
I also share your skepticism of thepredict
method's default choice of 0.5 as a cutoff, butGridSearchCV
acceptsscorer
objects which can tune models with respect to out-of-sample cross-entropy loss. Am I missing your point?
$endgroup$
– Sycorax
Jan 31 at 17:32
$begingroup$
Right, agreed that is best practice, but it doesn't encourage users to tune decision thresholds.
$endgroup$
– Matthew Drury
Jan 31 at 17:32
$begingroup$
Gotcha. I understand what you mean!
$endgroup$
– Sycorax
Jan 31 at 17:33
1
$begingroup$
@Sycorax tried to edit to clarify!
$endgroup$
– Matthew Drury
Jan 31 at 17:35
add a comment |
$begingroup$
But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter?
Yup, it does, sorta. It's a hyperparameter of you decision rule, but not the underlying regression.
If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV method (as you would do for the regularisation parameter C).
This is a design error in sklearn. The best practice for most classification scenarios is to fit the underlying model (which predicts probabilities) using some measure of the quality of these probabilities (like the log-loss in a logistic regression). Afterwards, a decision threshold on these probabilities should be tuned to optimize some business objective of your classification rule. The library should make it easy to optimize the decision threshold based on some measure of quality, but I don't believe it does that well.
I think this is one of the places sklearn got it wrong. The library includes a method, predict
, on all classification models that thresholds at 0.5
. This method is useless, and I strongly advocate for not ever invoking it. It's unfortunate that sklearn is not encouraging a better workflow.
$endgroup$
$begingroup$
I also share your skepticism of thepredict
method's default choice of 0.5 as a cutoff, butGridSearchCV
acceptsscorer
objects which can tune models with respect to out-of-sample cross-entropy loss. Am I missing your point?
$endgroup$
– Sycorax
Jan 31 at 17:32
$begingroup$
Right, agreed that is best practice, but it doesn't encourage users to tune decision thresholds.
$endgroup$
– Matthew Drury
Jan 31 at 17:32
$begingroup$
Gotcha. I understand what you mean!
$endgroup$
– Sycorax
Jan 31 at 17:33
1
$begingroup$
@Sycorax tried to edit to clarify!
$endgroup$
– Matthew Drury
Jan 31 at 17:35
add a comment |
$begingroup$
But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter?
Yup, it does, sorta. It's a hyperparameter of you decision rule, but not the underlying regression.
If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV method (as you would do for the regularisation parameter C).
This is a design error in sklearn. The best practice for most classification scenarios is to fit the underlying model (which predicts probabilities) using some measure of the quality of these probabilities (like the log-loss in a logistic regression). Afterwards, a decision threshold on these probabilities should be tuned to optimize some business objective of your classification rule. The library should make it easy to optimize the decision threshold based on some measure of quality, but I don't believe it does that well.
I think this is one of the places sklearn got it wrong. The library includes a method, predict
, on all classification models that thresholds at 0.5
. This method is useless, and I strongly advocate for not ever invoking it. It's unfortunate that sklearn is not encouraging a better workflow.
$endgroup$
But varying the threshold will change the predicted classifications. Does this mean the threshold is a hyperparameter?
Yup, it does, sorta. It's a hyperparameter of you decision rule, but not the underlying regression.
If so, why is it (for example) not possible to easily search over a grid of thresholds using scikit-learn's GridSearchCV method (as you would do for the regularisation parameter C).
This is a design error in sklearn. The best practice for most classification scenarios is to fit the underlying model (which predicts probabilities) using some measure of the quality of these probabilities (like the log-loss in a logistic regression). Afterwards, a decision threshold on these probabilities should be tuned to optimize some business objective of your classification rule. The library should make it easy to optimize the decision threshold based on some measure of quality, but I don't believe it does that well.
I think this is one of the places sklearn got it wrong. The library includes a method, predict
, on all classification models that thresholds at 0.5
. This method is useless, and I strongly advocate for not ever invoking it. It's unfortunate that sklearn is not encouraging a better workflow.
edited Feb 1 at 16:52
answered Jan 31 at 17:28
Matthew DruryMatthew Drury
25.7k262103
25.7k262103
$begingroup$
I also share your skepticism of thepredict
method's default choice of 0.5 as a cutoff, butGridSearchCV
acceptsscorer
objects which can tune models with respect to out-of-sample cross-entropy loss. Am I missing your point?
$endgroup$
– Sycorax
Jan 31 at 17:32
$begingroup$
Right, agreed that is best practice, but it doesn't encourage users to tune decision thresholds.
$endgroup$
– Matthew Drury
Jan 31 at 17:32
$begingroup$
Gotcha. I understand what you mean!
$endgroup$
– Sycorax
Jan 31 at 17:33
1
$begingroup$
@Sycorax tried to edit to clarify!
$endgroup$
– Matthew Drury
Jan 31 at 17:35
add a comment |
$begingroup$
I also share your skepticism of thepredict
method's default choice of 0.5 as a cutoff, butGridSearchCV
acceptsscorer
objects which can tune models with respect to out-of-sample cross-entropy loss. Am I missing your point?
$endgroup$
– Sycorax
Jan 31 at 17:32
$begingroup$
Right, agreed that is best practice, but it doesn't encourage users to tune decision thresholds.
$endgroup$
– Matthew Drury
Jan 31 at 17:32
$begingroup$
Gotcha. I understand what you mean!
$endgroup$
– Sycorax
Jan 31 at 17:33
1
$begingroup$
@Sycorax tried to edit to clarify!
$endgroup$
– Matthew Drury
Jan 31 at 17:35
$begingroup$
I also share your skepticism of the
predict
method's default choice of 0.5 as a cutoff, but GridSearchCV
accepts scorer
objects which can tune models with respect to out-of-sample cross-entropy loss. Am I missing your point?$endgroup$
– Sycorax
Jan 31 at 17:32
$begingroup$
I also share your skepticism of the
predict
method's default choice of 0.5 as a cutoff, but GridSearchCV
accepts scorer
objects which can tune models with respect to out-of-sample cross-entropy loss. Am I missing your point?$endgroup$
– Sycorax
Jan 31 at 17:32
$begingroup$
Right, agreed that is best practice, but it doesn't encourage users to tune decision thresholds.
$endgroup$
– Matthew Drury
Jan 31 at 17:32
$begingroup$
Right, agreed that is best practice, but it doesn't encourage users to tune decision thresholds.
$endgroup$
– Matthew Drury
Jan 31 at 17:32
$begingroup$
Gotcha. I understand what you mean!
$endgroup$
– Sycorax
Jan 31 at 17:33
$begingroup$
Gotcha. I understand what you mean!
$endgroup$
– Sycorax
Jan 31 at 17:33
1
1
$begingroup$
@Sycorax tried to edit to clarify!
$endgroup$
– Matthew Drury
Jan 31 at 17:35
$begingroup$
@Sycorax tried to edit to clarify!
$endgroup$
– Matthew Drury
Jan 31 at 17:35
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f390186%2fis-decision-threshold-a-hyperparameter-in-logistic-regression%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
"As I understand it, typically 0.5 is used by default." Depends on the meaning of the word "typical". In practice, no one should be doing this.
$endgroup$
– Matthew Drury
Jan 31 at 17:27
3
$begingroup$
Very much related: Classification probability threshold
$endgroup$
– Stephan Kolassa
Jan 31 at 18:17
$begingroup$
Strictly you don't mean logistic regression, you mean using one logistic regressor with a threshold for binary classification (you could also train one regressor for each of the two classes, with a little seeded randomness or weighting to avoid them being linearly dependent).
$endgroup$
– smci
Jan 31 at 19:53