Binary search for students

I'm putting together an example for students where I would like to show how we can optimize even such basic algorithm like a binary search with respect to hardware. I do not consider threads. I wrote it in C++ and the whole example can be found here. I'm trying to show some examples of the prefetching.

Test data If we would like to see the effects of the optimizations proposed in the following examples, then it is necessary to use data that are not in the CPU cache.

We are going to test it using some code like this

// arr - array with at least 100MB of values

// test_arr - array with values that we are going to search in arr

// test_count - size of the test_arr

for (auto i = 0u; i < test_count; ++i) {

   auto position = binarySearch(item_count, arr, test_arr[i]);

}

Size - If the arr array is significantly larger then largest CPU cache then most of the array won't be stored there and we will inevitably experience the L2 cache misses during the run. L2 cache misses are slow and it will become the major bottleneck of our binary search algorithm.

Randomness - It is better to have data in the test_arr that accesses the arr randomly, otherwise, the basic binary search algorithm will benefit from data access locality.

Please note that the L2 cache misses can be experienced even on a small array in some complex algorithm which works with a lot of data. In other words, we need a large array only for the purpose of testing, but the optimization can be useful everywhere where expect the binary search to happen on a data outside of CPU cache.

The basic algorithm It is something along these lines:

using Type = int64_t;

int binarySearch_basic(size_t item_count, const Type arr, int search)

{

    int l = 0;

    int r = item_count - 1;

    while (l <= r) {

        int m = l + ((r - l) >> 1);

        if (arr[m] == search) return m;

        if (arr[m] < search) l = m + 1;

        else r = m - 1;

    }

    return -1;

}

Prefetch The first optimization simply try to prefetch possible memory in advance

int binarySearch_basic(size_t item_count, const Type arr, int search)

{

    int l = 0;

    int r = item_count - 1;

    while (l <= r) {            

        int m1 = arr[l + ((r - l) >> 2)]; // new code 

        int m2 = arr[r - ((r - l) >> 2)]; // new code 



        int m = l + ((r - l) >> 1);

        if (arr[m] == search) return m;

        if (arr[m] < search) l = m + 1;

        else r = m - 1;

    }

    return -1;

}

More items per iteration The second optimization is a little bit more sophisticated. It accesses more than one item per iteration which allows read more relevant data per CPU L2 cache miss wait.

int binarySearch_duo(size_t item_count, const Type arr, int search)

{

    int l = 0;

    int r = item_count - 1;

    while (r - l > 3) {



        int delta = (r - l) / 3;

        int m1 = l + delta;

        int m2 = r - delta;



        if (arr[m1] == search) return m1;

        if (arr[m2] == search) return m2;



        if (arr[m1] < search) {

            if (arr[m2] < search) {

                l = m2 + 1;

            }

            else {

                r = m2 - 1;

                l = m1 + 1;

            }

        }

        else {

            r = m1 - 1;

        }

    }



    for (int i = l; i <= r; i++)

    {

        if (arr[i] == search) return i;

    }



    return -1;

}

Of course, this approach can be done with more accesses per iteration (I'm testing three as well in my demo). There are some infinite loop issues if the (l,r) interval is very small (less then three), therefore, I finish the remaining items by a simple sequential search to keep the main iteration clean and simple.

I was doing my tests with Visual Studio 2017 C++ compiler.

I have several questions:

Are there any other possible significant optimizations that could help the algorithm?

Is my explanation of the improvement of the last algorithm correct? (is it really due to the fact that the CPU read in parallel two data items; therefore, he waits for it just once not twice like in the first algorithm).

Are there any other comments about my code?

edited Jan 23 at 21:42

asked Jan 23 at 17:43

Radim Bača

1386

$begingroup$
Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
$endgroup$
– Mast
Jan 23 at 21:41

add a comment |

Test data If we would like to see the effects of the optimizations proposed in the following examples, then it is necessary to use data that are not in the CPU cache.

We are going to test it using some code like this

// arr - array with at least 100MB of values

// test_arr - array with values that we are going to search in arr

// test_count - size of the test_arr

for (auto i = 0u; i < test_count; ++i) {

   auto position = binarySearch(item_count, arr, test_arr[i]);

}

Randomness - It is better to have data in the test_arr that accesses the arr randomly, otherwise, the basic binary search algorithm will benefit from data access locality.

The basic algorithm It is something along these lines:

using Type = int64_t;

int binarySearch_basic(size_t item_count, const Type arr, int search)

{

    int l = 0;

    int r = item_count - 1;

    while (l <= r) {

        int m = l + ((r - l) >> 1);

        if (arr[m] == search) return m;

        if (arr[m] < search) l = m + 1;

        else r = m - 1;

    }

    return -1;

}

Prefetch The first optimization simply try to prefetch possible memory in advance

int binarySearch_basic(size_t item_count, const Type arr, int search)

{

    int l = 0;

    int r = item_count - 1;

    while (l <= r) {            

        int m1 = arr[l + ((r - l) >> 2)]; // new code 

        int m2 = arr[r - ((r - l) >> 2)]; // new code 



        int m = l + ((r - l) >> 1);

        if (arr[m] == search) return m;

        if (arr[m] < search) l = m + 1;

        else r = m - 1;

    }

    return -1;

}

More items per iteration The second optimization is a little bit more sophisticated. It accesses more than one item per iteration which allows read more relevant data per CPU L2 cache miss wait.

int binarySearch_duo(size_t item_count, const Type arr, int search)

{

    int l = 0;

    int r = item_count - 1;

    while (r - l > 3) {



        int delta = (r - l) / 3;

        int m1 = l + delta;

        int m2 = r - delta;



        if (arr[m1] == search) return m1;

        if (arr[m2] == search) return m2;



        if (arr[m1] < search) {

            if (arr[m2] < search) {

                l = m2 + 1;

            }

            else {

                r = m2 - 1;

                l = m1 + 1;

            }

        }

        else {

            r = m1 - 1;

        }

    }



    for (int i = l; i <= r; i++)

    {

        if (arr[i] == search) return i;

    }



    return -1;

}

I was doing my tests with Visual Studio 2017 C++ compiler.

I have several questions:

Are there any other possible significant optimizations that could help the algorithm?

Is my explanation of the improvement of the last algorithm correct? (is it really due to the fact that the CPU read in parallel two data items; therefore, he waits for it just once not twice like in the first algorithm).

Are there any other comments about my code?

edited Jan 23 at 21:42

asked Jan 23 at 17:43

Radim Bača

1386

$begingroup$
Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
$endgroup$
– Mast
Jan 23 at 21:41

add a comment |

Test data If we would like to see the effects of the optimizations proposed in the following examples, then it is necessary to use data that are not in the CPU cache.

We are going to test it using some code like this

// arr - array with at least 100MB of values

// test_arr - array with values that we are going to search in arr

// test_count - size of the test_arr

for (auto i = 0u; i < test_count; ++i) {

   auto position = binarySearch(item_count, arr, test_arr[i]);

}

Randomness - It is better to have data in the test_arr that accesses the arr randomly, otherwise, the basic binary search algorithm will benefit from data access locality.

The basic algorithm It is something along these lines:

using Type = int64_t;

int binarySearch_basic(size_t item_count, const Type arr, int search)

{

    int l = 0;

    int r = item_count - 1;

    while (l <= r) {

        int m = l + ((r - l) >> 1);

        if (arr[m] == search) return m;

        if (arr[m] < search) l = m + 1;

        else r = m - 1;

    }

    return -1;

}

Prefetch The first optimization simply try to prefetch possible memory in advance

int binarySearch_basic(size_t item_count, const Type arr, int search)

{

    int l = 0;

    int r = item_count - 1;

    while (l <= r) {            

        int m1 = arr[l + ((r - l) >> 2)]; // new code 

        int m2 = arr[r - ((r - l) >> 2)]; // new code 



        int m = l + ((r - l) >> 1);

        if (arr[m] == search) return m;

        if (arr[m] < search) l = m + 1;

        else r = m - 1;

    }

    return -1;

}

More items per iteration The second optimization is a little bit more sophisticated. It accesses more than one item per iteration which allows read more relevant data per CPU L2 cache miss wait.

int binarySearch_duo(size_t item_count, const Type arr, int search)

{

    int l = 0;

    int r = item_count - 1;

    while (r - l > 3) {



        int delta = (r - l) / 3;

        int m1 = l + delta;

        int m2 = r - delta;



        if (arr[m1] == search) return m1;

        if (arr[m2] == search) return m2;



        if (arr[m1] < search) {

            if (arr[m2] < search) {

                l = m2 + 1;

            }

            else {

                r = m2 - 1;

                l = m1 + 1;

            }

        }

        else {

            r = m1 - 1;

        }

    }



    for (int i = l; i <= r; i++)

    {

        if (arr[i] == search) return i;

    }



    return -1;

}

I was doing my tests with Visual Studio 2017 C++ compiler.

I have several questions:

Are there any other possible significant optimizations that could help the algorithm?

Is my explanation of the improvement of the last algorithm correct? (is it really due to the fact that the CPU read in parallel two data items; therefore, he waits for it just once not twice like in the first algorithm).

Are there any other comments about my code?

edited Jan 23 at 21:42

asked Jan 23 at 17:43

Radim Bača

1386

Test data If we would like to see the effects of the optimizations proposed in the following examples, then it is necessary to use data that are not in the CPU cache.

We are going to test it using some code like this

// arr - array with at least 100MB of values

// test_arr - array with values that we are going to search in arr

// test_count - size of the test_arr

for (auto i = 0u; i < test_count; ++i) {

   auto position = binarySearch(item_count, arr, test_arr[i]);

}

Randomness - It is better to have data in the test_arr that accesses the arr randomly, otherwise, the basic binary search algorithm will benefit from data access locality.

The basic algorithm It is something along these lines:

using Type = int64_t;

int binarySearch_basic(size_t item_count, const Type arr, int search)

{

    int l = 0;

    int r = item_count - 1;

    while (l <= r) {

        int m = l + ((r - l) >> 1);

        if (arr[m] == search) return m;

        if (arr[m] < search) l = m + 1;

        else r = m - 1;

    }

    return -1;

}

Prefetch The first optimization simply try to prefetch possible memory in advance

int binarySearch_basic(size_t item_count, const Type arr, int search)

{

    int l = 0;

    int r = item_count - 1;

    while (l <= r) {            

        int m1 = arr[l + ((r - l) >> 2)]; // new code 

        int m2 = arr[r - ((r - l) >> 2)]; // new code 



        int m = l + ((r - l) >> 1);

        if (arr[m] == search) return m;

        if (arr[m] < search) l = m + 1;

        else r = m - 1;

    }

    return -1;

}

More items per iteration The second optimization is a little bit more sophisticated. It accesses more than one item per iteration which allows read more relevant data per CPU L2 cache miss wait.

int binarySearch_duo(size_t item_count, const Type arr, int search)

{

    int l = 0;

    int r = item_count - 1;

    while (r - l > 3) {



        int delta = (r - l) / 3;

        int m1 = l + delta;

        int m2 = r - delta;



        if (arr[m1] == search) return m1;

        if (arr[m2] == search) return m2;



        if (arr[m1] < search) {

            if (arr[m2] < search) {

                l = m2 + 1;

            }

            else {

                r = m2 - 1;

                l = m1 + 1;

            }

        }

        else {

            r = m1 - 1;

        }

    }



    for (int i = l; i <= r; i++)

    {

        if (arr[i] == search) return i;

    }



    return -1;

}

I was doing my tests with Visual Studio 2017 C++ compiler.

I have several questions:

Are there any other possible significant optimizations that could help the algorithm?

Is my explanation of the improvement of the last algorithm correct? (is it really due to the fact that the CPU read in parallel two data items; therefore, he waits for it just once not twice like in the first algorithm).

Are there any other comments about my code?

c++ performance binary-search

edited Jan 23 at 21:42

asked Jan 23 at 17:43

Radim Bača

1386

edited Jan 23 at 21:42

asked Jan 23 at 17:43

Radim Bača

1386

edited Jan 23 at 21:42

asked Jan 23 at 17:43

Radim Bača

1386

asked Jan 23 at 17:43

Radim Bača

1386

asked Jan 23 at 17:43

Radim Bača

1386

$begingroup$
Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
$endgroup$
– Mast
Jan 23 at 21:41

add a comment |

$begingroup$
Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
$endgroup$
– Mast
Jan 23 at 21:41

Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.

– Mast
Jan 23 at 21:41

add a comment |

3 Answers
3

active

oldest

votes

How did you measure the performance of your code (test code, hardware, compiler, etc.)? Without that it is impossible to get into optimization. Depending on chosen compiler I was sort of able to reproduce your results, see this quick bench- for clang 7.0.0 the fancy algorithm is indeed faster. That being said with gcc the basic one is a lot faster (outperforming fancy algorithm in both clang and gcc).

With that in mind, you can optimize your algorithm for different array sizes quite easily (benchmarking on your specific setup), by choosing fastest of the 3 for specific sizes.

As far as the code is concerned, there are a couple of things you can improve.

You could rewrite your algorithms so that they accept iterators, just like all the algorithms from standard library, that will drastically improve their flexibility without added effort (and most probably without degrading performance, but you should measure it anyway).
Apart from that, you should avoid implicit conversions, e.g. here:

int r = item_count - 1;

You should also prefer to keep the bodies of your functions short, it's hard to understand what binarySearch_duo actually does (you could easily extract body of the loop etc). Also, avoid magic constants, where does the seven in: while (r - l > 7) { come from?

Edit: I've created one more benchmark for this, this time as proposed by OP, that is with huge array size (actually a vector) and made the access to the elements of the array random, see here- this time with gcc fancy version is as good as basic. The lesson should probably be: compiler is usually at least as smart as you are :).

edited Jan 24 at 20:26

answered Jan 23 at 19:37

paler123

2665

$begingroup$
You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
$endgroup$
– Radim Bača
Jan 23 at 20:12

1

$begingroup$
Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
$endgroup$
– paler123
Jan 23 at 20:29

$begingroup$
You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
$endgroup$
– Radim Bača
Jan 23 at 20:30

$begingroup$
I have updated the question. I hope it is more clear now. Thank you for your recommendations.
$endgroup$
– Radim Bača
Jan 23 at 21:06

$begingroup$
Concerning your benchmark - it is not large enough and also the queries into the array are not random.
$endgroup$
– Radim Bača
Jan 23 at 21:13

|
show 2 more comments

Prefetch version does nothing with optimizer turned on.

Since you are not using m1 and m2, compiler will eliminate them. You get identical codegen see here

Branchs

Rearranging the conditions appear to speed up the code by 20%, making basic case almost the same as duo version. Note the conditions are already swapped for duo version, this may explain some of its performance. It appears the less than branch is more common and allow other condition to be skipped.

if (arr[m] < search) l = m + 1;

else if (arr[m] == search) return m;

else r = m - 1;

edited Jan 29 at 23:36

answered Jan 25 at 1:11

wooooooooosh

47827

$begingroup$
True, I agree. The main issue here is why MSVC compiler is so bad on the basic version when compared to CLang/Gcc or when compared to the duo version (and note that it is only the case of MSVC on out of the CPU data).
$endgroup$
– Radim Bača
Jan 25 at 7:04

$begingroup$
@RadimBača oh I just had an interesting observation. See my edit.
$endgroup$
– wooooooooosh
Jan 29 at 5:49

$begingroup$
Yes, you are right! The order of the conditions seems to cause significant slowdown. There are still some questions left but thank you for the observation.
$endgroup$
– Radim Bača
Jan 29 at 7:30

$begingroup$
The careful calculation of mean isn't a performance hack - it's to ensure correctness (avoiding overflow).
$endgroup$
– Toby Speight
Jan 29 at 9:28

$begingroup$
See The curious case of Binary Search — The famous bug that remained undetected for 20 years, for instance.
$endgroup$
– Toby Speight
Jan 29 at 9:35

add a comment |

In reply to "Are there any other possible significant optimizations that could help the algorithm?"

I don't know, and it may have no effect at all, or even reduce performance, but I would like to know if reducing the apparent number of array operations improves performance, for example change

    if (arr[m1] == search) return m1;

    if (arr[m2] == search) return m2;



    if (arr[m1] < search) {

        if (arr[m2] < search) {

    Type am1 = arr[m1], am2 = arr[m2];

    if (am1 == search) return m1;

    if (am2 == search) return m2;



    if (am1 < search) {

        if (am2 < search) {

Apart from that, with regard to "Are there any other comments about my code?" could I suggest using say left and right instead of l and r as variable names. l is very easy to mistake for 1!

answered Jan 23 at 22:23

George Barwood

4309

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f212088%2fbinary-search-for-students%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

As far as the code is concerned, there are a couple of things you can improve.

int r = item_count - 1;

edited Jan 24 at 20:26

answered Jan 23 at 19:37

paler123

2665

$begingroup$
You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
$endgroup$
– Radim Bača
Jan 23 at 20:12

1

$begingroup$
Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
$endgroup$
– paler123
Jan 23 at 20:29

$begingroup$
You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
$endgroup$
– Radim Bača
Jan 23 at 20:30

$begingroup$
I have updated the question. I hope it is more clear now. Thank you for your recommendations.
$endgroup$
– Radim Bača
Jan 23 at 21:06

$begingroup$
Concerning your benchmark - it is not large enough and also the queries into the array are not random.
$endgroup$
– Radim Bača
Jan 23 at 21:13

|
show 2 more comments

As far as the code is concerned, there are a couple of things you can improve.

int r = item_count - 1;

edited Jan 24 at 20:26

answered Jan 23 at 19:37

paler123

2665

$begingroup$
You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
$endgroup$
– Radim Bača
Jan 23 at 20:12

1

$begingroup$
Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
$endgroup$
– paler123
Jan 23 at 20:29

$begingroup$
You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
$endgroup$
– Radim Bača
Jan 23 at 20:30

$begingroup$
I have updated the question. I hope it is more clear now. Thank you for your recommendations.
$endgroup$
– Radim Bača
Jan 23 at 21:06

$begingroup$
Concerning your benchmark - it is not large enough and also the queries into the array are not random.
$endgroup$
– Radim Bača
Jan 23 at 21:13

|
show 2 more comments

As far as the code is concerned, there are a couple of things you can improve.

int r = item_count - 1;

edited Jan 24 at 20:26

answered Jan 23 at 19:37

paler123

2665

As far as the code is concerned, there are a couple of things you can improve.

int r = item_count - 1;

edited Jan 24 at 20:26

answered Jan 23 at 19:37

paler123

2665

edited Jan 24 at 20:26

answered Jan 23 at 19:37

paler123

2665

answered Jan 23 at 19:37

paler123

2665

answered Jan 23 at 19:37

paler123

2665

$begingroup$
You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
$endgroup$
– Radim Bača
Jan 23 at 20:12

1

$begingroup$
Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
$endgroup$
– paler123
Jan 23 at 20:29

$begingroup$
You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
$endgroup$
– Radim Bača
Jan 23 at 20:30

$begingroup$
I have updated the question. I hope it is more clear now. Thank you for your recommendations.
$endgroup$
– Radim Bača
Jan 23 at 21:06

$begingroup$
Concerning your benchmark - it is not large enough and also the queries into the array are not random.
$endgroup$
– Radim Bača
Jan 23 at 21:13

|
show 2 more comments

$begingroup$
You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
$endgroup$
– Radim Bača
Jan 23 at 20:12

1

$begingroup$
Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
$endgroup$
– paler123
Jan 23 at 20:29

$begingroup$
You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
$endgroup$
– Radim Bača
Jan 23 at 20:30

$begingroup$
I have updated the question. I hope it is more clear now. Thank you for your recommendations.
$endgroup$
– Radim Bača
Jan 23 at 21:06

$begingroup$
Concerning your benchmark - it is not large enough and also the queries into the array are not random.
$endgroup$
– Radim Bača
Jan 23 at 21:13

You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.

– Radim Bača
Jan 23 at 20:12

Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.

– paler123
Jan 23 at 20:29

You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.

– Radim Bača
Jan 23 at 20:30

I have updated the question. I hope it is more clear now. Thank you for your recommendations.

– Radim Bača
Jan 23 at 21:06

Concerning your benchmark - it is not large enough and also the queries into the array are not random.

– Radim Bača
Jan 23 at 21:13

|
show 2 more comments

Prefetch version does nothing with optimizer turned on.

Since you are not using m1 and m2, compiler will eliminate them. You get identical codegen see here

Branchs

if (arr[m] < search) l = m + 1;

else if (arr[m] == search) return m;

else r = m - 1;

edited Jan 29 at 23:36

answered Jan 25 at 1:11

wooooooooosh

47827

$begingroup$
True, I agree. The main issue here is why MSVC compiler is so bad on the basic version when compared to CLang/Gcc or when compared to the duo version (and note that it is only the case of MSVC on out of the CPU data).
$endgroup$
– Radim Bača
Jan 25 at 7:04

$begingroup$
@RadimBača oh I just had an interesting observation. See my edit.
$endgroup$
– wooooooooosh
Jan 29 at 5:49

$begingroup$
Yes, you are right! The order of the conditions seems to cause significant slowdown. There are still some questions left but thank you for the observation.
$endgroup$
– Radim Bača
Jan 29 at 7:30

$begingroup$
The careful calculation of mean isn't a performance hack - it's to ensure correctness (avoiding overflow).
$endgroup$
– Toby Speight
Jan 29 at 9:28

$begingroup$
See The curious case of Binary Search — The famous bug that remained undetected for 20 years, for instance.
$endgroup$
– Toby Speight
Jan 29 at 9:35

add a comment |

Prefetch version does nothing with optimizer turned on.

Since you are not using m1 and m2, compiler will eliminate them. You get identical codegen see here

Branchs

if (arr[m] < search) l = m + 1;

else if (arr[m] == search) return m;

else r = m - 1;

edited Jan 29 at 23:36

answered Jan 25 at 1:11

wooooooooosh

47827

$begingroup$
True, I agree. The main issue here is why MSVC compiler is so bad on the basic version when compared to CLang/Gcc or when compared to the duo version (and note that it is only the case of MSVC on out of the CPU data).
$endgroup$
– Radim Bača
Jan 25 at 7:04

$begingroup$
@RadimBača oh I just had an interesting observation. See my edit.
$endgroup$
– wooooooooosh
Jan 29 at 5:49

$begingroup$
Yes, you are right! The order of the conditions seems to cause significant slowdown. There are still some questions left but thank you for the observation.
$endgroup$
– Radim Bača
Jan 29 at 7:30

$begingroup$
The careful calculation of mean isn't a performance hack - it's to ensure correctness (avoiding overflow).
$endgroup$
– Toby Speight
Jan 29 at 9:28

$begingroup$
See The curious case of Binary Search — The famous bug that remained undetected for 20 years, for instance.
$endgroup$
– Toby Speight
Jan 29 at 9:35

add a comment |

Prefetch version does nothing with optimizer turned on.

Since you are not using m1 and m2, compiler will eliminate them. You get identical codegen see here

Branchs

if (arr[m] < search) l = m + 1;

else if (arr[m] == search) return m;

else r = m - 1;

edited Jan 29 at 23:36

answered Jan 25 at 1:11

wooooooooosh

47827

Prefetch version does nothing with optimizer turned on.

Since you are not using m1 and m2, compiler will eliminate them. You get identical codegen see here

Branchs

if (arr[m] < search) l = m + 1;

else if (arr[m] == search) return m;

else r = m - 1;

edited Jan 29 at 23:36

answered Jan 25 at 1:11

wooooooooosh

47827

edited Jan 29 at 23:36

answered Jan 25 at 1:11

wooooooooosh

47827

answered Jan 25 at 1:11

wooooooooosh

47827

answered Jan 25 at 1:11

wooooooooosh

47827

$begingroup$
True, I agree. The main issue here is why MSVC compiler is so bad on the basic version when compared to CLang/Gcc or when compared to the duo version (and note that it is only the case of MSVC on out of the CPU data).
$endgroup$
– Radim Bača
Jan 25 at 7:04

$begingroup$
@RadimBača oh I just had an interesting observation. See my edit.
$endgroup$
– wooooooooosh
Jan 29 at 5:49

$begingroup$
Yes, you are right! The order of the conditions seems to cause significant slowdown. There are still some questions left but thank you for the observation.
$endgroup$
– Radim Bača
Jan 29 at 7:30

$begingroup$
The careful calculation of mean isn't a performance hack - it's to ensure correctness (avoiding overflow).
$endgroup$
– Toby Speight
Jan 29 at 9:28

$begingroup$
See The curious case of Binary Search — The famous bug that remained undetected for 20 years, for instance.
$endgroup$
– Toby Speight
Jan 29 at 9:35

add a comment |

$begingroup$
True, I agree. The main issue here is why MSVC compiler is so bad on the basic version when compared to CLang/Gcc or when compared to the duo version (and note that it is only the case of MSVC on out of the CPU data).
$endgroup$
– Radim Bača
Jan 25 at 7:04

$begingroup$
@RadimBača oh I just had an interesting observation. See my edit.
$endgroup$
– wooooooooosh
Jan 29 at 5:49

$begingroup$
Yes, you are right! The order of the conditions seems to cause significant slowdown. There are still some questions left but thank you for the observation.
$endgroup$
– Radim Bača
Jan 29 at 7:30

$begingroup$
The careful calculation of mean isn't a performance hack - it's to ensure correctness (avoiding overflow).
$endgroup$
– Toby Speight
Jan 29 at 9:28

$begingroup$
See The curious case of Binary Search — The famous bug that remained undetected for 20 years, for instance.
$endgroup$
– Toby Speight
Jan 29 at 9:35

True, I agree. The main issue here is why MSVC compiler is so bad on the basic version when compared to CLang/Gcc or when compared to the duo version (and note that it is only the case of MSVC on out of the CPU data).

– Radim Bača
Jan 25 at 7:04

@RadimBača oh I just had an interesting observation. See my edit.

– wooooooooosh
Jan 29 at 5:49

Yes, you are right! The order of the conditions seems to cause significant slowdown. There are still some questions left but thank you for the observation.

– Radim Bača
Jan 29 at 7:30

The careful calculation of mean isn't a performance hack - it's to ensure correctness (avoiding overflow).

– Toby Speight
Jan 29 at 9:28

See The curious case of Binary Search — The famous bug that remained undetected for 20 years, for instance.

– Toby Speight
Jan 29 at 9:35

add a comment |

In reply to "Are there any other possible significant optimizations that could help the algorithm?"

I don't know, and it may have no effect at all, or even reduce performance, but I would like to know if reducing the apparent number of array operations improves performance, for example change

    if (arr[m1] == search) return m1;

    if (arr[m2] == search) return m2;



    if (arr[m1] < search) {

        if (arr[m2] < search) {

    Type am1 = arr[m1], am2 = arr[m2];

    if (am1 == search) return m1;

    if (am2 == search) return m2;



    if (am1 < search) {

        if (am2 < search) {

answered Jan 23 at 22:23

George Barwood

4309

add a comment |

In reply to "Are there any other possible significant optimizations that could help the algorithm?"

I don't know, and it may have no effect at all, or even reduce performance, but I would like to know if reducing the apparent number of array operations improves performance, for example change

    if (arr[m1] == search) return m1;

    if (arr[m2] == search) return m2;



    if (arr[m1] < search) {

        if (arr[m2] < search) {

    Type am1 = arr[m1], am2 = arr[m2];

    if (am1 == search) return m1;

    if (am2 == search) return m2;



    if (am1 < search) {

        if (am2 < search) {

answered Jan 23 at 22:23

George Barwood

4309

add a comment |

In reply to "Are there any other possible significant optimizations that could help the algorithm?"

I don't know, and it may have no effect at all, or even reduce performance, but I would like to know if reducing the apparent number of array operations improves performance, for example change

    if (arr[m1] == search) return m1;

    if (arr[m2] == search) return m2;



    if (arr[m1] < search) {

        if (arr[m2] < search) {

    Type am1 = arr[m1], am2 = arr[m2];

    if (am1 == search) return m1;

    if (am2 == search) return m2;



    if (am1 < search) {

        if (am2 < search) {

answered Jan 23 at 22:23

George Barwood

4309

In reply to "Are there any other possible significant optimizations that could help the algorithm?"

I don't know, and it may have no effect at all, or even reduce performance, but I would like to know if reducing the apparent number of array operations improves performance, for example change

    if (arr[m1] == search) return m1;

    if (arr[m2] == search) return m2;



    if (arr[m1] < search) {

        if (arr[m2] < search) {

    Type am1 = arr[m1], am2 = arr[m2];

    if (am1 == search) return m1;

    if (am2 == search) return m2;



    if (am1 < search) {

        if (am2 < search) {

answered Jan 23 at 22:23

George Barwood

4309

answered Jan 23 at 22:23

George Barwood

4309

answered Jan 23 at 22:23

George Barwood

4309

answered Jan 23 at 22:23

George Barwood

4309

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

MbHECBevJxi56,d6gDXq6bwsOxVsdFec4,ZZr iXCYu1EycL11OSVZQK6t,ASg81,zPUBczodDkWEenOO4XLAq

搜尋此網誌

Jtdylktuy

Binary search for students

3 Answers
3

Prefetch version does nothing with optimizer turned on.

Branchs

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Prefetch version does nothing with optimizer turned on.

Branchs

Prefetch version does nothing with optimizer turned on.

Branchs

Prefetch version does nothing with optimizer turned on.

Branchs

Prefetch version does nothing with optimizer turned on.

Branchs

Post as a guest

Popular posts from this blog

What is the story behind “peach kanji” 桃?

Probability when a professor distributes a quiz and homework assignment to a class of n students.

Aardman Animations

Binary search for students

3 Answers 3

Prefetch version does nothing with optimizer turned on.

Branchs

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Prefetch version does nothing with optimizer turned on.

Branchs

Prefetch version does nothing with optimizer turned on.

Branchs

Prefetch version does nothing with optimizer turned on.

Branchs

Prefetch version does nothing with optimizer turned on.

Branchs

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

What is the story behind “peach kanji” 桃?

Probability when a professor distributes a quiz and homework assignment to a class of n students.

Aardman Animations

3 Answers
3

3 Answers
3

3 Answers
3