Binary search for students












7












$begingroup$


I'm putting together an example for students where I would like to show how we can optimize even such basic algorithm like a binary search with respect to hardware. I do not consider threads. I wrote it in C++ and the whole example can be found here. I'm trying to show some examples of the prefetching.



Test data If we would like to see the effects of the optimizations proposed in the following examples, then it is necessary to use data that are not in the CPU cache.



We are going to test it using some code like this



// arr - array with at least 100MB of values
// test_arr - array with values that we are going to search in arr
// test_count - size of the test_arr
for (auto i = 0u; i < test_count; ++i) {
auto position = binarySearch(item_count, arr, test_arr[i]);
}


Size - If the arr array is significantly larger then largest CPU cache then most of the array won't be stored there and we will inevitably experience the L2 cache misses during the run. L2 cache misses are slow and it will become the major bottleneck of our binary search algorithm.



Randomness - It is better to have data in the test_arr that accesses the arr randomly, otherwise, the basic binary search algorithm will benefit from data access locality.



Please note that the L2 cache misses can be experienced even on a small array in some complex algorithm which works with a lot of data. In other words, we need a large array only for the purpose of testing, but the optimization can be useful everywhere where expect the binary search to happen on a data outside of CPU cache.



The basic algorithm It is something along these lines:



using Type = int64_t;
int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}


Prefetch The first optimization simply try to prefetch possible memory in advance



int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m1 = arr[l + ((r - l) >> 2)]; // new code
int m2 = arr[r - ((r - l) >> 2)]; // new code

int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}


More items per iteration The second optimization is a little bit more sophisticated. It accesses more than one item per iteration which allows read more relevant data per CPU L2 cache miss wait.



int binarySearch_duo(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (r - l > 3) {

int delta = (r - l) / 3;
int m1 = l + delta;
int m2 = r - delta;

if (arr[m1] == search) return m1;
if (arr[m2] == search) return m2;

if (arr[m1] < search) {
if (arr[m2] < search) {
l = m2 + 1;
}
else {
r = m2 - 1;
l = m1 + 1;
}
}
else {
r = m1 - 1;
}
}

for (int i = l; i <= r; i++)
{
if (arr[i] == search) return i;
}

return -1;
}


Of course, this approach can be done with more accesses per iteration (I'm testing three as well in my demo). There are some infinite loop issues if the (l,r) interval is very small (less then three), therefore, I finish the remaining items by a simple sequential search to keep the main iteration clean and simple.



I was doing my tests with Visual Studio 2017 C++ compiler.



I have several questions:




  • Are there any other possible significant optimizations that could help the algorithm?

  • Is my explanation of the improvement of the last algorithm correct? (is it really due to the fact that the CPU read in parallel two data items; therefore, he waits for it just once not twice like in the first algorithm).

  • Are there any other comments about my code?










share|improve this question











$endgroup$












  • $begingroup$
    Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
    $endgroup$
    – Mast
    Jan 23 at 21:41
















7












$begingroup$


I'm putting together an example for students where I would like to show how we can optimize even such basic algorithm like a binary search with respect to hardware. I do not consider threads. I wrote it in C++ and the whole example can be found here. I'm trying to show some examples of the prefetching.



Test data If we would like to see the effects of the optimizations proposed in the following examples, then it is necessary to use data that are not in the CPU cache.



We are going to test it using some code like this



// arr - array with at least 100MB of values
// test_arr - array with values that we are going to search in arr
// test_count - size of the test_arr
for (auto i = 0u; i < test_count; ++i) {
auto position = binarySearch(item_count, arr, test_arr[i]);
}


Size - If the arr array is significantly larger then largest CPU cache then most of the array won't be stored there and we will inevitably experience the L2 cache misses during the run. L2 cache misses are slow and it will become the major bottleneck of our binary search algorithm.



Randomness - It is better to have data in the test_arr that accesses the arr randomly, otherwise, the basic binary search algorithm will benefit from data access locality.



Please note that the L2 cache misses can be experienced even on a small array in some complex algorithm which works with a lot of data. In other words, we need a large array only for the purpose of testing, but the optimization can be useful everywhere where expect the binary search to happen on a data outside of CPU cache.



The basic algorithm It is something along these lines:



using Type = int64_t;
int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}


Prefetch The first optimization simply try to prefetch possible memory in advance



int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m1 = arr[l + ((r - l) >> 2)]; // new code
int m2 = arr[r - ((r - l) >> 2)]; // new code

int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}


More items per iteration The second optimization is a little bit more sophisticated. It accesses more than one item per iteration which allows read more relevant data per CPU L2 cache miss wait.



int binarySearch_duo(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (r - l > 3) {

int delta = (r - l) / 3;
int m1 = l + delta;
int m2 = r - delta;

if (arr[m1] == search) return m1;
if (arr[m2] == search) return m2;

if (arr[m1] < search) {
if (arr[m2] < search) {
l = m2 + 1;
}
else {
r = m2 - 1;
l = m1 + 1;
}
}
else {
r = m1 - 1;
}
}

for (int i = l; i <= r; i++)
{
if (arr[i] == search) return i;
}

return -1;
}


Of course, this approach can be done with more accesses per iteration (I'm testing three as well in my demo). There are some infinite loop issues if the (l,r) interval is very small (less then three), therefore, I finish the remaining items by a simple sequential search to keep the main iteration clean and simple.



I was doing my tests with Visual Studio 2017 C++ compiler.



I have several questions:




  • Are there any other possible significant optimizations that could help the algorithm?

  • Is my explanation of the improvement of the last algorithm correct? (is it really due to the fact that the CPU read in parallel two data items; therefore, he waits for it just once not twice like in the first algorithm).

  • Are there any other comments about my code?










share|improve this question











$endgroup$












  • $begingroup$
    Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
    $endgroup$
    – Mast
    Jan 23 at 21:41














7












7








7


2



$begingroup$


I'm putting together an example for students where I would like to show how we can optimize even such basic algorithm like a binary search with respect to hardware. I do not consider threads. I wrote it in C++ and the whole example can be found here. I'm trying to show some examples of the prefetching.



Test data If we would like to see the effects of the optimizations proposed in the following examples, then it is necessary to use data that are not in the CPU cache.



We are going to test it using some code like this



// arr - array with at least 100MB of values
// test_arr - array with values that we are going to search in arr
// test_count - size of the test_arr
for (auto i = 0u; i < test_count; ++i) {
auto position = binarySearch(item_count, arr, test_arr[i]);
}


Size - If the arr array is significantly larger then largest CPU cache then most of the array won't be stored there and we will inevitably experience the L2 cache misses during the run. L2 cache misses are slow and it will become the major bottleneck of our binary search algorithm.



Randomness - It is better to have data in the test_arr that accesses the arr randomly, otherwise, the basic binary search algorithm will benefit from data access locality.



Please note that the L2 cache misses can be experienced even on a small array in some complex algorithm which works with a lot of data. In other words, we need a large array only for the purpose of testing, but the optimization can be useful everywhere where expect the binary search to happen on a data outside of CPU cache.



The basic algorithm It is something along these lines:



using Type = int64_t;
int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}


Prefetch The first optimization simply try to prefetch possible memory in advance



int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m1 = arr[l + ((r - l) >> 2)]; // new code
int m2 = arr[r - ((r - l) >> 2)]; // new code

int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}


More items per iteration The second optimization is a little bit more sophisticated. It accesses more than one item per iteration which allows read more relevant data per CPU L2 cache miss wait.



int binarySearch_duo(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (r - l > 3) {

int delta = (r - l) / 3;
int m1 = l + delta;
int m2 = r - delta;

if (arr[m1] == search) return m1;
if (arr[m2] == search) return m2;

if (arr[m1] < search) {
if (arr[m2] < search) {
l = m2 + 1;
}
else {
r = m2 - 1;
l = m1 + 1;
}
}
else {
r = m1 - 1;
}
}

for (int i = l; i <= r; i++)
{
if (arr[i] == search) return i;
}

return -1;
}


Of course, this approach can be done with more accesses per iteration (I'm testing three as well in my demo). There are some infinite loop issues if the (l,r) interval is very small (less then three), therefore, I finish the remaining items by a simple sequential search to keep the main iteration clean and simple.



I was doing my tests with Visual Studio 2017 C++ compiler.



I have several questions:




  • Are there any other possible significant optimizations that could help the algorithm?

  • Is my explanation of the improvement of the last algorithm correct? (is it really due to the fact that the CPU read in parallel two data items; therefore, he waits for it just once not twice like in the first algorithm).

  • Are there any other comments about my code?










share|improve this question











$endgroup$




I'm putting together an example for students where I would like to show how we can optimize even such basic algorithm like a binary search with respect to hardware. I do not consider threads. I wrote it in C++ and the whole example can be found here. I'm trying to show some examples of the prefetching.



Test data If we would like to see the effects of the optimizations proposed in the following examples, then it is necessary to use data that are not in the CPU cache.



We are going to test it using some code like this



// arr - array with at least 100MB of values
// test_arr - array with values that we are going to search in arr
// test_count - size of the test_arr
for (auto i = 0u; i < test_count; ++i) {
auto position = binarySearch(item_count, arr, test_arr[i]);
}


Size - If the arr array is significantly larger then largest CPU cache then most of the array won't be stored there and we will inevitably experience the L2 cache misses during the run. L2 cache misses are slow and it will become the major bottleneck of our binary search algorithm.



Randomness - It is better to have data in the test_arr that accesses the arr randomly, otherwise, the basic binary search algorithm will benefit from data access locality.



Please note that the L2 cache misses can be experienced even on a small array in some complex algorithm which works with a lot of data. In other words, we need a large array only for the purpose of testing, but the optimization can be useful everywhere where expect the binary search to happen on a data outside of CPU cache.



The basic algorithm It is something along these lines:



using Type = int64_t;
int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}


Prefetch The first optimization simply try to prefetch possible memory in advance



int binarySearch_basic(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (l <= r) {
int m1 = arr[l + ((r - l) >> 2)]; // new code
int m2 = arr[r - ((r - l) >> 2)]; // new code

int m = l + ((r - l) >> 1);
if (arr[m] == search) return m;
if (arr[m] < search) l = m + 1;
else r = m - 1;
}
return -1;
}


More items per iteration The second optimization is a little bit more sophisticated. It accesses more than one item per iteration which allows read more relevant data per CPU L2 cache miss wait.



int binarySearch_duo(size_t item_count, const Type arr, int search)
{
int l = 0;
int r = item_count - 1;
while (r - l > 3) {

int delta = (r - l) / 3;
int m1 = l + delta;
int m2 = r - delta;

if (arr[m1] == search) return m1;
if (arr[m2] == search) return m2;

if (arr[m1] < search) {
if (arr[m2] < search) {
l = m2 + 1;
}
else {
r = m2 - 1;
l = m1 + 1;
}
}
else {
r = m1 - 1;
}
}

for (int i = l; i <= r; i++)
{
if (arr[i] == search) return i;
}

return -1;
}


Of course, this approach can be done with more accesses per iteration (I'm testing three as well in my demo). There are some infinite loop issues if the (l,r) interval is very small (less then three), therefore, I finish the remaining items by a simple sequential search to keep the main iteration clean and simple.



I was doing my tests with Visual Studio 2017 C++ compiler.



I have several questions:




  • Are there any other possible significant optimizations that could help the algorithm?

  • Is my explanation of the improvement of the last algorithm correct? (is it really due to the fact that the CPU read in parallel two data items; therefore, he waits for it just once not twice like in the first algorithm).

  • Are there any other comments about my code?







c++ performance binary-search






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 23 at 21:42







Radim Bača

















asked Jan 23 at 17:43









Radim BačaRadim Bača

1386




1386












  • $begingroup$
    Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
    $endgroup$
    – Mast
    Jan 23 at 21:41


















  • $begingroup$
    Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
    $endgroup$
    – Mast
    Jan 23 at 21:41
















$begingroup$
Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
$endgroup$
– Mast
Jan 23 at 21:41




$begingroup$
Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. Please see what you may and may not do after receiving answers. There is no immediate problem, but if you add more code we'll have to rollback the edit. Should your code be improved over time, post a follow-up question linking back to this one instead. Give it some time though.
$endgroup$
– Mast
Jan 23 at 21:41










3 Answers
3






active

oldest

votes


















8












$begingroup$

How did you measure the performance of your code (test code, hardware, compiler, etc.)? Without that it is impossible to get into optimization. Depending on chosen compiler I was sort of able to reproduce your results, see this quick bench- for clang 7.0.0 the fancy algorithm is indeed faster. That being said with gcc the basic one is a lot faster (outperforming fancy algorithm in both clang and gcc).

With that in mind, you can optimize your algorithm for different array sizes quite easily (benchmarking on your specific setup), by choosing fastest of the 3 for specific sizes.



As far as the code is concerned, there are a couple of things you can improve.



You could rewrite your algorithms so that they accept iterators, just like all the algorithms from standard library, that will drastically improve their flexibility without added effort (and most probably without degrading performance, but you should measure it anyway).
Apart from that, you should avoid implicit conversions, e.g. here:



int r = item_count - 1;


You should also prefer to keep the bodies of your functions short, it's hard to understand what binarySearch_duo actually does (you could easily extract body of the loop etc). Also, avoid magic constants, where does the seven in: while (r - l > 7) { come from?



Edit: I've created one more benchmark for this, this time as proposed by OP, that is with huge array size (actually a vector) and made the access to the elements of the array random, see here- this time with gcc fancy version is as good as basic. The lesson should probably be: compiler is usually at least as smart as you are :).






share|improve this answer











$endgroup$













  • $begingroup$
    You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
    $endgroup$
    – Radim Bača
    Jan 23 at 20:12






  • 1




    $begingroup$
    Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
    $endgroup$
    – paler123
    Jan 23 at 20:29










  • $begingroup$
    You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
    $endgroup$
    – Radim Bača
    Jan 23 at 20:30












  • $begingroup$
    I have updated the question. I hope it is more clear now. Thank you for your recommendations.
    $endgroup$
    – Radim Bača
    Jan 23 at 21:06










  • $begingroup$
    Concerning your benchmark - it is not large enough and also the queries into the array are not random.
    $endgroup$
    – Radim Bača
    Jan 23 at 21:13



















3












$begingroup$

Prefetch version does nothing with optimizer turned on.



Since you are not using m1 and m2, compiler will eliminate them. You get identical codegen see here



Branchs



Rearranging the conditions appear to speed up the code by 20%, making basic case almost the same as duo version. Note the conditions are already swapped for duo version, this may explain some of its performance. It appears the less than branch is more common and allow other condition to be skipped.



if (arr[m] < search) l = m + 1;
else if (arr[m] == search) return m;
else r = m - 1;





share|improve this answer











$endgroup$













  • $begingroup$
    True, I agree. The main issue here is why MSVC compiler is so bad on the basic version when compared to CLang/Gcc or when compared to the duo version (and note that it is only the case of MSVC on out of the CPU data).
    $endgroup$
    – Radim Bača
    Jan 25 at 7:04












  • $begingroup$
    @RadimBača oh I just had an interesting observation. See my edit.
    $endgroup$
    – wooooooooosh
    Jan 29 at 5:49










  • $begingroup$
    Yes, you are right! The order of the conditions seems to cause significant slowdown. There are still some questions left but thank you for the observation.
    $endgroup$
    – Radim Bača
    Jan 29 at 7:30










  • $begingroup$
    The careful calculation of mean isn't a performance hack - it's to ensure correctness (avoiding overflow).
    $endgroup$
    – Toby Speight
    Jan 29 at 9:28










  • $begingroup$
    See The curious case of Binary Search — The famous bug that remained undetected for 20 years, for instance.
    $endgroup$
    – Toby Speight
    Jan 29 at 9:35



















1












$begingroup$

In reply to "Are there any other possible significant optimizations that could help the algorithm?"



I don't know, and it may have no effect at all, or even reduce performance, but I would like to know if reducing the apparent number of array operations improves performance, for example change



    if (arr[m1] == search) return m1;
if (arr[m2] == search) return m2;

if (arr[m1] < search) {
if (arr[m2] < search) {


to



    Type am1 = arr[m1], am2 = arr[m2];
if (am1 == search) return m1;
if (am2 == search) return m2;

if (am1 < search) {
if (am2 < search) {


Apart from that, with regard to "Are there any other comments about my code?" could I suggest using say left and right instead of l and r as variable names. l is very easy to mistake for 1!






share|improve this answer









$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "196"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f212088%2fbinary-search-for-students%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    8












    $begingroup$

    How did you measure the performance of your code (test code, hardware, compiler, etc.)? Without that it is impossible to get into optimization. Depending on chosen compiler I was sort of able to reproduce your results, see this quick bench- for clang 7.0.0 the fancy algorithm is indeed faster. That being said with gcc the basic one is a lot faster (outperforming fancy algorithm in both clang and gcc).

    With that in mind, you can optimize your algorithm for different array sizes quite easily (benchmarking on your specific setup), by choosing fastest of the 3 for specific sizes.



    As far as the code is concerned, there are a couple of things you can improve.



    You could rewrite your algorithms so that they accept iterators, just like all the algorithms from standard library, that will drastically improve their flexibility without added effort (and most probably without degrading performance, but you should measure it anyway).
    Apart from that, you should avoid implicit conversions, e.g. here:



    int r = item_count - 1;


    You should also prefer to keep the bodies of your functions short, it's hard to understand what binarySearch_duo actually does (you could easily extract body of the loop etc). Also, avoid magic constants, where does the seven in: while (r - l > 7) { come from?



    Edit: I've created one more benchmark for this, this time as proposed by OP, that is with huge array size (actually a vector) and made the access to the elements of the array random, see here- this time with gcc fancy version is as good as basic. The lesson should probably be: compiler is usually at least as smart as you are :).






    share|improve this answer











    $endgroup$













    • $begingroup$
      You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
      $endgroup$
      – Radim Bača
      Jan 23 at 20:12






    • 1




      $begingroup$
      Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
      $endgroup$
      – paler123
      Jan 23 at 20:29










    • $begingroup$
      You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
      $endgroup$
      – Radim Bača
      Jan 23 at 20:30












    • $begingroup$
      I have updated the question. I hope it is more clear now. Thank you for your recommendations.
      $endgroup$
      – Radim Bača
      Jan 23 at 21:06










    • $begingroup$
      Concerning your benchmark - it is not large enough and also the queries into the array are not random.
      $endgroup$
      – Radim Bača
      Jan 23 at 21:13
















    8












    $begingroup$

    How did you measure the performance of your code (test code, hardware, compiler, etc.)? Without that it is impossible to get into optimization. Depending on chosen compiler I was sort of able to reproduce your results, see this quick bench- for clang 7.0.0 the fancy algorithm is indeed faster. That being said with gcc the basic one is a lot faster (outperforming fancy algorithm in both clang and gcc).

    With that in mind, you can optimize your algorithm for different array sizes quite easily (benchmarking on your specific setup), by choosing fastest of the 3 for specific sizes.



    As far as the code is concerned, there are a couple of things you can improve.



    You could rewrite your algorithms so that they accept iterators, just like all the algorithms from standard library, that will drastically improve their flexibility without added effort (and most probably without degrading performance, but you should measure it anyway).
    Apart from that, you should avoid implicit conversions, e.g. here:



    int r = item_count - 1;


    You should also prefer to keep the bodies of your functions short, it's hard to understand what binarySearch_duo actually does (you could easily extract body of the loop etc). Also, avoid magic constants, where does the seven in: while (r - l > 7) { come from?



    Edit: I've created one more benchmark for this, this time as proposed by OP, that is with huge array size (actually a vector) and made the access to the elements of the array random, see here- this time with gcc fancy version is as good as basic. The lesson should probably be: compiler is usually at least as smart as you are :).






    share|improve this answer











    $endgroup$













    • $begingroup$
      You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
      $endgroup$
      – Radim Bača
      Jan 23 at 20:12






    • 1




      $begingroup$
      Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
      $endgroup$
      – paler123
      Jan 23 at 20:29










    • $begingroup$
      You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
      $endgroup$
      – Radim Bača
      Jan 23 at 20:30












    • $begingroup$
      I have updated the question. I hope it is more clear now. Thank you for your recommendations.
      $endgroup$
      – Radim Bača
      Jan 23 at 21:06










    • $begingroup$
      Concerning your benchmark - it is not large enough and also the queries into the array are not random.
      $endgroup$
      – Radim Bača
      Jan 23 at 21:13














    8












    8








    8





    $begingroup$

    How did you measure the performance of your code (test code, hardware, compiler, etc.)? Without that it is impossible to get into optimization. Depending on chosen compiler I was sort of able to reproduce your results, see this quick bench- for clang 7.0.0 the fancy algorithm is indeed faster. That being said with gcc the basic one is a lot faster (outperforming fancy algorithm in both clang and gcc).

    With that in mind, you can optimize your algorithm for different array sizes quite easily (benchmarking on your specific setup), by choosing fastest of the 3 for specific sizes.



    As far as the code is concerned, there are a couple of things you can improve.



    You could rewrite your algorithms so that they accept iterators, just like all the algorithms from standard library, that will drastically improve their flexibility without added effort (and most probably without degrading performance, but you should measure it anyway).
    Apart from that, you should avoid implicit conversions, e.g. here:



    int r = item_count - 1;


    You should also prefer to keep the bodies of your functions short, it's hard to understand what binarySearch_duo actually does (you could easily extract body of the loop etc). Also, avoid magic constants, where does the seven in: while (r - l > 7) { come from?



    Edit: I've created one more benchmark for this, this time as proposed by OP, that is with huge array size (actually a vector) and made the access to the elements of the array random, see here- this time with gcc fancy version is as good as basic. The lesson should probably be: compiler is usually at least as smart as you are :).






    share|improve this answer











    $endgroup$



    How did you measure the performance of your code (test code, hardware, compiler, etc.)? Without that it is impossible to get into optimization. Depending on chosen compiler I was sort of able to reproduce your results, see this quick bench- for clang 7.0.0 the fancy algorithm is indeed faster. That being said with gcc the basic one is a lot faster (outperforming fancy algorithm in both clang and gcc).

    With that in mind, you can optimize your algorithm for different array sizes quite easily (benchmarking on your specific setup), by choosing fastest of the 3 for specific sizes.



    As far as the code is concerned, there are a couple of things you can improve.



    You could rewrite your algorithms so that they accept iterators, just like all the algorithms from standard library, that will drastically improve their flexibility without added effort (and most probably without degrading performance, but you should measure it anyway).
    Apart from that, you should avoid implicit conversions, e.g. here:



    int r = item_count - 1;


    You should also prefer to keep the bodies of your functions short, it's hard to understand what binarySearch_duo actually does (you could easily extract body of the loop etc). Also, avoid magic constants, where does the seven in: while (r - l > 7) { come from?



    Edit: I've created one more benchmark for this, this time as proposed by OP, that is with huge array size (actually a vector) and made the access to the elements of the array random, see here- this time with gcc fancy version is as good as basic. The lesson should probably be: compiler is usually at least as smart as you are :).







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Jan 24 at 20:26

























    answered Jan 23 at 19:37









    paler123paler123

    2665




    2665












    • $begingroup$
      You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
      $endgroup$
      – Radim Bača
      Jan 23 at 20:12






    • 1




      $begingroup$
      Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
      $endgroup$
      – paler123
      Jan 23 at 20:29










    • $begingroup$
      You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
      $endgroup$
      – Radim Bača
      Jan 23 at 20:30












    • $begingroup$
      I have updated the question. I hope it is more clear now. Thank you for your recommendations.
      $endgroup$
      – Radim Bača
      Jan 23 at 21:06










    • $begingroup$
      Concerning your benchmark - it is not large enough and also the queries into the array are not random.
      $endgroup$
      – Radim Bača
      Jan 23 at 21:13


















    • $begingroup$
      You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
      $endgroup$
      – Radim Bača
      Jan 23 at 20:12






    • 1




      $begingroup$
      Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
      $endgroup$
      – paler123
      Jan 23 at 20:29










    • $begingroup$
      You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
      $endgroup$
      – Radim Bača
      Jan 23 at 20:30












    • $begingroup$
      I have updated the question. I hope it is more clear now. Thank you for your recommendations.
      $endgroup$
      – Radim Bača
      Jan 23 at 21:06










    • $begingroup$
      Concerning your benchmark - it is not large enough and also the queries into the array are not random.
      $endgroup$
      – Radim Bača
      Jan 23 at 21:13
















    $begingroup$
    You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
    $endgroup$
    – Radim Bača
    Jan 23 at 20:12




    $begingroup$
    You are using a small array that fits into the L2 cache (the CPU type is unknown). You can try to extend the size of the array to 1M to see the fancy code to win. However, it is still quite a small array and L2 cache misses do not occur significantly often. The number 7 is not important for the performance, it can be any small number lager than 3. I selected 7 since it approximately corresponds to one 64 cache line.
    $endgroup$
    – Radim Bača
    Jan 23 at 20:12




    1




    1




    $begingroup$
    Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
    $endgroup$
    – paler123
    Jan 23 at 20:29




    $begingroup$
    Not necessarily, gcc even after increasing the count of elements in the array to 10^6 elements produces better times for basic implementation. What I'm trying to say is that your results will vary depending on your benchmark and if you won't provide reliable way to measure performance, then it's tough to recommend further optimisations.
    $endgroup$
    – paler123
    Jan 23 at 20:29












    $begingroup$
    You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
    $endgroup$
    – Radim Bača
    Jan 23 at 20:30






    $begingroup$
    You are right. It is true that I did not specify the benchmark clearly here. I'll update the question.
    $endgroup$
    – Radim Bača
    Jan 23 at 20:30














    $begingroup$
    I have updated the question. I hope it is more clear now. Thank you for your recommendations.
    $endgroup$
    – Radim Bača
    Jan 23 at 21:06




    $begingroup$
    I have updated the question. I hope it is more clear now. Thank you for your recommendations.
    $endgroup$
    – Radim Bača
    Jan 23 at 21:06












    $begingroup$
    Concerning your benchmark - it is not large enough and also the queries into the array are not random.
    $endgroup$
    – Radim Bača
    Jan 23 at 21:13




    $begingroup$
    Concerning your benchmark - it is not large enough and also the queries into the array are not random.
    $endgroup$
    – Radim Bača
    Jan 23 at 21:13













    3












    $begingroup$

    Prefetch version does nothing with optimizer turned on.



    Since you are not using m1 and m2, compiler will eliminate them. You get identical codegen see here



    Branchs



    Rearranging the conditions appear to speed up the code by 20%, making basic case almost the same as duo version. Note the conditions are already swapped for duo version, this may explain some of its performance. It appears the less than branch is more common and allow other condition to be skipped.



    if (arr[m] < search) l = m + 1;
    else if (arr[m] == search) return m;
    else r = m - 1;





    share|improve this answer











    $endgroup$













    • $begingroup$
      True, I agree. The main issue here is why MSVC compiler is so bad on the basic version when compared to CLang/Gcc or when compared to the duo version (and note that it is only the case of MSVC on out of the CPU data).
      $endgroup$
      – Radim Bača
      Jan 25 at 7:04












    • $begingroup$
      @RadimBača oh I just had an interesting observation. See my edit.
      $endgroup$
      – wooooooooosh
      Jan 29 at 5:49










    • $begingroup$
      Yes, you are right! The order of the conditions seems to cause significant slowdown. There are still some questions left but thank you for the observation.
      $endgroup$
      – Radim Bača
      Jan 29 at 7:30










    • $begingroup$
      The careful calculation of mean isn't a performance hack - it's to ensure correctness (avoiding overflow).
      $endgroup$
      – Toby Speight
      Jan 29 at 9:28










    • $begingroup$
      See The curious case of Binary Search — The famous bug that remained undetected for 20 years, for instance.
      $endgroup$
      – Toby Speight
      Jan 29 at 9:35
















    3












    $begingroup$

    Prefetch version does nothing with optimizer turned on.



    Since you are not using m1 and m2, compiler will eliminate them. You get identical codegen see here



    Branchs



    Rearranging the conditions appear to speed up the code by 20%, making basic case almost the same as duo version. Note the conditions are already swapped for duo version, this may explain some of its performance. It appears the less than branch is more common and allow other condition to be skipped.



    if (arr[m] < search) l = m + 1;
    else if (arr[m] == search) return m;
    else r = m - 1;





    share|improve this answer











    $endgroup$













    • $begingroup$
      True, I agree. The main issue here is why MSVC compiler is so bad on the basic version when compared to CLang/Gcc or when compared to the duo version (and note that it is only the case of MSVC on out of the CPU data).
      $endgroup$
      – Radim Bača
      Jan 25 at 7:04












    • $begingroup$
      @RadimBača oh I just had an interesting observation. See my edit.
      $endgroup$
      – wooooooooosh
      Jan 29 at 5:49










    • $begingroup$
      Yes, you are right! The order of the conditions seems to cause significant slowdown. There are still some questions left but thank you for the observation.
      $endgroup$
      – Radim Bača
      Jan 29 at 7:30










    • $begingroup$
      The careful calculation of mean isn't a performance hack - it's to ensure correctness (avoiding overflow).
      $endgroup$
      – Toby Speight
      Jan 29 at 9:28










    • $begingroup$
      See The curious case of Binary Search — The famous bug that remained undetected for 20 years, for instance.
      $endgroup$
      – Toby Speight
      Jan 29 at 9:35














    3












    3








    3





    $begingroup$

    Prefetch version does nothing with optimizer turned on.



    Since you are not using m1 and m2, compiler will eliminate them. You get identical codegen see here



    Branchs



    Rearranging the conditions appear to speed up the code by 20%, making basic case almost the same as duo version. Note the conditions are already swapped for duo version, this may explain some of its performance. It appears the less than branch is more common and allow other condition to be skipped.



    if (arr[m] < search) l = m + 1;
    else if (arr[m] == search) return m;
    else r = m - 1;





    share|improve this answer











    $endgroup$



    Prefetch version does nothing with optimizer turned on.



    Since you are not using m1 and m2, compiler will eliminate them. You get identical codegen see here



    Branchs



    Rearranging the conditions appear to speed up the code by 20%, making basic case almost the same as duo version. Note the conditions are already swapped for duo version, this may explain some of its performance. It appears the less than branch is more common and allow other condition to be skipped.



    if (arr[m] < search) l = m + 1;
    else if (arr[m] == search) return m;
    else r = m - 1;






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Jan 29 at 23:36

























    answered Jan 25 at 1:11









    woooooooooshwooooooooosh

    47827




    47827












    • $begingroup$
      True, I agree. The main issue here is why MSVC compiler is so bad on the basic version when compared to CLang/Gcc or when compared to the duo version (and note that it is only the case of MSVC on out of the CPU data).
      $endgroup$
      – Radim Bača
      Jan 25 at 7:04












    • $begingroup$
      @RadimBača oh I just had an interesting observation. See my edit.
      $endgroup$
      – wooooooooosh
      Jan 29 at 5:49










    • $begingroup$
      Yes, you are right! The order of the conditions seems to cause significant slowdown. There are still some questions left but thank you for the observation.
      $endgroup$
      – Radim Bača
      Jan 29 at 7:30










    • $begingroup$
      The careful calculation of mean isn't a performance hack - it's to ensure correctness (avoiding overflow).
      $endgroup$
      – Toby Speight
      Jan 29 at 9:28










    • $begingroup$
      See The curious case of Binary Search — The famous bug that remained undetected for 20 years, for instance.
      $endgroup$
      – Toby Speight
      Jan 29 at 9:35


















    • $begingroup$
      True, I agree. The main issue here is why MSVC compiler is so bad on the basic version when compared to CLang/Gcc or when compared to the duo version (and note that it is only the case of MSVC on out of the CPU data).
      $endgroup$
      – Radim Bača
      Jan 25 at 7:04












    • $begingroup$
      @RadimBača oh I just had an interesting observation. See my edit.
      $endgroup$
      – wooooooooosh
      Jan 29 at 5:49










    • $begingroup$
      Yes, you are right! The order of the conditions seems to cause significant slowdown. There are still some questions left but thank you for the observation.
      $endgroup$
      – Radim Bača
      Jan 29 at 7:30










    • $begingroup$
      The careful calculation of mean isn't a performance hack - it's to ensure correctness (avoiding overflow).
      $endgroup$
      – Toby Speight
      Jan 29 at 9:28










    • $begingroup$
      See The curious case of Binary Search — The famous bug that remained undetected for 20 years, for instance.
      $endgroup$
      – Toby Speight
      Jan 29 at 9:35
















    $begingroup$
    True, I agree. The main issue here is why MSVC compiler is so bad on the basic version when compared to CLang/Gcc or when compared to the duo version (and note that it is only the case of MSVC on out of the CPU data).
    $endgroup$
    – Radim Bača
    Jan 25 at 7:04






    $begingroup$
    True, I agree. The main issue here is why MSVC compiler is so bad on the basic version when compared to CLang/Gcc or when compared to the duo version (and note that it is only the case of MSVC on out of the CPU data).
    $endgroup$
    – Radim Bača
    Jan 25 at 7:04














    $begingroup$
    @RadimBača oh I just had an interesting observation. See my edit.
    $endgroup$
    – wooooooooosh
    Jan 29 at 5:49




    $begingroup$
    @RadimBača oh I just had an interesting observation. See my edit.
    $endgroup$
    – wooooooooosh
    Jan 29 at 5:49












    $begingroup$
    Yes, you are right! The order of the conditions seems to cause significant slowdown. There are still some questions left but thank you for the observation.
    $endgroup$
    – Radim Bača
    Jan 29 at 7:30




    $begingroup$
    Yes, you are right! The order of the conditions seems to cause significant slowdown. There are still some questions left but thank you for the observation.
    $endgroup$
    – Radim Bača
    Jan 29 at 7:30












    $begingroup$
    The careful calculation of mean isn't a performance hack - it's to ensure correctness (avoiding overflow).
    $endgroup$
    – Toby Speight
    Jan 29 at 9:28




    $begingroup$
    The careful calculation of mean isn't a performance hack - it's to ensure correctness (avoiding overflow).
    $endgroup$
    – Toby Speight
    Jan 29 at 9:28












    $begingroup$
    See The curious case of Binary Search — The famous bug that remained undetected for 20 years, for instance.
    $endgroup$
    – Toby Speight
    Jan 29 at 9:35




    $begingroup$
    See The curious case of Binary Search — The famous bug that remained undetected for 20 years, for instance.
    $endgroup$
    – Toby Speight
    Jan 29 at 9:35











    1












    $begingroup$

    In reply to "Are there any other possible significant optimizations that could help the algorithm?"



    I don't know, and it may have no effect at all, or even reduce performance, but I would like to know if reducing the apparent number of array operations improves performance, for example change



        if (arr[m1] == search) return m1;
    if (arr[m2] == search) return m2;

    if (arr[m1] < search) {
    if (arr[m2] < search) {


    to



        Type am1 = arr[m1], am2 = arr[m2];
    if (am1 == search) return m1;
    if (am2 == search) return m2;

    if (am1 < search) {
    if (am2 < search) {


    Apart from that, with regard to "Are there any other comments about my code?" could I suggest using say left and right instead of l and r as variable names. l is very easy to mistake for 1!






    share|improve this answer









    $endgroup$


















      1












      $begingroup$

      In reply to "Are there any other possible significant optimizations that could help the algorithm?"



      I don't know, and it may have no effect at all, or even reduce performance, but I would like to know if reducing the apparent number of array operations improves performance, for example change



          if (arr[m1] == search) return m1;
      if (arr[m2] == search) return m2;

      if (arr[m1] < search) {
      if (arr[m2] < search) {


      to



          Type am1 = arr[m1], am2 = arr[m2];
      if (am1 == search) return m1;
      if (am2 == search) return m2;

      if (am1 < search) {
      if (am2 < search) {


      Apart from that, with regard to "Are there any other comments about my code?" could I suggest using say left and right instead of l and r as variable names. l is very easy to mistake for 1!






      share|improve this answer









      $endgroup$
















        1












        1








        1





        $begingroup$

        In reply to "Are there any other possible significant optimizations that could help the algorithm?"



        I don't know, and it may have no effect at all, or even reduce performance, but I would like to know if reducing the apparent number of array operations improves performance, for example change



            if (arr[m1] == search) return m1;
        if (arr[m2] == search) return m2;

        if (arr[m1] < search) {
        if (arr[m2] < search) {


        to



            Type am1 = arr[m1], am2 = arr[m2];
        if (am1 == search) return m1;
        if (am2 == search) return m2;

        if (am1 < search) {
        if (am2 < search) {


        Apart from that, with regard to "Are there any other comments about my code?" could I suggest using say left and right instead of l and r as variable names. l is very easy to mistake for 1!






        share|improve this answer









        $endgroup$



        In reply to "Are there any other possible significant optimizations that could help the algorithm?"



        I don't know, and it may have no effect at all, or even reduce performance, but I would like to know if reducing the apparent number of array operations improves performance, for example change



            if (arr[m1] == search) return m1;
        if (arr[m2] == search) return m2;

        if (arr[m1] < search) {
        if (arr[m2] < search) {


        to



            Type am1 = arr[m1], am2 = arr[m2];
        if (am1 == search) return m1;
        if (am2 == search) return m2;

        if (am1 < search) {
        if (am2 < search) {


        Apart from that, with regard to "Are there any other comments about my code?" could I suggest using say left and right instead of l and r as variable names. l is very easy to mistake for 1!







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 23 at 22:23









        George BarwoodGeorge Barwood

        4309




        4309






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Code Review Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f212088%2fbinary-search-for-students%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Probability when a professor distributes a quiz and homework assignment to a class of n students.

            Aardman Animations

            Are they similar matrix