Dealing with large lists of numbers











up vote
0
down vote

favorite












I am looking for references that discuss computer programs handling a large collection of numbers that do not fit in machine memory.



Some context:

I am working with GNU multi precision library aka gmplib (double precision was insufficient) and generating a list of numbers; a lot of the numbers being generated are duplicates (some room for improvement, but still lots of duplicates). One of the small lists has ~300,000,000 unique points but I would like to work with more. Afterwards, I need to search through the list and make sure there are no duplicates near the defined precision, and then some further calculation.



Thoughts on my problem:

Naively, just write to a file and store points there. But then I need to search through the list, so perhaps I should keep the list sorted, and at that point, maybe it's just easier to use a database.



So I'm looking into how other people have handled this, maybe Mathematica handles this easily and transparently (my in-memory version, the kernel died after using all machine memory), maybe a C program interfacing to mysql is the best solution (this is where I am now, and it is slow, I am spending time on DB optimization), maybe "get a machine with more RAM" is the best solution, etc. (of course, arbitrary "best" depending on the problem).



References are preferred (I think this question is somewhat general), or even just comments pointing to the right collection of keywords to search, but I welcome any feedback for my specific problem. Thanks.










share|cite|improve this question
























  • This is a question of programming, not mathematics. You would have the same kind of problems if you were generating character strings instead of multiprecision numbers.
    – Somos
    Nov 16 at 20:40










  • @Somos Yes, I'm storing in the database as character strings. But the point of the question is to learn from others who have encountered similar challenges doing similar mathematical work. I am aware of generic data store solutions, but was hoping for something more domain specific and provided some context for that. Perhaps you have suggestions of how I can better ask this question to achieve those kinds of answers?
    – Burnsba
    Nov 16 at 21:11















up vote
0
down vote

favorite












I am looking for references that discuss computer programs handling a large collection of numbers that do not fit in machine memory.



Some context:

I am working with GNU multi precision library aka gmplib (double precision was insufficient) and generating a list of numbers; a lot of the numbers being generated are duplicates (some room for improvement, but still lots of duplicates). One of the small lists has ~300,000,000 unique points but I would like to work with more. Afterwards, I need to search through the list and make sure there are no duplicates near the defined precision, and then some further calculation.



Thoughts on my problem:

Naively, just write to a file and store points there. But then I need to search through the list, so perhaps I should keep the list sorted, and at that point, maybe it's just easier to use a database.



So I'm looking into how other people have handled this, maybe Mathematica handles this easily and transparently (my in-memory version, the kernel died after using all machine memory), maybe a C program interfacing to mysql is the best solution (this is where I am now, and it is slow, I am spending time on DB optimization), maybe "get a machine with more RAM" is the best solution, etc. (of course, arbitrary "best" depending on the problem).



References are preferred (I think this question is somewhat general), or even just comments pointing to the right collection of keywords to search, but I welcome any feedback for my specific problem. Thanks.










share|cite|improve this question
























  • This is a question of programming, not mathematics. You would have the same kind of problems if you were generating character strings instead of multiprecision numbers.
    – Somos
    Nov 16 at 20:40










  • @Somos Yes, I'm storing in the database as character strings. But the point of the question is to learn from others who have encountered similar challenges doing similar mathematical work. I am aware of generic data store solutions, but was hoping for something more domain specific and provided some context for that. Perhaps you have suggestions of how I can better ask this question to achieve those kinds of answers?
    – Burnsba
    Nov 16 at 21:11













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I am looking for references that discuss computer programs handling a large collection of numbers that do not fit in machine memory.



Some context:

I am working with GNU multi precision library aka gmplib (double precision was insufficient) and generating a list of numbers; a lot of the numbers being generated are duplicates (some room for improvement, but still lots of duplicates). One of the small lists has ~300,000,000 unique points but I would like to work with more. Afterwards, I need to search through the list and make sure there are no duplicates near the defined precision, and then some further calculation.



Thoughts on my problem:

Naively, just write to a file and store points there. But then I need to search through the list, so perhaps I should keep the list sorted, and at that point, maybe it's just easier to use a database.



So I'm looking into how other people have handled this, maybe Mathematica handles this easily and transparently (my in-memory version, the kernel died after using all machine memory), maybe a C program interfacing to mysql is the best solution (this is where I am now, and it is slow, I am spending time on DB optimization), maybe "get a machine with more RAM" is the best solution, etc. (of course, arbitrary "best" depending on the problem).



References are preferred (I think this question is somewhat general), or even just comments pointing to the right collection of keywords to search, but I welcome any feedback for my specific problem. Thanks.










share|cite|improve this question















I am looking for references that discuss computer programs handling a large collection of numbers that do not fit in machine memory.



Some context:

I am working with GNU multi precision library aka gmplib (double precision was insufficient) and generating a list of numbers; a lot of the numbers being generated are duplicates (some room for improvement, but still lots of duplicates). One of the small lists has ~300,000,000 unique points but I would like to work with more. Afterwards, I need to search through the list and make sure there are no duplicates near the defined precision, and then some further calculation.



Thoughts on my problem:

Naively, just write to a file and store points there. But then I need to search through the list, so perhaps I should keep the list sorted, and at that point, maybe it's just easier to use a database.



So I'm looking into how other people have handled this, maybe Mathematica handles this easily and transparently (my in-memory version, the kernel died after using all machine memory), maybe a C program interfacing to mysql is the best solution (this is where I am now, and it is slow, I am spending time on DB optimization), maybe "get a machine with more RAM" is the best solution, etc. (of course, arbitrary "best" depending on the problem).



References are preferred (I think this question is somewhat general), or even just comments pointing to the right collection of keywords to search, but I welcome any feedback for my specific problem. Thanks.







sequences-and-series reference-request






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Nov 16 at 19:49

























asked Nov 16 at 19:39









Burnsba

412311




412311












  • This is a question of programming, not mathematics. You would have the same kind of problems if you were generating character strings instead of multiprecision numbers.
    – Somos
    Nov 16 at 20:40










  • @Somos Yes, I'm storing in the database as character strings. But the point of the question is to learn from others who have encountered similar challenges doing similar mathematical work. I am aware of generic data store solutions, but was hoping for something more domain specific and provided some context for that. Perhaps you have suggestions of how I can better ask this question to achieve those kinds of answers?
    – Burnsba
    Nov 16 at 21:11


















  • This is a question of programming, not mathematics. You would have the same kind of problems if you were generating character strings instead of multiprecision numbers.
    – Somos
    Nov 16 at 20:40










  • @Somos Yes, I'm storing in the database as character strings. But the point of the question is to learn from others who have encountered similar challenges doing similar mathematical work. I am aware of generic data store solutions, but was hoping for something more domain specific and provided some context for that. Perhaps you have suggestions of how I can better ask this question to achieve those kinds of answers?
    – Burnsba
    Nov 16 at 21:11
















This is a question of programming, not mathematics. You would have the same kind of problems if you were generating character strings instead of multiprecision numbers.
– Somos
Nov 16 at 20:40




This is a question of programming, not mathematics. You would have the same kind of problems if you were generating character strings instead of multiprecision numbers.
– Somos
Nov 16 at 20:40












@Somos Yes, I'm storing in the database as character strings. But the point of the question is to learn from others who have encountered similar challenges doing similar mathematical work. I am aware of generic data store solutions, but was hoping for something more domain specific and provided some context for that. Perhaps you have suggestions of how I can better ask this question to achieve those kinds of answers?
– Burnsba
Nov 16 at 21:11




@Somos Yes, I'm storing in the database as character strings. But the point of the question is to learn from others who have encountered similar challenges doing similar mathematical work. I am aware of generic data store solutions, but was hoping for something more domain specific and provided some context for that. Perhaps you have suggestions of how I can better ask this question to achieve those kinds of answers?
– Burnsba
Nov 16 at 21:11










2 Answers
2






active

oldest

votes

















up vote
0
down vote













You have too many numbers to store in RAM. Hence you have to output them (in native binary form) to disk files first. You don't want the files to be too big because searching takes more time. The ideal would be to determine intervals which contain at most a maximum count of numbers of your data.
As a first pass, write all the numbers to one big master file. Then run a scan of that file to get an idea of the distribution of the numbers. Now you can determine intervals without too many numbers in them. On a second pass, scan the master file and output to secondary files each containing only numbers in one interval. Number of intervals is numer of secondary files. From here, one by one, you can read each secondary file into memory and process it there before going on to the next file.






share|cite|improve this answer




























    up vote
    0
    down vote













    The problem is that there is too much work for one machine. I thought I could offload data storage, losing latency to gain space. This is basically a variety of "get a better machine" (vertical scaling). However, there seems to be a limit to what is practical to scale vertically in a kind of all-or-none way. That is, trying to use disk-based storage in place of RAM isn't really practical, and isn't really done (to my limited knowledge; of course, this is commonly handled transparently by the host operating system).



    It seems the logical evolution to optimize a problem is to break the problem into smaller pieces (horizontal scaling). So I'm thinking I will have to adjust my approach to one based on parallel/distributed computing.



    I will mention a few bullet points and starter references, but at this point I think there is not a "real world" on topic (for math.SE) answer like I had originally hoped for (off loading datastore for a single monolithic application). That is, there are a number of papers and even journals that cover distributed and parallel computing, but nothing specifically on solving problems in the manner like I ask about above. With that, I'm not sure if this should be closed as off topic.





    Probably the most applicable aspects of a distributed computing project are checkpoints and load balancing.



    Checkpointing




    In addition to their primary roles in scientific
    computation, cores must also perform checkpointing
    by saving the exact state of the calculation to disk
    periodically, because a volunteer's PC may crash or be
    turned off at any time. Since work units can run for
    days, checkpointing reduces the amount of work that is
    lost to several minutes when the client resumes
    operation. Most applications do not normally include
    checkpointing, or do so without the complete state
    needed, so adding this capability is the most significant
    part of making code that is designed to run on a single
    system run in a distributed system [1]




    Some more discussion on Fault-tolerance and thread migration in [3].



    Load balancing




    Interestingly, executing these large, decoupled applications avoid many of the common bottlenecks of traditional parallel computing. For example, load balancing is a problem only for small problems, not large ones. With large problems the longest work unit is still only a small fraction of the time the total job requires, which leads to efficient tiling regardless of the heterogeneity in the system. [2]




    References



    [1] A. L. Beberg, V. S. Pande, D. L. Ensign, S. Khaliq and G. Jayachandran, "Folding@home: Lessons from eight years of volunteer distributed computing," 2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Rome, 2009, pp. 1-8. url: doi.ieeecomputersociety.org/10.1109/IPDPS.2009.5160922



    [2] Andrew Chien, Brad Calder, Stephen Elbert, Karan Bhatia, "Entropia: architecture and performance of an enterprise desktop grid system," Journal of Parallel and Distributed Computing, Vol. 63, Iss. 5, 2003, pp 597-610. https://doi.org/10.1016/S0743-7315(03)00006-6



    [3] Michael O. Neary, Bernd O. Christiansen, Peter Cappello, Klaus E. Schauser, "Javelin: Parallel computing on the internet," Future Generation Computer Systems, Vol. 15, Iss. 5–6, 1999, pp 659-674. https://doi.org/10.1016/S0167-739X(99)00017-5






    share|cite|improve this answer





















      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "69"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      noCode: true, onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














       

      draft saved


      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3001561%2fdealing-with-large-lists-of-numbers%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      0
      down vote













      You have too many numbers to store in RAM. Hence you have to output them (in native binary form) to disk files first. You don't want the files to be too big because searching takes more time. The ideal would be to determine intervals which contain at most a maximum count of numbers of your data.
      As a first pass, write all the numbers to one big master file. Then run a scan of that file to get an idea of the distribution of the numbers. Now you can determine intervals without too many numbers in them. On a second pass, scan the master file and output to secondary files each containing only numbers in one interval. Number of intervals is numer of secondary files. From here, one by one, you can read each secondary file into memory and process it there before going on to the next file.






      share|cite|improve this answer

























        up vote
        0
        down vote













        You have too many numbers to store in RAM. Hence you have to output them (in native binary form) to disk files first. You don't want the files to be too big because searching takes more time. The ideal would be to determine intervals which contain at most a maximum count of numbers of your data.
        As a first pass, write all the numbers to one big master file. Then run a scan of that file to get an idea of the distribution of the numbers. Now you can determine intervals without too many numbers in them. On a second pass, scan the master file and output to secondary files each containing only numbers in one interval. Number of intervals is numer of secondary files. From here, one by one, you can read each secondary file into memory and process it there before going on to the next file.






        share|cite|improve this answer























          up vote
          0
          down vote










          up vote
          0
          down vote









          You have too many numbers to store in RAM. Hence you have to output them (in native binary form) to disk files first. You don't want the files to be too big because searching takes more time. The ideal would be to determine intervals which contain at most a maximum count of numbers of your data.
          As a first pass, write all the numbers to one big master file. Then run a scan of that file to get an idea of the distribution of the numbers. Now you can determine intervals without too many numbers in them. On a second pass, scan the master file and output to secondary files each containing only numbers in one interval. Number of intervals is numer of secondary files. From here, one by one, you can read each secondary file into memory and process it there before going on to the next file.






          share|cite|improve this answer












          You have too many numbers to store in RAM. Hence you have to output them (in native binary form) to disk files first. You don't want the files to be too big because searching takes more time. The ideal would be to determine intervals which contain at most a maximum count of numbers of your data.
          As a first pass, write all the numbers to one big master file. Then run a scan of that file to get an idea of the distribution of the numbers. Now you can determine intervals without too many numbers in them. On a second pass, scan the master file and output to secondary files each containing only numbers in one interval. Number of intervals is numer of secondary files. From here, one by one, you can read each secondary file into memory and process it there before going on to the next file.







          share|cite|improve this answer












          share|cite|improve this answer



          share|cite|improve this answer










          answered Nov 16 at 21:20









          Somos

          12.7k11034




          12.7k11034






















              up vote
              0
              down vote













              The problem is that there is too much work for one machine. I thought I could offload data storage, losing latency to gain space. This is basically a variety of "get a better machine" (vertical scaling). However, there seems to be a limit to what is practical to scale vertically in a kind of all-or-none way. That is, trying to use disk-based storage in place of RAM isn't really practical, and isn't really done (to my limited knowledge; of course, this is commonly handled transparently by the host operating system).



              It seems the logical evolution to optimize a problem is to break the problem into smaller pieces (horizontal scaling). So I'm thinking I will have to adjust my approach to one based on parallel/distributed computing.



              I will mention a few bullet points and starter references, but at this point I think there is not a "real world" on topic (for math.SE) answer like I had originally hoped for (off loading datastore for a single monolithic application). That is, there are a number of papers and even journals that cover distributed and parallel computing, but nothing specifically on solving problems in the manner like I ask about above. With that, I'm not sure if this should be closed as off topic.





              Probably the most applicable aspects of a distributed computing project are checkpoints and load balancing.



              Checkpointing




              In addition to their primary roles in scientific
              computation, cores must also perform checkpointing
              by saving the exact state of the calculation to disk
              periodically, because a volunteer's PC may crash or be
              turned off at any time. Since work units can run for
              days, checkpointing reduces the amount of work that is
              lost to several minutes when the client resumes
              operation. Most applications do not normally include
              checkpointing, or do so without the complete state
              needed, so adding this capability is the most significant
              part of making code that is designed to run on a single
              system run in a distributed system [1]




              Some more discussion on Fault-tolerance and thread migration in [3].



              Load balancing




              Interestingly, executing these large, decoupled applications avoid many of the common bottlenecks of traditional parallel computing. For example, load balancing is a problem only for small problems, not large ones. With large problems the longest work unit is still only a small fraction of the time the total job requires, which leads to efficient tiling regardless of the heterogeneity in the system. [2]




              References



              [1] A. L. Beberg, V. S. Pande, D. L. Ensign, S. Khaliq and G. Jayachandran, "Folding@home: Lessons from eight years of volunteer distributed computing," 2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Rome, 2009, pp. 1-8. url: doi.ieeecomputersociety.org/10.1109/IPDPS.2009.5160922



              [2] Andrew Chien, Brad Calder, Stephen Elbert, Karan Bhatia, "Entropia: architecture and performance of an enterprise desktop grid system," Journal of Parallel and Distributed Computing, Vol. 63, Iss. 5, 2003, pp 597-610. https://doi.org/10.1016/S0743-7315(03)00006-6



              [3] Michael O. Neary, Bernd O. Christiansen, Peter Cappello, Klaus E. Schauser, "Javelin: Parallel computing on the internet," Future Generation Computer Systems, Vol. 15, Iss. 5–6, 1999, pp 659-674. https://doi.org/10.1016/S0167-739X(99)00017-5






              share|cite|improve this answer

























                up vote
                0
                down vote













                The problem is that there is too much work for one machine. I thought I could offload data storage, losing latency to gain space. This is basically a variety of "get a better machine" (vertical scaling). However, there seems to be a limit to what is practical to scale vertically in a kind of all-or-none way. That is, trying to use disk-based storage in place of RAM isn't really practical, and isn't really done (to my limited knowledge; of course, this is commonly handled transparently by the host operating system).



                It seems the logical evolution to optimize a problem is to break the problem into smaller pieces (horizontal scaling). So I'm thinking I will have to adjust my approach to one based on parallel/distributed computing.



                I will mention a few bullet points and starter references, but at this point I think there is not a "real world" on topic (for math.SE) answer like I had originally hoped for (off loading datastore for a single monolithic application). That is, there are a number of papers and even journals that cover distributed and parallel computing, but nothing specifically on solving problems in the manner like I ask about above. With that, I'm not sure if this should be closed as off topic.





                Probably the most applicable aspects of a distributed computing project are checkpoints and load balancing.



                Checkpointing




                In addition to their primary roles in scientific
                computation, cores must also perform checkpointing
                by saving the exact state of the calculation to disk
                periodically, because a volunteer's PC may crash or be
                turned off at any time. Since work units can run for
                days, checkpointing reduces the amount of work that is
                lost to several minutes when the client resumes
                operation. Most applications do not normally include
                checkpointing, or do so without the complete state
                needed, so adding this capability is the most significant
                part of making code that is designed to run on a single
                system run in a distributed system [1]




                Some more discussion on Fault-tolerance and thread migration in [3].



                Load balancing




                Interestingly, executing these large, decoupled applications avoid many of the common bottlenecks of traditional parallel computing. For example, load balancing is a problem only for small problems, not large ones. With large problems the longest work unit is still only a small fraction of the time the total job requires, which leads to efficient tiling regardless of the heterogeneity in the system. [2]




                References



                [1] A. L. Beberg, V. S. Pande, D. L. Ensign, S. Khaliq and G. Jayachandran, "Folding@home: Lessons from eight years of volunteer distributed computing," 2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Rome, 2009, pp. 1-8. url: doi.ieeecomputersociety.org/10.1109/IPDPS.2009.5160922



                [2] Andrew Chien, Brad Calder, Stephen Elbert, Karan Bhatia, "Entropia: architecture and performance of an enterprise desktop grid system," Journal of Parallel and Distributed Computing, Vol. 63, Iss. 5, 2003, pp 597-610. https://doi.org/10.1016/S0743-7315(03)00006-6



                [3] Michael O. Neary, Bernd O. Christiansen, Peter Cappello, Klaus E. Schauser, "Javelin: Parallel computing on the internet," Future Generation Computer Systems, Vol. 15, Iss. 5–6, 1999, pp 659-674. https://doi.org/10.1016/S0167-739X(99)00017-5






                share|cite|improve this answer























                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  The problem is that there is too much work for one machine. I thought I could offload data storage, losing latency to gain space. This is basically a variety of "get a better machine" (vertical scaling). However, there seems to be a limit to what is practical to scale vertically in a kind of all-or-none way. That is, trying to use disk-based storage in place of RAM isn't really practical, and isn't really done (to my limited knowledge; of course, this is commonly handled transparently by the host operating system).



                  It seems the logical evolution to optimize a problem is to break the problem into smaller pieces (horizontal scaling). So I'm thinking I will have to adjust my approach to one based on parallel/distributed computing.



                  I will mention a few bullet points and starter references, but at this point I think there is not a "real world" on topic (for math.SE) answer like I had originally hoped for (off loading datastore for a single monolithic application). That is, there are a number of papers and even journals that cover distributed and parallel computing, but nothing specifically on solving problems in the manner like I ask about above. With that, I'm not sure if this should be closed as off topic.





                  Probably the most applicable aspects of a distributed computing project are checkpoints and load balancing.



                  Checkpointing




                  In addition to their primary roles in scientific
                  computation, cores must also perform checkpointing
                  by saving the exact state of the calculation to disk
                  periodically, because a volunteer's PC may crash or be
                  turned off at any time. Since work units can run for
                  days, checkpointing reduces the amount of work that is
                  lost to several minutes when the client resumes
                  operation. Most applications do not normally include
                  checkpointing, or do so without the complete state
                  needed, so adding this capability is the most significant
                  part of making code that is designed to run on a single
                  system run in a distributed system [1]




                  Some more discussion on Fault-tolerance and thread migration in [3].



                  Load balancing




                  Interestingly, executing these large, decoupled applications avoid many of the common bottlenecks of traditional parallel computing. For example, load balancing is a problem only for small problems, not large ones. With large problems the longest work unit is still only a small fraction of the time the total job requires, which leads to efficient tiling regardless of the heterogeneity in the system. [2]




                  References



                  [1] A. L. Beberg, V. S. Pande, D. L. Ensign, S. Khaliq and G. Jayachandran, "Folding@home: Lessons from eight years of volunteer distributed computing," 2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Rome, 2009, pp. 1-8. url: doi.ieeecomputersociety.org/10.1109/IPDPS.2009.5160922



                  [2] Andrew Chien, Brad Calder, Stephen Elbert, Karan Bhatia, "Entropia: architecture and performance of an enterprise desktop grid system," Journal of Parallel and Distributed Computing, Vol. 63, Iss. 5, 2003, pp 597-610. https://doi.org/10.1016/S0743-7315(03)00006-6



                  [3] Michael O. Neary, Bernd O. Christiansen, Peter Cappello, Klaus E. Schauser, "Javelin: Parallel computing on the internet," Future Generation Computer Systems, Vol. 15, Iss. 5–6, 1999, pp 659-674. https://doi.org/10.1016/S0167-739X(99)00017-5






                  share|cite|improve this answer












                  The problem is that there is too much work for one machine. I thought I could offload data storage, losing latency to gain space. This is basically a variety of "get a better machine" (vertical scaling). However, there seems to be a limit to what is practical to scale vertically in a kind of all-or-none way. That is, trying to use disk-based storage in place of RAM isn't really practical, and isn't really done (to my limited knowledge; of course, this is commonly handled transparently by the host operating system).



                  It seems the logical evolution to optimize a problem is to break the problem into smaller pieces (horizontal scaling). So I'm thinking I will have to adjust my approach to one based on parallel/distributed computing.



                  I will mention a few bullet points and starter references, but at this point I think there is not a "real world" on topic (for math.SE) answer like I had originally hoped for (off loading datastore for a single monolithic application). That is, there are a number of papers and even journals that cover distributed and parallel computing, but nothing specifically on solving problems in the manner like I ask about above. With that, I'm not sure if this should be closed as off topic.





                  Probably the most applicable aspects of a distributed computing project are checkpoints and load balancing.



                  Checkpointing




                  In addition to their primary roles in scientific
                  computation, cores must also perform checkpointing
                  by saving the exact state of the calculation to disk
                  periodically, because a volunteer's PC may crash or be
                  turned off at any time. Since work units can run for
                  days, checkpointing reduces the amount of work that is
                  lost to several minutes when the client resumes
                  operation. Most applications do not normally include
                  checkpointing, or do so without the complete state
                  needed, so adding this capability is the most significant
                  part of making code that is designed to run on a single
                  system run in a distributed system [1]




                  Some more discussion on Fault-tolerance and thread migration in [3].



                  Load balancing




                  Interestingly, executing these large, decoupled applications avoid many of the common bottlenecks of traditional parallel computing. For example, load balancing is a problem only for small problems, not large ones. With large problems the longest work unit is still only a small fraction of the time the total job requires, which leads to efficient tiling regardless of the heterogeneity in the system. [2]




                  References



                  [1] A. L. Beberg, V. S. Pande, D. L. Ensign, S. Khaliq and G. Jayachandran, "Folding@home: Lessons from eight years of volunteer distributed computing," 2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Rome, 2009, pp. 1-8. url: doi.ieeecomputersociety.org/10.1109/IPDPS.2009.5160922



                  [2] Andrew Chien, Brad Calder, Stephen Elbert, Karan Bhatia, "Entropia: architecture and performance of an enterprise desktop grid system," Journal of Parallel and Distributed Computing, Vol. 63, Iss. 5, 2003, pp 597-610. https://doi.org/10.1016/S0743-7315(03)00006-6



                  [3] Michael O. Neary, Bernd O. Christiansen, Peter Cappello, Klaus E. Schauser, "Javelin: Parallel computing on the internet," Future Generation Computer Systems, Vol. 15, Iss. 5–6, 1999, pp 659-674. https://doi.org/10.1016/S0167-739X(99)00017-5







                  share|cite|improve this answer












                  share|cite|improve this answer



                  share|cite|improve this answer










                  answered Nov 19 at 17:42









                  Burnsba

                  412311




                  412311






























                       

                      draft saved


                      draft discarded



















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3001561%2fdealing-with-large-lists-of-numbers%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Index of /

                      Tribalistas

                      Listed building