Dealing with large lists of numbers

up vote
0
down vote

favorite

I am looking for references that discuss computer programs handling a large collection of numbers that do not fit in machine memory.

Some context:

I am working with GNU multi precision library aka gmplib (double precision was insufficient) and generating a list of numbers; a lot of the numbers being generated are duplicates (some room for improvement, but still lots of duplicates). One of the small lists has ~300,000,000 unique points but I would like to work with more. Afterwards, I need to search through the list and make sure there are no duplicates near the defined precision, and then some further calculation.

Thoughts on my problem:

Naively, just write to a file and store points there. But then I need to search through the list, so perhaps I should keep the list sorted, and at that point, maybe it's just easier to use a database.

So I'm looking into how other people have handled this, maybe Mathematica handles this easily and transparently (my in-memory version, the kernel died after using all machine memory), maybe a C program interfacing to mysql is the best solution (this is where I am now, and it is slow, I am spending time on DB optimization), maybe "get a machine with more RAM" is the best solution, etc. (of course, arbitrary "best" depending on the problem).

References are preferred (I think this question is somewhat general), or even just comments pointing to the right collection of keywords to search, but I welcome any feedback for my specific problem. Thanks.

edited Nov 16 at 19:49

asked Nov 16 at 19:39

Burnsba

412311

This is a question of programming, not mathematics. You would have the same kind of problems if you were generating character strings instead of multiprecision numbers.
– Somos
Nov 16 at 20:40

@Somos Yes, I'm storing in the database as character strings. But the point of the question is to learn from others who have encountered similar challenges doing similar mathematical work. I am aware of generic data store solutions, but was hoping for something more domain specific and provided some context for that. Perhaps you have suggestions of how I can better ask this question to achieve those kinds of answers?
– Burnsba
Nov 16 at 21:11

add a comment |

up vote
0
down vote

favorite

I am looking for references that discuss computer programs handling a large collection of numbers that do not fit in machine memory.

edited Nov 16 at 19:49

asked Nov 16 at 19:39

Burnsba

412311

This is a question of programming, not mathematics. You would have the same kind of problems if you were generating character strings instead of multiprecision numbers.
– Somos
Nov 16 at 20:40

@Somos Yes, I'm storing in the database as character strings. But the point of the question is to learn from others who have encountered similar challenges doing similar mathematical work. I am aware of generic data store solutions, but was hoping for something more domain specific and provided some context for that. Perhaps you have suggestions of how I can better ask this question to achieve those kinds of answers?
– Burnsba
Nov 16 at 21:11

add a comment |

up vote
0
down vote

favorite

I am looking for references that discuss computer programs handling a large collection of numbers that do not fit in machine memory.

edited Nov 16 at 19:49

asked Nov 16 at 19:39

Burnsba

412311

I am looking for references that discuss computer programs handling a large collection of numbers that do not fit in machine memory.

sequences-and-series reference-request

edited Nov 16 at 19:49

asked Nov 16 at 19:39

Burnsba

412311

edited Nov 16 at 19:49

asked Nov 16 at 19:39

Burnsba

412311

edited Nov 16 at 19:49

asked Nov 16 at 19:39

Burnsba

412311

asked Nov 16 at 19:39

Burnsba

412311

asked Nov 16 at 19:39

Burnsba

412311

This is a question of programming, not mathematics. You would have the same kind of problems if you were generating character strings instead of multiprecision numbers.
– Somos
Nov 16 at 20:40

@Somos Yes, I'm storing in the database as character strings. But the point of the question is to learn from others who have encountered similar challenges doing similar mathematical work. I am aware of generic data store solutions, but was hoping for something more domain specific and provided some context for that. Perhaps you have suggestions of how I can better ask this question to achieve those kinds of answers?
– Burnsba
Nov 16 at 21:11

add a comment |

This is a question of programming, not mathematics. You would have the same kind of problems if you were generating character strings instead of multiprecision numbers.
– Somos
Nov 16 at 20:40

@Somos Yes, I'm storing in the database as character strings. But the point of the question is to learn from others who have encountered similar challenges doing similar mathematical work. I am aware of generic data store solutions, but was hoping for something more domain specific and provided some context for that. Perhaps you have suggestions of how I can better ask this question to achieve those kinds of answers?
– Burnsba
Nov 16 at 21:11

This is a question of programming, not mathematics. You would have the same kind of problems if you were generating character strings instead of multiprecision numbers.
– Somos
Nov 16 at 20:40

@Somos Yes, I'm storing in the database as character strings. But the point of the question is to learn from others who have encountered similar challenges doing similar mathematical work. I am aware of generic data store solutions, but was hoping for something more domain specific and provided some context for that. Perhaps you have suggestions of how I can better ask this question to achieve those kinds of answers?
– Burnsba
Nov 16 at 21:11

add a comment |

2 Answers
2

active

oldest

votes

up vote
0
down vote

You have too many numbers to store in RAM. Hence you have to output them (in native binary form) to disk files first. You don't want the files to be too big because searching takes more time. The ideal would be to determine intervals which contain at most a maximum count of numbers of your data.
As a first pass, write all the numbers to one big master file. Then run a scan of that file to get an idea of the distribution of the numbers. Now you can determine intervals without too many numbers in them. On a second pass, scan the master file and output to secondary files each containing only numbers in one interval. Number of intervals is numer of secondary files. From here, one by one, you can read each secondary file into memory and process it there before going on to the next file.

answered Nov 16 at 21:20

Somos

12.7k11034

add a comment |

up vote
0
down vote

The problem is that there is too much work for one machine. I thought I could offload data storage, losing latency to gain space. This is basically a variety of "get a better machine" (vertical scaling). However, there seems to be a limit to what is practical to scale vertically in a kind of all-or-none way. That is, trying to use disk-based storage in place of RAM isn't really practical, and isn't really done (to my limited knowledge; of course, this is commonly handled transparently by the host operating system).

It seems the logical evolution to optimize a problem is to break the problem into smaller pieces (horizontal scaling). So I'm thinking I will have to adjust my approach to one based on parallel/distributed computing.

I will mention a few bullet points and starter references, but at this point I think there is not a "real world" on topic (for math.SE) answer like I had originally hoped for (off loading datastore for a single monolithic application). That is, there are a number of papers and even journals that cover distributed and parallel computing, but nothing specifically on solving problems in the manner like I ask about above. With that, I'm not sure if this should be closed as off topic.

Probably the most applicable aspects of a distributed computing project are checkpoints and load balancing.

Checkpointing

In addition to their primary roles in scientific
computation, cores must also perform checkpointing
by saving the exact state of the calculation to disk
periodically, because a volunteer's PC may crash or be
turned off at any time. Since work units can run for
days, checkpointing reduces the amount of work that is
lost to several minutes when the client resumes
operation. Most applications do not normally include
checkpointing, or do so without the complete state
needed, so adding this capability is the most significant
part of making code that is designed to run on a single
system run in a distributed system [1]

Some more discussion on Fault-tolerance and thread migration in [3].

Load balancing

Interestingly, executing these large, decoupled applications avoid many of the common bottlenecks of traditional parallel computing. For example, load balancing is a problem only for small problems, not large ones. With large problems the longest work unit is still only a small fraction of the time the total job requires, which leads to efficient tiling regardless of the heterogeneity in the system. [2]

References

[1] A. L. Beberg, V. S. Pande, D. L. Ensign, S. Khaliq and G. Jayachandran, "Folding@home: Lessons from eight years of volunteer distributed computing," 2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Rome, 2009, pp. 1-8. url: doi.ieeecomputersociety.org/10.1109/IPDPS.2009.5160922

[2] Andrew Chien, Brad Calder, Stephen Elbert, Karan Bhatia, "Entropia: architecture and performance of an enterprise desktop grid system," Journal of Parallel and Distributed Computing, Vol. 63, Iss. 5, 2003, pp 597-610. https://doi.org/10.1016/S0743-7315(03)00006-6

[3] Michael O. Neary, Bernd O. Christiansen, Peter Cappello, Klaus E. Schauser, "Javelin: Parallel computing on the internet," Future Generation Computer Systems, Vol. 15, Iss. 5–6, 1999, pp 659-674. https://doi.org/10.1016/S0167-739X(99)00017-5

answered Nov 19 at 17:42

Burnsba

412311

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "69"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f3001561%2fdealing-with-large-lists-of-numbers%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
0
down vote

answered Nov 16 at 21:20

Somos

12.7k11034

add a comment |

up vote
0
down vote

answered Nov 16 at 21:20

Somos

12.7k11034

add a comment |

up vote
0
down vote

answered Nov 16 at 21:20

Somos

12.7k11034

answered Nov 16 at 21:20

Somos

12.7k11034

answered Nov 16 at 21:20

Somos

12.7k11034

answered Nov 16 at 21:20

Somos

12.7k11034

answered Nov 16 at 21:20

Somos

12.7k11034

add a comment |

up vote
0
down vote

Probably the most applicable aspects of a distributed computing project are checkpoints and load balancing.

Checkpointing

In addition to their primary roles in scientific
computation, cores must also perform checkpointing
by saving the exact state of the calculation to disk
periodically, because a volunteer's PC may crash or be
turned off at any time. Since work units can run for
days, checkpointing reduces the amount of work that is
lost to several minutes when the client resumes
operation. Most applications do not normally include
checkpointing, or do so without the complete state
needed, so adding this capability is the most significant
part of making code that is designed to run on a single
system run in a distributed system [1]

Some more discussion on Fault-tolerance and thread migration in [3].

Load balancing

Interestingly, executing these large, decoupled applications avoid many of the common bottlenecks of traditional parallel computing. For example, load balancing is a problem only for small problems, not large ones. With large problems the longest work unit is still only a small fraction of the time the total job requires, which leads to efficient tiling regardless of the heterogeneity in the system. [2]

References

answered Nov 19 at 17:42

Burnsba

412311

add a comment |

up vote
0
down vote

Probably the most applicable aspects of a distributed computing project are checkpoints and load balancing.

Checkpointing

In addition to their primary roles in scientific
computation, cores must also perform checkpointing
by saving the exact state of the calculation to disk
periodically, because a volunteer's PC may crash or be
turned off at any time. Since work units can run for
days, checkpointing reduces the amount of work that is
lost to several minutes when the client resumes
operation. Most applications do not normally include
checkpointing, or do so without the complete state
needed, so adding this capability is the most significant
part of making code that is designed to run on a single
system run in a distributed system [1]

Some more discussion on Fault-tolerance and thread migration in [3].

Load balancing

Interestingly, executing these large, decoupled applications avoid many of the common bottlenecks of traditional parallel computing. For example, load balancing is a problem only for small problems, not large ones. With large problems the longest work unit is still only a small fraction of the time the total job requires, which leads to efficient tiling regardless of the heterogeneity in the system. [2]

References

answered Nov 19 at 17:42

Burnsba

412311

add a comment |

up vote
0
down vote

Probably the most applicable aspects of a distributed computing project are checkpoints and load balancing.

Checkpointing

In addition to their primary roles in scientific
computation, cores must also perform checkpointing
by saving the exact state of the calculation to disk
periodically, because a volunteer's PC may crash or be
turned off at any time. Since work units can run for
days, checkpointing reduces the amount of work that is
lost to several minutes when the client resumes
operation. Most applications do not normally include
checkpointing, or do so without the complete state
needed, so adding this capability is the most significant
part of making code that is designed to run on a single
system run in a distributed system [1]

Some more discussion on Fault-tolerance and thread migration in [3].

Load balancing

Interestingly, executing these large, decoupled applications avoid many of the common bottlenecks of traditional parallel computing. For example, load balancing is a problem only for small problems, not large ones. With large problems the longest work unit is still only a small fraction of the time the total job requires, which leads to efficient tiling regardless of the heterogeneity in the system. [2]

References

answered Nov 19 at 17:42

Burnsba

412311

Probably the most applicable aspects of a distributed computing project are checkpoints and load balancing.

Checkpointing

In addition to their primary roles in scientific
computation, cores must also perform checkpointing
by saving the exact state of the calculation to disk
periodically, because a volunteer's PC may crash or be
turned off at any time. Since work units can run for
days, checkpointing reduces the amount of work that is
lost to several minutes when the client resumes
operation. Most applications do not normally include
checkpointing, or do so without the complete state
needed, so adding this capability is the most significant
part of making code that is designed to run on a single
system run in a distributed system [1]

Some more discussion on Fault-tolerance and thread migration in [3].

Load balancing

Interestingly, executing these large, decoupled applications avoid many of the common bottlenecks of traditional parallel computing. For example, load balancing is a problem only for small problems, not large ones. With large problems the longest work unit is still only a small fraction of the time the total job requires, which leads to efficient tiling regardless of the heterogeneity in the system. [2]

References

answered Nov 19 at 17:42

Burnsba

412311

answered Nov 19 at 17:42

Burnsba

412311

answered Nov 19 at 17:42

Burnsba

412311

answered Nov 19 at 17:42

Burnsba

412311

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Jtdylktuy