Parallele Computing - 2 vs. 4 processor speed [closed]
$begingroup$
I am evaluating a code which ends with Table having ParallelEvaluate
of a function XXXX[phi, theta, si]
. For a grid of 225 points, a normal 2 processor laptop is taking 7 h as compared to 8.30 h by a high end Xeon 4 processor computer. CPU and memory usage for laptop and computer are about 66% vs 99% and 700MB vs 900 MB respectively. Will be thankful for any suggestion on how to improve the evaluation speed on computer. Thanks
performance-tuning parallelization
$endgroup$
closed as off-topic by m_goldberg, José Antonio Díaz Navas, PlatoManiac, Yves Klett, b3m2a1 Jan 25 at 4:12
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "This question cannot be answered without additional information. Questions on problems in code must describe the specific problem and include valid code to reproduce it. Any data used for programming examples should be embedded in the question or code to generate the (fake) data must be included." – m_goldberg, José Antonio Díaz Navas, PlatoManiac, Yves Klett, b3m2a1
If this question can be reworded to fit the rules in the help center, please edit the question.
|
show 6 more comments
$begingroup$
I am evaluating a code which ends with Table having ParallelEvaluate
of a function XXXX[phi, theta, si]
. For a grid of 225 points, a normal 2 processor laptop is taking 7 h as compared to 8.30 h by a high end Xeon 4 processor computer. CPU and memory usage for laptop and computer are about 66% vs 99% and 700MB vs 900 MB respectively. Will be thankful for any suggestion on how to improve the evaluation speed on computer. Thanks
performance-tuning parallelization
$endgroup$
closed as off-topic by m_goldberg, José Antonio Díaz Navas, PlatoManiac, Yves Klett, b3m2a1 Jan 25 at 4:12
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "This question cannot be answered without additional information. Questions on problems in code must describe the specific problem and include valid code to reproduce it. Any data used for programming examples should be embedded in the question or code to generate the (fake) data must be included." – m_goldberg, José Antonio Díaz Navas, PlatoManiac, Yves Klett, b3m2a1
If this question can be reworded to fit the rules in the help center, please edit the question.
$begingroup$
Okay, having skimmed through the code, I can say that your actual problem is not parallelization but your coding style. You use far too much symbolic computation; you use unpacked arrays (in particular because you mix integer, symbolic and machine precision numbers in arrays); you recompute data over and over again (have a look at the manySort
andSortBy
commands); instead of concise function calls with purely numerical input and output, you use replace rules;...
$endgroup$
– Henrik Schumacher
Jan 19 at 2:20
$begingroup$
RunningXXXX[0, 0, 0]
once takes 135 s on my machine. I guess this can be executed 100--1000 times faster with proper refactoring of your code (and probably by usingCompile
here and there).
$endgroup$
– Henrik Schumacher
Jan 19 at 2:22
$begingroup$
@HenrikSchumacher You are correct, there could be many ways to write code, and as someone writing such complicated Mathematica code for the first time, I might have not opted the most efficient sub-steps. Though my Q about parallelization remains. Let us consider for any code taking x sec to evaluate at one point, how can we scale it linearly with no of points and no of processors (using ParallelTable or ParallelEvaluate or any other method) ? Will be thankful for your suggestion on that. In the meantime, I will try to modify code to reduce time/point "x" by incorporating your suggestion. Thx
$endgroup$
– user49535
Jan 19 at 4:49
1
$begingroup$
@user49535 I think that your replacement rules are killing you. If I runDSC[0, 0, 0, 1]
by itself, I get output that's over 12 million bytes because the code is unable to multiply the numbers by your f values since they're one of the last things to be defined. If possible, I would try to store the f values as actual numbers in a matrix. It looks to me like the output of DSC is actually supposed to be a matrix with 36 rows and 3 columns, where the second 2 columns are just indices, so it should be on the order of 1000 bytes.
$endgroup$
– MassDefect
Jan 19 at 5:52
1
$begingroup$
@user49535 Out of curiosity, is there a particular resource (like an algorithm from a book or code in another language) that you're trying to emulate? I'm trying to figure out what f does, but it's difficult. XX1 calls XXXX which calls DSC -> SEM -> SE -> TrueStrain -> EigenStrain which contains these f variables, but the computer doesn't know what they are yet. Then we go all the way back up the stack to DSC, and some of the f variables are replaced with other f variables but still not assigned a value, until we go back up to XXXX and havedsc/.R[m-1]
$endgroup$
– MassDefect
Jan 19 at 7:39
|
show 6 more comments
$begingroup$
I am evaluating a code which ends with Table having ParallelEvaluate
of a function XXXX[phi, theta, si]
. For a grid of 225 points, a normal 2 processor laptop is taking 7 h as compared to 8.30 h by a high end Xeon 4 processor computer. CPU and memory usage for laptop and computer are about 66% vs 99% and 700MB vs 900 MB respectively. Will be thankful for any suggestion on how to improve the evaluation speed on computer. Thanks
performance-tuning parallelization
$endgroup$
I am evaluating a code which ends with Table having ParallelEvaluate
of a function XXXX[phi, theta, si]
. For a grid of 225 points, a normal 2 processor laptop is taking 7 h as compared to 8.30 h by a high end Xeon 4 processor computer. CPU and memory usage for laptop and computer are about 66% vs 99% and 700MB vs 900 MB respectively. Will be thankful for any suggestion on how to improve the evaluation speed on computer. Thanks
performance-tuning parallelization
performance-tuning parallelization
edited Jan 24 at 6:14
user49535
asked Jan 18 at 7:27
user49535user49535
1465
1465
closed as off-topic by m_goldberg, José Antonio Díaz Navas, PlatoManiac, Yves Klett, b3m2a1 Jan 25 at 4:12
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "This question cannot be answered without additional information. Questions on problems in code must describe the specific problem and include valid code to reproduce it. Any data used for programming examples should be embedded in the question or code to generate the (fake) data must be included." – m_goldberg, José Antonio Díaz Navas, PlatoManiac, Yves Klett, b3m2a1
If this question can be reworded to fit the rules in the help center, please edit the question.
closed as off-topic by m_goldberg, José Antonio Díaz Navas, PlatoManiac, Yves Klett, b3m2a1 Jan 25 at 4:12
This question appears to be off-topic. The users who voted to close gave this specific reason:
- "This question cannot be answered without additional information. Questions on problems in code must describe the specific problem and include valid code to reproduce it. Any data used for programming examples should be embedded in the question or code to generate the (fake) data must be included." – m_goldberg, José Antonio Díaz Navas, PlatoManiac, Yves Klett, b3m2a1
If this question can be reworded to fit the rules in the help center, please edit the question.
$begingroup$
Okay, having skimmed through the code, I can say that your actual problem is not parallelization but your coding style. You use far too much symbolic computation; you use unpacked arrays (in particular because you mix integer, symbolic and machine precision numbers in arrays); you recompute data over and over again (have a look at the manySort
andSortBy
commands); instead of concise function calls with purely numerical input and output, you use replace rules;...
$endgroup$
– Henrik Schumacher
Jan 19 at 2:20
$begingroup$
RunningXXXX[0, 0, 0]
once takes 135 s on my machine. I guess this can be executed 100--1000 times faster with proper refactoring of your code (and probably by usingCompile
here and there).
$endgroup$
– Henrik Schumacher
Jan 19 at 2:22
$begingroup$
@HenrikSchumacher You are correct, there could be many ways to write code, and as someone writing such complicated Mathematica code for the first time, I might have not opted the most efficient sub-steps. Though my Q about parallelization remains. Let us consider for any code taking x sec to evaluate at one point, how can we scale it linearly with no of points and no of processors (using ParallelTable or ParallelEvaluate or any other method) ? Will be thankful for your suggestion on that. In the meantime, I will try to modify code to reduce time/point "x" by incorporating your suggestion. Thx
$endgroup$
– user49535
Jan 19 at 4:49
1
$begingroup$
@user49535 I think that your replacement rules are killing you. If I runDSC[0, 0, 0, 1]
by itself, I get output that's over 12 million bytes because the code is unable to multiply the numbers by your f values since they're one of the last things to be defined. If possible, I would try to store the f values as actual numbers in a matrix. It looks to me like the output of DSC is actually supposed to be a matrix with 36 rows and 3 columns, where the second 2 columns are just indices, so it should be on the order of 1000 bytes.
$endgroup$
– MassDefect
Jan 19 at 5:52
1
$begingroup$
@user49535 Out of curiosity, is there a particular resource (like an algorithm from a book or code in another language) that you're trying to emulate? I'm trying to figure out what f does, but it's difficult. XX1 calls XXXX which calls DSC -> SEM -> SE -> TrueStrain -> EigenStrain which contains these f variables, but the computer doesn't know what they are yet. Then we go all the way back up the stack to DSC, and some of the f variables are replaced with other f variables but still not assigned a value, until we go back up to XXXX and havedsc/.R[m-1]
$endgroup$
– MassDefect
Jan 19 at 7:39
|
show 6 more comments
$begingroup$
Okay, having skimmed through the code, I can say that your actual problem is not parallelization but your coding style. You use far too much symbolic computation; you use unpacked arrays (in particular because you mix integer, symbolic and machine precision numbers in arrays); you recompute data over and over again (have a look at the manySort
andSortBy
commands); instead of concise function calls with purely numerical input and output, you use replace rules;...
$endgroup$
– Henrik Schumacher
Jan 19 at 2:20
$begingroup$
RunningXXXX[0, 0, 0]
once takes 135 s on my machine. I guess this can be executed 100--1000 times faster with proper refactoring of your code (and probably by usingCompile
here and there).
$endgroup$
– Henrik Schumacher
Jan 19 at 2:22
$begingroup$
@HenrikSchumacher You are correct, there could be many ways to write code, and as someone writing such complicated Mathematica code for the first time, I might have not opted the most efficient sub-steps. Though my Q about parallelization remains. Let us consider for any code taking x sec to evaluate at one point, how can we scale it linearly with no of points and no of processors (using ParallelTable or ParallelEvaluate or any other method) ? Will be thankful for your suggestion on that. In the meantime, I will try to modify code to reduce time/point "x" by incorporating your suggestion. Thx
$endgroup$
– user49535
Jan 19 at 4:49
1
$begingroup$
@user49535 I think that your replacement rules are killing you. If I runDSC[0, 0, 0, 1]
by itself, I get output that's over 12 million bytes because the code is unable to multiply the numbers by your f values since they're one of the last things to be defined. If possible, I would try to store the f values as actual numbers in a matrix. It looks to me like the output of DSC is actually supposed to be a matrix with 36 rows and 3 columns, where the second 2 columns are just indices, so it should be on the order of 1000 bytes.
$endgroup$
– MassDefect
Jan 19 at 5:52
1
$begingroup$
@user49535 Out of curiosity, is there a particular resource (like an algorithm from a book or code in another language) that you're trying to emulate? I'm trying to figure out what f does, but it's difficult. XX1 calls XXXX which calls DSC -> SEM -> SE -> TrueStrain -> EigenStrain which contains these f variables, but the computer doesn't know what they are yet. Then we go all the way back up the stack to DSC, and some of the f variables are replaced with other f variables but still not assigned a value, until we go back up to XXXX and havedsc/.R[m-1]
$endgroup$
– MassDefect
Jan 19 at 7:39
$begingroup$
Okay, having skimmed through the code, I can say that your actual problem is not parallelization but your coding style. You use far too much symbolic computation; you use unpacked arrays (in particular because you mix integer, symbolic and machine precision numbers in arrays); you recompute data over and over again (have a look at the many
Sort
and SortBy
commands); instead of concise function calls with purely numerical input and output, you use replace rules;...$endgroup$
– Henrik Schumacher
Jan 19 at 2:20
$begingroup$
Okay, having skimmed through the code, I can say that your actual problem is not parallelization but your coding style. You use far too much symbolic computation; you use unpacked arrays (in particular because you mix integer, symbolic and machine precision numbers in arrays); you recompute data over and over again (have a look at the many
Sort
and SortBy
commands); instead of concise function calls with purely numerical input and output, you use replace rules;...$endgroup$
– Henrik Schumacher
Jan 19 at 2:20
$begingroup$
Running
XXXX[0, 0, 0]
once takes 135 s on my machine. I guess this can be executed 100--1000 times faster with proper refactoring of your code (and probably by using Compile
here and there).$endgroup$
– Henrik Schumacher
Jan 19 at 2:22
$begingroup$
Running
XXXX[0, 0, 0]
once takes 135 s on my machine. I guess this can be executed 100--1000 times faster with proper refactoring of your code (and probably by using Compile
here and there).$endgroup$
– Henrik Schumacher
Jan 19 at 2:22
$begingroup$
@HenrikSchumacher You are correct, there could be many ways to write code, and as someone writing such complicated Mathematica code for the first time, I might have not opted the most efficient sub-steps. Though my Q about parallelization remains. Let us consider for any code taking x sec to evaluate at one point, how can we scale it linearly with no of points and no of processors (using ParallelTable or ParallelEvaluate or any other method) ? Will be thankful for your suggestion on that. In the meantime, I will try to modify code to reduce time/point "x" by incorporating your suggestion. Thx
$endgroup$
– user49535
Jan 19 at 4:49
$begingroup$
@HenrikSchumacher You are correct, there could be many ways to write code, and as someone writing such complicated Mathematica code for the first time, I might have not opted the most efficient sub-steps. Though my Q about parallelization remains. Let us consider for any code taking x sec to evaluate at one point, how can we scale it linearly with no of points and no of processors (using ParallelTable or ParallelEvaluate or any other method) ? Will be thankful for your suggestion on that. In the meantime, I will try to modify code to reduce time/point "x" by incorporating your suggestion. Thx
$endgroup$
– user49535
Jan 19 at 4:49
1
1
$begingroup$
@user49535 I think that your replacement rules are killing you. If I run
DSC[0, 0, 0, 1]
by itself, I get output that's over 12 million bytes because the code is unable to multiply the numbers by your f values since they're one of the last things to be defined. If possible, I would try to store the f values as actual numbers in a matrix. It looks to me like the output of DSC is actually supposed to be a matrix with 36 rows and 3 columns, where the second 2 columns are just indices, so it should be on the order of 1000 bytes.$endgroup$
– MassDefect
Jan 19 at 5:52
$begingroup$
@user49535 I think that your replacement rules are killing you. If I run
DSC[0, 0, 0, 1]
by itself, I get output that's over 12 million bytes because the code is unable to multiply the numbers by your f values since they're one of the last things to be defined. If possible, I would try to store the f values as actual numbers in a matrix. It looks to me like the output of DSC is actually supposed to be a matrix with 36 rows and 3 columns, where the second 2 columns are just indices, so it should be on the order of 1000 bytes.$endgroup$
– MassDefect
Jan 19 at 5:52
1
1
$begingroup$
@user49535 Out of curiosity, is there a particular resource (like an algorithm from a book or code in another language) that you're trying to emulate? I'm trying to figure out what f does, but it's difficult. XX1 calls XXXX which calls DSC -> SEM -> SE -> TrueStrain -> EigenStrain which contains these f variables, but the computer doesn't know what they are yet. Then we go all the way back up the stack to DSC, and some of the f variables are replaced with other f variables but still not assigned a value, until we go back up to XXXX and have
dsc/.R[m-1]
$endgroup$
– MassDefect
Jan 19 at 7:39
$begingroup$
@user49535 Out of curiosity, is there a particular resource (like an algorithm from a book or code in another language) that you're trying to emulate? I'm trying to figure out what f does, but it's difficult. XX1 calls XXXX which calls DSC -> SEM -> SE -> TrueStrain -> EigenStrain which contains these f variables, but the computer doesn't know what they are yet. Then we go all the way back up the stack to DSC, and some of the f variables are replaced with other f variables but still not assigned a value, until we go back up to XXXX and have
dsc/.R[m-1]
$endgroup$
– MassDefect
Jan 19 at 7:39
|
show 6 more comments
1 Answer
1
active
oldest
votes
$begingroup$
Without knowing the exact function (I assume it's something fairly long, possibly involving integrals or differential equations), I can only make the following suggestions:
It looks like you're using exact numbers. If this is necessary for your application, then there's probably not a lot you can do, but exact numbers usually slow things down substantially. If you can, use Real
numbers (just place a dot after the numbers like {phi, 0., Pi/4., Pi/56.}
. If you need more precision than that but don't necessarily require the infinite precision of exact numbers, you can also do this: {phi, 0`50, Pi/4`50, Pi/56`50}
. This will give you 50 digits of precision to work with which should make your final answer pretty close to the exact answer.
The other thing I would try is:
XX1 = ParallelTable[
{XXXX[phi, theta, si]], phi, theta, si},
{phi, 0, Pi/4, Pi/56},
{theta, 0, ArcCot[Cos[phi]], ArcCot[Cos[phi]]/14},
{si, 0 Pi, 0 Pi, 0}
]
I think that ParallelTable
is a better way to handle this than ParallelEvaluate
. On a trial function, I see about a 100x speedup. ParallelEvaluate
is simply evaluating your exact same function 4 times at each data point rather than splitting the task into multiple threads.
If you can, combine both things for the best speedup.
I hope this helps a bit! There are some people on here that are amazing at optimizing, perhaps they will be able to improve the speed even more. If it's possible, I would recommend posting your XXXX
function unless it's insanely long.
$endgroup$
$begingroup$
Thanks @LukasLang ! How do you type grave accents without it interpreting them as the inline code markers? I tried backslashes before them, but that didn’t help.
$endgroup$
– MassDefect
Jan 18 at 9:10
1
$begingroup$
You have to increase the amount of enclosing accents: ``` `` Codewith
accents`` ```. If you need double accents, you enclose the code with three, and so on (edit: for some reason, it doesn't work in the comment section - but you can edit your answer to see how it's done)
$endgroup$
– Lukas Lang
Jan 18 at 9:24
$begingroup$
@LukasLang Oh, I see! Thanks!
$endgroup$
– MassDefect
Jan 18 at 9:32
$begingroup$
Thanks both of you. three points 1. I do not necessary need to use exact values of (theta, phi) if it can speed up, can use ".". 2. I tried to use ParalleleTable first, but in contrast to your experience, it took 30h/48h for 4/2 processor computer as compared to 8h/7h for ParallelEvaluate. 4 - 8 times slower. 3. How can I combine both...you mean ParallelTable[ParallelEvaluate[. ??
$endgroup$
– user49535
Jan 18 at 11:03
1
$begingroup$
@user49535 As MassDefect already pointed out, usingParallelEvaluate
here does not make sense at all. It enforces that the same value is computed on each of your CPU cores which is why you won't gain any speedup. It really depends on your actual functionXXXX
whetherParallelTable
can help at all. If it is a pure function thenParallelTable
should help.But ifXXXX
has side effects (like modifying data that has to be used by another thread) then it is hard to parallelize the execution. In a nutshell, we cannot give any further suggestions without knowingXXXX
.
$endgroup$
– Henrik Schumacher
Jan 18 at 11:49
|
show 2 more comments
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Without knowing the exact function (I assume it's something fairly long, possibly involving integrals or differential equations), I can only make the following suggestions:
It looks like you're using exact numbers. If this is necessary for your application, then there's probably not a lot you can do, but exact numbers usually slow things down substantially. If you can, use Real
numbers (just place a dot after the numbers like {phi, 0., Pi/4., Pi/56.}
. If you need more precision than that but don't necessarily require the infinite precision of exact numbers, you can also do this: {phi, 0`50, Pi/4`50, Pi/56`50}
. This will give you 50 digits of precision to work with which should make your final answer pretty close to the exact answer.
The other thing I would try is:
XX1 = ParallelTable[
{XXXX[phi, theta, si]], phi, theta, si},
{phi, 0, Pi/4, Pi/56},
{theta, 0, ArcCot[Cos[phi]], ArcCot[Cos[phi]]/14},
{si, 0 Pi, 0 Pi, 0}
]
I think that ParallelTable
is a better way to handle this than ParallelEvaluate
. On a trial function, I see about a 100x speedup. ParallelEvaluate
is simply evaluating your exact same function 4 times at each data point rather than splitting the task into multiple threads.
If you can, combine both things for the best speedup.
I hope this helps a bit! There are some people on here that are amazing at optimizing, perhaps they will be able to improve the speed even more. If it's possible, I would recommend posting your XXXX
function unless it's insanely long.
$endgroup$
$begingroup$
Thanks @LukasLang ! How do you type grave accents without it interpreting them as the inline code markers? I tried backslashes before them, but that didn’t help.
$endgroup$
– MassDefect
Jan 18 at 9:10
1
$begingroup$
You have to increase the amount of enclosing accents: ``` `` Codewith
accents`` ```. If you need double accents, you enclose the code with three, and so on (edit: for some reason, it doesn't work in the comment section - but you can edit your answer to see how it's done)
$endgroup$
– Lukas Lang
Jan 18 at 9:24
$begingroup$
@LukasLang Oh, I see! Thanks!
$endgroup$
– MassDefect
Jan 18 at 9:32
$begingroup$
Thanks both of you. three points 1. I do not necessary need to use exact values of (theta, phi) if it can speed up, can use ".". 2. I tried to use ParalleleTable first, but in contrast to your experience, it took 30h/48h for 4/2 processor computer as compared to 8h/7h for ParallelEvaluate. 4 - 8 times slower. 3. How can I combine both...you mean ParallelTable[ParallelEvaluate[. ??
$endgroup$
– user49535
Jan 18 at 11:03
1
$begingroup$
@user49535 As MassDefect already pointed out, usingParallelEvaluate
here does not make sense at all. It enforces that the same value is computed on each of your CPU cores which is why you won't gain any speedup. It really depends on your actual functionXXXX
whetherParallelTable
can help at all. If it is a pure function thenParallelTable
should help.But ifXXXX
has side effects (like modifying data that has to be used by another thread) then it is hard to parallelize the execution. In a nutshell, we cannot give any further suggestions without knowingXXXX
.
$endgroup$
– Henrik Schumacher
Jan 18 at 11:49
|
show 2 more comments
$begingroup$
Without knowing the exact function (I assume it's something fairly long, possibly involving integrals or differential equations), I can only make the following suggestions:
It looks like you're using exact numbers. If this is necessary for your application, then there's probably not a lot you can do, but exact numbers usually slow things down substantially. If you can, use Real
numbers (just place a dot after the numbers like {phi, 0., Pi/4., Pi/56.}
. If you need more precision than that but don't necessarily require the infinite precision of exact numbers, you can also do this: {phi, 0`50, Pi/4`50, Pi/56`50}
. This will give you 50 digits of precision to work with which should make your final answer pretty close to the exact answer.
The other thing I would try is:
XX1 = ParallelTable[
{XXXX[phi, theta, si]], phi, theta, si},
{phi, 0, Pi/4, Pi/56},
{theta, 0, ArcCot[Cos[phi]], ArcCot[Cos[phi]]/14},
{si, 0 Pi, 0 Pi, 0}
]
I think that ParallelTable
is a better way to handle this than ParallelEvaluate
. On a trial function, I see about a 100x speedup. ParallelEvaluate
is simply evaluating your exact same function 4 times at each data point rather than splitting the task into multiple threads.
If you can, combine both things for the best speedup.
I hope this helps a bit! There are some people on here that are amazing at optimizing, perhaps they will be able to improve the speed even more. If it's possible, I would recommend posting your XXXX
function unless it's insanely long.
$endgroup$
$begingroup$
Thanks @LukasLang ! How do you type grave accents without it interpreting them as the inline code markers? I tried backslashes before them, but that didn’t help.
$endgroup$
– MassDefect
Jan 18 at 9:10
1
$begingroup$
You have to increase the amount of enclosing accents: ``` `` Codewith
accents`` ```. If you need double accents, you enclose the code with three, and so on (edit: for some reason, it doesn't work in the comment section - but you can edit your answer to see how it's done)
$endgroup$
– Lukas Lang
Jan 18 at 9:24
$begingroup$
@LukasLang Oh, I see! Thanks!
$endgroup$
– MassDefect
Jan 18 at 9:32
$begingroup$
Thanks both of you. three points 1. I do not necessary need to use exact values of (theta, phi) if it can speed up, can use ".". 2. I tried to use ParalleleTable first, but in contrast to your experience, it took 30h/48h for 4/2 processor computer as compared to 8h/7h for ParallelEvaluate. 4 - 8 times slower. 3. How can I combine both...you mean ParallelTable[ParallelEvaluate[. ??
$endgroup$
– user49535
Jan 18 at 11:03
1
$begingroup$
@user49535 As MassDefect already pointed out, usingParallelEvaluate
here does not make sense at all. It enforces that the same value is computed on each of your CPU cores which is why you won't gain any speedup. It really depends on your actual functionXXXX
whetherParallelTable
can help at all. If it is a pure function thenParallelTable
should help.But ifXXXX
has side effects (like modifying data that has to be used by another thread) then it is hard to parallelize the execution. In a nutshell, we cannot give any further suggestions without knowingXXXX
.
$endgroup$
– Henrik Schumacher
Jan 18 at 11:49
|
show 2 more comments
$begingroup$
Without knowing the exact function (I assume it's something fairly long, possibly involving integrals or differential equations), I can only make the following suggestions:
It looks like you're using exact numbers. If this is necessary for your application, then there's probably not a lot you can do, but exact numbers usually slow things down substantially. If you can, use Real
numbers (just place a dot after the numbers like {phi, 0., Pi/4., Pi/56.}
. If you need more precision than that but don't necessarily require the infinite precision of exact numbers, you can also do this: {phi, 0`50, Pi/4`50, Pi/56`50}
. This will give you 50 digits of precision to work with which should make your final answer pretty close to the exact answer.
The other thing I would try is:
XX1 = ParallelTable[
{XXXX[phi, theta, si]], phi, theta, si},
{phi, 0, Pi/4, Pi/56},
{theta, 0, ArcCot[Cos[phi]], ArcCot[Cos[phi]]/14},
{si, 0 Pi, 0 Pi, 0}
]
I think that ParallelTable
is a better way to handle this than ParallelEvaluate
. On a trial function, I see about a 100x speedup. ParallelEvaluate
is simply evaluating your exact same function 4 times at each data point rather than splitting the task into multiple threads.
If you can, combine both things for the best speedup.
I hope this helps a bit! There are some people on here that are amazing at optimizing, perhaps they will be able to improve the speed even more. If it's possible, I would recommend posting your XXXX
function unless it's insanely long.
$endgroup$
Without knowing the exact function (I assume it's something fairly long, possibly involving integrals or differential equations), I can only make the following suggestions:
It looks like you're using exact numbers. If this is necessary for your application, then there's probably not a lot you can do, but exact numbers usually slow things down substantially. If you can, use Real
numbers (just place a dot after the numbers like {phi, 0., Pi/4., Pi/56.}
. If you need more precision than that but don't necessarily require the infinite precision of exact numbers, you can also do this: {phi, 0`50, Pi/4`50, Pi/56`50}
. This will give you 50 digits of precision to work with which should make your final answer pretty close to the exact answer.
The other thing I would try is:
XX1 = ParallelTable[
{XXXX[phi, theta, si]], phi, theta, si},
{phi, 0, Pi/4, Pi/56},
{theta, 0, ArcCot[Cos[phi]], ArcCot[Cos[phi]]/14},
{si, 0 Pi, 0 Pi, 0}
]
I think that ParallelTable
is a better way to handle this than ParallelEvaluate
. On a trial function, I see about a 100x speedup. ParallelEvaluate
is simply evaluating your exact same function 4 times at each data point rather than splitting the task into multiple threads.
If you can, combine both things for the best speedup.
I hope this helps a bit! There are some people on here that are amazing at optimizing, perhaps they will be able to improve the speed even more. If it's possible, I would recommend posting your XXXX
function unless it's insanely long.
edited Jan 18 at 8:42
Lukas Lang
6,7401930
6,7401930
answered Jan 18 at 8:07
MassDefectMassDefect
1,15139
1,15139
$begingroup$
Thanks @LukasLang ! How do you type grave accents without it interpreting them as the inline code markers? I tried backslashes before them, but that didn’t help.
$endgroup$
– MassDefect
Jan 18 at 9:10
1
$begingroup$
You have to increase the amount of enclosing accents: ``` `` Codewith
accents`` ```. If you need double accents, you enclose the code with three, and so on (edit: for some reason, it doesn't work in the comment section - but you can edit your answer to see how it's done)
$endgroup$
– Lukas Lang
Jan 18 at 9:24
$begingroup$
@LukasLang Oh, I see! Thanks!
$endgroup$
– MassDefect
Jan 18 at 9:32
$begingroup$
Thanks both of you. three points 1. I do not necessary need to use exact values of (theta, phi) if it can speed up, can use ".". 2. I tried to use ParalleleTable first, but in contrast to your experience, it took 30h/48h for 4/2 processor computer as compared to 8h/7h for ParallelEvaluate. 4 - 8 times slower. 3. How can I combine both...you mean ParallelTable[ParallelEvaluate[. ??
$endgroup$
– user49535
Jan 18 at 11:03
1
$begingroup$
@user49535 As MassDefect already pointed out, usingParallelEvaluate
here does not make sense at all. It enforces that the same value is computed on each of your CPU cores which is why you won't gain any speedup. It really depends on your actual functionXXXX
whetherParallelTable
can help at all. If it is a pure function thenParallelTable
should help.But ifXXXX
has side effects (like modifying data that has to be used by another thread) then it is hard to parallelize the execution. In a nutshell, we cannot give any further suggestions without knowingXXXX
.
$endgroup$
– Henrik Schumacher
Jan 18 at 11:49
|
show 2 more comments
$begingroup$
Thanks @LukasLang ! How do you type grave accents without it interpreting them as the inline code markers? I tried backslashes before them, but that didn’t help.
$endgroup$
– MassDefect
Jan 18 at 9:10
1
$begingroup$
You have to increase the amount of enclosing accents: ``` `` Codewith
accents`` ```. If you need double accents, you enclose the code with three, and so on (edit: for some reason, it doesn't work in the comment section - but you can edit your answer to see how it's done)
$endgroup$
– Lukas Lang
Jan 18 at 9:24
$begingroup$
@LukasLang Oh, I see! Thanks!
$endgroup$
– MassDefect
Jan 18 at 9:32
$begingroup$
Thanks both of you. three points 1. I do not necessary need to use exact values of (theta, phi) if it can speed up, can use ".". 2. I tried to use ParalleleTable first, but in contrast to your experience, it took 30h/48h for 4/2 processor computer as compared to 8h/7h for ParallelEvaluate. 4 - 8 times slower. 3. How can I combine both...you mean ParallelTable[ParallelEvaluate[. ??
$endgroup$
– user49535
Jan 18 at 11:03
1
$begingroup$
@user49535 As MassDefect already pointed out, usingParallelEvaluate
here does not make sense at all. It enforces that the same value is computed on each of your CPU cores which is why you won't gain any speedup. It really depends on your actual functionXXXX
whetherParallelTable
can help at all. If it is a pure function thenParallelTable
should help.But ifXXXX
has side effects (like modifying data that has to be used by another thread) then it is hard to parallelize the execution. In a nutshell, we cannot give any further suggestions without knowingXXXX
.
$endgroup$
– Henrik Schumacher
Jan 18 at 11:49
$begingroup$
Thanks @LukasLang ! How do you type grave accents without it interpreting them as the inline code markers? I tried backslashes before them, but that didn’t help.
$endgroup$
– MassDefect
Jan 18 at 9:10
$begingroup$
Thanks @LukasLang ! How do you type grave accents without it interpreting them as the inline code markers? I tried backslashes before them, but that didn’t help.
$endgroup$
– MassDefect
Jan 18 at 9:10
1
1
$begingroup$
You have to increase the amount of enclosing accents: ``` `` Code
with
accents`` ```. If you need double accents, you enclose the code with three, and so on (edit: for some reason, it doesn't work in the comment section - but you can edit your answer to see how it's done)$endgroup$
– Lukas Lang
Jan 18 at 9:24
$begingroup$
You have to increase the amount of enclosing accents: ``` `` Code
with
accents`` ```. If you need double accents, you enclose the code with three, and so on (edit: for some reason, it doesn't work in the comment section - but you can edit your answer to see how it's done)$endgroup$
– Lukas Lang
Jan 18 at 9:24
$begingroup$
@LukasLang Oh, I see! Thanks!
$endgroup$
– MassDefect
Jan 18 at 9:32
$begingroup$
@LukasLang Oh, I see! Thanks!
$endgroup$
– MassDefect
Jan 18 at 9:32
$begingroup$
Thanks both of you. three points 1. I do not necessary need to use exact values of (theta, phi) if it can speed up, can use ".". 2. I tried to use ParalleleTable first, but in contrast to your experience, it took 30h/48h for 4/2 processor computer as compared to 8h/7h for ParallelEvaluate. 4 - 8 times slower. 3. How can I combine both...you mean ParallelTable[ParallelEvaluate[. ??
$endgroup$
– user49535
Jan 18 at 11:03
$begingroup$
Thanks both of you. three points 1. I do not necessary need to use exact values of (theta, phi) if it can speed up, can use ".". 2. I tried to use ParalleleTable first, but in contrast to your experience, it took 30h/48h for 4/2 processor computer as compared to 8h/7h for ParallelEvaluate. 4 - 8 times slower. 3. How can I combine both...you mean ParallelTable[ParallelEvaluate[. ??
$endgroup$
– user49535
Jan 18 at 11:03
1
1
$begingroup$
@user49535 As MassDefect already pointed out, using
ParallelEvaluate
here does not make sense at all. It enforces that the same value is computed on each of your CPU cores which is why you won't gain any speedup. It really depends on your actual function XXXX
whether ParallelTable
can help at all. If it is a pure function then ParallelTable
should help.But if XXXX
has side effects (like modifying data that has to be used by another thread) then it is hard to parallelize the execution. In a nutshell, we cannot give any further suggestions without knowing XXXX
.$endgroup$
– Henrik Schumacher
Jan 18 at 11:49
$begingroup$
@user49535 As MassDefect already pointed out, using
ParallelEvaluate
here does not make sense at all. It enforces that the same value is computed on each of your CPU cores which is why you won't gain any speedup. It really depends on your actual function XXXX
whether ParallelTable
can help at all. If it is a pure function then ParallelTable
should help.But if XXXX
has side effects (like modifying data that has to be used by another thread) then it is hard to parallelize the execution. In a nutshell, we cannot give any further suggestions without knowing XXXX
.$endgroup$
– Henrik Schumacher
Jan 18 at 11:49
|
show 2 more comments
$begingroup$
Okay, having skimmed through the code, I can say that your actual problem is not parallelization but your coding style. You use far too much symbolic computation; you use unpacked arrays (in particular because you mix integer, symbolic and machine precision numbers in arrays); you recompute data over and over again (have a look at the many
Sort
andSortBy
commands); instead of concise function calls with purely numerical input and output, you use replace rules;...$endgroup$
– Henrik Schumacher
Jan 19 at 2:20
$begingroup$
Running
XXXX[0, 0, 0]
once takes 135 s on my machine. I guess this can be executed 100--1000 times faster with proper refactoring of your code (and probably by usingCompile
here and there).$endgroup$
– Henrik Schumacher
Jan 19 at 2:22
$begingroup$
@HenrikSchumacher You are correct, there could be many ways to write code, and as someone writing such complicated Mathematica code for the first time, I might have not opted the most efficient sub-steps. Though my Q about parallelization remains. Let us consider for any code taking x sec to evaluate at one point, how can we scale it linearly with no of points and no of processors (using ParallelTable or ParallelEvaluate or any other method) ? Will be thankful for your suggestion on that. In the meantime, I will try to modify code to reduce time/point "x" by incorporating your suggestion. Thx
$endgroup$
– user49535
Jan 19 at 4:49
1
$begingroup$
@user49535 I think that your replacement rules are killing you. If I run
DSC[0, 0, 0, 1]
by itself, I get output that's over 12 million bytes because the code is unable to multiply the numbers by your f values since they're one of the last things to be defined. If possible, I would try to store the f values as actual numbers in a matrix. It looks to me like the output of DSC is actually supposed to be a matrix with 36 rows and 3 columns, where the second 2 columns are just indices, so it should be on the order of 1000 bytes.$endgroup$
– MassDefect
Jan 19 at 5:52
1
$begingroup$
@user49535 Out of curiosity, is there a particular resource (like an algorithm from a book or code in another language) that you're trying to emulate? I'm trying to figure out what f does, but it's difficult. XX1 calls XXXX which calls DSC -> SEM -> SE -> TrueStrain -> EigenStrain which contains these f variables, but the computer doesn't know what they are yet. Then we go all the way back up the stack to DSC, and some of the f variables are replaced with other f variables but still not assigned a value, until we go back up to XXXX and have
dsc/.R[m-1]
$endgroup$
– MassDefect
Jan 19 at 7:39