Long running work unit
Long running work unit
http://www.rnaworld.de/rnaworld/workuni ... uid=157852
One of my machines picked up a rather long running work unit. The machine is a 2.4 GHz Q6600 with 4G of RAM running linux64 so should be powerful enough.
It is current at 6 days compute time and went high priority yesterday. The time to complete continues to increase and if it is correct, will no longer complete before the deadline. Two people have already aborted Workunit 157852, two people have had errors and 9 of us are still grinding away at it.
I read the thread that says a 2 GHz athlon will finish all work in a couple of hours. If this is true then shouldn't the work automatically abort after a day and not tie up a core for a week when you guys give us buggy work?
Are all work units that have run more than 2 hours buggy and need to be aborted? It seems wasteful to tie up resources for so long when they would be more productive doing something else.
One of my machines picked up a rather long running work unit. The machine is a 2.4 GHz Q6600 with 4G of RAM running linux64 so should be powerful enough.
It is current at 6 days compute time and went high priority yesterday. The time to complete continues to increase and if it is correct, will no longer complete before the deadline. Two people have already aborted Workunit 157852, two people have had errors and 9 of us are still grinding away at it.
I read the thread that says a 2 GHz athlon will finish all work in a couple of hours. If this is true then shouldn't the work automatically abort after a day and not tie up a core for a week when you guys give us buggy work?
Are all work units that have run more than 2 hours buggy and need to be aborted? It seems wasteful to tie up resources for so long when they would be more productive doing something else.
- Michael H.W. Weber
- Vereinsvorstand
- Beiträge: 22000
- Registriert: 07.01.2002 01:00
- Wohnort: Marpurk
- Kontaktdaten:
Re: Long running work unit
Well, what thread are you actually talking about?fractal hat geschrieben:I read the thread that says a 2 GHz athlon will finish all work in a couple of hours. If this is true then shouldn't the work automatically abort after a day and not tie up a core for a week when you guys give us buggy work?



Michael.
P.S.: We do not send out any buggy WU and no WU requires manual abortion. There are also only very few if any reports on errors with this project. Quite frequently, people just have not enough patience to wait for proper completion. In this case, it would help us to improve project overall performance by deleting early if a WU seems too long because by doing this we can send it out again quickly. Anything else will just delay the entire project progress.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.
http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B


Re: Long running work unit
I was referring to http://www.rechenkraft.net/phpBB/viewto ... 75&t=10592 . Re-reading it I see you said 47 hrs on a 2ghz AMD.
It is currently at 145 hrs of processing and time to complete has fluctuated between 10 and 12 hrs for the past two days. Report deadline is 5 hrs from now. You might want to take a look at that unit since, as I reported earlier, nobody has finished it and 64 bit linux cores that could be used to clean up the backlog of pending are tied up on it.
It is currently at 145 hrs of processing and time to complete has fluctuated between 10 and 12 hrs for the past two days. Report deadline is 5 hrs from now. You might want to take a look at that unit since, as I reported earlier, nobody has finished it and 64 bit linux cores that could be used to clean up the backlog of pending are tied up on it.
- FalconFly
- Mikrocruncher
- Beiträge: 25
- Registriert: 28.07.2009 18:49
- Wohnort: 5335N 00745E
- Kontaktdaten:
Re: Long running work unit
I also have two WorkUnits ( example ) that could not finish within the deadline.
Both take approx. 670hrs on a 2.5GHz Phenom II X4 905e, which isn't quite a slow CPU.
x86_64 Linux reports normal CPU usage, Task is running normal.
As the deadline is already blown, should I abort those or should I wait another ~340hrs for them to complete ?
(i.e. will the server still accept them ?)
-- edit --
Restarting them seems to have resetted their figures. I guess I'll keep an eye on how they proceed (?)
Both take approx. 670hrs on a 2.5GHz Phenom II X4 905e, which isn't quite a slow CPU.
x86_64 Linux reports normal CPU usage, Task is running normal.
As the deadline is already blown, should I abort those or should I wait another ~340hrs for them to complete ?
(i.e. will the server still accept them ?)
-- edit --
Restarting them seems to have resetted their figures. I guess I'll keep an eye on how they proceed (?)
Scientific Network : 44800 MHz - 77824 MB - 1970 GB
- Michael H.W. Weber
- Vereinsvorstand
- Beiträge: 22000
- Registriert: 07.01.2002 01:00
- Wohnort: Marpurk
- Kontaktdaten:
Re: Long running work unit
No, you are referring to something unrelated. Here we have a CMC WU and not a CMS set of the CSP type which is the topic of the thread you link to.fractal hat geschrieben:I was referring to http://www.rechenkraft.net/phpBB/viewto ... 75&t=10592 . Re-reading it I see you said 47 hrs on a 2ghz AMD.
Hmmm, that means you must have had that WU quite long in your queue without working on it, because the CMC WUs use to have a 14 day deadline. So, if you invested 145 hrs and the final deadline is in 5 hrs - well, you can calculate yourself. I do not know whether it will be completed in time, but I can do a test run on a 955 BE to see what run time is expected. Maybe just keep it running if you do not mind. It might really help us fix a problem.fractal hat geschrieben:It is currently at 145 hrs of processing and time to complete has fluctuated between 10 and 12 hrs for the past two days. Report deadline is 5 hrs from now. You might want to take a look at that unit since, as I reported earlier, nobody has finished it and 64 bit linux cores that could be used to clean up the backlog of pending are tied up on it.
Michael.
[edit]: Run time estimate on AMD 955 BE (3.0 GHz Quad): 85 hrs. I had this WU before, it took more than 145 hrs then my machine was accidentially detached from BOINC - restart.

Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.
http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B


- Michael H.W. Weber
- Vereinsvorstand
- Beiträge: 22000
- Registriert: 07.01.2002 01:00
- Wohnort: Marpurk
- Kontaktdaten:
Re: Long running work unit
That is strange. My 955 BE estimates a run time of 244 hrs for this WU (will most likely be a bit more) with a RAM usage of up to 1.2 GB (might be higher, will increase and even change up and down during computation). Could you please also check for swapping of RAM to HD?FalconFly hat geschrieben:I also have two WorkUnits ( example ) that could not finish within the deadline.
Both take approx. 670hrs on a 2.5GHz Phenom II X4 905e, which isn't quite a slow CPU.
x86_64 Linux reports normal CPU usage, Task is running normal.
As the deadline is already blown, should I abort those or should I wait another ~340hrs for them to complete ?
(i.e. will the server still accept them ?)
-- edit --
Restarting them seems to have resetted their figures. I guess I'll keep an eye on how they proceed (?)
Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.
http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B


Re: Long running work unit
I think you got it.
It is weird. I have told boinc not to switch between applications, but it just won't listen. It just loads them all up in memory and alternates between them. Silly program.
boinc ran some units from another project and switched to other units of the same project leaving them in memory. This eventually ate all of memory. I have suspended all other projects until this unit completes.boinc@g31mx-1:~$ free
total used free shared buffers cached
Mem: 4038988 3770556 268432 0 109184 357996
-/+ buffers/cache: 3303376 735612
Swap: 1253028 916248 336780
It is weird. I have told boinc not to switch between applications, but it just won't listen. It just loads them all up in memory and alternates between them. Silly program.
- FalconFly
- Mikrocruncher
- Beiträge: 25
- Registriert: 28.07.2009 18:49
- Wohnort: 5335N 00745E
- Kontaktdaten:
Re: Long running work unit
With me that wasn't the case I think (at least not yesterday when I checked on both machines).
cmcalibrate RAM uses only ~700MB, currently running along WCG (only 30MB per Task) and leaving some 2.8GB RAM free and 2GB Swap unused.
Even if cmcalibrate went up to 3.5GB, there still would be no impact on the systems (4GB RAM).
Since the next attempt is way out of the deadline (failed checkpoint set them back from 340h runtime to like 12min runtime), I think I'll just abort them like all the other Users had to.
cmcalibrate RAM uses only ~700MB, currently running along WCG (only 30MB per Task) and leaving some 2.8GB RAM free and 2GB Swap unused.
Even if cmcalibrate went up to 3.5GB, there still would be no impact on the systems (4GB RAM).
Since the next attempt is way out of the deadline (failed checkpoint set them back from 340h runtime to like 12min runtime), I think I'll just abort them like all the other Users had to.
Scientific Network : 44800 MHz - 77824 MB - 1970 GB
- Michael H.W. Weber
- Vereinsvorstand
- Beiträge: 22000
- Registriert: 07.01.2002 01:00
- Wohnort: Marpurk
- Kontaktdaten:
Re: Long running work unit
No, please keep it running. If the machine is not swapping and no zombie task is detectable, I do not see a reason why the WU should be dead. It should count down, soon. I just had that sitation on my box.
Michael.
Michael.
Fördern, kooperieren und konstruieren statt fordern, konkurrieren und konsumieren.
http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B

http://signature.statseb.fr I: Kaputte Seite A
http://signature.statseb.fr II: Kaputte Seite B


- FalconFly
- Mikrocruncher
- Beiträge: 25
- Registriert: 28.07.2009 18:49
- Wohnort: 5335N 00745E
- Kontaktdaten:
Re: Long running work unit
Okidok, I'll let them run.
It'll be at least 4 weeks before they can finish, however - 19h done, ~650h to go
(I wonder what kind of calibration is done that needs such enormous runtimes?)
It'll be at least 4 weeks before they can finish, however - 19h done, ~650h to go

(I wonder what kind of calibration is done that needs such enormous runtimes?)
Scientific Network : 44800 MHz - 77824 MB - 1970 GB
Re: Long running work unit
And another very long unit (513689, cms_6S6[e]_Monodelphis-domestica-(gray-short-tailed-opossum)_CM000370.lin.EMBL_f_1268060823_33_0), running on a Q6600/2.4GHz, but only 2GB RAM.
It's currently at 63¾ hours @ 9.1%, so ~637 hours/26½ days to go at the current rate; the progress bar is clicking up 0.001% per tick. The deadline is 18/3 (just under 7 days); it's showing 125 hours to go, so BOINC hasn't put it on high priority yet.
Abort (I hate wasting crunching time) or persevere?
It's currently at 63¾ hours @ 9.1%, so ~637 hours/26½ days to go at the current rate; the progress bar is clicking up 0.001% per tick. The deadline is 18/3 (just under 7 days); it's showing 125 hours to go, so BOINC hasn't put it on high priority yet.
Abort (I hate wasting crunching time) or persevere?
Re: Long running work unit
How much RAM does the wu consumes?Al Dente hat geschrieben:And another very long unit (513689, cms_6S6[e]_Monodelphis-domestica-(gray-short-tailed-opossum)_CM000370.lin.EMBL_f_1268060823_33_0), running on a Q6600/2.4GHz, but only 2GB RAM.
It's currently at 63¾ hours @ 9.1%, so ~637 hours/26½ days to go at the current rate; the progress bar is clicking up 0.001% per tick. The deadline is 18/3 (just under 7 days); it's showing 125 hours to go, so BOINC hasn't put it on high priority yet.
Abort (I hate wasting crunching time) or persevere?
yoyo