Home All Groups Group Topic Archive Search About

NOT using the System.Threadpool in server class applications

Author
27 Jun 2006 4:24 AM
Chris Mullins
Quite often of late I've been asked about when to use the System Threadpool.

I finally sat down and wrote a blog detailing the issues in the ThreadPool
and why it's unsuitable for use in production grade server applications.

Any feedback on this is appriciated!

http://www.coversant.net/dotnetnuke/Coversant/Blogs/tabid/88/EntryID/8/Default.aspx

--
Chris Mullins

Author
27 Jun 2006 6:19 AM
Jon Skeet [C# MVP]
Chris Mullins <cmull***@yahoo.com> wrote:
> Quite often of late I've been asked about when to use the System Threadpool.
>
> I finally sat down and wrote a blog detailing the issues in the ThreadPool
> and why it's unsuitable for use in production grade server applications.
>
> Any feedback on this is appriciated!
>
> http://www.coversant.net/dotnetnuke/Coversant/Blogs/tabid/88/EntryID/8/Default.aspx

My only feedback is that this is something I've been aware of for a
while. When I tried to make the same point on the CLR mailing list, I
was variously accused of being naive or finding problems when there
really aren't any.

Half the problem is that *anything* can use the system threadpool -
there's no way of isolating different areas of code to use different
pools.

Oh well - at least I'm not alone in considering it a problem.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet   Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Author
27 Jun 2006 5:22 PM
Chris Mullins
"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote
> Chris Mullins <cmull***@yahoo.com> wrote:
>>
>> http://www.coversant.net/dotnetnuke/Coversant/Blogs/tabid/88/EntryID/8/Default.aspx
>

[Jon has seen this too]

> My only feedback is that this is something I've been aware of for a
> while. When I tried to make the same point on the CLR mailing list, I
> was variously accused of being naive or finding problems when there
> really aren't any.

That's what I've been bumping into as well. I dismiss alot of it, as its
from people who write proof-of-concept applications for a living, rather
than production grade applications. As we all know, there's a world of
difference between them.

> Half the problem is that *anything* can use the system threadpool -
> there's no way of isolating different areas of code to use different
> pools.

Yea, I agree. This was, when we ran into this 3 years or so ago, the big
issue. We needed the ability to say, "CLR gets these threads, SoapBox Server
gets these threads". Without that ability, the threadpool runs out of
threads and the whole appdomain hangs.

--
Chris Mullins
Author
27 Jun 2006 8:55 AM
Vadym Stetsyak
Hello, Chris!

CM> I finally sat down and wrote a blog detailing the issues in the
CM> ThreadPool and why it's unsuitable for use in production grade server
CM> applications.

CM> Any feedback on this is appriciated!

CM> http://www.coversant.net/dotnetnuke/Coversant/Blogs/tabid/88/EntryID/8/
CM> Default.aspx

Nice post :8-)

I have some questions on how you handled issues with ThreadPool.

Did you consider increasing number of worker threads in the threadpool?

Yes, ThreadPool is being used by vast number of classes in the framework, however, developer possesses some
power to influence that threadpool consumption, albeit indirect.
For instance, consider ADO.NET, server can use limited connection pool, that will limit usage of ThreadPool threads.

IMO under heavy load any threadpool can starve threads.

In one of the earlier threads William Stacey mentioned SEDA architecture.
( http://groups.google.com.ua/group/microsoft.public.dotnet.framework/browse_thread/thread/d9dfe43f5f60a98d/9c4bff3f789d2e32?lnk=st&q=Chris+Mullins+async&rnum=1&hl=uk#9c4bff3f789d2e32 )

SEDA paper ( http://www.eecs.harvard.edu/~mdw/papers/seda-sosp01.pdf  )
In that scenario custom thread pool would be very helpful. But I think that such an architerure can also be built
using ThreadPool class, however one has to know what class consumes threads and how many of them it consumes. This will
help tune ThreadPool appropriately ( increase number of working threads )

--
Regards, Vadym Stetsyak
www: http://vadmyst.blogspot.com
Author
27 Jun 2006 5:12 PM
Chris Mullins
Show quote
"Vadym Stetsyak" <vady***@ukr.net> wrote
> Hello, Chris!
>
> CM> I finally sat down and wrote a blog detailing the issues in the
> CM> ThreadPool and why it's unsuitable for use in production grade server
> CM> applications.
>
> CM> Any feedback on this is appriciated!
>
> CM>
> http://www.coversant.net/dotnetnuke/Coversant/Blogs/tabid/88/EntryID/8/
> CM> Default.aspx
>
> Nice post :8-)

Thanks. I'm always worred about a post like that, as it goes against what so
many other people preach.

> I have some questions on how you handled issues with ThreadPool.
> Did you consider increasing number of worker threads in the threadpool?

The problem isn't the size of the threadpool. I can use one of the ICor
interfaces (I don't remember which one) to manipulate the size of the thread
pool, and have in fact done this. I've also run into this problem on
machines that have 400 threads (by default) in the threadpool.

The problems stems from the fact that no matter how many threads are in
there, I can exhaust them all. Because the CLR itself needs threadpool
threads, this means the whole system then deadlocks.

What's needed are two threadpools: One for my application to use, and one
for the CLR to use. There's just no getting around this.

--
Chris Mullins
Author
27 Jun 2006 6:50 PM
Vadym Stetsyak
Hello, Chris!
You wrote  on Tue, 27 Jun 2006 10:12:56 -0700:

      CM>  The problems stems from the fact that no matter how many threads
are in
      there, I can exhaust them all. Because the CLR itself needs threadpool
      threads, this means the whole system then deadlocks.


Strange workaround just came to my mind, it can solve the problem with
single thread pool, but its not optiomal, for the software that you
described in your post.

The solutions is multiple AppDomains. Every AppDomain has its own
ThreadPool, so we can create something like "worker" AppDomain.

This approach is an overkill, due to marshalling, but nevertheless this
solution exists, and application  can be built using this solution.

      CM>  What's needed are two threadpools: One for my application to use,
and
      one for the CLR to use. There's just no getting around this.


Yeah, that would simplify a lot of problems, with parallel exucution in
hight performance systems.

--
Regards,
Vadym Stetsyak.
Blog: http://vadmyst.blogspot.com
Author
28 Jun 2006 3:01 AM
William Stacey [MVP]
It might be easier to just use your own thread pool.  There are multiple
implementations out there.

--
William Stacey [MVP]

Show quote
"Vadym Stetsyak" <vady***@ukr.net> wrote in message
news:urHGmqhmGHA.4700@TK2MSFTNGP03.phx.gbl...
| Hello, Chris!
| You wrote  on Tue, 27 Jun 2006 10:12:56 -0700:
|
|      CM>  The problems stems from the fact that no matter how many threads
| are in
|      there, I can exhaust them all. Because the CLR itself needs
threadpool
|      threads, this means the whole system then deadlocks.
|
|
| Strange workaround just came to my mind, it can solve the problem with
| single thread pool, but its not optiomal, for the software that you
| described in your post.
|
| The solutions is multiple AppDomains. Every AppDomain has its own
| ThreadPool, so we can create something like "worker" AppDomain.
|
| This approach is an overkill, due to marshalling, but nevertheless this
| solution exists, and application  can be built using this solution.
|
|      CM>  What's needed are two threadpools: One for my application to
use,
| and
|      one for the CLR to use. There's just no getting around this.
|
|
| Yeah, that would simplify a lot of problems, with parallel exucution in
| hight performance systems.
|
| --
| Regards,
| Vadym Stetsyak.
| Blog: http://vadmyst.blogspot.com
|
|
Author
29 Jun 2006 12:39 AM
Chris Mullins
"William Stacey [MVP]" <william.sta***@gmail.com> wrote
> It might be easier to just use your own thread pool.  There are multiple
> implementations out there.

I've looked all over the place for a good threadpool implementation, and
have yet to find one.

My ideal thread pool would have the following features:
1 - Threads are dynamically added and removed form the queue depending on
load.
2 - Threads have some degree of processor affinity in order to minimize
cache issues
3 - The scheduler is smart about assigning work items to threads based on
processor load.
4 - A built-in watchdog. If a work item is exceeding some timeout, a
mechanism exists to abort it.
5 - Intelligent about locking on the work item queue

I've built a number of thread pools, and looked at several that I've found
on the Internet. None of them have ever had all these qualities..

--
Chris Mullins
Author
29 Jun 2006 2:02 AM
Barry Kelly
"Chris Mullins" <cmull***@yahoo.com> wrote:

> 4 - A built-in watchdog. If a work item is exceeding some timeout, a
> mechanism exists to abort it.

There's no good way to abort a thread, so I wouldn't be too optimistic
about finding this feature. Thread.Abort is dangerous and evil, and any
shared structures being currently manipulated by the thread will likely
be in an inconsistent state. The best you can do is tear down the
AppDomain.

-- Barry

Author
29 Jun 2006 3:16 AM
William Stacey [MVP]
That is true.  The best place for this is in your own code (running on a TP
thread).  I guess an optional callback could be a safty for things user code
does not check for, like timeouts on locks or send/receives.  But then
again, the callback has to interupt and/or abort the thread, but at least
you get the option to do some last second clean up before tearing down the
app.  Not sure a great solution exists as you said.

--
William Stacey [MVP]

Show quote
"Barry Kelly" <barry.j.ke***@gmail.com> wrote in message
news:vjc6a2pqi32e4s106le487tumkl6qort55@4ax.com...
| "Chris Mullins" <cmull***@yahoo.com> wrote:
|
| > 4 - A built-in watchdog. If a work item is exceeding some timeout, a
| > mechanism exists to abort it.
|
| There's no good way to abort a thread, so I wouldn't be too optimistic
| about finding this feature. Thread.Abort is dangerous and evil, and any
| shared structures being currently manipulated by the thread will likely
| be in an inconsistent state. The best you can do is tear down the
| AppDomain.
|
| -- Barry
|
| --
| http://barrkel.blogspot.com/
Author
29 Jun 2006 2:27 AM
William Stacey [MVP]
I'll play.  I wrote a couple before so have general interest and maybe I'll
blow the dust off it and update.
In general, some of these things may have more cost then payoff for a
general purpose thread pool, but let me ask.

| 1 - Threads are dynamically added and removed form the queue depending on
| load.

What defines "load" here?  Is it Q size or cpu load.  CPU load logic may
compete against min/max thread parms.  So what is algo here?

| 2 - Threads have some degree of processor affinity in order to minimize
| cache issues

Windows using soft affinity automatically. It prefers to run a task on the
same processor as last time, but will swith if it has to based on its algo.
IMO, the system will do a better job then we can do here.  The successful
usage cases of changing processor affinity are rare.  However, I agree there
may be special cases - but wonder if that will be abused if available.  What
are your thoughts here?

| 3 - The scheduler is smart about assigning work items to threads based on
| processor load.

What you mean?  tia

| 4 - A built-in watchdog. If a work item is exceeding some timeout, a
| mechanism exists to abort it.

That could be a good add.  It is overhead, but probably not too bad.

| 5 - Intelligent about locking on the work item queue

At a minimum, you need to lock on add and remove.  Are you thinking of
something specific, or just seen poor locking in the TPs you reviewed?


--
William Stacey [MVP]
Author
29 Jun 2006 6:06 AM
Chris Mullins
"William Stacey [MVP]" <william.sta***@gmail.com> wrote

[Features a threadpool should have]

> | 1 - Threads are dynamically added and removed form the queue depending
> on
> | load.
>
> What defines "load" here?  Is it Q size or cpu load.  CPU load logic may
> compete against min/max thread parms.  So what is algo here?

I think it would be more cpu load than anything else. I'm not sold however
on this being the correct metric.

The behavior that I explicitly want to avoid is dumping 100 items into a
Queue, and having the thread pool immediatly spin up 100 threads. Likewise,
if 100 threads are spun up, I wouldn't want to see them all terminated
immediatly should there be no items in the pool.

The System thread pool seems to have a timeout of some sort before spawning
new threads. This seems to work fairly well.


>
> | 2 - Threads have some degree of processor affinity in order to minimize
> | cache issues

[Thread Affinity]

> IMO, the system will do a better job then we can do here.  The successful
> usage cases of changing processor affinity are rare.  However, I agree
> there
> may be special cases - but wonder if that will be abused if available.
> What
> are your thoughts here?


I tend to agree with you here, and as a result I have never myself written
code that has processor affinity. Windows has always done a good job of
things.

However I've been runnnig alot of tests latley on this machine:
http://www.coversant.net/bigiron.png

That particular machine has 16 physical Itanium2 processors, and some of
it's behavior has been unexpected. We consistantly see certain processors
load signifigantly higher than others, and it's tough to tell quite why.
Down deep in the network stack, there are some issues (which are solved by
the new Chimney features in the Scalable Network Pack) with interrupts and
processor affinity, so this may explain some of it. On the other hand, there
seems to be room for improvement. This is a more of a feeling than a fact,
I'll readily admit. It's also compounded by the lack of profiling tools that
run on Itanium (there are a few, but they're not nearly as usefull for .Net
as the Compuware stuff is).

I would really like to be able to have a way to figure out what processor
cache (especially on these Itanium2 machines with their HUGE caches) my
datastructure is already in, have a thread on that processor wake up,
process the item from the cache, write the results to main memory. I don't
think I'm going to see this any time in the furture, but it's on my wish
list. I have that feeling (I know, I know) this will give a fiarly
signifigant boost to performance as the load increases.

> | 3 - The scheduler is smart about assigning work items to threads based
> on
> | processor load.
>
> What you mean?  tia

Many of the threadpools that I have seen use this algorithm:
1  -thread that wakes up
2 - thread checks for shutdown
3 - thread checks for a work item
4 - thread processes work item
5 - thread goes back to sleep
6 - goto 1

Some use variations on this, and fancy means of putting threads to sleep -
I've seen a few algorithms:
1 - Keep all threads blocked in a Montior. Whenver a work items comes in, an
"PulseOne" (or, god help us, a Pulse-All with a "First one wins" algorithm)
2 - Use Events and Waits to test for shutdown, items in the queue, etc.
3 - Items are posted to a Queue, a control thread wakes up and gives the
work item to a particular work thread.

I would really like to see an algorithm that has, in the "busy" case, as few
context switches as possible. When I post an item to a queue, there should
be exactly 1 context switch needed to process my item.

>
> | 5 - Intelligent about locking on the work item queue
>
> At a minimum, you need to lock on add and remove.  Are you thinking of
> something specific, or just seen poor locking in the TPs you reviewed?

I am thinking of some specific stuff. That 16 processor Itanium box has
signifigantly influenced the way I think about locking and contention.

Typically a queue has two locks - one on Enqueue, one on Dequeue. This is
how I've been coding it for years. However, after seeing some performance
that really puzzled me, I spent some time looking into the contention in our
system. While doing that, I came across this:
http://makeashorterlink.com/?Q4132185D

I played with this test for a while, and ran it on single, dual, quad, and
16 processor boxes, and got some shocking results. After looking into this
more and more, I became quite alarmed at just how much contention we were
seeing. It was.... shocking. I have all the numbers recorded, and will be
doing a blog entry on my findings here in the not-too-distant future.

Anyway, after much wailing and gnashing of teeth, I stumbled across
thread-safe, lock-free datastructures:
http://www.boyet.com/Articles/LockfreeQueue.html and have been educating
myself.

I added these into the test suites, and got more interesting results. On a
single or dual processor machine they're signifigantly slower. On a 16 way
machine, they're waaaaaay faster.

The end result, is that on different processor architectures, and different
numbers of processors, the locking mechanisms need to change. The difference
in some cases was a full 4 orders of magnitude (in one very dramatic case,
what took 85 milliseconds on a single processor x86 machine, took 200
milliseconds on a dual-processor machine, took 2 minutes on a quad processor
machine, and took more than 30 minutes on a 16-way machine). Granted these
cases were contrived and almost nothing but thread contention, but they
illustrated the case and let me pick the right datastructures.

My hypothetical thread pool would take this into account and have a variety
of locking mechanisms. It would even be especially nice if, in a general
some case, a work item could be handed directly to a thread with no lock
required at all. In the "lots of threads awake" case, it seems as if there
should be a way to do this.

Now, I'll be the first to admit that locking and unlocking on the queue for
work items shouldn't be too big a hit, especially when compared to parsing
some xml, string manipulation, etc - but these locks are hit at least twice
per operation (one in, one out), and at 10k operations per second, that adds
up quick.


--
Chris
Author
29 Jun 2006 1:51 PM
Carl Daniel [VC++ MVP]
Chris Mullins wrote:
Show quote
> "William Stacey [MVP]" <william.sta***@gmail.com> wrote
>
> [Features a threadpool should have]
>
>>> 1 - Threads are dynamically added and removed form the queue
>>> depending
>> on
>>> load.
>>
>> What defines "load" here?  Is it Q size or cpu load.  CPU load logic
>> may compete against min/max thread parms.  So what is algo here?
>
> I think it would be more cpu load than anything else. I'm not sold
> however on this being the correct metric.
>
> The behavior that I explicitly want to avoid is dumping 100 items
> into a Queue, and having the thread pool immediatly spin up 100
> threads. Likewise, if 100 threads are spun up, I wouldn't want to see
> them all terminated immediatly should there be no items in the pool.

[lots more great discussion about thread pools and lock-free data
structures]

It seems to me that there's a great model of how to make a high performance
server make the most effective use of multiple CPUs out there already:  SQL
Server (and, I would presume, Oracle and DB2 must use similar techniques).

Within SQL Server, they go to extraordinary lengths to try to have exactly 1
unblocked thread per CPU, almost never more, and fewer only when there are
fewer work items than CPUs. Further, they go to extraordinary lengths to
ensure that context switches never occur and that worker thread never block.
Considerable writing has been done on the inner workings of the User Mode
Scheduler that SQL Server uses.  For example,

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsqldev/html/sqldev_02252004.asp

How does that mesh with your wish-list for the ultimate thread pool?  Of
course, it's more than a thread pool - it is, in effect, an entire Operating
System built on top of the OS, requiring that code running atop it use it's
I/O and memory allocation facilities exclusively.

-cd
Author
30 Jun 2006 12:58 AM
Chris Mullins
"Carl Daniel [VC++ MVP]" <cpdaniel_remove_this_and_nospam@mvps.org.nospam>
wrote in message news:eeK3sM4mGHA.4960@TK2MSFTNGP04.phx.gbl...
> Chris Mullins wrote:
>> "William Stacey [MVP]" <william.sta***@gmail.com> wrote
>>
>> [Features a threadpool should have]
>>

> It seems to me that there's a great model of how to make a high
> performance server make the most effective use of multiple CPUs out there
> already:  SQL Server (and, I would presume, Oracle and DB2 must use
> similar techniques).

The SQL Server infrastructure is great, but I'm afraid it doesn't really do
me any good. Now, as soon as Microsoft releases SQL Server 2000 or 2005
under a GPL or BSD license, I'll use that code :)

I'm not so desperate yet as to use Fibers and write my own scheduler. Next
year, perhaps, but not yet!

--
Chris Mullins
Author
30 Jun 2006 3:22 AM
Carl Daniel [VC++ MVP]
Chris Mullins wrote:
Show quote
> "Carl Daniel [VC++ MVP]"
> <cpdaniel_remove_this_and_nospam@mvps.org.nospam> wrote in message
> news:eeK3sM4mGHA.4960@TK2MSFTNGP04.phx.gbl...
>> Chris Mullins wrote:
>>> "William Stacey [MVP]" <william.sta***@gmail.com> wrote
>>>
>>> [Features a threadpool should have]
>>>
>
>> It seems to me that there's a great model of how to make a high
>> performance server make the most effective use of multiple CPUs out
>> there already:  SQL Server (and, I would presume, Oracle and DB2
>> must use similar techniques).
>
> The SQL Server infrastructure is great, but I'm afraid it doesn't
> really do me any good. Now, as soon as Microsoft releases SQL Server
> 2000 or 2005 under a GPL or BSD license, I'll use that code :)
>
> I'm not so desperate yet as to use Fibers and write my own scheduler.
> Next year, perhaps, but not yet!

I have, a couple years back.  It would be interestsing to dust that code
off, make it .NET friendly (it's native C++), and use it to build a UMS-like
framework for .NET.  That'd be a powerful tool to have in the .NET toolbox.

-cd
Author
30 Jun 2006 5:35 AM
Chris Mullins
"Carl Daniel [VC++ MVP]"
>
>> I'm not so desperate yet as to use Fibers and write my own scheduler.
>> Next year, perhaps, but not yet!

> I have, a couple years back.  It would be interestsing to dust that code
> off, make it .NET friendly (it's native C++), and use it to build a
> UMS-like framework for .NET.  That'd be a powerful tool to have in the
> .NET toolbox.

.... and here I thought I was hardcore. You, sir, are well and truly sick! :)

--
Chris Mullins
Author
29 Jun 2006 7:52 PM
William Stacey [MVP]
| The behavior that I explicitly want to avoid is dumping 100 items into a
| Queue, and having the thread pool immediatly spin up 100 threads.
Likewise,
| if 100 threads are spun up, I wouldn't want to see them all terminated
| immediatly should there be no items in the pool.

Here is where trade-offs come into play and needing a specific behavior
instead of something general.  A simple way (that I used) is to have each
thread check, after, it is done with the delegate if new threads need to be
created (i.e. Q.Count > 1 and TPThreads.Count < maxThreads).  This avoids
needing a seperate manager thread running.  This, in fact, may be the
required behavior for the app (i.e. don't wait but bound the max).  Having
to walk the active threads and check if they are in fact blocked and only
then spin up more can induce delay and not sure there is a managed way to do
this.

| The System thread pool seems to have a timeout of some sort before
spawning
| new threads. This seems to work fairly well.

Not sure exact way they do it either.  Put has something to do with checking
if threads are blocked, then spin 1 more, and so on, upto the max.

| > | 2 - Threads have some degree of processor affinity in order to
minimize
| > | cache issues
|
| [Thread Affinity]

Not sure how much thread affinity will help here and is more likely to slow
things down.  You never want to wait, so if there is a ready, willing, and
able thread ready to pop the queue and start work, then that is better then
trying to wait for same thread used last time or spend cycles trying to
figure that out when the work could have already started or completed.  So
not sure we get payoff here in general.  Naturally, there are probably use
cases that prove my feeling wrong.

| Some use variations on this, and fancy means of putting threads to sleep -
| I've seen a few algorithms:
| 1 - Keep all threads blocked in a Montior. Whenver a work items comes in,
an
| "PulseOne" (or, god help us, a Pulse-All with a "First one wins"
algorithm)

Need to look very hard at this one.  I use pulseall in my blocking-Q because
it is the right thing to do in that case.  I have seen many incorrect
implementation that use Pulse().  It may look right at first, but there a
few funky things that can happen that can create a live-lock when using just
Pulse().  In most cases, if the loop is tight, a PulseAll is not going to
create a hurding issue and may, in fact, be the only way to ensure
correctness.  Sure you may get a couple threads here and there that wake-up
only to sleep again, but that should be the exception.

| I would really like to see an algorithm that has, in the "busy" case, as
few
| context switches as possible. When I post an item to a queue, there should
| be exactly 1 context switch needed to process my item.

That is the way mine implementation works.  I admit to more locking then I
would like, but can't figure out a way around it while keeping everything
correct and not getting into lock-free stuff.  I'll post here, so please
take shoots at it.

--
William Stacey [MVP]
Author
30 Jun 2006 1:02 AM
Chris Mullins
Show quote
"William Stacey [MVP]" <william.sta***@gmail.com> wrote

> | 1 - Keep all threads blocked in a Montior. Whenver a work items comes
> in,
> an
> | "PulseOne" (or, god help us, a Pulse-All with a "First one wins"
> algorithm)
>
> Need to look very hard at this one.  I use pulseall in my blocking-Q
> because
> it is the right thing to do in that case.  I have seen many incorrect
> implementation that use Pulse().  It may look right at first, but there a
> few funky things that can happen that can create a live-lock when using
> just
> Pulse().  In most cases, if the loop is tight, a PulseAll is not going to
> create a hurding issue and may, in fact, be the only way to ensure
> correctness.  Sure you may get a couple threads here and there that
> wake-up
> only to sleep again, but that should be the exception.

I used the word dreaded because the normal case (where only some threads
should be active) has Windows waking up all the threads, then putting most
of them right back to sleep. This can't be a good idea... Especially with
dozens (or hundreds) of threads in the pool.

I don't know the right answer to this, but there's gotta be a better way.

--
Chris Mullins
Author
30 Jun 2006 3:22 AM
Joe Seigh
Chris Mullins wrote:
Show quote
> "William Stacey [MVP]" <william.sta***@gmail.com> wrote
>
>
>>| 1 - Keep all threads blocked in a Montior. Whenver a work items comes
>>in,
>>an
>>| "PulseOne" (or, god help us, a Pulse-All with a "First one wins"
>>algorithm)
>>
>>Need to look very hard at this one.  I use pulseall in my blocking-Q
>>because
>>it is the right thing to do in that case.  I have seen many incorrect
>>implementation that use Pulse().  It may look right at first, but there a
>>few funky things that can happen that can create a live-lock when using
>>just
>>Pulse().  In most cases, if the loop is tight, a PulseAll is not going to
>>create a hurding issue and may, in fact, be the only way to ensure
>>correctness.  Sure you may get a couple threads here and there that
>>wake-up
>>only to sleep again, but that should be the exception.
>
>
> I used the word dreaded because the normal case (where only some threads
> should be active) has Windows waking up all the threads, then putting most
> of them right back to sleep. This can't be a good idea... Especially with
> dozens (or hundreds) of threads in the pool.
>
> I don't know the right answer to this, but there's gotta be a better way.
>
I'm not sure how you're using the queue here.  If it's just a simple queue
use a semaphore to control blocking on the queue.  You'll need a lock as
well if you have more than one queueing and one dequeueing thread.  Unless
it's a lock-free queue of course.

If you want a fast pathed semaphore which avoids system calls much like
what critical section does for mutex, you can check here
http://groups.google.com/groups?selm=412F5D8D.416AE***@xemaps.com
Note that if you use this on xbox360 the interlocked functions don't
have memory barriers so you'll have to provide your own.

For things like timer queues I use a gating mutex or semaphore to
avoid the thundering herd problem caused by having the entire
thread pool wait on the same timeout.  The gating mutex ensures
only one thread is waiting on an autoreset event with infinite
timeout if there's no work and a specific timeout if there is
a scheduled piece of work.  So in pseudo code

      gate.lock();
      queue.lock();
      for (;;) {
        if queue.empty()
          timeout = INFINITE;
        else
          timeout = queue.peek().deadline - now;


        if (timeout < 0)
          break;

        queue.unlock();
        event.wait();   // autoreset event
        queue.lock()
      }

      item = queue.pop();
      queue.unlock();
      gate.unlock();


Any queueing of work that changes the head of the queue (next item
to be scheduled) signals the autoreset event.


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
Author
30 Jun 2006 4:12 AM
William Stacey [MVP]
Show quote
|      gate.lock();
|      queue.lock();
|      for (;;) {
|        if queue.empty()
|          timeout = INFINITE;
|        else
|          timeout = queue.peek().deadline - now;
|
|
|        if (timeout < 0)
|          break;
|
|        queue.unlock();
|        event.wait();   // autoreset event
|        queue.lock()
|      }
|
|      item = queue.pop();
|      queue.unlock();
|      gate.unlock();
|

I am not following that totally.
1) if queue is empty timeout is set to -1 (i.e. INFINITE).  The next "if" is
then true and does a break.  Then you try to pop() an empty queue?  This
does not seem right.  Please help.
2) Where else do you wait on timeout?  Should it be event.wait(timeout)
instead of event.wait()?

tia
Author
30 Jun 2006 11:09 AM
Joe Seigh
William Stacey [MVP] wrote:
Show quote
> |      gate.lock();
> |      queue.lock();
> |      for (;;) {
> |        if queue.empty()
> |          timeout = INFINITE;
> |        else
> |          timeout = queue.peek().deadline - now;
> |
> |
> |        if (timeout < 0)
> |          break;
> |
> |        queue.unlock();
> |        event.wait();   // autoreset event
> |        queue.lock()
> |      }
> |
> |      item = queue.pop();
> |      queue.unlock();
> |      gate.unlock();
> |
>
> I am not following that totally.
> 1) if queue is empty timeout is set to -1 (i.e. INFINITE).  The next "if" is
> then true and does a break.  Then you try to pop() an empty queue?  This
> does not seem right.  Please help.
> 2) Where else do you wait on timeout?  Should it be event.wait(timeout)
> instead of event.wait()?
>
> tia
>
>
Right that should be "if (timeout != INFINITE && timeout < 0)"
and "event.wait(timeout)".


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
Author
30 Jun 2006 2:31 AM
Joe Seigh
Chris Mullins wrote:
> Anyway, after much wailing and gnashing of teeth, I stumbled across
> thread-safe, lock-free datastructures:
> http://www.boyet.com/Articles/LockfreeQueue.html and have been educating
> myself.
>
That's Maged Michael and Michael Scott's lock free queue algorithm if you
want to google for it.


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
Author
30 Jun 2006 5:47 AM
Chris Mullins
"Joe Seigh" <jseigh***@xemaps.com> wrote
> Chris Mullins wrote:
>> Anyway, after much wailing and gnashing of teeth, I stumbled across
>> thread-safe, lock-free datastructures:
>> http://www.boyet.com/Articles/LockfreeQueue.html and have been educating
>> myself.
>>
> That's Maged Michael and Michael Scott's lock free queue algorithm if you
> want to google for it.

Wow. I just spent some time reading Maged Michael's paper's over at IBM
reasearch. He's got some great, great stuff in there. Thanks for the
pointer.

Any idea if I can find nice (and debugged) C# implementations of his
algorithms?

--
Chris Mullins
Author
30 Jun 2006 9:22 AM
Vadym Stetsyak
Hello, Chris!

CM> Any idea if I can find nice (and debugged) C# implementations of his
CM> algorithms?

Yep, the stuff is great out there.
Also some of the algorithms are in the links provided by you, namely CAS,
( http://www.boyet.com/Articles/LockfreeStack.html )

--
Regards, Vadym Stetsyak
www: http://vadmyst.blogspot.com
Author
30 Jun 2006 11:58 AM
Joe Seigh
Chris Mullins wrote:
Show quote
> "Joe Seigh" <jseigh***@xemaps.com> wrote
>
>>Chris Mullins wrote:
>>
>>>Anyway, after much wailing and gnashing of teeth, I stumbled across
>>>thread-safe, lock-free datastructures:
>>>http://www.boyet.com/Articles/LockfreeQueue.html and have been educating
>>>myself.
>>>
>>
>>That's Maged Michael and Michael Scott's lock free queue algorithm if you
>>want to google for it.
>
>
> Wow. I just spent some time reading Maged Michael's paper's over at IBM
> reasearch. He's got some great, great stuff in there. Thanks for the
> pointer.
>
> Any idea if I can find nice (and debugged) C# implementations of his
> algorithms?
>

I'm not sure.  Most of the stuff I did was read lock-free (as opposed
to write lock-free) and mainly used RCU, Maged's SMR hazard pointers,
or lock-free reference counting for the GC component of the lock-free
algorithms.  So you could have read lock-free hash tables, linked lists,
or even balanced binary trees if I had ever gotten around to working out
the heuristics for balancing (you can't just plug in the current
AVL or Red/Black algorithms as is).  But too much patent activity in
that area so I shut down those projects.

You could take a look at the java.util.concurrent classes in the latest
version of Java to see how they did it.  JSR-166 added in lock-free support.
They basically use double wide interlocked compare and exchange to
implement the AtomicMarkableReference and AtomicStampedReference classes.
You need stuff like that to avoid the ABA problem.  You can use tracing GC if
it's built in to avoid the ABA problem like the article above did but
for a high throughput queue, you are going to generate a lot of garbage
for the GC to collect.



--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
Author
28 Jun 2006 4:51 AM
William Stacey [MVP]
Here is rough pseudocode for a server that uses a threadpool (the clr or
other)
to handle N client sockets. We break a TP rule here because we use a
threadpool thread
for a long running operation (i.e. the socket read). However, it is our
server, so we
understand and manage that issue. Basically, we add all client sockets to a
list - for management, etc. We then kick off the first read (and future
reads) on a thread pool thread.
We set a max of threads to 100 in this example. So at any one instant, we
could have a 100 pending reads. Other client reads will just queue up in the
TP FIFO queue.

Issues:
1) 100 slow clients (in this example) can stall the TP. This could be an
attack, or just 100 slow clients. We can mitigate attack by only allowing so
many connections from same IP. Other methods, such as an async server, can
also stall after many pending reads, so not sure how much of issue this is
in reality compared to other methods. After a client does a read/reply, it
frees that thread and posts another read to the queue so other clients get
time. The next client in the queue gets to read so the server will keep
running as long as client reads eventually happen.
2) Writes are done sync on the same thread. Not sure this is much of an
issue however. The write will be done quickly with no wait. Not sure we gain
anything by doing an async write.

Advantages:
1) Built in fairness. Client requests are handled FIFO.
2) Dynamic throttle. We can throttle and apply "back-pressure" just by
changing the number of MaxThreads in the TP. This will limit the number of
pending sync reads/readprocessing on the server at any one time. New clients
just queue up.
3) Logic is sync. However we don't assign 1 thread per client for the whole
client session. Each request/reply will be handle by another TP thread
(possibly the same if Q is empty). This makes it easy to code and find bugs.
We can also use the socket timeout feature on reads and writes. Slow clients
that timeout can just be dropped (or use some algo). After a timeout, we
free that thread for another client read.
4) Easy to replace system TP with a custom TP if needed.
5) Can handle many thousands of clients with little number of threads. Only
testing we find the optimal number of threads needed for the app.


PseudoCode (just the basics. Locks, exception handling, etc, not shown for
simplicity):
---------------------------------------------------------------------------------------
ThreadPool.MinThreads = 25;
ThreadPool.MaxThreads = 100;

while(started)
{
   sock = Accept();
   sock.ReadTimeout = 5000;
   if ( userList.Count > 5000 ) // Set to how many concurrent clients you
want.
   {
      sock.Close(); //A hard close - too many clients. Can do this better.
      continue;
   }
   userList.Add(sock);

   // Kick off the request/response chain for client.
   ThreadPool.QueueUserWorkItem(ReadMethod, sock);
}

private void ReadMethod(object state)
{
   // This is on a threadpool thread.
   //
   Socket sock = (Socket)state;
   byte[] buf = new byte[1024];
   try
   {
      int read = sock.Receive(buf);

      // Handle request processing.

      // Write reply back to client.
      sock.Send(writebuf);
      if ( done )
      {
         sock.Close()
         userList.Remove(sock);
         return; // Client done.
      }

      // Release this thread to pool, but queue another read before we
leave.
      // Note: If there is no other queued work items,
      // the TP will most likely just use the same thread.
      ThreadPool.QueueUserWorkItem(ReadMethod, sock);
   }
   catch(Timeout te)
   {
       //Handle timeout.
   }
   catch(Exception ex)
   {
       //Handle exception
   }
}

--
William Stacey [MVP]

"Chris Mullins" <cmull***@yahoo.com> wrote in message
news:uN57HGamGHA.4716@TK2MSFTNGP04.phx.gbl...
| Quite often of late I've been asked about when to use the System
Threadpool.
|
| I finally sat down and wrote a blog detailing the issues in the ThreadPool
| and why it's unsuitable for use in production grade server applications.
|
| Any feedback on this is appriciated!
|
|
http://www.coversant.net/dotnetnuke/Coversant/Blogs/tabid/88/EntryID/8/Default.aspx
Show quote
|
| --
| Chris Mullins
|
|

AddThis Social Bookmark Button