Multi-process or multi-threaded design for Ruby daemons? GIL to the rescue :)

MRI Ruby has a global interpreter lock (GIL), meaning that even when writing multi-threaded Ruby only a single thread is on-CPU at a point in time. Other distributions of Ruby have done away with the GIL, but even in MRI threads can be useful. The Sidekiq background worker gem takes advantage of this, running multiple workers in separate threads within a single process.

If the workload of a job blocks on I/O, Ruby can context-switch to other threads and do other work until the I/O finishes. This could happen when the workload reaches out to an external API, shells out to another command, or is accessing the file system.

If the workload of a process does not block on I/O, it will not benefit from thread switching under a GIL, as it will be, instead, CPU-bound. In this case, multiple processes will be more efficient, and will be able to take better advantage of multi-core systems.

So… why not skip threads and just deal with processes? A number of reasons.

Threads are more memory efficient. Fewer processes means less virtual memory allocation, allowing for more workers on fewer hosts. This can result in considerably less cost over the course of a year. Garbage collection fixes in Ruby 2 promise better shared memory management between forked processes, but I have yet to see material benefit from this in production.

Context switching between processes is more expensive than context switching between threads. This is because process context-switching involves switching out the memory address space. Thread switching happens within the same address space.

Even when pooling database connections through a connection manager like PGBouncer, more processes means more idle connections, and if you're not careful (i.e., you don't monitor connection count) it's easy to accidentally max out your database connection limit. We are particularly aggressive on this front (having burned ourselves a few times), sometimes going so far as to force ActiveRecord to release connections back into its connection pool before starting long, blocking requests. For instance:

class PushNotificationWorker &lt; WaneloWorker
  def perform!(user_id, message)
    User.find(user_id)
    ActiveRecord::Base.clear_active_connections!
    PushNotification.new(user, message).deliver!
  end
end

This way other threads are able to get the connection out of the pool before the first thread finishes.

Sometimes a workload will purposefully block (or sleep), for instance, in a daemon process that only wakes up every N seconds to do some work. The spanx (https://github.com/wanelo/spanx) gem works this way, with multiple actors running in separate threads.

It's much easier to manage a single process daemon, from an operational point of view, than a set of daemons. In SmartOS this means that the SMF definition for the service does not have to manage multiple processes. Additionally this prevents a situation where one actor may not start, which might happen with multi-process design. It's much less confusing to type "svcadm disable spanx-watcher" when there's a problem, than to track down four separate services in order to stop them all (having said that, SMF supports "noop" service that can be declared as a single dependency for several others, thus stopping noop service also stops the dependents).

In our Chef cookbook for automating running Sidekiq background jobs as a service in SmartOS, we define a pool of Sidekiq workers attached to a set of queues, with a configurable concurrency. This allows us to run CPU-bound background jobs as a pool of single-threaded multi-process workers. Conversely, we can configure IO-bound jobs, such as workers that need to "talk" to external APIs, as a pool with high concurrency (often as high as 10 or 20). If workers have to wait 2-3 seconds for API calls to return, that's a lot of time left for other jobs to be processing in parallel.

Related: In his presentation "Accelerating Wanelo to 200K RPM", Konstantin Gredeskoul shows how to use NewRelic to calculate ratio of CPU to IO in our ruby stack, to determine how many single-threaded unicorn processes to run on a multi-core system for optimal utilization.

- Eric