Scaling a ruby script by running multiple processes instead of using threads
I want to increase the throughput of a script that does clean I / O (scraper). Instead of making it multithreaded in ruby (I'm using 1.9.1 as the default interpreter), I want to start multiple processes. So, is there a system for this where I can track when someone finishes restarting it again so that I have the X number working at all times. Also some will work with different args commands. I was considering writing a bash script, but that sounds like a potentially bad idea if there is already a method to do something like this on linux.
a source to share
I would recommend not forking, but instead you use EventMachine (and the excellent em-http-request if you are doing HTTP). Managing multiple processes can be a bit big, even more than handling multiple threads, but going down that path is much easier by comparison. Since you want to mostly do network I / O, which is mostly waiting, I think the planned approach will scale as well, or better than forking or streaming. Most importantly, it will require much less code and will be more readable.
Even if you choose to run separate processes for each task, EventMachine can help you write the code, managing subprocesses, using, for example EventMachine.popen
.
And finally, if you want to do it without EventMachine, read the docs for IO.popen , Open3.popen, and Open4.popen . Everything is more or less the same, but gives you access to stdin, stdout, stderr (Open3, Open4) and pid (Open4) of the subprocess.
a source to share
You can try fork http://ruby-doc.org/core/classes/Process.html#M003148
You can get the PID in return and see if this process will work again or not.
If you want to manage IO concurrency. I suggest you use EventMachine.
a source to share
You can either
- implement (or find an equivalent gem) ThreadPool (ProcessPool, in your case) or
- prepare an array of all, say, 1000 tasks to be processed, divide them into 10 chunks of 100 tasks (10 is the number of parallel processes you want to run) and start 10 processes, of which each process immediately gets 100 tasks to process. Thus, you do not need to start 1000 processes and control that no more than 10 of them are running at the same time.
a source to share