The current code uses work queue for performing SCSI commands (or block target's tasks). Work queue is simple and good enough for debugging, however, a single thread per CPU is not good enough (from the performance perspective). I thought about creating multiple kernel threads by hand. Are there handy APIs?