Member-only story
Coroutines in Python for Data Engineering (0)
Imagine we have several small pieces of data that need to be processed. To minimize the running time, run the data processing job via multiple tasks on a multiple core platform would be the best solution.
There are several ways to achieve the multi-task design, like multi-processing, multi-treading, coroutine …
Multi-processing solutions usually means several machines or clusters, like map-reduce. Because of the natural of process, multi-process architecture could be very complicated and expensive in terms of communication, and cooperating. That’s why unless it’s very necessary, we should avoid multi-process architecting on a single machine.
In this blog, I’ll also explain why multi-threading is not a good choice for python as well.
Threads scheduling
System scheduler is operating Kernel Scheduling Entity( KSE). A user space thread can be scheduled by being bound to a KSE. One kernel thread can be bound with one or more user thread.
1:1 bind
Most thread libs, like `java.lang.thread`, `std:thread` in c++ and `threading` in python, are applying the 1:1 user thread to kernel thread binding. The scheduling job will be done by system scheduler, so it’s easy to implement. Since different user thread is bound to different kernel thread, user threads can run parallel.
However 1:1 bind will cost more system resource and performance will go down…