Python Tips & Memos by Physical Oceanography & Climate Laboratory in Hokkaido University

multiprocessing

multiprocessingはpython標準の並列計算モジュールである．関数をn番目のprocessで実行することができる．もっともよくあるデータ解析での利用は，複数の年や，複数のアンサンブルや，複数のモデル出力に同じ処理をすることであろう．この場合はそれぞれの計算結果は，ファイルに書き出すだけでよく，process間の通信を行う必要がない．
multiprocessing の並列計算では，関数を定義して，その関数を並列に呼び出す。
注意点としては，multiprocessing の並列部分でエラーが生じると，その追跡が難しい。そこで，並列計算をする，しないを，スクリプト中のパラメータで選び，エラーが出る場合には並列化無しでデバッグして，エラーを解決してから並列計算をすることがおすすめだ。以下の例は，並列計算を，年で分割して行っている。実際の計算は，calc_for_one_member関数で行い，この関数は並列でも，非並列でも使用する。
import numpy as npimport multiprocessingimport osimport pdb
is_parallel=Truenice_value=20n_parallel=9 # number of parallel computation
def calc_for_one_member(member):    if is_parallel:        os.nice(nice_value) # <- 並列計算では，niceの値を大きくして優先度を下げる。    ... 計算の本体    ....    if not is_parallel:        pdb.set_trace() #<- 必要な場合には，デバッガーで停止させる。使えるのは非並列のみ。

members = np.arange(29) # memberの生成の例len_members = len(members)n_round = int(np.ceil(len_members/n_parallel))
for i_round in range(n_round):    i_st = i_round * n_parallel # index start     i_ed = i_st + n_parallel # index end     if is_parallel:
        jobs=[]        for member in members[i_st:i_ed]:            job=multiprocessing.Process(target=calc_for_one_member,                                         args=([member]))            jobs.append(job)            job.start()        [job.join() for job in jobs]    else:        for member in members[i_st:i_ed]:            calc_for_one_member(member)

multiprocessing is Python's standard module for parallel computing. It allows a function to be executed by the nth process. The most common use case in data analysis is to apply the same process to multiple years, multiple ensembles, or multiple model outputs. In this case, the results of each calculation only need to be written to a file, so there is no need for communication between processes.
In parallel computing with multiprocessing, you define a function and call that function in parallel.
One important point to note is that if an error occurs in the parallel portion of multiprocessing, it can be difficult to trace. Therefore, it is recommended to include an option in the script to choose whether or not to use parallelization. If an error occurs, debug it without parallelization, resolve the issue, and then perform the parallel computation. The following example performs parallel computation by dividing the calculation by year. The actual calculation is done in the calc_for_one_member function, which can be used for both parallel and non-parallel execution. import numpy as npimport multiprocessingimport osimport pdb
is_parallel=Truenice_value=20n_parallel=9 # number of parallel computation
def calc_for_one_member(member):    if is_parallel:        os.nice(nice_value) # <- reduce priority by using large nice value    ... main body of computation    ....    if not is_parallel:        pdb.set_trace() #<- if needed, stop by using python debugger, usable only non-parallel

members = np.arange(29) # member generation example.len_members = len(members)n_round = int(np.ceil(len_members/n_parallel))
for i_round in range(n_round):    i_st = i_round * n_parallel # index start     i_ed = i_st + n_parallel # index end     if is_parallel:
        jobs=[]        for member in members[i_st:i_ed]:            job=multiprocessing.Process(target=calc_for_one_member,                                         args=([member]))            jobs.append(job)            job.start()        [job.join() for job in jobs]    else:        for member in members[i_st:i_ed]:            calc_for_one_member(member)

Page updated

Google Sites

Report abuse