PyTorch CUDAにおけるストリーム制御の達人ガイド：`torch.cuda.set_stream`徹底解説

torch.cuda.set_stream は、PyTorch CUDA で ストリーム を設定するための関数です。ストリームは、CUDA デバイス上でのカーネル実行の順序を制御する仮想的なキューです。複数のストリームを使用することで、複数の操作を並行して実行し、GPU パフォーマンスを向上させることができます。

使い方

torch.cuda.set_stream は、引数として Stream オブジェクトを受け取ります。Stream オブジェクトは、torch.cuda.Stream() 関数を使用して作成できます。

import torch

device = torch.device("cuda")

stream = torch.cuda.Stream(device=device)

torch.cuda.set_stream(stream)

上記のコードでは、device で指定された CUDA デバイス用の Stream オブジェクトを作成し、torch.cuda.set_stream 関数を使用してそれを現在のストリームに設定しています。

コンテキストマネージャー

torch.cuda.stream 関数は、コンテキストマネージャーとしても使用できます。

import torch

device = torch.device("cuda")

with torch.cuda.stream(stream):
    # このブロック内のすべての操作は、指定されたストリームで実行されます。
    pass

上記のコードでは、with ステートメントを使用して torch.cuda.stream 関数を呼び出し、ブロック内のすべての操作を指定されたストリームで実行します。ブロックが終了すると、自動的に元のストリームに復元されます。

torch.cuda.set_stream の使用例

torch.cuda.set_stream は、さまざまな状況で使用できます。以下に、いくつかの例を示します。

オーバーラップ実行による計算時間の短縮
複数のモデルを並行して実行
複数の GPU デバイス間でのデータ転送を並行化

注意事項

ストリームを使用する前に、torch.cuda.synchronize() 関数を使用して、以前のストリームのすべての操作が完了していることを確認する必要があります。
torch.cuda.set_stream を使用するには、PyTorch が CUDA バージョン 10.1 以降でコンパイルされている必要があります。

torch.cuda.set_stream は、PyTorch CUDA でストリームを制御するための強力なツールです。ストリームを正しく使用することで、GPU パフォーマンスを大幅に向上させることができます。

複数のGPUデバイス間でのデータ転送を並行化

この例では、2つのGPUデバイス間でデータを並行して転送する方法を示します。

import torch
import time

device0 = torch.device("cuda:0")
device1 = torch.device("cuda:1")

x = torch.randn(1000, 1000, device=device0)

stream0 = torch.cuda.Stream(device=device0)
stream1 = torch.cuda.Stream(device=device1)

with torch.cuda.stream(stream0):
    y = x.to(device1, non_blocking=True)

with torch.cuda.stream(stream1):
    z = y.to(device0, non_blocking=True)

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

with torch.cuda.stream(stream0):
    start_event.record()
    y.copy_(x)
    end_event.record()

elapsed_time = start_event.elapsed_time(end_event)
print(f"CPUからGPU0への転送時間: {elapsed_time:.3f} ms")

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

with torch.cuda.stream(stream1):
    start_event.record()
    z.copy_(y)
    end_event.record()

elapsed_time = start_event.elapsed_time(end_event)
print(f"GPU0からGPU1への転送時間: {elapsed_time:.3f} ms")

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

with torch.cuda.stream(stream0):
    start_event.record()
    x.copy_(z)
    end_event.record()

elapsed_time = start_event.elapsed_time(end_event)
print(f"GPU1からCPUへの転送時間: {elapsed_time:.3f} ms")

このコードでは、stream0 と stream1 という2つのストリームを作成し、それぞれ異なるGPUデバイスに関連付けます。次に、with ステートメントを使用して、各ストリームのコンテキスト内でデータ転送を実行します。これにより、転送操作が並行して実行され、パフォーマンスが向上します。

この例では、2つのモデルを並行して実行する方法を示します。

import torch
import time

model0 = torch.nn.Linear(100, 10)
model1 = torch.nn.Linear(10, 1)

x = torch.randn(100, 10)

stream0 = torch.cuda.Stream()
stream1 = torch.cuda.Stream()

with torch.cuda.stream(stream0):
    y0 = model0(x)

with torch.cuda.stream(stream1):
    y1 = model1(y0)

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

with torch.cuda.stream(stream0):
    start_event.record()
    model0(x)
    end_event.record()

elapsed_time = start_event.elapsed_time(end_event)
print(f"モデル0の実行時間: {elapsed_time:.3f} ms")

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

with torch.cuda.stream(stream1):
    start_event.record()
    model1(y0)
    end_event.record()

elapsed_time = start_event.elapsed_time(end_event)
print(f"モデル1の実行時間: {elapsed_time:.3f} ms")

明示的なストリーム引数を使用する

多くの PyTorch 関数は、stream 引数を受け取ります。この引数に Stream オブジェクトを渡すことで、その関数内の操作を特定のストリームに割り当てることができます。

import torch

device = torch.device("cuda")

stream = torch.cuda.Stream(device=device)

x = torch.randn(1000, 1000, device=device)
y = x.to(device1, stream=stream)

このコードでは、x.to(device1) 操作を stream に割り当てています。これは torch.cuda.set_stream を使用するよりも簡潔で、コードの流れを妨げにくいです。

利点

特定の操作のみを特定のストリームに割り当てたい場合に適している
コードが簡潔で読みやすい

欠点

複数の操作を同じストリームにグループ化するのが難しい
すべての操作で明示的にストリーム引数を使用する必要がある

torch.cuda.current_stream() を使用する

torch.cuda.current_stream() 関数は、現在のストリームを取得します。この関数を使用して、現在のストリームを一時的に変更し、その後元のストリームに復元することができます。

import torch

device = torch.device("cuda")

stream = torch.cuda.Stream(device=device)

with torch.cuda.stream(stream):
    x = torch.randn(1000, 1000, device=device)
    y = x.to(device1)

# 現在のストリームは元のストリームに戻ります。

z = y + y

このコードでは、with ステートメントを使用して、ブロック内の操作を stream に割り当てています。ブロックが終了すると、torch.cuda.current_stream() によって自動的に元のストリームに復元されます。

利点

特定のコードブロック内のすべての操作を同じストリームにグループ化できます。
with ステートメントを使用して、ストリームのスコープを簡単に管理できます。

欠点

複数のストリームを同時に使用する場合は適切ではない
コードが冗長になる可能性がある

CUDA カーネルランチャーを使用する

CUDA カーネルランチャーは、低レベルな API であり、よりきめ細かなストリーム制御を提供します。これは、高度なパフォーマンスチューニングが必要な場合に役立ちます。

#include <cuda.h>

void my_kernel(float* data, int n) {
  // ...
}

int main() {
  float* data = new float[n];

  cudaStream_t stream;
  cudaStreamCreate(&stream);

  cudaStreamSetCurrent(stream);

  my_kernel(data, n);

  cudaStreamSynchronize(stream);

  cudaStreamDestroy(stream);

  delete[] data;

  return 0;
}

この C++ コード例では、cudaStreamSetCurrent を使用してストリームを設定し、my_kernel 関数をそのストリームで実行しています。その後、cudaStreamSynchronize を使用してストリームが完了するのを待ってから、ストリームを破棄します。

利点

高度なパフォーマンスチューニングに役立つ
非常にきめ細かなストリーム制御が可能

欠点

CUDA ランタイム API に精通している必要がある
コードが複雑で、理解しにくい

torch.cuda.set_stream は、CUDA デバイス上でのカーネル実行の順序を制御するための汎用的なツールですが、状況によっては代替方法の方が適切な場合があります。上記で説明した代替方法はそれぞれ、利点と欠点があります。ニーズに合った最良のツールを選択することが重要です。

PyTorch バージョン 1.9 以降では、torch.cuda.synchronize() 関数に async=True 引数が追加されました。この引数を True に設定すると、関数は非同期で実行され、