跳转至

分布式训练

以两台机器为例,每台机器 1 个 trainer,1 个分片的存储后端。

环境

约定:

  • 机器 A: 10.0.0.1,shard 0,trainer rank 0
  • 机器 B: 10.0.0.2,shard 1,trainer rank 1
  • master_addr=10.0.0.1
  • master_port=29670

配置文件

两台机器必须共享同一份 distributed_client 配置,尤其是 num_shardshash_methodservers,下称 ./recstore_config.distributed.json

配置文件示例
recstore_config.distributed.json
{
  "cache_ps": {
    "ps_type": "BRPC",
    "max_batch_keys_size": 65536,
    "num_threads": 32,
    "num_shards": 2,
    "servers": [
      {
        "host": "10.0.0.1",
        "port": 15123,
        "shard": 0
      },
      {
        "host": "10.0.0.2",
        "port": 15123,
        "shard": 1
      }
    ],
    "base_kv_config": {
      "path": "/tmp/recstore_dist_data",
      "capacity": 1000000,
      "value_size": 512,
      "value_type": "DRAM",
      "index_type": "DRAM",
      "value_memory_management": "PersistLoopShmMalloc"
    }
  },
  "distributed_client": {
    "num_shards": 2,
    "hash_method": "city_hash",
    "max_keys_per_request": 500,
    "servers": [
      {
        "host": "10.0.0.1",
        "port": 15123,
        "shard": 0
      },
      {
        "host": "10.0.0.2",
        "port": 15123,
        "shard": 1
      }
    ]
  },
  "client": {
    "host": "10.0.0.1",
    "port": 15123,
    "shard": 0
  }
}

参数服务器

./build/bin/ps_server \
  --config_path ./recstore_config.distributed.json \
  -local_shard_id=0

期望日志包含:

bRPC Server shard 0 listening on 10.0.0.1:15123
./build/bin/ps_server \
  --config_path ./recstore_config.distributed.json \
  -local_shard_id=1

期望日志包含:

bRPC Server shard 1 listening on 10.0.0.2:15123
连通性检查

本机端口:

ss -ltnp | grep 15123
ss -ltnp | grep 15123

跨机连通性检查:

nc -vz 10.0.0.2 15123
nc -vz 10.0.0.1 15123

训练

可以指定训练输出存放位置:

mkdir -p ./tmp/rs_demo ./tmp/recstore-dist-shared
python3 model_zoo/rs_demo/run_mock_stress.py \
  --backend recstore \
  --nnodes 2 \
  --node-rank 0 \
  --nproc-per-node 1 \
  --master-addr 10.0.0.1 \
  --master-port 29670 \
  --run-id recstore-dist-2node-2proc \
  --output-root ./tmp/rs_demo \
  --recstore-runtime-dir ./tmp/recstore-dist-shared \
  --no-start-server
python3 model_zoo/rs_demo/run_mock_stress.py \
  --backend recstore \
  --nnodes 2 \
  --node-rank 1 \
  --nproc-per-node 1 \
  --master-addr 10.0.0.1 \
  --master-port 29670 \
  --run-id recstore-dist-2node-2proc \
  --output-root ./tmp/rs_demo \
  --recstore-runtime-dir ./tmp/recstore-dist-shared \
  --no-start-server

结果

输出目录
ls -l ./tmp/rs_demo/outputs/recstore-dist-2node-2proc
rank 子目录
ls -l ./tmp/rs_demo/outputs/recstore-dist-2node-2proc/recstore_ranks
聚合结果
cat ./tmp/rs_demo/outputs/recstore-dist-2node-2proc/recstore_main_agg.csv
明细结果前 5 行
head -n 5 ./tmp/rs_demo/outputs/recstore-dist-2node-2proc/recstore_main.csv

TorchRec 对齐

python3 model_zoo/rs_demo/run_mock_stress.py \
  --backend torchrec \
  --nnodes 2 \
  --node-rank 0 \
  --nproc-per-node 1 \
  --master-addr 10.0.0.1 \
  --master-port 29671 \
  --run-id torchrec-fair-2node-2proc \
  --output-root ./tmp/rs_demo \
  --torchrec-dist-mode fair_remote \
  --no-start-server
python3 model_zoo/rs_demo/run_mock_stress.py \
  --backend torchrec \
  --nnodes 2 \
  --node-rank 1 \
  --nproc-per-node 1 \
  --master-addr 10.0.0.1 \
  --master-port 29671 \
  --run-id torchrec-fair-2node-2proc \
  --output-root ./tmp/rs_demo \
  --torchrec-dist-mode fair_remote \
  --no-start-server

工具命令

server 日志
tail -f /path/to/ps_server.log
rank 日志
ls -l ./tmp/rs_demo/outputs/recstore-dist-2node-2proc/recstore_ranks
worker 指纹
cat ./tmp/rs_demo/outputs/recstore-dist-2node-2proc/recstore_worker_fingerprints.json