分布式训练
以两台机器为例,每台机器 1 个 trainer,1 个分片的存储后端。
环境
约定:
- 机器 A:
10.0.0.1,shard 0,trainer rank 0
- 机器 B:
10.0.0.2,shard 1,trainer rank 1
master_addr=10.0.0.1
master_port=29670
配置文件
两台机器必须共享同一份 distributed_client 配置,尤其是 num_shards、hash_method、servers,下称 ./recstore_config.distributed.json。
配置文件示例
| recstore_config.distributed.json |
|---|
| {
"cache_ps": {
"ps_type": "BRPC",
"max_batch_keys_size": 65536,
"num_threads": 32,
"num_shards": 2,
"servers": [
{
"host": "10.0.0.1",
"port": 15123,
"shard": 0
},
{
"host": "10.0.0.2",
"port": 15123,
"shard": 1
}
],
"base_kv_config": {
"path": "/tmp/recstore_dist_data",
"capacity": 1000000,
"value_size": 512,
"value_type": "DRAM",
"index_type": "DRAM",
"value_memory_management": "PersistLoopShmMalloc"
}
},
"distributed_client": {
"num_shards": 2,
"hash_method": "city_hash",
"max_keys_per_request": 500,
"servers": [
{
"host": "10.0.0.1",
"port": 15123,
"shard": 0
},
{
"host": "10.0.0.2",
"port": 15123,
"shard": 1
}
]
},
"client": {
"host": "10.0.0.1",
"port": 15123,
"shard": 0
}
}
|
参数服务器
连通性检查
本机端口:
跨机连通性检查:
训练
可以指定训练输出存放位置:
mkdir -p ./tmp/rs_demo ./tmp/recstore-dist-shared
结果
输出目录ls -l ./tmp/rs_demo/outputs/recstore-dist-2node-2proc
rank 子目录ls -l ./tmp/rs_demo/outputs/recstore-dist-2node-2proc/recstore_ranks
聚合结果cat ./tmp/rs_demo/outputs/recstore-dist-2node-2proc/recstore_main_agg.csv
明细结果前 5 行head -n 5 ./tmp/rs_demo/outputs/recstore-dist-2node-2proc/recstore_main.csv
TorchRec 对齐
工具命令
server 日志tail -f /path/to/ps_server.log
rank 日志ls -l ./tmp/rs_demo/outputs/recstore-dist-2node-2proc/recstore_ranks
worker 指纹cat ./tmp/rs_demo/outputs/recstore-dist-2node-2proc/recstore_worker_fingerprints.json