深度学习开发者应该都知道,“炼丹”到后期,模型训练性能瓶颈往往是难以突破的一道屏障。明明已经优化了网络模型结构,尝试了各种BuildStrategy(计算图优化策略),但是训练速度还是没有达到要求,性能优化的十八般武艺都用上了,为什么还是达不到预期效果。
问题到底出在哪儿呢?
这时候你一定期望有一个“透视镜”,打开模型运行的黑盒,看看到底是哪个“零部件”出了问题。
看到这篇文章的小伙伴是幸运的,本文介绍的飞桨模型性能分析工具Profiler,就是这样一个“透视镜”,可以帮助开发者庖丁解牛,找到性能瓶颈,实现模型训练速度的突破。
飞桨Profiler是怎么做到的呢?
Profiler 通过从多个层面、多个维度统计模型执行过程中的OP运行时间,从而为进一步优化模型提供直接的性能数据来源。
说千道万不如举个例子,下面就由“炼丹筑基后期”的PP同学,介绍一下如何通过Profiler透视模型性能,并顺利完成优化任务的两次经历吧。
通过Profiler进行性能优化的实例
场景一:XLNet 混合精度性能优化
PP同学正在实现XLNet的混合精度训练,发现混合精度训练的加速比达不到预期。为了解决这个问题,PP同学尝试了各种优化策略发现性能并没有提升。这个时候,PP同学想到了采用Profiler对整个模型进行性能分析。通过在with语句下添加一个profiler.profiler接口,即可对with下的运行过程进行profiler性能分析。
with profiler.profiler('All', 'total') as prof:
查看分析结果,经过仔细观察,PP同学发现slice_grad这个OP的时间占比异常,其主要性能数据如下所示:
-------------------------> Profiling Report <-------------------------
Note! This Report merge all thread info into one.
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Total time: 35377.5
Computation time Total: 32290.3 Ratio: 91.2735%
Framework overhead Total: 3087.21 Ratio: 8.72648%
------------------------- GpuMemCpy Summary -------------------------
GpuMemcpy Calls: 6352 Total: 627.243 Ratio: 1.773%
GpuMemcpyAsync Calls: 6312 Total: 621.182 Ratio: 1.75587%
GpuMemcpySync Calls: 40 Total: 6.06099 Ratio: 0.0171323%
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
slice_grad 1900 17120.6 16651.581006 (0.972604) 469.040045 (0.027396) 0.16375 92.6487 9.01085 0.48394
slice_grad/compute 1900 17085.3 16616.287107 (0.972547) 469.040045 (0.027453) 0.150132 92.6314 8.99228 0.482943
slice_grad/infer_shape 1900 7.84785 7.847853 (1.000000) 0.000000 (0.000000) 0.002344 0.131375 0.00413045 0.000221831
slice_grad/prepare_data 1900 3.35192 3.351923 (1.000000) 0.000000 (0.000000) 0.00098 0.027065 0.00176417 9.47472e-05
matmul_grad 3720 2403.56 591.410434 (0.246056) 1812.148788 (0.753944) 0.105223 12.9082 0.646118 0.0679402
matmul_grad/compute 3720 2338.72 526.575728 (0.225155) 1812.148788 (0.774845) 0.092585 12.8887 0.628689 0.0661076
matmul_grad/infer_shape 3720 15.3237 15.323685 (1.000000) 0.000000 (0.000000) 0.002286 0.034366 0.00411927 0.000433147
matmul_grad/prepare_data 3720 5.76746 5.767456 (1.000000) 0.000000 (0.000000) 0.000957 0.033173 0.00155039 0.000163026
cast 14600 1651.43 1240.865640 (0.751388) 410.564599 (0.248612) 0.02195 9.75191 0.113112 0.0466802
cast/compute 14600 1499.45 1088.888215 (0.726190) 410.564599 (0.273810) 0.011647 9.74306 0.102702 0.0423843
cast/infer_shape 14600 39.1496 39.149612 (1.000000) 0.000000 (0.000000) 0.001536 0.213652 0.00268148 0.00110662
cast/prepare_data 14600 22.4475 22.447455 (1.000000) 0.000000 (0.000000) 0.000908 0.130246 0.0015375 0.00063451
仔细观察性能数据,slice_grad这个OP的时间占比是49%左右,但是实际上这个OP的计算并不复杂,并不应该占比这么高,所以存在异常。
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
slice_grad 1900 17120.6 16651.581006 (0.972604) 469.040045 (0.027396) 0.16375 92.6487 9.01085 0.48394
slice_grad/compute 1900 17085.3 16616.287107 (0.972547) 469.040045 (0.027453) 0.150132 92.6314 8.99228 0.482943
slice_grad/infer_shape 1900 7.84785 7.847853 (1.000000) 0.000000 (0.000000) 0.002344 0.131375 0.00413045 0.000221831
slice_grad/prepare_data 1900 3.35192 3.351923 (1.000000) 0.000000 (0.000000) 0.00098 0.027065 0.00176417 9.47472e-05
在发现slice_grad这个OP的计算时间占比异常后,PP同学初步判断slice_grad这个OP 的实现存在问题。继续分析slice_grad这个OP的具体性能数据,发现CPU的实现占比很大(CPU Time (Ratio)),达到97%,看来这个OP中 CPU的某些实现不合理导致了性能问题。确定问题范围后,PP同学快速定位了导致性能瓶颈的代码,对slice_grad这个OP的CPU部分进行优化。重新跑一下,模型训练的整体性能提升了10%,效果不错。
场景二:控制流性能优化
看完这个例子我们再来看一个例子。PP同学最近在研究PaddlePaddle控制流,看了官网的示例之后开始研究一些控制流的性能数据,但是发现这段示例代码GPU运行的性能数据竟然比CPU运行速度要慢,百思不得其解。PP同学想要提升示例的GPU运行速度,于是决定借助Profiler一探究竟,通过startprofiler 和 stopprofiler两个函数接口指定性能分析范围,缩小分析范围有利于提高定位效率。
import paddle.fluid as fluid
import paddle.fluid.layers as layers
import paddle.fluid.profiler as profiler
import time
# 开始Profiler性能分析
profiler.start_profiler('All')
def fn_1():
return layers.fill_constant(shape=[1, 2], dtype='float32', value=1)
def fn_2():
return layers.fill_constant(shape=[2, 2], dtype='int32', value=2)
def fn_3():
return layers.fill_constant(shape=[3], dtype='int32', value=3)
main_program = fluid.default_startup_program()
startup_program = fluid.default_main_program()
with fluid.program_guard(main_program, startup_program):
x = layers.fill_constant(shape=[1], dtype='float32', value=0.3)
y = layers.fill_constant(shape=[1], dtype='float32', value=0.1)
z = layers.fill_constant(shape=[1], dtype='float32', value=0.2)
pred_1 = layers.less_than(z, x) # true: 0.2 < 0.3
pred_2 = layers.less_than(x, y) # false: 0.3 < 0.1
pred_3 = layers.equal(x, y) # false: 0.3 == 0.1
# 如果pred_1为true则运行fn_1
out_1 = layers.case(
pred_fn_pairs=[(pred_1, fn_1), (pred_2, fn_2)], default=fn_3)
# 此处default 参数为None,当前面条件都不满足的时候会把最后一个分支当成默认分支
out_2 = layers.case(pred_fn_pairs=[(pred_2, fn_2), (pred_3, fn_3)])
exe = fluid.Executor(fluid.CUDAPlace(0))
for i in range(100):
if i > 20:
time_start = time.time()
res_1, res_2 = exe.run(main_program, fetch_list=[out_1, out_2])
time_end = time.time()
print('time cost',time_end-time_start,'s')
# 结束Profiler性能分析
profiler.stop_profiler('total', '/tmp/profile')
print(res_1) # [[1. 1.]]
print(res_2) # [3 3 3]
查看Profiler性能分析结果:
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
------------------------- Overhead Summary -------------------------
Total time: 104.442
Computation time Total: 38.1497 Ratio: 36.5271%
Framework overhead Total: 66.2927 Ratio: 63.4729%
------------------------- GpuMemCpy Summary -------------------------
GpuMemcpy Calls: 1200 Total: 32.5351 Ratio: 31.1512%
GpuMemcpyAsync Calls: 800 Total: 20.1238 Ratio: 19.2679%
GpuMemcpySync Calls: 400 Total: 12.4112 Ratio: 11.8833%
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::conditional_block 400 45.318 43.963987 (0.970122) 1.354010 (0.029878) 0.037256 0.637326 0.113295 0.433904
GpuMemcpyAsync:GPU->CPU 400 11.8308 11.040198 (0.933176) 0.790586 (0.066824) 0.023449 0.059867 0.029577 0.113276
assign 200 8.0128 7.738432 (0.965759) 0.274368 (0.034241) 0.033865 0.069719 0.040064 0.0767198
GpuMemcpyAsync(same_gpu):GPU->GPU 200 4.55945 4.285085 (0.939824) 0.274368 (0.060176) 0.018909 0.048334 0.0227973 0.0436552
fill_constant 200 7.80007 7.511009 (0.962942) 0.289056 (0.037058) 0.034299 0.083957 0.0390003 0.0746829
thread0::fill_constant 300 13.7209 13.298308 (0.969203) 0.422559 (0.030797) 0.025552 0.31214 0.0457362 0.131373
thread0::select_input 200 13.4918 12.810170 (0.949480) 0.681598 (0.050520) 0.061153 0.095125 0.0674588 0.129179
GpuMemcpySync:GPU->CPU 200 6.69242 6.296225 (0.940800) 0.396191 (0.059200) 0.030713 0.052313 0.0334621 0.0640776
GpuMemcpyAsync(same_gpu):GPU->GPU 200 3.73361 3.448200 (0.923557) 0.285407 (0.076443) 0.016164 0.047654 0.018668 0.035748
thread0::less_than 200 7.32649 7.053690 (0.962765) 0.272799 (0.037235) 0.029206 0.07994 0.0366324 0.0701486
thread0::fetch 200 7.16217 6.766394 (0.944741) 0.395773 (0.055259) 0.031348 0.06381 0.0358108 0.0685753
GpuMemcpySync:GPU->CPU 200 5.71881 5.323041 (0.930795) 0.395773 (0.069205) 0.025997 0.046626 0.0285941 0.0547557
thread0::logical_not 200 7.04657 6.772137 (0.961055) 0.274432 (0.038945) 0.031275 0.054511 0.0352328 0.0674685
thread0::cast 200 7.04549 6.768117 (0.960631) 0.277375 (0.039369) 0.030785 0.068275 0.0352275 0.0674582
thread0::equal 100 3.33104 3.192579 (0.958432) 0.138464 (0.041568) 0.030176 0.053589 0.0333104 0.0318936
PP同学观察性能数据,发现在控制流的OP中有很多CPU<--->GPU 的数据拷贝(GpuMemcpy calls 1200次)
------------------------- GpuMemCpy Summary -------------------------
GpuMemcpy Calls: 1200 Total: 32.5351 Ratio: 31.1512%
GpuMemcpyAsync Calls: 800 Total: 20.1238 Ratio: 19.2679%
GpuMemcpySync Calls: 400 Total: 12.4112 Ratio: 11.8833%
因为控制流的计算比较简单,是可以放到CPU上实现的,所以数据拷贝很多都是可以优化掉的。于是PP同学采用device_guard(1)的配置,指定控制流相关的OP运行在CPU上,减少CPU与GPU 之间的数据拷贝交互。立马动手修改代码:
import paddle.fluid as fluid
import paddle.fluid.layers as layers
import paddle.fluid.profiler as profiler
import time
profiler.start_profiler('All')
def fn_1():
return layers.fill_constant(shape=[1, 2], dtype='float32', value=1)
def fn_2():
return layers.fill_constant(shape=[2, 2], dtype='int32', value=2)
def fn_3():
return layers.fill_constant(shape=[3], dtype='int32', value=3)
main_program = fluid.default_startup_program()
startup_program = fluid.default_main_program()
with fluid.program_guard(main_program, startup_program):
#设置x、 y 、z 运行在CPU上
with fluid.device_guard("cpu"):
x = layers.fill_constant(shape=[1], dtype='float32', value=0.3)
y = layers.fill_constant(shape=[1], dtype='float32', value=0.1)
z = layers.fill_constant(shape=[1], dtype='float32', value=0.2)
#设置pred_1、 pred_2 、pred_3 运行在CPU上
with fluid.device_guard("cpu"):
pred_1 = layers.less_than(z, x) # true: 0.2 < 0.3
pred_2 = layers.less_than(x, y) # false: 0.3 < 0.1
pred_3 = layers.equal(x, y) # false: 0.3 == 0.1
# 如果pred_1为True则运行
out_1 = layers.case(
pred_fn_pairs=[(pred_1, fn_1), (pred_2, fn_2)], default=fn_3)
# 此处default 参数为None,当前面条件都不满足的时候会把最后一个分支当成默认分支
out_2 = layers.case(pred_fn_pairs=[(pred_2, fn_2), (pred_3, fn_3)])
exe = fluid.Executor(fluid.CUDAPlace(0))
for i in range(100):
if i > 20:
time_start=time.time()
res_1, res_2 = exe.run(main_program, fetch_list=[out_1, out_2])
time_end = time.time()
print('time cost',time_end-time_start,'s')
profiler.stop_profiler('total', '/tmp/profile')
print(res_1) # [[1. 1.]]
print(res_2) # [3 3 3]
对修改的代码进行Profiler,结果如下:
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
------------------------- Overhead Summary -------------------------
Total time: 58.7053
Computation time Total: 19.7797 Ratio: 33.6932%
Framework overhead Total: 38.9256 Ratio: 66.3068%
------------------------- GpuMemCpy Summary -------------------------
GpuMemcpy Calls: 600 Total: 15.281 Ratio: 26.0301%
GpuMemcpyAsync Calls: 400 Total: 8.73338 Ratio: 14.8767%
GpuMemcpySync Calls: 200 Total: 6.54766 Ratio: 11.1534%
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::conditional_block 400 29.0445 28.488841 (0.980867) 0.555707 (0.019133) 0.002378 0.383915 0.0726114 0.494752
fill_constant 200 9.52572 9.245150 (0.970546) 0.280573 (0.029454) 0.034405 0.159726 0.0476286 0.162263
assign 200 8.37987 8.104736 (0.967167) 0.275134 (0.032833) 0.031566 0.088268 0.0418994 0.142745
GpuMemcpyAsync(same_gpu):GPU->GPU 200 4.94725 4.672113 (0.944386) 0.275134 (0.055614) 0.017507 0.05243 0.0247362 0.0842726
thread0::fetch 200 8.15252 7.754766 (0.951210) 0.397758 (0.048790) 0.032766 0.075615 0.0407626 0.138872
GpuMemcpySync:GPU->CPU 200 6.54766 6.149903 (0.939252) 0.397758 (0.060748) 0.026847 0.052196 0.0327383 0.111534
thread0::fill_constant 300 7.08439 7.084390 (1.000000) 0.000000 (0.000000) 0.010914 0.535107 0.0236146 0.120677
thread0::select_input 200 5.03929 4.765534 (0.945675) 0.273757 (0.054325) 0.020717 0.053569 0.0251965 0.0858405
GpuMemcpyAsync(same_gpu):GPU->GPU 200 3.78614 3.512380 (0.927695) 0.273757 (0.072305) 0.015393 0.048223 0.0189307 0.064494
thread0::less_than 200 2.83091 2.830909 (1.000000) 0.000000 (0.000000) 0.009482 0.039851 0.0141545 0.0482224
thread0::logical_not 200 2.73213 2.732127 (1.000000) 0.000000 (0.000000) 0.010661 0.028379 0.0136606 0.0465397
thread0::cast 200 2.57408 2.574083 (1.000000) 0.000000 (0.000000) 0.010177 0.031029 0.0128704 0.0438476
thread0::equal 100 1.24741 1.247408 (1.000000) 0.000000 (0.000000) 0.010344 0.026703 0.0124741 0.0212487
观察Profiler结果,CPU和GPU的数据交互减少了一半,最新的Memcpy总次数从1200次变成了600次。
------------------------- GpuMemCpy Summary -------------------------
GpuMemcpy Calls: 600 Total: 15.281 Ratio: 26.0301%
GpuMemcpyAsync Calls: 400 Total: 8.73338 Ratio: 14.8767%
GpuMemcpySync Calls: 200 Total: 6.54766 Ratio: 11.1534%
PP同学对程序进行性能测试,发现运行时间减少近10%,程序性能得到提升,完成了程序性能优化。
从上面的两个实例,可以看到Profiler对辅助模型性能优化有重要的作用,PP同学也对性能优化有了更多的了解和体会。除了实例中的能力,Profiler还有更多维度的可视化能力,做模型性能优化的小伙伴一定不要错过,接下来我们就详细了解一下Profiler的详细功效,希望给小伙伴性能优化的工作带来启发和帮助。
Profiler 简单易用的API 接口
飞桨提供了startprofiler 和 stopprofiler两个函数接口进行性能分析, 在使用Profiler工具分析模型性能的时候我们可以配合使用startprofiler和stopprofiler 确定profiler范围,在确定的范围内进行性能分析, 其使用方式如下所示:
import paddle.fluid as fluid
import paddle.fluid.compiler as compiler
import paddle.fluid.profiler as profiler
data1 = fluid.layers.fill_constant(shape=[1, 3, 8, 8], value=0.5, dtype='float32')
data2 = fluid.layers.fill_constant(shape=[1, 3, 5, 5], value=0.5, dtype='float32')
shape = fluid.layers.shape(data2)
shape = fluid.layers.slice(shape, axes=[0], starts=[0], ends=[4])
out = fluid.layers.crop_tensor(data1, shape=shape)
place = fluid.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
compiled_prog = compiler.CompiledProgram(fluid.default_main_program())
for iter in range(10):
if(iter == 2):
#开始Profiler过程,对CPU和GPU进行性能分析统计,tracer_option 选项为Default
profiler.start_profiler(state='All', tracer_option='Default')
if(iter == 8):
#结束Profiler过程,按照总时间打印分析结果,/tmp路径下生成profile用于后续timeline生成和分析
profiler.stop_profiler(sorted_key='total', profile_path='/tmp/profile')
result = exe.run(program=compiled_prog, fetch_list=[out])
上述代码对网络一共训练了10次,我们在第二个iter开始进行Profiler,在第8个iter停止。相关参数如下:
state:用于指定对GPU计算或者CPU计算进行性能数据记录,亦或者两个都进行性能数据记录(All)。
tracer_option:设置如何记录性能数据的,打印整体性能信息还是详细信息。
sorted_key:用于指定profiler结果的打印排序,我们一般会选择按总时间(total)排序,为了关注主要的性能瓶颈。
profile_path:保存profile结果信息的路径。
除了这种灵活的startprofiler 和 stopprofiler函数调用方法,飞桨还提供了Profiler通用性能分析器进行性能分析,在with语句下添加一个profiler.profiler接口,即可对with下的运行过程进行profiler性能分析,其使用接口示例如下所示:
import paddle.fluid as fluid
import paddle.fluid.compiler as compiler
import paddle.fluid.profiler as profiler
data1 = fluid.layers.fill_constant(shape=[1, 3, 8, 8], value=0.5, dtype='float32')
data2 = fluid.layers.fill_constant(shape=[1, 3, 5, 5], value=0.5, dtype='float32')
shape = fluid.layers.shape(data2)
shape = fluid.layers.slice(shape, axes=[0], starts=[0], ends=[4])
out = fluid.layers.crop_tensor(data1, shape=shape)
place = fluid.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
compiled_prog = compiler.CompiledProgram(fluid.default_main_program())
#对with 下的运行过程进行profiler性能分析
with profiler.profiler('All', 'total') as prof:
for i in range(10):
result = exe.run(program=compiled_prog, fetch_list=[out])
使用start_profiler和stop_profiler相结合进行性能分析,优点是可以更加灵活的控制profiler范围,缺点是需要多配置一些代码,profiler通用分析器接口则是使用起来更为方便,但是不利于范围控制。在进行分析的时候可以根据自己的需求进行合适的接口进行性能分析。
Profiler 多维度性能分析功能
飞桨Profiler是帮助飞桨开发者定位性能问题的一套工具,通过Profiler接口进行性能分析最后得到打印结果,示例如下:
-------------------------> Profiling Report <-------------------------
Note! This Report merge all thread info into one.
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Total time: 7.30941
Computation time Total: 0.225158 Ratio: 3.08039%
Framework overhead Total: 7.08425 Ratio: 96.9196%
------------------------- GpuMemCpy Summary -------------------------
GpuMemcpy Calls: 2 Total: 0.74759 Ratio: 10.2278%
GpuMemcpyAsync Calls: 2 Total: 0.74759 Ratio: 10.2278%
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
FastThreadedSSAGraphExecutorPrepare 1 6.08951 6.076904 (0.997930) 0.012608 (0.002070) 6.08951 6.08951 6.08951 0.833106
GpuMemcpyAsync:CPU->GPU 1 0.703009 0.700705 (0.996723) 0.002304 (0.003277) 0.703009 0.703009 0.703009 0.0961787
mul 1 0.219497 0.211977 (0.965740) 0.007520 (0.034260) 0.219497 0.219497 0.219497 0.0300294
elementwise_add 1 0.108017 0.105905 (0.980448) 0.002112 (0.019552) 0.108017 0.108017 0.108017 0.0147778
Fetch 1 0.102637 0.100142 (0.975691) 0.002495 (0.024309) 0.102637 0.102637 0.102637 0.0140418
GpuMemcpyAsync:GPU->CPU 1 0.044581 0.042086 (0.944034) 0.002495 (0.055966) 0.044581 0.044581 0.044581 0.00609913
mean 1 0.064722 0.061234 (0.946108) 0.003488 (0.053892) 0.064722 0.064722 0.064722 0.00885462
InitLocalVars 1 0.006395 0.006395 (1.000000) 0.000000 (0.000000) 0.006395 0.006395 0.006395 0.0008749
eager_deletion 2 0.006166 0.006166 (1.000000) 0.000000 (0.000000) 0.002291 0.003875 0.003083 0.000843571
ScopeBufferedMonitor::post_local_exec_scopes_process 1 0.005985 0.005985 (1.000000) 0.000000 (0.000000) 0.005985 0.005985 0.005985 0.000818808
ScopeBufferedMonitor::pre_local_exec_scopes_process 1 0.003467 0.003467 (1.000000) 0.000000 (0.000000) 0.003467 0.003467 0.003467 0.00047432
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread1::mul 1 0.219497 0.211977 (0.965740) 0.007520 (0.034260) 0.219497 0.219497 0.219497 0.438084
thread1::elementwise_add 1 0.108017 0.105905 (0.980448) 0.002112 (0.019552) 0.108017 0.108017 0.108017 0.215586
thread1::Fetch 1 0.102637 0.100142 (0.975691) 0.002495 (0.024309) 0.102637 0.102637 0.102637 0.204848
GpuMemcpyAsync:GPU->CPU 1 0.044581 0.042086 (0.944034) 0.002495 (0.055966) 0.044581 0.044581 0.044581 0.0889771
thread1::mean 1 0.064722 0.061234 (0.946108) 0.003488 (0.053892) 0.064722 0.064722 0.064722 0.129176
thread1::eager_deletion 2 0.006166 0.006166 (1.000000) 0.000000 (0.000000) 0.002291 0.003875 0.003083 0.0123064
thread0::FastThreadedSSAGraphExecutorPrepare 1 6.08951 6.076904 (0.997930) 0.012608 (0.002070) 6.08951 6.08951 6.08951 0.894416
thread0::GpuMemcpyAsync:CPU->GPU 1 0.703009 0.700705 (0.996723) 0.002304 (0.003277) 0.703009 0.703009 0.703009 0.103257
thread0::InitLocalVars 1 0.006395 0.006395 (1.000000) 0.000000 (0.000000) 0.006395 0.006395 0.006395 0.000939285
thread0::ScopeBufferedMonitor::post_local_exec_scopes_process 1 0.005985 0.005985 (1.000000) 0.000000 (0.000000) 0.005985 0.005985 0.005985 0.000879065
thread0::ScopeBufferedMonitor::pre_local_exec_scopes_process 1 0.003467 0.003467 (1.000000) 0.000000 (0.000000) 0.003467 0.003467 0.003467 0.000509226
我们可以观察到其主要分为“框架性能分析结果”和“OP性能分析结果”。
框架性能分析
框架性能分析结果中,我们可以看到在训练过程中的框架运行开销,这里的开销主要指除了OP的计算函数之外的所有时间开销。同时为了更好的反映在训练过程中GPU和CPU的数据交互过程,提供了Memcpy的次数统计,CPU和GPU过多的内存交互会导致性能下降,结合device_guard 我们可以减少CPU和GPU的数据交互,提升性能。
-------------------------> Profiling Report <-------------------------
Note! This Report merge all thread info into one.
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Total time: 7.30941
Computation time Total: 0.225158 Ratio: 3.08039%
Framework overhead Total: 7.08425 Ratio: 96.9196%
------------------------- GpuMemCpy Summary -------------------------
GpuMemcpy Calls: 2 Total: 0.74759 Ratio: 10.2278%
GpuMemcpyAsync Calls: 2 Total: 0.74759 Ratio: 10.2278%
OP多维度性能分析
OP性能分析结果包括两部分:合并线程OP性能分析结果、非合并线程OP性能分析结果。
在合并线程OP性能分析的打印结果中,我们可以观察到每个OP在所有线程中调用次数,总耗时(包括平均耗时、最小耗时、最大耗时)、CPU耗时、GPU耗时、耗时比例相关信息。
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
FastThreadedSSAGraphExecutorPrepare 1 6.08951 6.076904 (0.997930) 0.012608 (0.002070) 6.08951 6.08951 6.08951 0.833106
GpuMemcpyAsync:CPU->GPU 1 0.703009 0.700705 (0.996723) 0.002304 (0.003277) 0.703009 0.703009 0.703009 0.0961787
mul 1 0.219497 0.211977 (0.965740) 0.007520 (0.034260) 0.219497 0.219497 0.219497 0.0300294
elementwise_add 1 0.108017 0.105905 (0.980448) 0.002112 (0.019552) 0.108017 0.108017 0.108017 0.0147778
Fetch 1 0.102637 0.100142 (0.975691) 0.002495 (0.024309) 0.102637 0.102637 0.102637 0.0140418
GpuMemcpyAsync:GPU->CPU 1 0.044581 0.042086 (0.944034) 0.002495 (0.055966) 0.044581 0.044581 0.044581 0.00609913
mean 1 0.064722 0.061234 (0.946108) 0.003488 (0.053892) 0.064722 0.064722 0.064722 0.00885462
InitLocalVars 1 0.006395 0.006395 (1.000000) 0.000000 (0.000000) 0.006395 0.006395 0.006395 0.0008749
eager_deletion 2 0.006166 0.006166 (1.000000) 0.000000 (0.000000) 0.002291 0.003875 0.003083 0.000843571
ScopeBufferedMonitor::post_local_exec_scopes_process 1 0.005985 0.005985 (1.000000) 0.000000 (0.000000) 0.005985 0.005985 0.005985 0.000818808
ScopeBufferedMonitor::pre_local_exec_scopes_process 1 0.003467 0.003467 (1.000000) 0.000000 (0.000000) 0.003467 0.003467 0.003467 0.00047432
非合并线程OP性能分析的打印结果和合并线程类似,只是将不同线程的性能数据结果独立开来统计。
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread1::mul 1 0.219497 0.211977 (0.965740) 0.007520 (0.034260) 0.219497 0.219497 0.219497 0.438084
thread1::elementwise_add 1 0.108017 0.105905 (0.980448) 0.002112 (0.019552) 0.108017 0.108017 0.108017 0.215586
thread1::Fetch 1 0.102637 0.100142 (0.975691) 0.002495 (0.024309) 0.102637 0.102637 0.102637 0.204848
GpuMemcpyAsync:GPU->CPU 1 0.044581 0.042086 (0.944034) 0.002495 (0.055966) 0.044581 0.044581 0.044581 0.0889771
thread1::mean 1 0.064722 0.061234 (0.946108) 0.003488 (0.053892) 0.064722 0.064722 0.064722 0.129176
thread1::eager_deletion 2 0.006166 0.006166 (1.000000) 0.000000 (0.000000) 0.002291 0.003875 0.003083 0.0123064
thread0::FastThreadedSSAGraphExecutorPrepare 1 6.08951 6.076904 (0.997930) 0.012608 (0.002070) 6.08951 6.08951 6.08951 0.894416
thread0::GpuMemcpyAsync:CPU->GPU 1 0.703009 0.700705 (0.996723) 0.002304 (0.003277) 0.703009 0.703009 0.703009 0.103257
thread0::InitLocalVars 1 0.006395 0.006395 (1.000000) 0.000000 (0.000000) 0.006395 0.006395 0.006395 0.000939285
thread0::ScopeBufferedMonitor::post_local_exec_scopes_process 1 0.005985 0.005985 (1.000000) 0.000000 (0.000000) 0.005985 0.005985 0.005985 0.000879065
thread0::ScopeBufferedMonitor::pre_local_exec_scopes_process 1 0.003467 0.003467 (1.000000) 0.000000 (0.000000) 0.003467 0.003467 0.003467 0.000509226
同时Profiler提供了多种排序打印功能,一般来说我们是按照总耗时进行排序,在性能优化过程中,优化耗时占比较大的部分,是解决性能问题的突破口。另外,Profiler提供了OP之间层级信息打印的功能,在对OP进行性能分析的时候会保留其层级关系,最后在打印性能分析结构的时候也会保留不同OP之间的层级关系信息。这样可以更加直观地反映OP性能信息,比如对于OP中数据拷贝(例如GpuMemcpyAsync:GPU->CPU)事件等能够很好的从层级信息中反映出来,这样可以有利于我们便捷地进行性能分析工作。
OP深层次性能分析
观察前面Profiler API接口,可以看到提供了tracer_option参数, tracer_option提供了三个选项,分别为Default、OpDetail、AllOpDetail。Default选项只打印OP整体性能信息,OpDetail和AllOpDetail则会打印其他更细致的信息。
Default选项的打印结果可以参见前文中OP层级信息打印的示例,可以查看OP概要的耗时信息,不拆分OP详细过程的耗时信息。
OpDetail选项则打印每个OP的详细性能数据,包括infer_shape、prepare_data、compute三个具体时间,可以帮助我们更好地了解每个OP下不同部分的耗时,更加针对性地进行优化。infer_shape是指OP进行shape推断和检查的过程,prepare_data是进行计算的数据准备过程,compute则是真正进行OP计算的过程。其分析结果如下所示:
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
fill_constant 12 9.59502 9.577872 (0.998212) 0.017152 (0.001788) 0.078359 4.11928 0.799585 0.772524
fill_constant/compute 12 9.29641 9.279253 (0.998155) 0.017152 (0.001845) 0.064842 4.06913 0.7747 0.748481
fill_constant/infer_shape 12 0.075591 0.075591 (1.000000) 0.000000 (0.000000) 0.002567 0.01396 0.00629925 0.00608606
fill_constant/prepare_data 12 0.041932 0.041932 (1.000000) 0.000000 (0.000000) 0.001451 0.010575 0.00349433 0.00337607
slice 6 1.40287 1.382675 (0.985607) 0.020192 (0.014393) 0.078351 0.83448 0.233811 0.112949
slice/prepare_data 6 1.13939 1.127870 (0.989889) 0.011520 (0.010111) 0.045453 0.776968 0.189898 0.0917357
GpuMemcpySync:CPU->GPU 6 0.92333 0.911810 (0.987523) 0.011520 (0.012477) 0.027762 0.731752 0.153888 0.07434
slice/compute 6 0.185768 0.177096 (0.953318) 0.008672 (0.046682) 0.022966 0.041444 0.0309613 0.0149567
slice/infer_shape 6 0.039823 0.039823 (1.000000) 0.000000 (0.000000) 0.004883 0.009277 0.00663717 0.00320627
crop_tensor 6 0.586683 0.563035 (0.959692) 0.023648 (0.040308) 0.082797 0.13253 0.0977805 0.0472356
crop_tensor/compute 6 0.501688 0.478040 (0.952863) 0.023648 (0.047137) 0.068311 0.117511 0.0836147 0.0403924
GpuMemcpySync:GPU->CPU 6 0.230851 0.218243 (0.945385) 0.012608 (0.054615) 0.032632 0.055755 0.0384752 0.0185865
crop_tensor/infer_shape 6 0.03331 0.033310 (1.000000) 0.000000 (0.000000) 0.003598 0.009579 0.00555167 0.00268189
crop_tensor/prepare_data 6 0.009975 0.009975 (1.000000) 0.000000 (0.000000) 0.001146 0.003719 0.0016625 0.000803117
Fetch 6 0.328085 0.316213 (0.963814) 0.011872 (0.036186) 0.047929 0.071537 0.0546808 0.0264151
GpuMemcpyAsync:GPU->CPU 6 0.183378 0.171506 (0.935259) 0.011872 (0.064741) 0.026675 0.037943 0.030563 0.0147643
FastThreadedSSAGraphExecutorPrepare 6 0.165158 0.165158 (1.000000) 0.000000 (0.000000) 0.016557 0.052761 0.0275263 0.0132974
shape 6 0.160389 0.160389 (1.000000) 0.000000 (0.000000) 0.014121 0.055736 0.0267315 0.0129134
shape/compute 6 0.08775 0.087750 (1.000000) 0.000000 (0.000000) 0.004395 0.03734 0.014625 0.00706501
shape/infer_shape 6 0.023968 0.023968 (1.000000) 0.000000 (0.000000) 0.00276 0.00575 0.00399467 0.00192974
shape/prepare_data 6 0.008976 0.008976 (1.000000) 0.000000 (0.000000) 0.001318 0.001797 0.001496 0.000722684
ScopeBufferedMonitor::post_local_exec_scopes_process 6 0.109541 0.109541 (1.000000) 0.000000 (0.000000) 0.012726 0.029744 0.0182568 0.00881947
eager_deletion 18 0.056515 0.056515 (1.000000) 0.000000 (0.000000) 0.00099 0.008829 0.00313972 0.00455019
ScopeBufferedMonitor::pre_local_exec_scopes_process 6 0.016096 0.016096 (1.000000) 0.000000 (0.000000) 0.001069 0.009635 0.00268267 0.00129594
AllOpDetail选项可以打印网络中每个OP的运行时间,在之前的性能分析结果中,所有同类型的OP都归为一类进行分析。AllOpDetail相比于OpDetail,则将网络中不同位置的OP的性能数据一一打印出来,比如网络中存在两个卷积层,使用AllOpDetail选项则会将两个卷积层分别以conv1和conv2的形式打印出性能分析数据,这样可以帮助我们更深入地进行性能分析,其打印结果如下:
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
fill_constant 12 8.99499 8.977549 (0.998061) 0.017439 (0.001939) 0.059922 3.75175 0.749582 0.736424
fill_constant0 6 4.53633 4.527403 (0.998032) 0.008927 (0.001968) 0.084302 3.68861 0.756055 0.371391
fill_constant0/compute 6 4.33761 4.328688 (0.997942) 0.008927 (0.002058) 0.067428 3.60916 0.722936 0.355122
fill_constant0/infer_shape 6 0.050087 0.050087 (1.000000) 0.000000 (0.000000) 0.004186 0.018715 0.00834783 0.00410064
fill_constant0/prepare_data 6 0.019992 0.019992 (1.000000) 0.000000 (0.000000) 0.002385 0.00582 0.003332 0.00163675
fill_constant1 6 4.36386 4.355344 (0.998049) 0.008512 (0.001951) 0.05556 3.74226 0.727309 0.357271
fill_constant1/compute 6 4.20648 4.197966 (0.997976) 0.008512 (0.002024) 0.03893 3.68969 0.70108 0.344386
fill_constant1/infer_shape 6 0.043087 0.043087 (1.000000) 0.000000 (0.000000) 0.003811 0.017195 0.00718117 0.00352755
fill_constant1/prepare_data 6 0.016492 0.016492 (1.000000) 0.000000 (0.000000) 0.00199 0.004173 0.00274867 0.00135021
slice 6 1.49876 1.478472 (0.986463) 0.020288 (0.013537) 0.09695 0.936687 0.249793 0.122704
slice0 6 1.47483 1.454542 (0.986244) 0.020288 (0.013756) 0.093403 0.932261 0.245805 0.120745
slice0/prepare_data 6 1.16992 1.158303 (0.990071) 0.011616 (0.009929) 0.051469 0.86229 0.194986 0.0957818
GpuMemcpySync:CPU->GPU 6 0.993707 0.982091 (0.988310) 0.011616 (0.011690) 0.030551 0.813402 0.165618 0.0813552
slice0/compute 6 0.183369 0.174697 (0.952707) 0.008672 (0.047293) 0.025079 0.043277 0.0305615 0.0150125
slice0/infer_shape 6 0.053972 0.053972 (1.000000) 0.000000 (0.000000) 0.006931 0.01328 0.00899533 0.00441871
crop_tensor 6 0.686333 0.662589 (0.965405) 0.023744 (0.034595) 0.092694 0.159363 0.114389 0.0561904
crop_tensor0 6 0.661745 0.638001 (0.964119) 0.023744 (0.035881) 0.088815 0.154756 0.110291 0.0541773
crop_tensor0/compute 6 0.498947 0.475203 (0.952412) 0.023744 (0.047588) 0.069971 0.10652 0.0831578 0.040849
GpuMemcpySync:GPU->CPU 6 0.22741 0.214642 (0.943855) 0.012768 (0.056145) 0.034765 0.043249 0.0379017 0.0186182
crop_tensor0/infer_shape 6 0.040319 0.040319 (1.000000) 0.000000 (0.000000) 0.005645 0.008416 0.00671983 0.00330093
crop_tensor0/prepare_data 6 0.017584 0.017584 (1.000000) 0.000000 (0.000000) 0.001817 0.008174 0.00293067 0.00143961
Fetch 6 0.384527 0.371983 (0.967378) 0.012544 (0.032622) 0.052345 0.084889 0.0640878 0.0314814
GpuMemcpyAsync:GPU->CPU 6 0.178042 0.165498 (0.929545) 0.012544 (0.070455) 0.027737 0.032732 0.0296737 0.0145764
shape 6 0.228702 0.228702 (1.000000) 0.000000 (0.000000) 0.024993 0.064773 0.038117 0.0187239
shape0 6 0.203062 0.203062 (1.000000) 0.000000 (0.000000) 0.020906 0.059788 0.0338437 0.0166248
shape0/compute 6 0.093917 0.093917 (1.000000) 0.000000 (0.000000) 0.00592 0.039347 0.0156528 0.00768903
shape0/infer_shape 6 0.027612 0.027612 (1.000000) 0.000000 (0.000000) 0.003379 0.006415 0.004602 0.00226061
shape0/prepare_data 6 0.021202 0.021202 (1.000000) 0.000000 (0.000000) 0.001856 0.01158 0.00353367 0.00173582
FastThreadedSSAGraphExecutorPrepare 6 0.173718 0.173718 (1.000000) 0.000000 (0.000000) 0.01823 0.057948 0.028953 0.0142224
ScopeBufferedMonitor::post_local_exec_scopes_process 6 0.136108 0.136108 (1.000000) 0.000000 (0.000000) 0.014345 0.041874 0.0226847 0.0111432
eager_deletion 18 0.092351 0.092351 (1.000000) 0.000000 (0.000000) 0.002197 0.010871 0.00513061 0.00756082
ScopeBufferedMonitor::pre_local_exec_scopes_process 6 0.018935 0.018935 (1.000000) 0.000000 (0.000000) 0.002004 0.00793 0.00315583 0.00155022
总结
飞桨核心框架对cudnn和mkldnn等底层库的使用有着良好的支持,结合这些高性能库的支持以及实现上的精细打磨,飞桨OP性能整体上表现优异。同时飞桨良好的架构设计也保证了整体执行的高效。但少数情况下,如果模型训练性能达不到预期效果,可以使用Profiler工具进行性能分析,帮助开发者在模型开发以及框架开发过程中及时发现和解决性能问题。当然,试用过程中发现OP性能问题和框架开销问题,欢迎开发者积极地通过GitHub issue和我们讨论或者贡献自己的PR,让飞桨深度学习框架的运行速度在更多的硬件上飞起来吧。
如在使用过程中有问题,可加入飞桨官方QQ群:1108045677。
如果您想详细了解更多飞桨的相关内容,请参阅以下文档。官网地址:https://www.paddlepaddle.org.cn
飞桨开源框架项目地址:
GitHub: https://github.com/PaddlePadd...