极术小姐姐 · 2020年02月03日

Partial register dependency neon

I'm having trouble finding any informations on partial neon register dependencies.

Take for example the following code:

Fullscreen

Does the second load have to wait for the previous one to complete or may it continue right away?

I'm working with image data that needs to be palletised from a 256 16-bit entry table and I want to further process it with neon. Unfortunately due to the table size are tbl instructions not an option, since it would take up all of the 32 registers. Would doing the look up with arm first, then combining and transfering the results in 4 64-bit registers be faster?

If it helps I'm targeting Cortex-A57.

1 个回答 得票排序 · 时间排序
棋子 · 2020年02月03日

Chapter "4.4 Register Forwarding Hazards" of the Cortex-A57 Software Optimization Guide talks about similar cases.

You might want to measure your specific example using the PMUs:

"The Performance Monitor Unit (PMU) in Cortex-A57 may be used to determine when register forwarding hazards are actually occurring The implementation defined PMU event number 0x12C (DISP_SWDW_STALL) has been assigned to count the number of cycles spent stalling due to these hazards."

你的回答
关注数
1
收藏数
0
浏览数
3025
棋子
极术微信服务号
关注极术微信号
实时接收点赞提醒和评论通知
安谋科技学堂公众号
关注安谋科技学堂
实时获取安谋科技及 Arm 教学资源
安谋科技招聘公众号
关注安谋科技招聘
实时获取安谋科技中国职位信息