棋子 · 2020年02月24日

Inexplicable performance on big.LITTLE technology (on Android)

I apologise for the long question, but I am trying to measure performance of different indexing techniques on various platforms, one of which is Adaptive Radix tree.

I have run tests where the basic steps look like this (c/c++):

Step 1: Generate or load data (few million key-value pairs)
Step 2: Insert into index and measure time taken (insert_time)
Step 3: Retrieve from index and measure time taken (retrieve_time)
I find that always insert_time > retrieve_time on most platforms such as Intel desktops (i386/amd64), iPad (Apple A9), Android (ARMv7) and Raspberry Pi 3 (ARMv8). This is expected, as insert complexity is higher than retrieve complexity.

But when I run the steps on big.LITTLE platforms, specifically Snapdragon 845 (Xiaomi POCO F1) and HiSilicon Kirin 659 (Honor 9 lite), I find insert_time < retrieve_time, except when data size is too low.

To diagnose what could be wrong, I went through the following steps:

Ensure that the thread is running at maximum speed by using following code:

void set_thread_priority() {

nice(-20);
int policy = 0;
struct sched_param param;
pthread_getschedparam(pthread_self(), &policy, &param);
param.sched_priority = sched_get_priority_max(policy);
pthread_setschedparam(pthread_self(), policy, &param);

}
I could see that the nice value is reflected against the process and the thread runs 100% CPU in most cases (it is basically single thread algorithm).

Set CPU affinity using following code:

void set_affinity() {

cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(4, &mask);
CPU_SET(5, &mask);
CPU_SET(6, &mask);
CPU_SET(7, &mask);
sched_setaffinity(0, sizeof(mask), &mask);

}
This code also reflects well on big.LITTLE because when I set CPUs as 0, 1, 2, 3, the code runs much slower than when I set CPUs as 4, 5, 6, 7. Even then insert_time < retrieve_time in both cases.

Ensure that sufficient free RAM is available for my dataset

To avoid the possibility that Step 3 might retrieve from virtual memory, I added Step 4, which is just repeating Step 3:

Step 4: Retrieve from index and measure time taken again (retrieve_time2)
To my surprise, retrieve_time2 > retrieve_time > insert_time (by 2 to 3 seconds for 10 million records).

As for my code, the insert code looks like this:

it1 = m.begin();
start = getTimeVal();
for (; it1 != m.end(); ++it1) {
    art_insert(&at, (unsigned char*) it1->first.c_str(),
           (int) it1->first.length() + 1, (void *) it1->second.c_str(),
           (int) it1->second.length());
    ctr++;
}
stop = getTimeVal();

and retrieve code looks like this:

it1 = m.begin();
start = getTimeVal();
for (; it1 != m.end(); ++it1) {
    int len;
    char *value = (char *) art_search(&at,
        (unsigned char*) it1->first.c_str(), (int) it1->first.length() + 1, &len);
    ctr++;
}
stop = getTimeVal();

Any pointers as to what I could do further? Or is there an explanation for this from the platform perspective?

1 个回答 得票排序 · 时间排序
极术小姐姐 · 2020年02月24日

It's not easy for me to say what's the exact problem. But I hope you can investigate from below aspects:

1) Can you do the CPU hot-plug-out to shut down all big or LITTLE CPUs, so that the big.LITTLE SoC regresses into a non-big.LITTLE CPU topology?
Then you can test the insert/retrieve time instead of setting the CPU affinity.

2) "Ensure that the thread is running at maximum speed by using following code:" Can you double check that what's the CPU frequency governor in your big.LITTLE system? In the recent Linux kernels, Energy Aware Scheduling (EAS) governor was deployed. If your insert/retrieve operations are distributed into different threads or even in single thread but running in different CPU time, the CPU frequency may be adjusted based on the current CPU workloads.

See https://developer.arm.com/ope...

3) Is it possible for you to add some memory barrier instructions in the high level C++ STL language?
Just to make sure the memory results of insert operations are flushed into memory, not temporarily stored in cache. So that your next retrieve operations may not fetch the data from cache.

4) "To my surprise, retrieve_time2 > retrieve_time > insert_time (by 2 to 3 seconds for 10 million records)."
It depends on your code design. If the data that retrieve_time2 wants to access is stored in the memory but the retrieve_time can access the data from cache, you may observe the result as what you mentioned. In a huge loop ( the loop count is too larger compared to a small TLB/L1 or L2 cache size), the retrieve_time operations may dither the TLB/L1/L2 cache so that some data are flushed out into next level cache or memory.

你的回答
关注数
1
收藏数
0
浏览数
2553
极术小姐姐
极术微信服务号
关注极术微信号
实时接收点赞提醒和评论通知
安谋科技学堂公众号
关注安谋科技学堂
实时获取安谋科技及 Arm 教学资源
安谋科技招聘公众号
关注安谋科技招聘
实时获取安谋科技中国职位信息