I apologise for the long question, but I am trying to measure performance of different indexing techniques on various platforms, one of which is Adaptive Radix tree.
I have run tests where the basic steps look like this (c/c++):
Step 1: Generate or load data (few million key-value pairs)
Step 2: Insert into index and measure time taken (insert_time)
Step 3: Retrieve from index and measure time taken (retrieve_time)
I find that always insert_time > retrieve_time on most platforms such as Intel desktops (i386/amd64), iPad (Apple A9), Android (ARMv7) and Raspberry Pi 3 (ARMv8). This is expected, as insert complexity is higher than retrieve complexity.
But when I run the steps on big.LITTLE platforms, specifically Snapdragon 845 (Xiaomi POCO F1) and HiSilicon Kirin 659 (Honor 9 lite), I find insert_time < retrieve_time, except when data size is too low.
To diagnose what could be wrong, I went through the following steps:
Ensure that the thread is running at maximum speed by using following code:
void set_thread_priority() {
nice(-20);
int policy = 0;
struct sched_param param;
pthread_getschedparam(pthread_self(), &policy, ¶m);
param.sched_priority = sched_get_priority_max(policy);
pthread_setschedparam(pthread_self(), policy, ¶m);
}
I could see that the nice value is reflected against the process and the thread runs 100% CPU in most cases (it is basically single thread algorithm).
Set CPU affinity using following code:
void set_affinity() {
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(4, &mask);
CPU_SET(5, &mask);
CPU_SET(6, &mask);
CPU_SET(7, &mask);
sched_setaffinity(0, sizeof(mask), &mask);
}
This code also reflects well on big.LITTLE because when I set CPUs as 0, 1, 2, 3, the code runs much slower than when I set CPUs as 4, 5, 6, 7. Even then insert_time < retrieve_time in both cases.
Ensure that sufficient free RAM is available for my dataset
To avoid the possibility that Step 3 might retrieve from virtual memory, I added Step 4, which is just repeating Step 3:
Step 4: Retrieve from index and measure time taken again (retrieve_time2)
To my surprise, retrieve_time2 > retrieve_time > insert_time (by 2 to 3 seconds for 10 million records).
As for my code, the insert code looks like this:
it1 = m.begin();
start = getTimeVal();
for (; it1 != m.end(); ++it1) {
art_insert(&at, (unsigned char*) it1->first.c_str(),
(int) it1->first.length() + 1, (void *) it1->second.c_str(),
(int) it1->second.length());
ctr++;
}
stop = getTimeVal();
and retrieve code looks like this:
it1 = m.begin();
start = getTimeVal();
for (; it1 != m.end(); ++it1) {
int len;
char *value = (char *) art_search(&at,
(unsigned char*) it1->first.c_str(), (int) it1->first.length() + 1, &len);
ctr++;
}
stop = getTimeVal();
Any pointers as to what I could do further? Or is there an explanation for this from the platform perspective?