Nanopi Neo/M1 0.83ns "cycle counter register"
Posted: Sat Aug 27, 2016 1:12 pm
I learned a lot on Arm v6 cycle count register with Raspberry Pi Zero in this thread:
https://www.raspberrypi.org/forums/viewtopic.php?f=63&t=155830
Since Nanopi Neo and M1 are armv7 and quad core there are some differences in how to measure time very precisely.
In postings in this forum I described how to compile loadable kernel modules for Neo/M1, here is the last:
http://www.friendlyarm.com/Forum/viewtopic.php?f=47&t=240&p=810#p810
I found the complete code for cycle counting on armv7 in this stackoverflow posting:
http://stackoverflow.com/a/31649809
This is the corresponding armv7 spec section:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0464f/BIIBDHAF.html
"enable_ccnt_read()" does two things in that code, enables (CPU clock) cycle counter and enables user land code access to cycle counter (by default cycle counter can only be accessed from kernel space). The provided user space program just reads the cycle counter. The important difference to Pi Zero code is not the different assembler commands used for armv7 instead of armv6 (Pi Zero). The difference is that all enablings need to be done on each of the 4 CPU cores [by "on_each_cpu(enable_ccnt_read, NULL, 1)"] because you don't know on which CPU your program will run on normally.
Fur ultra precise time measurements with cycle counter register I disabled 3 CPUs just to be sure where the action is:
This is the corresponding "dmesg" output:
Next I fixed cpu0 frequency to the minimum of 480MHz for initial investigation:
Then I used below described kernel space program for two measurements, then did set min and max cpu0 frequency to 1.2GHz and did two measurements again:
This is what I was after, the corresponding "dmesg" output:
So what can we see from this?
First the overhead of doing cycle counter measurements, it is always 5 clock ticks, regardless on whether CPU runs at 480MHz or at 1.2GHz,
Second we see roughly 480000 reported as difference for "usleep(1000)" or 1ms, which nearly perfectly fits 480MHz.
Finally we see roughly 1.2 million as difference for 1ms usleep() which matches 1.2GHz CPU frequency.
So these measurements confirm cycle counter readings are depending on CPU frequency and are really precise, and that the overhead for reading cycle counter registers (the difference of two consecutive register readings) is 5 which is less than the overhead of 8 clock ticks for armv6 Pi Zero, and even less considering that a clock tick on Pi Zero is 1ns while it is 0.83ns(!) in case you set minimal CPU frequency to 1200000.
Last, but not least, the code:
Hermann.
https://www.raspberrypi.org/forums/viewtopic.php?f=63&t=155830
Since Nanopi Neo and M1 are armv7 and quad core there are some differences in how to measure time very precisely.
In postings in this forum I described how to compile loadable kernel modules for Neo/M1, here is the last:
http://www.friendlyarm.com/Forum/viewtopic.php?f=47&t=240&p=810#p810
I found the complete code for cycle counting on armv7 in this stackoverflow posting:
http://stackoverflow.com/a/31649809
This is the corresponding armv7 spec section:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0464f/BIIBDHAF.html
"enable_ccnt_read()" does two things in that code, enables (CPU clock) cycle counter and enables user land code access to cycle counter (by default cycle counter can only be accessed from kernel space). The provided user space program just reads the cycle counter. The important difference to Pi Zero code is not the different assembler commands used for armv7 instead of armv6 (Pi Zero). The difference is that all enablings need to be done on each of the 4 CPU cores [by "on_each_cpu(enable_ccnt_read, NULL, 1)"] because you don't know on which CPU your program will run on normally.
Fur ultra precise time measurements with cycle counter register I disabled 3 CPUs just to be sure where the action is:
Code: Select all
root@FriendlyARM:~# echo 0 > /sys/devices/system/cpu/cpu2/online
root@FriendlyARM:~# echo 0 > /sys/devices/system/cpu/cpu3/online
root@FriendlyARM:~# echo 0 > /sys/devices/system/cpu/cpu1/online
root@FriendlyARM:~#
This is the corresponding "dmesg" output:
Code: Select all
[ 610.674308] CPU2: shutdown
[ 610.674352] [hotplug]: cpu(3) try to kill cpu(2)
[ 610.675445] [hotplug]: cpu2 is killed! .
[ 613.290250] CPU3: shutdown
[ 613.290287] [hotplug]: cpu(0) try to kill cpu(3)
[ 613.291379] [hotplug]: cpu3 is killed! .
[ 615.497689] CPU1: shutdown
[ 615.497729] [hotplug]: cpu(0) try to kill cpu(1)
[ 615.498826] [hotplug]: cpu1 is killed! .
Next I fixed cpu0 frequency to the minimum of 480MHz for initial investigation:
Code: Select all
root@FriendlyARM:~# echo 480000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
root@FriendlyARM:~# echo 480000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
root@FriendlyARM:~# cpu_freq
CPU0 online=1 temp=46 governor=interactive cur_freq=480000
DDR governor=userspace cur_freq=432000 max=432000 min=408000
root@FriendlyARM:~#
Then I used below described kernel space program for two measurements, then did set min and max cpu0 frequency to 1.2GHz and did two measurements again:
Code: Select all
root@FriendlyARM:~/ccnt-2# insmod ccnt-2.ko
root@FriendlyARM:~/ccnt-2# rmmod -f ccnt-2.ko
root@FriendlyARM:~/ccnt-2# insmod ccnt-2.ko
root@FriendlyARM:~/ccnt-2# rmmod -f ccnt-2.ko
root@FriendlyARM:~/ccnt-2# echo 1200000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
root@FriendlyARM:~/ccnt-2# echo 1200000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
root@FriendlyARM:~/ccnt-2# cpu_freq
CPU0 online=1 temp=47 governor=interactive cur_freq=1200000
DDR governor=userspace cur_freq=432000 max=432000 min=408000
root@FriendlyARM:~/ccnt-2# insmod ccnt-2.ko
root@FriendlyARM:~/ccnt-2# rmmod -f ccnt-2.ko
root@FriendlyARM:~/ccnt-2# insmod ccnt-2.ko
root@FriendlyARM:~/ccnt-2# rmmod -f ccnt-2.ko
root@FriendlyARM:~/ccnt-2#
This is what I was after, the corresponding "dmesg" output:
Code: Select all
[ 705.247173] 130 135 5
[ 705.247202] 135 481201 481066
[ 705.247219] 481201 481206 5
[ 717.199641] Disabling lock debugging due to kernel taint
[ 720.169848] 967406470 967406475 5
[ 720.169876] 967406475 967887431 480956
[ 720.169894] 967887431 967887436 5
[ 807.820957] 2970431661 2970431666 5
[ 807.820972] 2970431666 2971632605 1200939
[ 807.820979] 2971632605 2971632610 5
[ 811.965180] 3322501780 3322501785 5
[ 811.965197] 3322501785 3323702717 1200932
[ 811.965205] 3323702717 3323702722 5
So what can we see from this?
First the overhead of doing cycle counter measurements, it is always 5 clock ticks, regardless on whether CPU runs at 480MHz or at 1.2GHz,
Second we see roughly 480000 reported as difference for "usleep(1000)" or 1ms, which nearly perfectly fits 480MHz.
Finally we see roughly 1.2 million as difference for 1ms usleep() which matches 1.2GHz CPU frequency.
So these measurements confirm cycle counter readings are depending on CPU frequency and are really precise, and that the overhead for reading cycle counter registers (the difference of two consecutive register readings) is 5 which is less than the overhead of 8 clock ticks for armv6 Pi Zero, and even less considering that a clock tick on Pi Zero is 1ns while it is 0.83ns(!) in case you set minimal CPU frequency to 1200000.
Last, but not least, the code:
Code: Select all
root@FriendlyARM:~/ccnt-2# cat Makefile
obj-m += ccnt-2.o
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
root@FriendlyARM:~/ccnt-2#
root@FriendlyARM:~/ccnt-2# cat ccnt-2.c
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/delay.h>
static void enable_ccnt_read(void* data)
{
// PMCR.E (bit 0) = 1
asm volatile ("mcr p15, 0, %0, c9, c12, 0" :: "r"(1));
// PMCNTENSET.C (bit 31) = 1
asm volatile ("mcr p15, 0, %0, c9, c12, 1" :: "r"(1 << 31));
}
int init_module()
{
volatile unsigned cc1,cc2,cc3,cc4;
on_each_cpu(enable_ccnt_read, NULL, 1);
asm volatile ("mrc p15, 0, %0, c9, c13, 0" : "=r" (cc1));
asm volatile ("mrc p15, 0, %0, c9, c13, 0" : "=r" (cc2));
udelay(1000);
asm volatile ("mrc p15, 0, %0, c9, c13, 0" : "=r" (cc3));
asm volatile ("mrc p15, 0, %0, c9, c13, 0" : "=r" (cc4));
printk("%u %u %u\n",cc1,cc2,cc2-cc1);
printk("%u %u %u\n",cc2,cc3,cc3-cc2);
printk("%u %u %u\n",cc3,cc4,cc4-cc3);
return 0;
}
void cleanup_module()
{
}
MODULE_LICENSE("GPL");
root@FriendlyARM:~/ccnt-2#
Hermann.