Power & Source of Big Ideas

Power design error?

Moderators: chensy, FATechsupport

Gents,

I was looking at the NanoPi M3 as an interesting module for a cluster for computation of large networks daily because of:
- cost
- 8 cores
- Gigabyte network

So I got hold on 6 of them and I started to test them for performance. Here are the parameters for testing:
- the M3 uses the dedicated heatsink; in addition I have used some high performance GELID thermal pad on the CPU and kept the original (trimmed) pad for the rest of the surface; I'll preempt some remarks that I will have later, but in this configuration I have not seen the temperature going over 65 degrees (but you will see there are other issues)
- I power the board through the GPIO pins using AWG20 cables to a bench power supply that can provide 30A
So until now there should be no problem with the power.

In terms of software testing I have installed OpenBLAS and let it build with the defaults (it will build armv7l - since we don't have yet an armv8 kernel..., with 8 threads). I've done the update-alternatives and then started to run the standard benchmarks from OpenBLAS with numpy.

So here things started to get interesting.

When testing single precision matrix multiplication (sgemm.py) I have no problems and I can easily go to 5000 x 5000 matrices. The average processing is 16-17 Gflops (reported by the benchmark) and the board doesn't even get a sweat: it barely goes to 62 degrees.

The issues start with double precision matrix multiplication(dgemm.py). If you run it straight as it comes, the board quickly hangs. Obviously I though initially that there is a software problem but I then started to investigate.

First I started reducing the number of cores that OpenBLAS uses (OPENBLAS_NUM_THREADS=x). At 1.4GHz I can safely run for long time the double precision benchmark with 6 cores. That benchmark reports aprox. 12 Gflops and the temperature stays bellow 75 degrees - there is no throthling. Running more than 6 threads at 1.4 GHz quickly hangs the board.

The other option is to reduce the frequency but keep the 8 threads. With this I have managed to run safely up to 1GHz. At 1.1 GHz the processing is unpredictable - it might work for a while and then suddenly hang. In all cases the temperature is not the problem as well as the power supply.

The conclusion that I have is that the power supply provided by the AXP228 is insufficient for the 8 core CPU. Since the only diagrams of the board are for version 1604 (mine says 1605) I only can assume they are the same and I can notice there that you only use DC3 for the powering of the cores, while DC2 stays unused. The data sheet indicate that the DC2 and DC3 can provide up to 2.5A, so in this case you are providing max 2.5A to the Samsung cores. If that design works for the 4 core boards (ex. M1, Tx, etc.) I'm afraid that for M3 the CPU is massively underpowered.

At rest the overall board consumption is 0.45A (input on from the bench power supply). The maximum consumption before the board hangs is 2.32A. Of course that not all the power goes to the cores but most of it does. Let's do a little math: the difference in power consumption is 5V x (2.32A - 0.45A) = 9.35W. Let's say half of this goes to the CPU and then we have aprox 15% loss in the DC-DC switching regulator => 3.97W. Since the core is powered at 1V (1.25 max according to the scalling voltages) it's obvious that the current will be significantly over 2.5A.

I think you need to revisit the design for the board and provide an additional core power from the DC2 output of the AXP228, otherwise the full power of the chip will never be usable.
sonel wrote:
Gents,

I was looking at the NanoPi M3 as an interesting module for a cluster for computation of large networks daily because of:
- cost
- 8 cores
- Gigabyte network

So I got hold on 6 of them and I started to test them for performance. Here are the parameters for testing:
- the M3 uses the dedicated heatsink; in addition I have used some high performance GELID thermal pad on the CPU and kept the original (trimmed) pad for the rest of the surface; I'll preempt some remarks that I will have later, but in this configuration I have not seen the temperature going over 65 degrees (but you will see there are other issues)
- I power the board through the GPIO pins using AWG20 cables to a bench power supply that can provide 30A
So until now there should be no problem with the power.

In terms of software testing I have installed OpenBLAS and let it build with the defaults (it will build armv7l - since we don't have yet an armv8 kernel..., with 8 threads). I've done the update-alternatives and then started to run the standard benchmarks from OpenBLAS with numpy.

So here things started to get interesting.

When testing single precision matrix multiplication (sgemm.py) I have no problems and I can easily go to 5000 x 5000 matrices. The average processing is 16-17 Gflops (reported by the benchmark) and the board doesn't even get a sweat: it barely goes to 62 degrees.

The issues start with double precision matrix multiplication(dgemm.py). If you run it straight as it comes, the board quickly hangs. Obviously I though initially that there is a software problem but I then started to investigate.

First I started reducing the number of cores that OpenBLAS uses (OPENBLAS_NUM_THREADS=x). At 1.4GHz I can safely run for long time the double precision benchmark with 6 cores. That benchmark reports aprox. 12 Gflops and the temperature stays bellow 75 degrees - there is no throthling. Running more than 6 threads at 1.4 GHz quickly hangs the board.

The other option is to reduce the frequency but keep the 8 threads. With this I have managed to run safely up to 1GHz. At 1.1 GHz the processing is unpredictable - it might work for a while and then suddenly hang. In all cases the temperature is not the problem as well as the power supply.

The conclusion that I have is that the power supply provided by the AXP228 is insufficient for the 8 core CPU. Since the only diagrams of the board are for version 1604 (mine says 1605) I only can assume they are the same and I can notice there that you only use DC3 for the powering of the cores, while DC2 stays unused. The data sheet indicate that the DC2 and DC3 can provide up to 2.5A, so in this case you are providing max 2.5A to the Samsung cores. If that design works for the 4 core boards (ex. M1, Tx, etc.) I'm afraid that for M3 the CPU is massively underpowered.

At rest the overall board consumption is 0.45A (input on from the bench power supply). The maximum consumption before the board hangs is 2.32A. Of course that not all the power goes to the cores but most of it does. Let's do a little math: the difference in power consumption is 5V x (2.32A - 0.45A) = 9.35W. Let's say half of this goes to the CPU and then we have aprox 15% loss in the DC-DC switching regulator => 3.97W. Since the core is powered at 1V (1.25 max according to the scalling voltages) it's obvious that the current will be significantly over 2.5A.

I think you need to revisit the design for the board and provide an additional core power from the DC2 output of the AXP228, otherwise the full power of the chip will never be usable.



Thanks for you suggestion, We'll check it in the future, power circuit is a complex thing, we are struggling on this all the time.

-- Mindee
Hi there,
I'm slightly puzzled by the power supply circuits as well.
I note that the schematic shows a dedicated ETA3451 driving the CPU (VCC1P1_ARM), with the AX228 being used to drive the SoC core (VCC1P0_CORE) -- from what I can glean from the Samsung S5P6618 data sheet, I'd guess the GPU is the biggest consumer on that "core" circuit. Separate pins on the AX228 are used for the other sub-systems.

As described by sonel above, running multi-core OpenBLAS uses a LOT of power for the ARM CPU, so I'd guess the ETA3451 will hit its specified 3.5A limit at 1.2 volts.
Q: when running "standard" OpenBLAS, is the GPU (and the rest of the core) pushed hard?
OpenBLAS -WILL- push the the CPU (i.e., the _ARM supply) hard, of course.

If I'm right, the GPU is not using a lot of power, so AX228 output driving the rest of the SOC core shouldn't
be the limiting factor. Looks like its the ETA 3451 that will limit first. I imagine that's not easy to fix :/

To my question: as the ETA3451 is a fixed voltage regulator, how is the CPU voltage scaling done? Is this done within the S5P6618 itself?

all the best
Lawrence
lconroy wrote:
Q: when running "standard" OpenBLAS, is the GPU (and the rest of the core) pushed hard?

The OpenBLAS implementation does not yet optimise the use of GPU for ARM processors. This is because the GPU is not part of the standard ARM - each producer has its own implementation and most of them do not make public the API. The OpenBLAS has some optimisation for NEON instruction set in A53 and should provide a significantly better performance than a plain armv7 processor even though the code is not armv8 -- at the moment there are no kernels for armv8.

In regards to the power, the two VDDs (VDD_CORE and VDD_CPU) are used exactly for that purpose: VDD_CODE powers the (8) cores while VDD_CPU powers the rest of the CPU (cache, DMA, etc.). That might indeed include the GPU. To answer the other question about the voltage scaling: that is done by AXP228. It is actually a very sophisticated chip and includes an I2C interface and a set of registers that allow for quite a lot of tweaks in the output.

I don't think that ETA3451 is the problem. If you look in the spec sheet for S56818 you see that VDD_CORE is provided on 32 pins while VDD_ARM(CPU) is on 16 pins. That clearly indicates that the expected current for the cores is significantly higher that on the rest of the CPU. As I mentioned in the post running the same test limiting the execution to only 6 cores does not have any problems. I still believe it is the VDD_CORE that provides insufficient current when the cores are maxed out.
Hi Sonel, folks,
Thanks -- I didn't know that OpenBLAS wasn't GPU-optimised, but it makes sense. As you say, the GPU is a variable element so can't be relied on without many many different versions, depending on which of the many devices is used.

I'm looking at the schematic and the SoC user guide; as you point out, there are twice as many VDDI pads as there are VDDI_ARM pads on the SoC. Looks like Samsung ASSUME that more power will be needed for the rest of the SoC than will be needed for the ARM cores.
BUT ... this is the nub of my confusion.

The schematic shows that VCC1P1_ARM is output from the ETA3451D2I-T (i.e., fixed voltage regulator), whilst VCC1P0_CORE is the DCDC3 output from the AX228 (and so could be varied).
VCC1P0_CORE (i.e., the variable source) is shown as being connected to the SoC's many VCCI pads. VCC1P1_ARM (i.e., the fixed voltage) is shown as being connected to the SoC's fewer VDDI_ARM pads.

Hence my confusion: I had assumed that the ARM cpu supply voltage was varied with changes in cpu frequency -- and it looks like it can't be on the schematic -- the ARM cpu it gets a fixed 1.1 volt supply from the ETA3451D2I-T.

That seems to conflict with the SoC User Guide: section 2.4.31 says that VDDI should be fixed at 1 volt to drive the SoC core, whilst VDDI_ARM is variable (from 1.0 to 1.3 volts) to drive the ARM cpus.
But ... this doesn't seem to be what's happening on the board Schematic.
That gives a variable VDDI and a fixed VDDI_ARM supply.

When we talk about DVFS (Dynamic Voltage & Frequency Scaling), do we actually mean to vary the SoC core (VDDI) voltage, rather than the ARM cpu (VDDI_ARM) voltage? That's what the Schematic seems to show.
If DVFS does vary the ARM supply voltage, then the schematic must be wrong. What am I missing?

best regards
Lawrence

Who is online

In total there are 27 users online :: 0 registered, 0 hidden and 27 guests (based on users active over the past 5 minutes)
Most users ever online was 5185 on Wed Jan 22, 2020 1:44 pm

Users browsing this forum: No registered users and 27 guests