Dynamic Power Management: A Quantitative Approach
by Johan De Gelas on January 18, 2010 2:00 AM EST- Posted in
- IT Computing
Analysis: What Happened?
The measurements on the previous page are fine but we also want to understand how well the hardware and operating system coped with the "low load" scenario. What did Windows 2008R2 do? We asked the Windows Driver Kit "Powertest" tool to tell us more. The first thing we want to know is the clock speed the CPU was ordered to run at in "Balanced" mode. The differences are very telling. First the Xeon's clock speed changes:
Xeon L3426 Core Speeds | ||||||||
Frequency | Core 0 | Core 1 | Core 2 | Core 3 | Core 4 | Core 5 | Core 6 | Core 7 |
10 times | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 |
20 times | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 |
1 time | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 | 1463 |
10 times | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 |
1 time | 1729 | 1729 | 1729 | 1729 | 1729 | 1729 | 1729 | 1729 |
Many | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 | 1863 |
The Xeon L3426 almost always ran at 1.86GHz. In a period of 30 seconds, we noticed only two P-state change requests: one speed bin lower (-133MHz) and 3 speed bins lower (-400MHz). All cores were always asked to run at the same clock speed.
Next those of the Opteron:
Opteron 2435 Core Speeds | ||||||
Frequency | Core 0 | Core 1 | Core 2 | Core 3 | Core 4 | Core 5 |
1 time | 800 | 1400 | 800 | 2600 | 800 | 800 |
1 time | 800 | 800 | 1400 | 1400 | 800 | 800 |
1 time | 800 | 800 | 800 | 800 | 800 | 800 |
1 time | 800 | 800 | 800 | 2600 | 800 | 800 |
1 time | 800 | 800 | 800 | 800 | 800 | 800 |
1 time | 800 | 800 | 800 | 800 | 2600 | 1400 |
1 time | 800 | 800 | 800 | 800 | 800 | 800 |
Where the Xeon hardly gets any P-state changes, the six-core Opteron 2435 frequently switches between 0.8GHz, 1.4GHz, and 2.6GHz. A lot of times one of the cores runs at 1400MHz, another one at 2600MHz, and the rest at 800MHz. Basically, the above table is repeated over and over again. This means that the frequency scaling is far from ideal: we should see two cores at 2.6GHz most of the time as the application spawns two threads that require 100% core power. This in turn explains the 15% performance hit between "Balanced" and "Performance". If the hardware and OS worked together better, the performance hit should not be more than a few percent. This makes us conclude that in this case, the 4W power savings are not worth the performance hit.
Sleeping
We have focused on the active cores so far, but the important power savings can also come from putting idle cores in sleep states. Did the CPU driver and OS scheduler work well together? Again, there are remarkable differences.
CPU Sleep State Comparison | ||||
% Idle | ACPI C1 | ACPI C2 | ACPI C3 | |
Opteron 2435 | 86 | 100 | 0 | 0 |
Xeon L3426 | 81 | 7 | 93 | 0 |
Opteron 2389 | 72.4 | 100 | 0 | 0 |
The six-core had more idle cores than the quad-core Opteron, and as a result it did experience more idle time. All idle time with the Opterons was spent in the C1/"Halt" status.
The Xeon was quite a bit more aggressive: 93% of the idle time was spent in the C2 state, but C2 at the operating system level does not mean the hardware actually runs in C2. In theory, the hardware is capable of putting the core into a "deeper" CC (Core Sleep) state. Intel promised that the idle Nehalem cores would be able to reach even the deepest C6 sleep while other cores were working. Did that actually happen?
Software tools read out the API of the OS and thus - as far as we know - always read out the ACPI states. We followed the guidelines in Intel's White Paper, "Intel Turbo Boost Technology in Intel Core Microarchitecture Based Processors", and did some programming (in assembly) to find the actual hardware C-states.
First we read out the Time Stamp Register
RDTSC
0x000086FCCA7EBD0E
Next we read out the right Machine Specific Register
RDMSR 0x3FDH
High 32bit(EDX) = 0x00007265, Low 32bit(EAX) = 0xF842A000
We wait for 1500ms and then repeat the previous procedure:
RDTSC
0x000086FD78268DC2
RDMSR 0x3FDH
High 32bit(EDX) = 0x00007265, Low 32bit(EAX) = 0xFA3F0000
In some cases, the MSR did not get one tick more, clearly indicating that the CPU had not entered C6 during the 1.5 second period. Both the "real" physical and logical core report the same TSC and MSR info, so it is quite easy to make a distinction between the real cores and the logical cores which are a result of SMT (Hyper-Threading).
With the "Performance" power plan we get:
"Performance" Power Profile C6 | |||
Clockticks | Ticks spent in C6 | Percentage C6 | |
Core 1 | 2913456308 | 33316864 | 1.14% |
Core 2 | 2933155470 | 0 | 0.00% |
Core 3 | 2950461391 | 2809569280 | 95.22% |
Core 4 | 2957802638 | 0 | 0.00% |
So on average the CPU is in C6 24% of the time, which is quite impressive. However, the way we measure this is not perfect: the measurement puts an extra load (slightly less than a chess thread) on the CPU. So the load on the CPU is not two but rather three threads. This means that the CPU probably spends even more time in C6 mode with two active threads.
Next the same measurement but with the "Balanced" power plan:
"Balanced" Power Profile C6 | |||
Clockticks | Ticks spent in C6 | Percentage C6 | |
Core 1 | 2961019252 | 0 | 0.00% |
Core 2 | 2991271044 | 2371919872 | 79.29% |
Core 3 | 3012220038 | 74088448 | 2.46% |
Core 4 | 3012878436 | 22192128 | 0.74% |
This time we spend a little bit less time in C6: about 21%. Setting the power plan to Performance allows the idle cores to go just a little bit more into deep sleep as the active cores are working harder. Of course total power does not decline as the higher power consumption of the Turbo Boosted cores is much more important than the small effect of some cores being in deep sleep an extra 10% of the time.
35 Comments
View All Comments
JohanAnandtech - Tuesday, January 19, 2010 - link
Well, Oracle has a few downsides when it comes to this kind of testing. It is not very popular in the smaller and medium business AFAIK (our main target), and we still haven't worked out why it performs much worse on Linux than on Windows. So chosing Oracle is a sure way to make the projecttime explode...IMHO.ChristopherRice - Thursday, January 21, 2010 - link
Works worse on Linux then windows? You have a setup issue likely with the kernel parameters or within oracle itself. I actually don't know of any enterprise location that uses oracle on windows anymore. "Generally all Rhel4/Rhel5/Sun".TeXWiller - Monday, January 18, 2010 - link
The 34xx series supports four quad rank modules, giving today a maximum supported amount of 32GB per CPU (and board). The 24GB limit is that of the three channel controller with unbuffered memory modules.pablo906 - Monday, January 18, 2010 - link
I love Johan's articles. I think this has some implications in how virtualization solutions may be the most cost effective. When you're running at 75% capacity on every server I think the AMD solution could have possibly become more attractive. I think I'm going to have to do some independent testin in my datacenter with this.I'd like to mention that focusing on VMWare is a disservice to Vt technology as a whole. It would be like not having benchmarked the K6-3+ just because P2's and Celerons were the mainstream and SS7 boards weren't quite up to par. There are situations, primarily virtualizing Linux, where Citrix XenServer is a better solution. Also many people who are buying Server '08 licenses are getting Hyper-V licenses bundled in for "free."
I've known several IT Directors in very large Health Care organization who are deploying a mixed Hyper-V XenServer environment because of the "integration" between the two. Many of the people I've talked to at events around the country are using this model for at least part of the Virtualization deployments. I believe it would be important to publish to the industry what kind of performance you can expect from deployments.
You can do some really interesting HomeBrew SAN deployments with OpenFiler or OpeniSCSI that can compete with the performance of EMC, Clarion, NetApp, LeftHand, etc. NFS deployments I've found can bring you better performance and manageability. I would love to see some articles about the strengths and weaknesses of the storage subsystem used and how it affects each type of deployment. I would absolutely be willing to devote some datacenter time and experience with helping put something like this together.
I think this article really lends itself well into tieing with the Virtualization talks and I would love to see more comments on what you think this means to someone with a small, medium, and large datacenter.
maveric7911 - Tuesday, January 19, 2010 - link
I'd personally prefer to see kvm over xenserver. Even redhat is ditching xen for kvm. In the environments I work in, xen is actually being decommissioned for VMware.JohanAnandtech - Tuesday, January 19, 2010 - link
I can see the theoretical reasons why some people are excited about KVM, but I still don't see the practical ones. Who is using this in production? Getting Xen, VMware or Hyper-V do their job is pretty easy, KVM does not seem to be even close to being beta. It is hard to get working, and it nowhere near to Xen when it comes to reliabilty. Admitted, those are our first impressions, but we are no virtualization rookies.Why do you prefer KVM?
VJ - Wednesday, January 20, 2010 - link
"It is hard to get working, and it nowhere near to Xen when it comes to reliabilty. "I found Xen (separate kernel boot at the time) more difficult to work with than KVM (kernel module) so I'm thinking that the particular (host) platform you're using (windows?) may be geared towards one platform.
If you had to set it up yourself then that may explain reliability issues you've had?
On Fedora linux, it shouldn't be more difficult than Xen.
Toadster - Monday, January 18, 2010 - link
One of the new technologies released with Xeon 5500 (Nehalem) is Intel Intelligent Power Node Manager which controls P/T states within the server CPU. This is a good article on existing P/C states, but will you guys be doing a review of newer control technologies as well?http://communities.intel.com/community/openportit/...">http://communities.intel.com/community/...r-intel-...
JohanAnandtech - Tuesday, January 19, 2010 - link
I don't think it is "newer". Going to C6 for idle cores is less than a year old remember :-).It seems to be a sort of manager which monitors the electrical input (PDU based?) and then lowers the p-states to keep the power at certain level. Did I miss something? (quickly glanced)
I think personally that HP is more onto something by capping the power inside their server management software. But I still have to evaluate both. We will look into that.
n0nsense - Monday, January 18, 2010 - link
May be i missed something in the article, but from what I see at home C2Q (and C2D) can manage frequencies per core.i'm not sure it is possible under Windows, but in Linux it just works this way. You can actually see each core at its own frequency.
Moreover, you can select for each core which frequency it should run.