What is a Core?
A standard
processor has one core (single-core.) Single core processors only process one
instruction at a time (they do use pipelines internally, which allow several
instructions to be processed together; however, they are still run one at a
time.)
What is a Multi-Core Processor?
A multi-core
processor is comprised of two or more independent cores, each capable of
processing individual instructions. A dual-core processor contains two cores, a
quad-core processor contains four cores, and a hexa-core processor contains six
cores.
Why do I Need Multiple Cores?
Multiple cores
can be used to run two programs side by side and, when an intensive program is
running, (AV Scan, Video conversion, CD ripping etc.) you can utilize another
core to run your browser to check your email etc.
Multiple cores
really shine when you’re using a program that can utilize more than one core
(called Parallelization) to improve the program’s efficiency and
addressability. Programs such as graphic software, games etc. can run multiple
instructions at the same time and deliver faster, smoother results.
If you use
CPU-intensive software, multiple cores will likely provide a better computing
experience. If you use your PC to check emails and watch the occasional video,
you really don’t need a multi-core processor.
How Do Core 2 Duo and Core 2 Quad Compare with Core
i3, i5, and i7?
Core 2 Duo
processors run two threads; i3′s and i5′s run four threads. Core 2 Duo
processors are socket 775 (45/65nm); Core i3 and i5 processors are socket 1156
(nm) but only work with DDR3 RAM (Some Core 2 Duo’s work with both DDR2 and
DDR3.)
For desktops,
core 2 duos are preferable due to their lack of power and compatibility with
the newest PC hardware; for laptops, it all depends on your usage. As laptops
aren’t as easy to upgrade, buying dated technology might burn you in the future
when you find your Core 2 Duo PC’s motherboard only supports 4GB RAM—for
example.
i5′s come with
“turbo boost”; however, i3′s overclock very well if that’s your thing. That’s
where I’d put my money if you’re considering Core 2 Duo, i3, and i5: the i3
provides the best value for most casual PC users.
While I focused
on i3, i5, and Core 2 Duo to answer a reader’s question, the principles apply
when comparing i5, i7, and Core 2 Quad. I’d go for the i5 unless you are
willing to pay a premium for a little more performance.
Do I need an i3, i5, or i7?
As with all
computer hardware, the type of processor you need depends on your needs, for
how long you want your computer to stay current, and your budget.
Here’s a very
simple breakdown of what you should look to buy depending on your computing
needs. All suggestions assume you are buying a pre-built PC (so you don’t have
to worry about motherboard and RAM specs and so you don’t have to worry about
upgrade compatibility.)
If you:
Browse the
internet, check email, and play the occasional flash game (like Farmville): Get
a single core netbook or desktop
Do word
processing, spreadsheets etc., listen to music often, and watch movies, get an
i3 processor (or any dual core processor i.e. core 2 duo)
Play the
occasional game and are happy with lower resolution and lower quality graphics
(my suggestion assumes the graphics processor on the pre-built PC will be
well-matched for the processor suggestions), watch HD movies etc., get an i5.
If you do
graphic publishing, music creation, programming (and compiling), watch HD
movies, or like to play visually appealing games, get a quad core i5, or i7.
If you like to
have the very best hardware and play the most graphically intense games, get a
quad core or hexa corei7 Extreme.
Intel Sandy Bridge microarchitecture
This new design
is a significant improvement. Many of the bottlenecks of previous designs have
been dealt with in the Sandy Bridge. Instruction fetch and predecoding has been
a serious bottleneck in Intel designs for many years. In the NetBurst
architecture they tried to fix this problem by caching decoded µops, without
much success. In the Sandy Bridge design, they are caching instructions both before
and after decoding. The limited size of the µop cache is therefore less
problematic, and the µop cache appears to be very efficient.
The limited
number of register read ports has been a serious, and often neglected, bottleneck
since the old Pentium Pro. This bottleneck has now finally been removed in the Sandy
Bridge. Previous Intel processors have only one memory read port where AMD
processors have two. This was a bottleneck in many math applications. The Sandy
Bridge has two read ports, whereby this bottleneck is removed.
The branch
prediction has been improved by having bigger buffers and a shorter misprediction
penalty, but it has no loop predictor, and mispredictions are still quite common.
The new AVX
instruction set is an important improvement. The throughput of floating point addition
and multiplication is doubled when the new 256-bit YMM registers are used. The new
non-destructive three-operand instructions are quite convenient for reducing
register pressure and avoiding register move instructions. There is, however, a
serious performance penalty for mixing vector instructions with and without the
VEX prefix. This penalty is easily avoided if the programming guidelines are
followed, but I suspect that it will be a very common programming error in the
future to inadvertently mix VEX and non-VEX instructions, and such errors will
be difficult to detect.
Whenever the
narrowest bottleneck is removed from a system, the next less narrow bottleneck
will become the limiting factor. The new bottlenecks that require attention in
the Sandy Bridge are the following:
1. The µop
cache. This cache can ideally hold up to 1536 µops. The effective utilization will
be much less in most cases. The programmer should pay attention to make sure the
most critical inner loops fit into the µop cache.
2. Instruction
fetch and decoding. The fetch/decode rate has not been improved over previous
processors and is still a potential bottleneck for code that doesn’t fit into
the µop cache.
3. Data cache
bank conflicts. The increased memory read bandwidth means that the frequency of
cache conflicts will increase. Cache bank conflicts are almost unavoidable in
programs that utilize the memory ports to their maximum capacity.
4. Branch
prediction. While the branch history buffer and branch target buffers are probably
bigger than in previous designs, mispredictions are still quite common.
5. Sharing of
resources between threads. Many of the critical resources are shared between
the two threads of a core when hyperthreading is on. It may be wise to turn off
hyperthreading when multiple threads depend on the same execution resources.
Intel Ivy Bridge microarchitecture
Ivy Bridge is
the codename for an Intel microprocessor using the Sandy Bridge
microarchitecture. The name is also applied more broadly to the 22 nm die
shrink of the microarchitecture based on tri-gate ("3D") transistors,
which is also used in the future Ivy Bridge-EX and Ivy Bridge-EP
microprocessors. Ivy Bridge processors are backwards-compatible with the Sandy
Bridge platform, but might require a firmware update (vendor specific).[1]
Intel has released new 7-series Panther Point chipsets with integrated USB 3.0
to complement Ivy Bridge.[2]
Volume
production of Ivy Bridge chips began in the third quarter of 2011.[3] Quad-core
and dual-core-mobile models launched on April 29, 2012 and May 31, 2012
respectively.[4] Core i3 desktop processors, as well as the first 22 nm Pentium
were launched and available the first week of September, 2012.
How much faster are the Ivy Bridge processors?
The base clock
frequency of these processors ranges from 2.8 GHz (for Core i5-3450S) to 3.5
GHz (for Core i7-3770K).
What other notable features are present in Ivy Bridge?
HD graphics –
Ivy Bridge processors have in-built GPU chip inside them. The GPU supports
DirectX 11 (Sandy Bridge supports version 10.1), OpenGL 3.1 (Sandy Bridge
supports version 3.0). Ivy Bridge processors have the Intel HD4000/HD2500 GPU
chips. This means that you do not need an add-on graphics card.
QuickSync Video
– This feature is introduced in the Intel 3rd generation processors. It uses
dedicated media processing to make video creation and conversion faster and
easier. Whether you want to create DVDs, create, convert and edit 3D/2D videos,
upload to your favorite social networking sites – everything is done in a
jiffy.
WiDi 3.0 –
Wireless Display technology allows you to stream media content to a multitude
of your Wi-Fi connected display devices. You can share a 1080p 60FPS video
using WiDi.
Turbo Boost Technology
2.0 – Using the Turbo boost technology, you can make your Ivy Bridge processors
run faster than their base frequency. For example, a 3.5GHz iCore i7 can be
made to run at 3.9 GHz for some time.
What different types of the Ivy Bridge processors are
available?
There are many
types of processors in the Ivy Bridge family. The type is indicated by putting
a suffix to the CPU model name. The following list explains these suffixes -
·
K – Unlocked, ready to be
overclocked.
·
S – Performance optimized. Low power
consumption.
·
T – Power optimized. Ultra low
power consumption.
·
M – Mobile processors for
mobile devices.
·
Q – Quad core processors.
What is the main difference between Sandy Bridge and
Ivy Bridge?
The main
difference is the process architecture. Ivy Bridge uses the 22 nm architecture
while Sandy Bridge uses 32 nm architecture. This means that Ivy Bridge can pack
more electronic components in a smaller area, giving better power efficiency
and performance.
Intel has also
introduced a new 3D transistor technology in the Ivy Bridge processors. They
are also called tri-gate transistors. According to Intel, this new
architectural design would result in Ivy Bridge providing better performance
while consuming significantly less power compared to an equivalent Sandy Bridge
processor.
Champion vs. Rookie: Core 2 Duo vs. Core i3
The Core 2 Duo
is Intel's veteran, covering a wide range of price and performance sweet spots.
It is now being replaced, however, by Intel's rookie Core i3. So, is the Core
i3 actually better than the Core 2 Duo, or can you hold off upgrading for a
while longer?
Core 2 Duo vs. Core i3: The Veteran vs. The Rookie
The Core 2 Duo
has been the processor of choice in laptops for about three years. Over those
three years the average speeds of Core 2 Duo processors have advanced
significantly and many of today's Core 2 Duo laptops have speeds of around 2.2
GHz or faster. Core 2 Duo processors have also been the go-to for many less
expensive desktop systems, with speeds reaching over 3 GHz.
However, there
is a newcomer which is challenging the Core 2 Duo. This is the Core i3. It is
very similar to the Core 2 Duo in many ways. Both are dual-core processors and
most Core 2 Duos and Core i3 have similar clock speeds. However, the processors
are based on different architectures.
So, which one is
better?
Architecture
The Core 2 Duo
processors are based off the Core 2 architecture. The Core and Core 2
architectures were arguably Intel's most successful architectures, as they
replaced the Pentium 4 processors in desktop systems and made Intel competitive
in that space once again.
The Core i3 is
based off a new architecture called Nehalem. The Nehalem architecture has
numerous advantages over the Core 2 architecture. Nehalem is better constructed
for quad-core processors, has hyper-threading available, and can use a feature
called Turbo Boost which maximizes processor speed. However, because the Core
i3 is the low-end Nehalem variant, most of these features are disabled or not
relevant - the Core i3 is a dual core processor and Turbo Boost is disabled,
but hyper-threading is enabled.
Processor Performance
The Core i3 is
the slowest variant of the Nehalem based processor. The Core 2 Duo processors,
however, don't have the same differentiation between versions of the same
architecture. The fastest Core 2 Duo desktop processor has a speed of 3.33 GHz,
while the fastest Core i3 desktop processor is clocked at 3.06 GHz.
You might
therefore expect that the Core 2 Duo would have the edge - particularly when
you consider that the Core 2 Duo costs almost three times as much if you buy it
individually - but in fact the Core i3 is faster, and often by no small margin.
The Core i3 is faster even in single-threaded applications, but the performance
gap really widens in multi-threaded applications. This is because the Core i3
has hyper-threading, which turns the two real cores into four virtual cores.
Windows works with the Core i3 as if it is a quad-core processor.
These results
remain true in the mobile space, as well. Core i3 processors punch at least 500
MHz above their weight in single-thread applications, and are virtually always
faster in multi-threaded applications, no matter the clock speeds of the Core 2
Duo and Core i3 processors you are comparing.
Power Usage and Heat
A look at the
technical specifications of the Core i3 processors automatically puts them into
a negative light when it comes to power consumption. The desktop Core i3 parts
at listed as having a 73 Watt TDP, while most Core 2 Duo desktop parts have a
65 Watt TDP. In laptops the Core i3 has a 35 watt TDP, while Core 2 Duo mobile
processors usually have a 25 Watt TDP.
These
differences pan out about how you'd expect them to when it comes to absolute
power consumption. The Core i3 processors do consume just slightly more power
than Core 2 Duo processors at load and at idle. We're talking a difference of
around 10 Watts on desktops and a few on laptops - nothing huge, but a
difference none the less.
However, when it
comes to power efficiency the answer becomes less clear. In order for a
processor to be power efficient, it needs to not only have low power
consumption but also the ability to complete tasks quickly. This lowers the
overall "task energy" because a faster processor will be done with a
task before a slower processor, and once done it will slip back into an idle
state.
When viewed from
this perspective, the Core i3 is much more efficient than the Core 2 Duo on
both the desktop and the laptop. This means that the Core i3 will probably not
use any more power than a Core 2 Duo - and may actually use less - unless your
usage patterns place a constant load on your processor.
i3 vs. i5 vs. i7: A Branding Dream
Core i3 Series
Intel's Core i3
processor line has always been a budget option. These processors remain
dual-core, unlike the rest of the Core line, which is made up of quad core
processors. Intel's Core i3 processors also have many features restricted.
The main feature
that is kept from the Core i3 processors is Turbo Boost, the dynamic overclocking
available on most Intel processors. This, alongside with the dual-core design,
accounts for most of the performance difference between Core i3 processors and
the i5 and i7 options.
One feature that
Core i3 has - and i5 doesn't - is hyper-threading. This is Intel's logic-core
duplication technology which allows each physical core to be used as two logic
cores. The result of this is that Windows will display a dual-core Core i3
processor as if it were a quad-core.
Finally, Core i3
processors have their integrated graphics processor restricted to a maximum
clock speed of 1100 MHz, and all Core i3 processors have the 2000 series IGP,
which is restricted to 6 execution cores. This will result in slightly lower
IGP performance overall, but the difference is frankly inconsequential in many
situations.
Core i5 Series
Intel used to
split the Core i5 processor brand into two different lines, one of which was
dual-core and one of which was quad-core.
All Sandy Bridge
Core i5 processors are quad-core processors, they all have Turbo Boost, and
they all lack Hyper-Threading. Most of the Core i5 processors, besides the K
series (explained later) use the same 2000 series IGP with a maximum clock
speed of 1100 MHz and six execution cores.
In the i3 vs. i5
vs. i7 battle, the Core i5 processor is now obviously the main-stream option no
matter which product you buy. The only substantial difference between the Core
i5 options is the clock speed, which ranges from 2.8 GHz to 3.3 GHz. Obviously,
the products with a quicker clock speed are more expensive than those that are
slower.
Core i7 Series
These processors
are virtually identical to the Core i5. They have a 100 MHz higher base clock
speed, which is inconsequential in most situations. The real feature difference
is the addition of hyper-threading on the Core i7, which means that the
processor will appear as an 8-core processor in Windows. This improves threaded
performance and can result in a substantial boost if you're using a program
that is able to take advantage of 8 threads.
Of course, most
programs can't take advantage of 8 threads. Those that can are almost usually
meant for enterprise or advanced video editing applications - 3D rendering
programs, photo editing programs, and scientific programs are categories of
software frequently designed to use 8 threads. The average user is unlikely to
see the full benefit of the hyper-threading feature. In the Core i3 vs. i5 vs.
i7 battle, the i7 has limited appeal.
The IGP on Core
i7 processors can also reach a higher maximum clock speed of 1350 MHz as I've
said before; however, this difference is largely inconsequential when measuring
real-world performance.
The K series processor
Late in the
lifespan of Intel's previous Core i branded products; Intel introduced the
"K" series. These processors had unlocked multipliers, making them
easier to overclock.
Intel has kept
this line of products alive with the new Sandy Bridge architecture by
introducing a K series Core i5 and i7 processor. As before, these processors
have unlocked multipliers. However, they also have a new feature - better
integrated graphics processors.
This comes in
the form of the 3000 series IGP, which has 12 execution cores instead of 6. The
maximum clock speed remains limited by the processor brand - the Core i5 K is
limited to 1100 MHz, while the Core i7 K can reach 1350 MHz the additional
execution cores can result in better performance in games, although to honest,
the IGP isn't remotely cut out for desktop gaming.
Buying Advice
Intel's Core i5
processor line remains the one to buy. The quad-core i5 processors are
extremely quick, and have all of the features that are important, such as Turbo
Boost. They're also reasonably priced, however, with the 2.8 GHz variant
starting at just under $180 bucks. That's not a bargain, but considering the
performance - which is far in excess of Intel's previous Core i5 processors and
AMD's quad-core offerings - it's a good value.
Still, the i3
processor should be considered if you're not looking for a performance
speed-demon. We reached the point at which a basic processor proved capable of
offering adequate day-to-day performance years ago. Tasks such as HD video,
basic video transcoding and productivity applications will easily be conquered
by the least expensive i3.
Finally, we have
the i7. In the i3 vs. i5 vs. i7 battle, the Core i7 is the hardest to
recommend. Hyper-threading is great, but only if you use specific applications
that can take advantage of 8 threads. If you don't, there isn't much reason to
spend the extra dough.
Core i3 vs. Core i5: What's the Difference?
The Processor
Features: Sandy Bridge
There are
several major differences when comparing Core i3 vs. Core i5 Sandy Bridge
processors, and they're more defined than they were before. The first is the
number of cores. The new Core i3 processors are dual-core only, while the new
Core i5 processors are quad-core only. Another major different is Turbo Boost,
which is disabled on Core i3 processors but enabled on Core i5 processors.
However, Intel Core i3 processors have hyper-threading support but Core i5
processors don't offer this feature.
Besides these
points, there are two minor differences. Core i3 processors don't support
Intel's vPro virtualization technology and don't support Intel's AES encryption
acceleration technology. Intel Core i5 processors support both.
The IGP Features: Sandy Bridge
The most
importance new feature added to Intel's Sandy Bridge processors is the
inclusion of an IGP on the processor. Intel did this before with Core i3 and
some Core i5 processors, but the IGP was still separate from the processor
itself - the IGP and CPU were placed on the same piece of silicon, but didn't
physically work together.
Now Intel has
taken the IGP integration a step further and worked the IGP into the CPU
architecture. It even shares cache with the processor. What this means, in
practical terms, is that the on-board graphics of Intel's new processors are
superior to anything they've offered before. It also enables Quick Sync, a
video transcoding feature that provides blazing performance when converting
videos to a different format.
Intel is
offering two different types of IGPs on its processors. The 2000 has 6
execution units, while the 3000 has 12 execution units. Obviously, the later is
quicker. Intel hasn't tied the IGP that you receive to the type of processor
you choose, however. Instead, it has tied the 3000 series IGP to the
"K" series processors. If you see a "K" at the end of the
processor's name, it has the 3000 series IGP. So far, Intel doesn't offer a
Core i3 K series processor, but that could change in the future.
Laying Out the Chipset
The staggered
release of Intel's previous Core i3/i5/i7 products also resulted in a staggered
release of processor sockets and their related chipsets. First came LGA 1366
processor socket, which was tied to some Core i7 processors. Then Intel
confused things by releasing the LGA 1156 socket, which was made available on
several different chipsets and processor types. Choosing the right socket and
chipset for a processor wasn't easy.
Intel has now
clarified matters by releasing a single processor socket and two processor
chipsets alongside Sandy Bridge. The new socket is LGA 1155, and it isn't
backwards compatible with anything Intel has previously offered. The new
chipsets are P67 and H67, with the P variant being performance-oriented and the
H variant targeted at general use. The main difference is that P67 allows for
processor overclocking, while H67 does not. P67 also offers 16 additional PCIe
lanes. Both Core i3 and i5 processors are compatible with either chipset.
Core i5 vs. Core i7: What's the Difference?
Core i5: The New Middle Class
While the
hardware has changed, Intel's branding scheme remains the same, and Core i5
remains Intel's primary mid-range processor. It is targeted at the heart of the
market, with pricing that is not at budget levels but still affordable, and
performance that is extremely quick but not the fastest Intel offers.
Intel's high-end
processor line is the Core i7. Many users who are looking for a
high-performance part end up considering both i5 and i7 products.
A Unified Socket and Chipset
Perhaps the best
news to come out of Intel's new line of i5 and i7 processors is introduction of
a single socket for all Sandy Bridge Core i3/i5/i7 processors. For now,
however, the Sandy Bridge processors all use the LGA 1155 socket. In case
you're wondering, this socket is not backwards compatible with previous LGA1156
processors.
Intel Turbo Boost
Intel has made
Turbo Boost a standard feature on all Core i5 and i7 processors, from the least
to most expensive. Intel has also reduced the gap between the maximum turbo
boost frequencies on different processors. Previously, some of the older Core
i7 processors actually had a much less efficient Turbo Boost feature than some
newer Core i5s.
All of Intel's
current Core i5 and i7 processors offer a boost of between 300 and 400 MHz The
least expensive i5s offer the 300 MHz boost - for example, the Core i5 2300 has
a base clock speed of 2.8 GHz and a maximum Turbo Boost speed of 3.1 GHz. The
Intel Core i7 2600, on the other hand, offers a base clock speed of 3.4 GHz and
a maximum Turbo Boost of 3.8 GHz.
Besides the
clock speed difference, Turbo Boost is essentially the same on the i5 and i7
processors.
Difference in Hyper-Threading
Another
significant performance difference is how the Core i7 and Core i5 products will
be handling hyper-threading. Hyper-threading is a technology used by Intel to
simulate more cores than actually exist on the processor. While Core i7
products have all been quad-cores, they appear in Windows as having eight
cores. This further improves performance when using programs that make good use
of multi-threading.
All Sandy Bridge
Core i5 processors have hyper-threading disabled, and all Sandy Bridge Core i7
processors have hyper-threading enabled. This is a major feature difference of
Core i5 vs. Core i7 processors, and it will give the Core i7 products an
advantage over Core i5 processors in some heavily multi-threaded applications.
The New IGP
All of Intel's
Sandy Bridge processors make use of a new integrated IGP that is part of the
processor architecture. While far from a gaming-grade video solution, the
integrated IGP offers reasonable performance without consuming much power. It
also enables features like Quick Sync, which can transcode video extremely
quickly.
There are two
versions of this IGP; the 2000 and the 3000. The only difference between the
two is the number of execution units. The 2000 has 6, while the 3000 has 12.
This doesn't mean the 3000 is twice as quick, but it does means the 3000 is
about 50% quicker in most benchmarks.
i5 vs. i7: What it means to Consumers and Power Users
Currently, the
Core i5 processor brand makes up most of Intel's Sandy Bridge processor line.
The prices of these processors range from $177 to $216 with base clock speeds
between 2.8 GHz and 3.3 GHz. Intel only offers two Core i7 products, the Core
i7-2600 and Core i7-2600K, both of which have a 3.4 GHz base clock speed. The
i7-2600 has a price tag of $294.
As you may have
guessed, paying about $80 more for the 100 MHz clock speed increase between the
fastest i5 and the i7 isn't a great deal. The main reason to pay this
additional cash for an i7 is hyper-threading, but this advantage will only be
evident if you frequently use programs that can actually make use of 8 threads.
For most users,
the i5 is clearly the better deal. The i5-2500 makes the most sense in my
opinion, as it offers an extremely quick base clock speed of 3.3 GHz for about
$200. Of course, the value of this is subject to change in the future as Intel
fleshes out its product line with new models.
Hyper threading
Hyper-Threading
Technology brings the concept of simultaneous multi-threading to the Intel Architecture.
Hyper-Threading Technology makes a single physical processor appear as two
logical processors. The physical execution resources are shared and the
architecture state is duplicated for the two logical processors. From a
software or architecture perspective, this means operating systems and user
programs can schedule processes or threads to logical processors as they would
on multiple physical processors. From a microarchitecture perspective, this
means that instructions from both logical processors will persist and execute
simultaneously on shared execution resources.
The amazing
growth of the Internet and telecommunications is powered by ever-faster systems
demanding increasingly higher levels of processor performance. To keep up with
this demand we cannot rely entirely on traditional approaches to processor
design. Microarchitecture techniques used to achieve past processor performance
improvement–super-pipelining, branch prediction, super-scalar execution,
out-of-order execution, caches–have made microprocessors increasingly more
complex, have more transistors, and consume more power. In fact, transistor
counts and power are increasing at rates greater than processor performance.
Processor architects are therefore looking for ways to improve performance at a
greater rate than transistor counts and power dissipation. Intel’s
Hyper-Threading Technology is one solution
A look at
today’s software trends reveals that server applications consist of multiple
threads or processes that can be executed in parallel. On-line transaction
processing and Web services have an abundance of software threads that can be
executed simultaneously for faster performance. Even desktop applications are
becoming increasingly parallel. Intel architects have been trying to leverage
this so-called thread-level parallelism (TLP) to gain a better performance vs.
transistor count and power ratio.
In both the
high-end and mid-range server markets, multiprocessors have been commonly used
to get more performance from the system. By adding more processors,
applications potentially get substantial performance improvement by executing
multiple threads on multiple processors at the same time. These threads might
be from the same application, from different applications running simultaneously,
from operating system services, or from operating system threads doing
background maintenance. Multiprocessor systems have been used for many years,
and high-end programmers are familiar with the techniques to exploit
multiprocessors for higher performance levels.
In recent years
a number of other techniques to further exploit TLP have been discussed and
some products have been announced. One of these techniques is chip
multiprocessing (CMP), where two processors are put on a single die. The two
processors each have a full set of execution and architectural resources. The
processors may or may not share a large on-chip cache. CMP is largely
orthogonal to conventional multiprocessor systems, as you can have multiple CMP
processors in a multiprocessor configuration. Recently announced processors
incorporate two processors on each die. However, a CMP chip is significantly
larger than the size of a single-core chip and therefore more expensive to
manufacture; moreover, it does not begin to address the die size and power
considerations.
Another approach
is to allow a single processor to execute multiple threads by switching between
them. Time-slice multithreading is where the processor switches between
software threads after a fixed time period. Time-slice multithreading can
result in wasted execution slots but can effectively minimize the effects of
long latencies to memory. Switch-on-event multithreading would switch threads
on long latency events such as cache misses. This approach can work well for
server applications that have large numbers of cache misses and where the two
threads are executing similar tasks. However, both the time-slice and the
switch-on event multi- threading techniques do not achieve optimal overlap of
many sources of inefficient resource usage, such as branch mispredictions,
instruction dependencies, etc.
Finally, there
is simultaneous multi-threading, where multiple threads can execute on a single
processor without switching. The threads execute simultaneously and make much
better use of the resources. This approach makes the most effective use of
processor resources: it maximizes the performance vs. transistor count and
power consumption. Hyper-Threading Technology brings the simultaneous
multi-threading approach to the Intel architecture. In this paper we discuss
the architecture and the first implementation of Hyper-Threading Technology on
the Intel Xeon processor family.
Hyper-Threading
Technology makes a single physical processor appear as multiple logical
processors. To do this, there is one copy of the architecture state for each
logical processor, and the logical processors share a single set of physical
execution resources. From a software or architecture perspective, this means
operating systems and user programs can schedule processes or threads to
logical processors as they would on conventional physical processors in a
multiprocessor system. From a microarchitecture perspective, this means that
instructions from logical processors will persist and execute simultaneously on
shared execution resources.
There are few
elements in CPU that need to be understand to know about hyper-threading
technology
Ø
Registers - Registers are
basically circuits that hold a single 64-bit value and are the fastest form of
storage available on a computer. The x86- architecture provides a number of
General Purpose Registers that are used by an executing program. In a multicore
chip, registers are unique to each core so if you have a quad-core processor,
there will be 4 sets of general purpose registers.
Ø
Cache – Cache is essentially a
form of storage that falls between registers and RAM in terms of speed. In
modern processors there are generally three levels and in the case of the i7,
Levels 1 & 2 is private and Level 3 is shared by all the cores on a chip.
The most important thing to know is that accessing the cache is slower than
registers but still faster than RAM.
Ø
Execution Unit – This is the
section in the CPU responsible for actually executing the instructions. If you
tell the computer to add 2 + 3, this is the part that operation would be
performed in.
Ø
Front-End – This is a unit of
the processor that is also known as Instruction Fetch/Decode. Essentially this
unit will grab instructions from either cache or RAM and decode them into a
form that execution unit can understand.
Ø
Branch Predictor - this unit
will attempt to predict branches in program code. If there is an “if-then”
statement in a program, it will guess which statements will be executed and
prefetch them for the front-end.
In a core with
HT, the registers are all duplicated. This means that one core will have 2 sets
of registers and this is what the operating will see as a “logical core” since
the sum of the registers represents the processor’s state. We’ll call these
sets A and B. Even though it appears as two cores, they will still be sharing
the same cache, branch predictor, front-end, and most importantly, execution
unit. Because they still share so many resources, only one thread will
technically execute at once. The advantage of adding the HT logic is that if a
thread is executing and stalls for any reason, the other thread can be switched
in very fast while the cause of the stall in the first thread is addressed. To
better illustrate how this works, consider the following:
Ø
Set A is considered the current
state of the processor
Ø
Thread a starts executing
Ø
Thread A needs a value from
memory that isn’t in the cache
Ø
Memory access is very time
consuming in CPU terms, so thread A is considered stalled
Ø
Instead of wasting cycles
waiting for the memory operation to complete, set B is considered the current
state
Ø
Thread B is now executing until
it stalls or until thread A can execute again (memory operation finishes)
This process
basically just continues on constantly. Now, there should be an obvious
question: What can cause a thread to stall? There are a few things; the
simplest one to understand is a cache miss. This is when the thread goes to
access a value that isn’t currently in the cache or any of the registers. A
branch miss prediction can also occur when the branch predictor prefetches the
wrong instructions into the cache.
There is another
time Hyper-Threading kicks in, and that is if one thread is using
Floating-Point resources while the 2nd is only using Integer resources. HT will
allow them both to execute simultaneously while they don't conflict.
Does hyper threading actually help?
Hyper-Threading
has some interesting performance characteristics as a result of its nature. HT
will provide close to zero advantage if instruction decoding or execution is
the limiting factor in performance. In the Nehalem architecture this is rarely
the case. It performs ideally when there are a lot of cache misses or branch
miss predictions since the execution unit would otherwise be idle waiting for
these issues to be resolved.
Basically,
certain applications will benefit more than others. Running a more parallel
workload such as rendering or encoding video will see a nice benefit from HT
since it’s likely both threads will be accessing the same data so they aren’t
really competing for cache. Additionally the relatively small amount of local
L2 cache in the i7 (256k) means there will be a decent amount of memory access
giving the second thread time to execute. Also, it can result in a more
responsive machine if not much is going on since threads will have very low
execution time and it’s much faster for the CPU to switch the active register
set than to grab another thread from RAM and load it into the registers.
Are there drawbacks?
As with most
engineering decisions, there are drawbacks to HT. One of the more obvious one
is that since HT keeps the execution unit fed more efficiently, it spends less
time idle and can result in higher operating temperatures. More time idle would
mean the CPU got a chance to cool down before the next execution burst and
would result in a lower max temperature.
There are also
programs that will either not see any benefit from HT or see decreased
performance as well. Typically something that has performance limited by cache,
instruction decode, the execution unit, or memory access will see little to
negative improvement from HT (one of the reasons the i7 has so much memory
bandwidth).
Running more
than one multithreaded, computationally intensive task at a time can also be a
situation where HT doesn’t help performance. If a processor core is running
threads from different programs or that are operating on different data, all of
the shared resources are effectively halved (data cache, branch prediction,
instruction cache). This means branch miss predictions and cache misses become
even more common, possibly to the point where both threads are stalled.
Depending on the specific program this can mean either lower performance
(compared to HT being disabled) or worse scaling than expected.
The last
drawback is probably the most important one: The benefit of HT is inconsistent
and dependent upon the specific operating environment and programs being run.
Because of the way it works, code that is heavily optimized is likely to show
less benefit as it would be designed to lower branch miss-predictions and cache
misses. The inconsistency of HT while multitasking won’t show up on benchmarks
since they’re designed to only test a single task at a time.
Is it worth to use Hyper-thread technology?
If one does a
lot of 3d rendering or Video Transcoding then it probably is since this is the
workload HT is best suited for. If you find that you generally run multiple
intensive tasks simultaneously (like playing a game while encoding a video or
recompiling the Linux kernel in a VM) then HT could have a negative impact on
overall performance (though not necessarily). One thing that is for sure is its
impact is exaggerated in synthetic benchmarks, almost to the point where it
becomes misleading.
Virtualization
Server virtualization:
Huge data- centers
contains large number of server. Work- load, user-activity and other things
decides which server when to use and for the servers that are not been
used according to their capacities
companies still spending their money, energy
and resources to keeping them updated and preventing them from any
crashing and overheating. So server virtualization concept is used to make that
physical server consolidate on fewer more powerful and energy efficient server
and that vm (virtual machine) or energy efficient server imitate or pretends to
be multiple servers on network. Virtual server environment is transparent on
network so each user can interact with virtual server as if they are still
multiple servers but now main advantage is that they should have to take care
of only few energy efficient servers instead of many servers and saving of resources,
energy and money also possible
As shown in
figure in traditional architecture there is hardware which is working on single
operating system and in that operating system different - different application
are working.
But as we know
as this system as not energy efficient so one virtual environment is developed
through which now we can work on different operating system with a single
machine
Virtual machine
A virtual
machine monitor (VMM) is a host program that allows a single computer to
support multiple, identical execution environments. All the users see their
systems as self-contained computers isolated from other users, even though
every user is served by the same machine. In this context, a virtual machine is
an operating system (OS) that is managed by an underlying control program. For
example, IBM's VM/ESA can control multiple virtual machines on an IBM S/390
system.
We are doing
server virtualization to reduce energy cost, simplify manageability and
disaster management.
In server
virtualization what we are doing is adding vmm software to allow hardware to
use more than one os.
Major component
of the server
Ø
Processor
Ø
Chipset
Ø
Network interface
Individual technologies
that make up Intel VT are built in this component that boost Performance, boost
reliability, and boost flexibility.
Intel VT
supports virtual machine architectures comprised of two principal classes of
software:
Ø
Virtual-Machine Monitor (VMM):
A VMM acts as a host and has full control of the processor(s) and other
platform hardware. VMM presents guest software (see below) with an abstraction
of a virtual processor and allows it to execute directly on a logical
processor. A VMM is able to retain selective control of processor resources,
physical memory, interrupt management, and I/O.
Ø
Guest Software: Each virtual
machine is a guest software environment that supports a stack consisting of an
operating system (OS) and application software. Each operates independently of
other virtual machines and uses the same interface to processor(s), memory,
storage, graphics, and I/O provided by a physical platform. The software stack
acts as if it were running on a platform with no VMM. Software executing in a
virtual machine must operate with reduced privilege so that the VMM can retain
control of platform resources.
Intel virtualization technology-flex migration (Intel VT-X)
Obviously, as IT
adds new systems, it would be much more convenient and efficient if an IT
manager could simply add new resources to existing pools without having to
worry about differences in processor generation. For this reason, Intel has
developed Intel VT Flex Migration. When combined with support from
virtualization software, it ensures that the hypervisor can expose a consistent
set of instructions across all servers in the pool. Intel VT Flex Migration
support starts with Intel® Core™ microarchitecture and will be available in
future generations of the Intel Xeon processor family.
With Intel VT
Flex Migration, IT managers can easily add current and future Intel Xeon
processor-based systems to the same resource pool when using supporting
hypervisor software. This gives IT the power to choose the right server
platform when it is needed to optimize performance, cost, power, and
reliability, without having to worry about forward and backward compatibility
across generations of Intel Xeon processor-based servers starting with Intel
Core microarchitecture and extending into future generations of Intel Xeon
processors. IT managers can pool server resources using multiple generations of
Intel Xeon processors (Intel Core microarchitecture and beyond) whether they
are single, dual- or multi-processor based. This creates a dynamic virtual server
infrastructure that enables the use of live VM migration to improve usage
models such as failover, load balancing, disaster recovery, and server
maintenance.
Current Intel®
Xeon® 5400 and 5200 processor series, 3300 and 3100 processor series, as well
as future Intel Xeon processors, support Intel VT Flex Migration. Using
virtualization software that is enabled to take advantage of this feature,
Intel servers based on these processors can be pooled with earlier generation
of Intel Core microarchitecture processors. These include Intel® Xeon® 7300,
5300, 5100, 3200, 3000 series processors. Major Intel VT-x component is Intel VT-x
flex migration. By using this technology, we will be able to migrate the application
from one server to another and recover from disaster.
From Intel VT
flex migration one can migrate between to generation processor so one can react
quickly on change in condition making it much easier to server upend running.
0
Flex priority
Intel VT Flex
Priority optimizes and accelerates interrupt virtualization by improving
virtual machine access to the Task Priority Register thereby enabling efficient
Symmetric Multi-Processing (SMP) configurations of 32-bit guest operating
systems. For users, this translates into more efficient performance in virtual
environments for their critical enterprise applications.
Intel VT Flex
Priority was designed to accelerate virtualization interrupt handling thereby
improving virtualization performance. Intel VT Flex Priority accelerates
interrupt handling by preventing unnecessary VMExits on accesses to the
Advanced Programmable Interrupt Controller…
Intel flex
priority improves virtualization by 35%
When processor
is constantly bombarded with interruption many of which are critical so Intel
VT flex priority is kind of like receptionist who alerts when interruption is
critical. Because it is not necessary that all the interrupt that are given to
the processor are necessarily
Critical to be
executed at the time of occurrence of interruption so through flex priority is
kind like receptionist who alerts when interruption is critical so processor
can work efficiently if it is less interrupted.
Virtualization for directed I/O
A VMM must
support virtualization of I/O requests from guest software. I/O virtualization
may be supported by a VMM through any of the following models:
Ø
Emulation: A VMM may expose a
virtual device to guest software by emulating an existing (legacy) I/O device.
VMM emulates the functionality of the I/O device in software over whatever
physical devices are available on the physical platform. I/O virtualization
through emulation provides good compatibility (by allowing existing device
drivers to run within a guest), but pose limitations with performance and
functionality.
Ø
New Software Interfaces: This model
is similar to I/O emulation, but instead of emulating legacy devices, VMM
software exposes a synthetic device interface to guest software. The synthetic
device interface is defined to be virtualization-friendly to enable efficient
virtualization compared to the overhead associated with I/O emulation. This
model provides improved performance over emulation, but has reduced
compatibility (due to the need for specialized guest software or drivers
utilizing the new software interfaces).
Ø
Assignment: A VMM may directly
assign the physical I/O devices to VMs. In this model, the driver for an
assigned I/O device runs in the VM to which it is assigned and is allowed to
interact directly with the device hardware with minimal or no VMM involvement.
Robust I/O assignment requires additional hardware support to ensure the
assigned device accesses are isolated and restricted to resources owned by the
assigned partition. The I/O assignment model may also be used to create one or
more I/O container partitions that support emulation or software interfaces for
virtualizing I/O requests from other guests. The I/O-container-based approach
removes the need for running the physical device drivers as part of VMM
privileged software.
Ø
I/O Device Sharing: In this
model, which is an extension to the I/O assignment model, an I/O device
supports multiple functional interfaces, each of which may be independently
assigned to a VM. The device hardware itself is capable of accepting multiple
I/O requests through any of these functional interfaces and processing them
utilizing the device's hardware resources.
Depending on the
usage requirements, a VMM may support any of the above models for I/O
virtualization. For example, I/O emulation may be best suited for virtualizing
legacy devices. I/O assignment may provide the best performance when hosting
I/O-intensive workloads in a guest. Using new software interfaces makes a
trade-off between compatibility and performance, and device I/O sharing
provides more virtual devices than the number of physical devices in the
platform.
Overview
A general
requirement for all of above I/O virtualization models is the ability to
isolate and restrict device accesses to the resources owned by the partition
managing the device. Intel
VT for Directed
I/O provides VMM software with the following capabilities:
Ø
I/O device assignment: for
flexibly assigning I/O devices to VMs and extending the protection and
isolation properties of VMs for I/O operations.
Ø
DMA remapping: for supporting
independent address translations for Direct Memory Accesses (DMA) from devices.
Ø
Interrupt remapping: for
supporting isolation and routing of interrupts from devices and external
interrupt controllers to appropriate VMs.
Ø
Reliability: for recording and
reporting to system software DMA and interrupt errors that may otherwise
corrupt memory or impact VM isolation.
DMA REMAPPING
DMA remapping
facilities have been implemented in a variety of contexts in the past to
facilitate different usages. In workstations and server platforms, traditional
I/O memory management units (IOMMUs) have been implemented in PCI root bridges
to efficiently support scatter/gather operations or I/O devices with limited
DMA addressability. Other well-known examples of DMA remapping facilities
include the AGP Graphics Aperture Remapping Table (GART), the Translation and
Protection Table (TPT) defined in the Virtual Interface Architecture, and
subsequently influencing a similar capability in the InfiniBand Architecture
and Remote DMA (RDMA) over TCP/IP specifications. DMA remapping facilities have
also been explored in the context of NICs designed for low latency cluster
interconnects.
Traditional
IOMMUs typically support an aperture-based architecture. All DMA requests that
target a programmed aperture address range in the system physical address space
are translated irrespective of the source of the request. While this is useful
for handling legacy device limitations (such as limited DMA addressability or
scatter/gather capabilities), they are not adequate for I/O virtualization
usages that require full DMA isolation.
The VT-d
architecture is a generalized IOMMU architecture that enables system software
to create multiple DMA protection domains. A protection domain is abstractly
defined as an isolated environment to which a subset of the host physical
memory is allocated. Depending on the software usage model, a DMA protection
domain may represent memory allocated to a VM, or the DMA memory allocated by a
guest-OS driver running in a VM or as part of the VMM itself. The VT-d
architecture enables system software to assign one or more I/O devices to a
protection domain. DMA isolation is achieved by restricting access to a
protection domain's physical memory from I/O devices not assigned to it,
through address- translation tables.
The I/O devices
assigned to a protection domain can be provided a view of memory that may be
different than the host view of physical memory. VT-d hardware treats the
address specified in a DMA request as a DMA virtual address (DVA). Depending on
the software usage model, a DVA may be the Guest Physical Address (GPA) of the
VM to which the I/O device is assigned, or some software-abstracted virtual I/O
address (similar to CPU linear addresses). VT-d hardware transforms the address
in a DMA request issued by an I/O device to its corresponding Host Physical
Address (HPA).
Figure 5
illustrates DMA address translation in a multi-domain usage. I/O devices 1 and
2 are assigned to protection domains 1 and 2, respectively, each with its own
view of the DMA address space.
Figure
5: DMA remapping
Figure 6
illustrates a PC platform configuration with VT-d hardware implemented in the
north-bridge component.
Figure
6: Platform configuration with VT-d
Intel smart memory access
Intel Smart
Memory Access improves system performance by optimizing the use of the
available data bandwidth from the memory subsystem and hiding the latency of
memory accesses. The goal is to ensure that data can be used as quickly as
possible and is located as close as possible to where it’s needed to minimize
latency and thus improve efficiency and speed. Intel Smart Memory Access
includes a new capability called memory disambiguation, which increases the
efficiency of out-of-order processing by providing the execution cores with the
built-in intelligence to speculatively load data for instructions that are
about to execute before all previous store instructions are executed. Intel
Smart Memory Access also includes an instruction pointer-based prefetcher that
“prefetches” memory contents before they are requested so they can be placed in
cache and readily accessed when needed. Increasing the number of loads that
occur from cache versus main memory reduces memory latency and improves
performance.
How Intel smart memory access improves execution throughput
Intel core
microarchitecture memory cluster (level 1 data memory subsystem) is highly out
of order, non blocking and speculative. It has a variety of methods of caching
and buffering to help achieve its performance. Included among these are Intel
Smart Memory Access and its two key features: memory disambiguation and
instruction pointer based (IP-based) prefetcher to the level 1 data cache.
Memory disambiguation
Since Intel
Pentium pro and all Intel processor have featured a sophisticated out of order
memory engine allowing the CPU to execute non -dependent instruction in any
order but they had significant short coming, these processors were built around
a conservative set of assumptions concerning which memory accesses could
proceed out of order. They would not move a load in the execution order above a
store having an unknown address (cases where a prior store has not been
executed yet). This was because if the store and load end up sharing the same
address, it results in an incorrect instruction execution. Yet many loads are
to locations unrelated to recently executed stores. Prior hardware
implementations created false dependencies if they blocked such loads based on
unknown store addresses. All these false dependencies resulted in many lost
opportunities for out-of-order execution.
In designing
Intel Core microarchitecture, Intel sought a way to eliminate false
dependencies using a technique known as memory disambiguation.
(“Disambiguation” is defined as the clarification that follows the removal of
an ambiguity.) Through memory disambiguation, Intel Core microarchitecture is
able to resolve many of the cases where the ambiguity of whether a particular
load and store share the same address thwart out-of-order execution.
Memory
disambiguation uses a predictor and accompanying algorithms to eliminate these
false dependencies that block a load from being moved up and completed as soon
as possible. The basic objective is to be able to ignore unknown store-address
blocking conditions whenever a load operation dispatched from the processor’s
reservation station (RS) is predicted to not collide with a store. This
prediction is eventually verified by checking all RS-dispatched store addresses
for an address match against newer loads that were predicted non-conflicting
and already executed. If there is an offending load already executed, the pipe
is flushed and execution restarted from that load.
The memory
disambiguation predictor is based on a hash table that is indexed with a hashed
version of the load’s EIP address bits. (“EIP” is used here to represent the
instruction pointer in all x86 modes.) Each predictor entry behaves as a
saturating counter, with reset.
The predictor
has two write operation both done during the load’s retirement
Ø
Increment the entry if load
“behaved well” that if it meet unknown store address but none of them collided.
Ø
Reset the entry to zero if the
load “misbehaved.” That is, if it collided with at least one older store that
was dispatched by the RS after the load. The reset is done regardless of
whether the load was actually disambiguated.
The predictor
takes a conservative approach. In order to allow memory disambiguation, it
requires that a number of consecutive iterations of a load having the same EIP
behave well. This isn’t necessarily a guarantee of success though. If two loads
with different EIPs clash in the same predictor entry, their prediction will
interact.
Predictor lookup
The predictor is
looked up when load instruction is dispatched from RS to the memory pipe. If
the respective counter is saturated, the load is assumed to be safe and the
result is written to the “disambiguation allowed bit” in the loaded buffer. This
means that if load finds its relevant store address and the load is allowed to
go on. If the predictor is not saturated, the load will behave like in prior
implementations. In other words, if there is a relevant unknown store address,
the load will get blocked.
Load dispatch
In case the load
meets an older unknown store address, it sets the “update bit” indicating the
load should update the predictor. If the prediction was "go,” the load
will be dispatched and set the “done” bit indicating that disambiguation was
done. If the prediction was "no go," the load will be conservatively
blocked until resolving of all older store addresses.
Prediction verification
To recover in
case of a misprediction by the disambiguation predictor, the address of all the
store operations dispatched from the RS to the Memory Order Buffer must be
compared with the address of all the loads that are younger than the store. If
such a match is found the respective “reset bit” is set. When a load retires
that was disambiguated and its reset bit set, we restart the pipe from that
load to re-execute it and all its dependent instructions correctly.
Watchdog mechanism
Disambiguation
is based on prediction and mispredictions can cause execution pipe flush, it’s
important to build in safeguards to avoid rare cases of performance loss.
Consequently, Intel Core microarchitecture includes a mechanism to temporarily
disable memory disambiguation to prevent cas.es of performance loss. This
mechanism constantly monitors the success rate of the disambiguation predictor.
Advanced smart cache
Intel Advanced
Smart Cache is a multi-core optimized cache that improves performance and
efficiency by increasing the probability that each execution core of a dual
core processor can access data from a higher-performance, more-efficient cache
subsystem. To accomplish this, Intel Core microarchitecture shares the Level 2
(L2) cache between the cores. This better optimizes cache resources by storing
data in one place that each core can access. By sharing L2 cache between each
core, Intel Advanced Smart Cache allows each core to dynamically use up to 100
percent of available L2 cache. Threads can then dynamically use the required
cache capacity. As an extreme example, if one of the cores is inactive, the
other core will have access to the full cache. Intel Advanced Smart Cache enables
very efficient sharing of data between threads running in different cores. It
also enables obtaining data from cache at higher throughput rates for better
performance. Intel Advanced Smart Cache provides a peak transfer rate of 96
GB/sec (at 3 GHz frequency).
Wide dynamic execution
Intel Wide
Dynamic Execution significantly enhances dynamic execution, enabling delivery
of more instructions per clock cycle to improve execution time and energy
efficiency. Every execution core is 33 percent wider than previous generations,
allowing each core to fetch, decode, and retire up to four full instructions
simultaneously.
Intel Wide
Dynamic Execution also includes a new and innovative capability called
Macrofusion. Macrofusion combines certain common x86 instructions into a single
instruction that is executed as a single entity, increasing the peak throughput
of the engine to five instructions per clock. The wide execution engine, when
Macrofusion comes into play, is then capable of up to six instructions per cycle
throughputs for even greater energy -efficient performance. Intel Core
microarchitecture also uses extended microfusion, a technique that “fuses”
micro-ops derived from the same macro-op to reduce the number of micro-ops that
need to be executed. Studies have shown that micro-op fusion can reduce the
number of micro-ops handled by the out-of-order logic by more than 10 percent.
Intel Core microarchitecture “extends” the number of micro-ops that can be
fused internally within the processor.
Intel Core microarchitecture
also incorporates an updated ESP (Extended Stack Pointer) Tracker. Stack
tracking allows safe early resolution of stack references by keeping track of
the value of the ESP register. About 25 percent of all loads are stack loads
and 95 percent of these loads may be resolved in the front end, again
contributing to greater energy efficiency [Bekerman].
Micro-op
reduction resulting from micro-op fusion, Macrofusion, ESP Tracker, and other
techniques make various resources in the engine appear virtually deeper than their actual size and results in
executing a given amount of work with less toggling of signals—two factors that
provide more performance for the same or less power.
Intel Core
microarchitecture also provides deep out of-order buffers to allow for more
instructions in flight, enabling more out-of-order execution to better
instruction level parallelism
Advanced Digital media boost
Intel Advanced
Digital Media Boost helps achieve similar dramatic gains in throughputs for
programs utilizing SSE instructions of 128-bit operands. (SSE instructions
enhance Intel architecture by enabling programmers to develop algorithms that
can mix packed, single-precision, and double-precision floating point and
integers, using SSE instructions.) These throughput gains come from combining a
128-bit-wide internal data path with Intel Wide Dynamic Execution and matching
widths and throughputs in the relevant caches. Intel Advanced Digital Media
Boost enables most 128-bit instructions to be dispatched at a throughput rate
of one per clock cycle, effectively doubling the speed of execution and
resulting in peak floating point performance of 24 GFlops (on each core, single
precision, at 3 GHz frequency). Intel Advanced Digital Media Boost is
particularly useful when running many important multimedia operations involving
graphics, video, and audio, and processing other rich data sets that use SSE,
SSE2, and SSE3 instructions.
Intelligent power capability
Intel
Intelligent Power Capability is a set of capabilities for reducing power
consumption and device design requirements. This feature manages the runtime
power consumption of all the processor’s execution cores. It includes an
advanced power-gating capability that allows for an ultra fine-grained logic control
that turns on individual processor logic subsystems only if and when they are
needed. Additionally, many buses and arrays are split so that data required in
some modes of operation can be put in a low-power state when not needed. In the
past, implementing such power gating has been challenging because of the power
consumed in powering down and ramping back up, as well as the need to maintain
system responsiveness when returning to full power [Wechsler]. Through Intel
Intelligent Power Capability Intel has been able to satisfy these concerns,
ensuring significant power savings without sacrificing responsiveness.
References: