NUMA Frequently Asked Questions
Purpose:
This is a document designed to (hopefully) answer some frequently asked questions
about the NUMA architecture.
Frequently Asked Questions:
- What does NUMA stand for?
- OK, So what does Non-Uniform Memory
Access really mean?
- What is the difference between NUMA and SMP?
- What is the difference between NUMA and ccNUMA?
- What is a node?
- What is meant by local and remote memory?
- What do you mean by distance?
- Could you give a real-world analogy of the
NUMA architecture to help understand all these terms?
- Why should I use NUMA? What are the benefits of NUMA?
- What are the peculiarities of NUMA?
- What are some alternatives to NUMA?
- Could you give a brief description of the main
NUMA architecture implementations?
Frequently Given Answers:
- What does NUMA stand for?
NUMA stands for Non-Uniform Memory Access.
[Top]
- OK, So what does Non-Uniform Memory
Access really mean to me?
Non-Uniform Memory Access means that it will take longer to access some regions
of memory than others. This is due to the fact that some regions of memory are
on physically different busses from other regions. For a more visual description,
please refer to the section on NUMA architeture implementations.
Also, see the real-world analogy for the NUMA
architecture. This can result in some programs that are not NUMA-aware performing
poorly. It also introduces the concept of local
and remote memory.
[Top]
- What is the difference between NUMA and SMP?
The NUMA architecture was designed to surpass the scalability limits of the
SMP architecture. With SMP, which stands for Symmetric Multi-Processing, all
memory access are posted to the same shared memory bus. This works fine for
a relatively small number of CPUs, but the problem with the shared bus appears
when you have dozens, even hundreds, of CPUs competing for access to the shared
memory bus. NUMA alleviates these bottlenecks by limiting the number of CPUs
on any one memory bus, and connecting the various nodes by means of a high speed
interconnect.
[Top]
- What is the difference between NUMA and ccNUMA?
The difference is almost nonexistent at this point. ccNUMA stands for Cache-Coherent
NUMA, but NUMA and ccNUMA have really come to be synonymous. The applications
for non-cache coherent NUMA machines are almost non-existent, and they are a
real pain to program for, so unless specifically stated otherwise, NUMA actually
means ccNUMA.
[Top]
- What is a node?
One of the problems with describing NUMA is that there are many different ways
to implement this technology. This has led to a plethora of "definintions" for
node. A fairly technically correct and also fairly ugly definition of a node
is: a region of memory in which every byte has the same distance from each CPU.
A more common definition is: a block of memory and the CPUs, I/O, etc. physically
on the same bus as the memory. Some architectures do not have memory, CPUs,
and I/O all on the same physical bus, so the second definition does not truly
hold. In many cases, the less technical definition should be sufficient, but
often the technical definition is more correct.
[Top]
- What is meant by local and remote memory?
The terms local memory and remote memory are typically used in reference to
a currently running process. That said, local memory is typically defined to
be the memory that is on the same node as the CPU currently running the process.
Any memory that does not belong to the node on which the process is currently
running is then, by that definition, remote.
Local and remote memory can also be used in reference to things other than the
currently running process. When in interrupt context, there technically is no
currently executing process, but memory on the node containing the CPU handling
the interrupt is still called local memory. Also, you could use local and remote
memory in terms of a disk. For example if there was a disk (attatched to node
1) doing a DMA, the memory it is reading or writing would be called remote if
it were located on another node (ie: node 0).
[Top]
- What do you mean by distance?
NUMA-based architectures necessarily introduce a notion of distance between
system components (ie: CPUs, memory, I/O busses, etc). The metric used to determine
a distance varies, but hops is a popular metric, along with latency and bandwidth.
These terms all mean essentially the same thing that they do when used in a
networking context (mostly because a NUMA machine is not all that different
from a very tightly coupled cluster). So when used to describe a node, we could
say that a particular range of memory is 2 hops (busses) from CPUs 0..3 and
SCSI Controller 0. Thus, CPUs 0..3 and the SCSI Controller are a part of the
same node.
[Top]
- Could you give a real-world analogy of the
NUMA architecture to help understand all these terms?
Imagine that you are baking a cake. You have a group of ingredients (=memory
pages) that you need to complete the recipe(=process). Some of the ingredients
you may have in your cabinet(=local memory), but some of the ingredients you
might not have, and have to ask a neighbor for(=remote memory). The general
idea is to try and have as many of the ingredients in your own cabinet as possible,
since this reduces your time and effort in making the cake.
You also have to remember that your cabinets can only hold a fixed amount of
ingredients(=physical nodal memory). If you try and buy more, but you have no
room to store it, you may have to ask your neighbor to keep it in his/her cabinet
until you need it(=local memory full, so allocate pages remotely).
A bit of a strange example, I'll admit, but I think it works. If you have a
better analogy, I'm all ears! ;)
[Top]
- Why should I use NUMA? What are the benefits of NUMA?
The main benefit of NUMA is, as mentioned above, scalability. It is extremely
difficult to scale SMP past 8-12 CPUs. At that number of CPUs, the memory bus
is under heavy contention. NUMA is one way of reducing the number of CPUs competing
for access to a shared memory bus. This is accomplished by having several memory
busses and only having a small number of CPUs on each of those busses. There
are other ways of building massively multiprocessor machines, but this is a
NUMA FAQ, so we'll leave the discussion of other methods to other FAQs.
[Top]
- What are the peculiarities of NUMA?
CPU and/or node caches can result in NUMA effects. For example, the CPUs on
a particular node will have a higher bandwidth and/or a lower latency to access
the memory and CPUs on that same node. Due to this, you can see things like
lock starvation under high contention. This is because if CPU x in the node
requests a lock already held by another CPU y in the node, it's request will
tend to beat out a request from a remote CPU z.
[Top]
- What are some alternatives to NUMA?
Also, splitting memory up and (possibly arbitrarily) assigning it to groups
of CPUs can give some performance benefits similar to actual NUMA. A setup like
this would be like a regular NUMA machine where the line between local and remote
memory is blurred, since all the memory is actually on the same bus. The PowerPC
Regatta system is an example of this.
You can achieve some NUMA-like performance by using clusters as well. A cluster
is very similar to a NUMA machine, where each individual machine in the cluster
becomes a node in our virtual NUMA machine. The only real difference is the
nodal latency. In a clustered environment, the latency and bandwidth on the
internodal links are likely to be much worse.
[Top]
- Could you give a brief description of the main NUMA
architecture implementations?
Sure! The main types are IBM NUMA-Q, Compaq Wildfire, and SGI MIPS64. Click
here for descriptions and diagrams of the above system
types, and also a standard SMP system for comparison.
[Top]
Last updated: 1/04/02 Any problems, additions, etc., please send email to this page's
maintainer [snoopdobb@users.sourceforge.net].