انت هنا الان : شبكة جامعة بابل > موقع الكلية > نظام التعليم الالكتروني > مشاهدة المحاضرة

Lec 12 - Memory Hierarchy

Share |
الكلية كلية العلوم للبنات     القسم قسم الحاسبات     المرحلة 1
أستاذ المادة احمد محمد حسين الغزالي       12/03/2019 22:12:22
? LEARNING OBJECTIVES
After completion of this lecture, you should be able to:
? Describe how the computer representation numerical information.
? Describe the computer arithmetic
? Describe the floating-point representation

1. MEMORY HIERARCHY

Memory in a conventional digital computer is organized in a hierarchy as illustrated in Figure 1. At the top of the hierarchy are registers that are matched in speed to the CPU, but tend to be large and consume a significant amount of power. There are normally only a small number of registers in a processor, on the order of a few hundred or less. At the bottom of the hierarchy are secondary and off-line storage memories such as hard magnetic disks and magnetic tapes, in which the cost per stored bit is small in terms of money and electrical power, but the access time is very long when compared with registers. Between the registers and secondary storage are a number of other forms of memory that bridge the gap between the two.
Cache and main memory are built using solid-state semiconductor material. It is customary to call the fast memory level the primary memory. The solid-state memory is followed by larger, less expensive, and far slower magnetic memories that consist typically of the (hard) disk and the tape. It is customary to call the slower level the secondary memory. The objective behind designing a memory hierarchy is to have a memory system that performs as if it consists entirely of the fastest unit and whose cost is dominated by the cost of the slowest unit.
The memory hierarchy can be characterized by a number of parameters. Among these parameters are the access type, capacity, cycle time, latency, bandwidth, and cost. The term access refers to the action that physically takes place during a read or write operation. The capacity of a memory level is usually measured in bytes. The cycle time is defined as the time elapsed from the start of a read operation to the start of a subsequent read. The latency is defined as the time interval between the request for information and the access to the first bit of that information. The bandwidth provides a measure of the number of bits per second that can be accessed. The cost of a memory level is usually specified as dollars per megabytes. Figure 1 depicts a typical memory hierarchy. Table 4 provides typical values of the memory hierarchy parameters.













TABLE 1 MEMORY HIERARCHY PARAMETERS

Access type Capacity Latency Bandwidth Cost/MB
CPU Registers Random 64-1024 bytes 1-10 ns System clock rate High
Cache memory Random 8-512 KB 15-20 ns 10-20 MB/s $500
Main memory Random 16-512 MB 30-50 ns 1-2 MB/s $20-50
Disk memory Direct 1-20 GB 10-30 ms 1-2 MB/s $0.25
Tape memory Sequential 1-20 TB 30-10.000 ms 1-2 MB/s $0.025
2. CACHE MEMORY

Memory access is generally slow when compared with the speed of the central processing unit (CPU), and so the memory posesتشكل a significant مأزق in computer performance. Processing speed is mostly limited by the speed of main memory. If the average memory access time can be reduced, this would reduce the total execution time of the program. In many systems it is not economical اقتصاديto use high speed memory devices for all of the internal memory. Instead, A small but fast cache memory (fast SRAM), in which the contents of the most commonly accessed locations are maintained, can be placed between the main memory and the CPU. The idea behind using a cache is to keep the information expected to be used more frequently by the CPU in the cache (a small high-speed memory that is near the CPU).
Cache memory is much smaller than the main memory. Usually implemented using SRAMs, which sitsتقع between the processor and main memory in the memory hierarchy. The cache effectively isolates the processor from the slowness of the main memory, which is DRAM-based. The principle behind the cache memories is to prefetch جلب مسبق the data from the main memory before the processor needs them. If we are successful in predicting تخمين the data that the processor needs in the near future, we can preload تحميل مسبق the cache and supply those data from the faster cache. It turns out that predicting the processor future access requirements is not difficult owing to a phenomenonظاهرة known as locality of reference that programs exhibit.
A typical situation is shown in Figure 2. A simple computer without a cache memory is shown in the left side of the figure. This cache-less computer contains a CPU that has a clock speed of 400 MHz, but communicates over a 66 MHz bus to a main memory that supports a lower clock speed of 10 MHz. A few bus cycles are normally needed to synchronize the CPU with the bus, and thus the difference in speed between main memory and the CPU can be as large as a factor of ten or more. A cache memory can be positioned closer to the CPU as shown in the right side of Figure 2, so that the CPU sees fast accesses over a 400 MHz direct path to the cache.









Performance of cache systems can be measured using the hit rate معدل النجاحاتor hit ratio. When the processor makes a request for a memory reference, the request is first soughtتبحث in the cache. If the request corresponds to an element that is currently residing in the cache, we call that a cache hit. On the other hand, if the request corresponds to an element that is not currently in the cache, we call that a cache miss. A cache hit ratio, hc, is defined as the probabilityاحتمالية of finding the requested element in the cache. A cache miss ratio (1 - hc) is defined as the probability of not finding the requested element in the cache. The likelihoodاحتمال of the processor finding what it wants in cache a high as 95 percent.
Figure 3 shows how memory read operations are handled to include the cache memory. When the data needed by the processor are in the cache memory (“read hit”), the requested data are supplied by the cache (see Figure 3a).
The dashed line indicates the flow of information in the case of a read hit. In the case of a read miss, we have to read the data from the main memory. While supplying the data from the main memory, a copy is also placed in the cache as indicated by the bottom dashed lines in Figure 3b. Reading data on a cache miss takes more time than reading directly from the main memory as we have to first test to see if the data are in the cache and also consider the overhead involved in copying the data into the cache. This additional time is referred to as the miss penalty.









(a) Read hit











In the case that the requested element is not found in the cache, then it has to be brought from a subsequent memory level in the memory hierarchy. Assuming that the element exists in the next memory level, that is, the main memory, then it has to be brought and placed in the cache. In expectation that the next requested element will be residing in the neighboring locality of the current requested element (spatial locality), then upon a cache miss what is actually brought to the main memory is a block of elements that contains the requested element (a block as a minimum unit of data transfer). The advantage of transferring a block from the main memory to the cache will be most visible if it could be possible to transfer such a block using one main memory access time. Such a possibility could be achieved by increasing the rate at which information can be transferred between the main memory and the cache. The transferring of data is referred to as a mapping process.
Modern memory systems may have several levels of cache, referred to as Level 1 (L1), Level 2 (L2), and even, in some cases, Level 3 (L3). In most instances the L1 cache is implemented right on the CPU chip. Both the Intel Pentium and the IBM-Motorola PowerPC G3 processors have 32 Kbytes of L1 cache on the CPU chip. Processors will often have two separate L1 caches, one for instructions (I-cache) and one for data (D-cache). A level 2 (L2) high speed memory cache store up to 512 kilobytes of data.




































2.1. CACHE MEMORY ORGANIZATION
There are three main different organization techniques used for cache memory. The three techniques are discussed below. These techniques differ in two main aspects:
1. The criterion used to place, in the cache, an incoming block from the main memory.
2. The criterion used to replace a cache block by an incoming block (on cache full).

2.1.1. DIRECT MAPPING
This is the simplest among the three techniques. Its simplicity stems from the fact that it places an incoming main memory block into a specific fixed cache block location. The placement is done based on a fixed relation between the incoming block number, i, the cache block number, j, and the number of cache blocks, N:
j = i mod N

EXAMPLE 1: Consider, for example, the case of a main memory consisting of 4K blocks, a cache memory consisting of 128 blocks, and a block size of 16 words. Figure 5 shows the division of the main memory and the cache according to the direct-mapped cache technique. As the figure shows, there are a total of 32 main memory blocks that map to a given cache block. For example, main memory blocks 0, 128, 256, 384, . . . , 3968 map to cache block 0. We therefore call the direct-mapping technique a many-to-one mapping technique.







The main advantage of the direct-mapping technique is its simplicity in determining where to place an incoming main memory block in the cache. Its main disadvantage is the inefficient غير كفؤةuse of the cache. This is because according to this technique, a number of main memory blocks may compete for a given cache block even if there exist other empty cache blocks. This disadvantage should lead to achieving a low cache hit ratio.
According to the direct-mapping technique, the address issued by the processor interprets by dividing the address into three fields as shown in Figure 6. The lengths, in bits, of each of the fields in Figure 6 are:
1. Word field = log2 B, where B is the size of the block in words.
2. Block field = log2 N, where N is the size of the cache in blocks.
3. Tag field = log2 (M/N), where M is the size of the main memory in blocks.
4. The number of bits in the main memory address = log2 (B × M)

It should be noted that the total number of bits as computed by the first three equations should add up to the length of the main memory address. This can be used as a check for the correctness of your computation.




EXAMPLE 2: Compute the above four parameters for Example 1.
Word field = log2 B = log2 16 = log2 24 = 4 bits
Block field = log2 N = log2 128 = log2 27 = 7 bits
Tag field = log2(M/N) = log2(22 × 210/27) = 5 bits
The number of bits in the main memory address = log2 (B × M) = log2 (24 × 212) = 16 bits.
2.1.2. FULLY ASSOCIATIVE MAPPING
According to this technique, an incoming main memory block can be placed in any available cache block. Therefore, the address issued by the processor need only have two fields. These are the Tag and Word fields. The first uniquely identifies the block while residing in the cache. The second field identifies the element within the block that is requested by the processor. the address issued by the processor is interpreted by dividing it into two fields as shown in Figure 7. The length, in bits, of each of the fields in Figure 7 are given by:
1. Word field = log2 B, where B is the size of the block in words
2. Tag field = log2 M, where M is the size of the main memory in blocks
3. The number of bits in the main memory address = log2 (B × M)





EXAMPLE 3: Compute the above three parameters for a memory system having the following specification: size of the main memory is 4K blocks, size of the cache is 128 blocks, and the block size is 16 words. Assume that the system uses associative mapping.
Word field = log2 B = log2 16 = log2 24 = 4 bits
Tag field = log2 M = log2 22 × 210 = 12 bits
The number of bits in the main memory address = log2 (B × M) = log2(24 × 212) = 16 bits.

2.1.3. SET-ASSOCIATIVE MAPPING
In the set-associative mapping technique, the cache is divided into a number of sets. Each set consists of a number of blocks. A given main memory block maps to a specific cache set based on the equation s = i mod S, where S is the number of sets in the cache, i is the main memory block number, and s is the specific cache set to which block i maps. However, an incoming block maps to any block in the assigned cache set. Therefore, the address issued by the processor is divided into three distinct fields. These are the Tag, Set, and Word fields. The Set field is used to uniquely identify the specific cache set that ideally should hold the targeted block. The Tag field uniquely identifies the targeted block within the determined set. The Word field identifies the element (word) within the block that is requested by the processor. According to the set-associative mapping technique, the address issued by the processor is interprets by dividing it into three fields as shown in Figure 8. The length, in bits, of each of the fields of Figure 8 is given by
1. Word field = log2 B, where B is the size of the block in words
2. Set field = log2 S, where S is the number of sets in the cache
3. Tag field = log2 (M/S), where M is the size of the main memory in blocks. S = N/Bs, where N is the number of cache blocks and Bs is the number of blocks per set
4. The number of bits in the main memory address = log2 (B × M)





EXAMPLE 4: Compute the above three parameters (Word, Set, and Tag) for a memory system having the following specification: size of the main memory is 4K blocks, size of the cache is 128 blocks, and the block size is 16 words. Assume that the system uses set-associative mapping with four blocks per set.
S = 128/4 = 32 sets:
1. Word field = log2 B = log2 16 = log2 24 = 4 bits
2. Set field = log2 32 = 5 bits
3. Tag field = log2 (4 × 210/32) = 7 bits

3. VIRTUAL MEMORY
When you write programs, you are not really concerned with the amount of memory available on your system to run the program. What if your program requires more memory to run than is available on your machine? This is not a theoretical question, in spite of the amount of memory available on current machines. Even on a single-user system, not all the memory is available for your program. The operating system takes quite a big chunkكمية كبيرة of it. If you consider a time-shared multiprogrammed system, the problem becomes even more serious. Virtual memory was proposed to deal with this problem.
Virtual memory was developed to eliminate the physical memory size restriction mentioned before. There are some similarities between the cache memory and virtual memory. Just as with the cache memory, we would like to use the relatively small main memory and create the illusionخداع (to the programmer) of a much larger memory, as shown in Figure 9. The programmer is concerned only with the virtual address space. Programs use virtual addresses and when these programs are run, their virtual addresses are mapped to physical addresses at run time.
The illusion of larger address space is realized by using much slower disk storage. Virtual memory can be implemented by devising an appropriate mapping function between the virtual and physical address spaces. As a result of this similarity between cache and virtual memories, both memory system designs are based on the same underlying principles. The success of the virtual memory in providing larger virtual address space also depends on the locality we mentioned before.
Before the virtual memory technique was proposed, one had to resort to a technique known as overlaying in order to run programs that did not fit into the physical memory. With only a tiny amount of memory available (by current standards) on the earlier machines, only a simple program could fit into the memory. In this technique, the programmer divides the program into several chunks, each of which can fit in the memory. These chunks are known as overlaysتراكبات . The whole program (i.e., all overlays) resides in the disk. The programmer is responsible for explicitly managing the overlays. Typically, when an overlay in the memory is finished, it will bring in the next overlay that is required for program execution. Needless to sayوغني عن القول, this is not something a programmer would like to doلا يفضله المبرمج. Virtual memory automates the management of overlays without requiring any help from the programmer.










Many of the concepts we use here are similar to the concepts used in cache systems. Caches use a small amount of fast memory but to a program it appears as a large amount of fast memory. As mentioned before, virtual memory also provides a similar illusion. As a result of this similarity, the principles involved are the same. The details, however, are quite differentمختلفة تماما because the motivationsالدوافع for cache memory and virtual memory are different. We use cache memory to improve performance whereas virtual memory is necessary to run programs in a small amount of memory.
Virtual memory implements a mapping function between a much larger virtual address space and the physical memory address space. For example, in the PowerPC a 48-bit virtual address is translated into a 32-bit memory address. On the other hand, the Pentium uses 32-bit addresses for both virtual and physical memory addresses.

4. THE VON NEUMANN MODEL

John von Neumann, along with others, proposed the concept of the stored program that we use even today. The idea was to keep a program in the memory and read the instructions from it. He also proposed an architecture that clearly identified the components we have presented previously: ALU, control, input, output, and memory as illustrated in Figure 10. This architecture is known as the von Neumann architecture.









This architecture uses what is known as the stored program model. In the von Neumann architecture, the stored program is the most important aspect of the von Neumann model. The key features of this architecture are as follows:
• There is no distinction between instructions and data. This requirement has several main implications:
1. Instructions are represented as numbers, just like the data themselves. This uniform treatment of instructions and data simplifies the design of memory and software.
2. Instructions and data are not stored in separate memories; a single memory is used for both. Thus, a single path from the memory can carry both data and instructions.
3. The memory is addressed by location, regardless of the type of data at that location.
• By default, instructions are executed in the sequential manner in which they are present in the stored program.
A program is stored in the computer’s memory along with the data to be processed (A computer with a von Neumann architecture has a single memory space that contains both the instructions and the data, see figure 11). This can lead to a condition called the von Neumann bottleneck, it places a limitation on how fast the processor can run. Instructions and data must share the same path to the CPU from memory, so if the CPU is writing a data value out to memory, it cannot fetch the next instruction to be executed. It must wait until the data has been written before proceeding and vice versa.









In contrast to the single memory concept used in the von Neumann architecture, the Harvard architecture uses separate memories for instructions and data. The term now refers to machines that have a single main memory but use separate caches for instructions and data.
Most processors now use two separate caches: one for instructions and the other for data. This design uses separate buses for instructions and data. Processors typically use the Harvard architecture only at the CPU-cache interface.

In order to avoid the von Neumann bottleneck :-
? Multi-level caches used to reduce miss penalty (assuming that the L1 cache is on-chip); and
? Memory system is designed to support caches with burst mode accesses.


المادة المعروضة اعلاه هي مدخل الى المحاضرة المرفوعة بواسطة استاذ(ة) المادة . وقد تبدو لك غير متكاملة . حيث يضع استاذ المادة في بعض الاحيان فقط الجزء الاول من المحاضرة من اجل الاطلاع على ما ستقوم بتحميله لاحقا . في نظام التعليم الالكتروني نوفر هذه الخدمة لكي نبقيك على اطلاع حول محتوى الملف الذي ستقوم بتحميله .
الرجوع الى لوحة التحكم