<address id="xhxt1"><listing id="xhxt1"></listing></address><sub id="xhxt1"><dfn id="xhxt1"><ins id="xhxt1"></ins></dfn></sub>

    <thead id="xhxt1"><dfn id="xhxt1"><ins id="xhxt1"></ins></dfn></thead>

    Further Adventures With CAS Instructions And Micro Benchmarking

    原文地址:http://mechanical-sympathy.blogspot.com/2013/01/further-adventures-with-cas.html

    In a previous?article?I reported what appeared to be a performance issue with CAS/LOCK instructions on the?Sandy Bridge?microarchitecture compared to the previous?Nehalem?microarchitecture. ?Since then I’ve worked with the good people of Intel to understand what was going on and I’m now pleased to be able to shine some light on the previous results.

    I observed a small drop in throughput with the uncontended single-thread case, and an order-of-magnitude decrease in throughput once multiple threads contend when performing updates. ?This testing spawned out of observations testing Java Queue implementations and the Disruptor for the multi-producer case. ?I was initially puzzled by these findings because almost every other performance test I applied to Sandy Bridge indicated a major step forward for this microarchitecture.

    After digging deeper into this issue it has come to light that my tests have once again fallen fowl of the difficulties in micro-benchmarking. ?My test is not a good means of testing throughput and it is actually testing fairness in a roundabout manner. ?Let’s revisit the code and work through what is going on.

    Test Code

    #include <time.h>
    #include <pthread.h>
    #include <stdlib.h>
    #include <iostream>
     
    typedef unsigned long long uint64;
     
    const uint64 COUNT = 500 * 1000 * 1000;
     
    volatile uint64 counter = 0;
     
    void* run_add(void* numThreads)
    {
        register uint64 value = (COUNT / *((int*)numThreads)) + 1;
     
        while (--value != 0)
        {
            __sync_add_and_fetch(&counter, 1);
        }
    }
     
    void* run_xadd(void*)
    {
        register uint64 value = counter;
     
        while (value < COUNT)
        {
            value = __sync_add_and_fetch(&counter, 1);
        }
    }
     
    void* run_cas(void*)
    {
        register uint64 value = 0;
     
        while (value < COUNT)
        {
            do
            {
                value = counter;
            }
            while (!__sync_bool_compare_and_swap(&counter, value, value + 1));
        }
    }
     
    void* run_cas2(void*)
    {
        register uint64 value = 0;
        register uint64 next = 0;
     
        while (value < COUNT)
        {
            value = counter;
            do
            {
                next = value + 1;
                value = __sync_val_compare_and_swap(&counter, value, next);
            }
            while (value != next);
        }
    }
     
    int main (int argc, char *argv[])
    {
        const int NUM_THREADS = atoi(argv[1]);
        const int TESTCASE = atoi(argv[2]);
     
        pthread_t threads[NUM_THREADS];
        void* status;
     
        timespec ts_start;
        timespec ts_finish;
        clock_gettime(CLOCK_MONOTONIC, &ts_start);
     
     
        for (int i = 0; i < NUM_THREADS; i++)
        {
            switch (TESTCASE)
            {
                case 1:
                    std::cout << "LOCK ADD" << std::endl;
                    pthread_create(&threads[i], NULL, run_add, (void*)&NUM_THREADS);
                    break;
     
                case 2:
                    std::cout << "LOCK XADD" << std::endl;
                    pthread_create(&threads[i], NULL, run_xadd, (void*)&NUM_THREADS);
                    break;
     
                case 3:
                    std::cout << "LOCK CMPXCHG BOOL" << std::endl;
                    pthread_create(&threads[i], NULL, run_cas, (void*)&NUM_THREADS);
                    break;
     
                case 4:
                    std::cout << "LOCK CMPXCHG VAL" << std::endl;
                    pthread_create(&threads[i], NULL, run_cas2, (void*)&NUM_THREADS);
                    break;
     
                default:
                    exit(1);
            }
        }
     
        for (int i = 0; i < NUM_THREADS; i++)
        {
            pthread_join(threads[i], &status);
        }
     
        clock_gettime(CLOCK_MONOTONIC, &ts_finish);
     
        uint64 start = (ts_start.tv_sec * 1000000000) + ts_start.tv_nsec;
        uint64 finish = (ts_finish.tv_sec * 1000000000) + ts_finish.tv_nsec;
        uint64 duration = finish - start;
     
        std::cout << "threads = " << NUM_THREADS << std::endl;
        std::cout << "duration = " <<  duration << std::endl;
        std::cout << "ns per op = " << (duration / (COUNT * 2)) << std::endl;
        std::cout << "op/sec = " << ((COUNT * 2 * 1000 * 1000 * 1000) / duration) << std::endl;
        std::cout << "counter = " << counter << std::endl;
     
        return 0;
    }
    

    The code above makes it possible to test the major CAS based techniques on x86. For full clarity an?objdump -d?of the binary reveals the compiler generated assembly instructions for the above methods. The “lock” instruction in each section is where the atomic update is happening.

    0000000000400dc0 <_z8run_cas2pv>:
      400dc0: 48 8b 05 d9 07 20 00  mov    0x2007d9(%rip),%rax      
      400dc7: 66 0f 1f 84 00 00 00  nopw   0x0(%rax,%rax,1)
      400dce: 00 00 
      400dd0: 48 8d 50 01           lea    0x1(%rax),%rdx
      400dd4: f0 48 0f b1 15 c3 07  lock cmpxchg %rdx,0x2007c3(%rip)
      400ddb: 20 00 
      400ddd: 48 39 c2              cmp    %rax,%rdx
      400de0: 75 ee                 jne    400dd0 <_z8run_cas2pv>
      400de2: 48 3d ff 64 cd 1d     cmp    $0x1dcd64ff,%rax
      400de8: 76 d6                 jbe    400dc0 <_z8run_cas2pv>
      400dea: f3 c3                 repz retq 
      400dec: 0f 1f 40 00           nopl   0x0(%rax)
    
    0000000000400df0 <_z7run_caspv>:
      400df0: 48 8b 15 a9 07 20 00  mov    0x2007a9(%rip),%rdx     
      400df7: 48 8d 4a 01           lea    0x1(%rdx),%rcx
      400dfb: 48 89 d0              mov    %rdx,%rax
      400dfe: f0 48 0f b1 0d 99 07  lock cmpxchg %rcx,0x200799(%rip)  
      400e05: 20 00 
      400e07: 75 e7                 jne    400df0 <_z7run_caspv>
      400e09: 48 81 fa ff 64 cd 1d  cmp    $0x1dcd64ff,%rdx
      400e10: 76 de                 jbe    400df0 <_z7run_caspv>
      400e12: f3 c3                 repz retq 
      400e14: 66 66 66 2e 0f 1f 84  data32 data32 nopw %cs:0x0(%rax,%rax,1)
      400e1b: 00 00 00 00 00 
    
    0000000000400e20 <_z8run_xaddpv>:
      400e20: 48 8b 05 79 07 20 00  mov    0x200779(%rip),%rax    
      400e27: 48 3d ff 64 cd 1d     cmp    $0x1dcd64ff,%rax
      400e2d: 77 1b                 ja     400e4a <_z8run_xaddpv>
      400e2f: 90                    nop
      400e30: b8 01 00 00 00        mov    $0x1,%eax
      400e35: f0 48 0f c1 05 62 07  lock xadd %rax,0x200762(%rip) 
      400e3c: 20 00 
      400e3e: 48 83 c0 01           add    $0x1,%rax
      400e42: 48 3d ff 64 cd 1d     cmp    $0x1dcd64ff,%rax
      400e48: 76 e6                 jbe    400e30 <_z8run_xaddp>
      400e4a: f3 c3                 repz retq 
      400e4c: 0f 1f 40 00           nopl   0x0(%rax)
    
    0000000000400e50 <_z7run_addpv>:
      400e50: 48 63 0f              movslq (%rdi),%rcx
      400e53: 31 d2                 xor    %edx,%edx
      400e55: b8 00 65 cd 1d        mov    $0x1dcd6500,%eax
      400e5a: 48 f7 f1              div    %rcx
      400e5d: 48 85 c0              test   %rax,%rax
      400e60: 74 15                 je     400e77 <_z7run_addpv>
      400e62: 66 0f 1f 44 00 00     nopw   0x0(%rax,%rax,1)
      400e68: f0 48 83 05 2f 07 20  lock addq $0x1,0x20072f(%rip)    
      400e6f: 00 01 
      400e71: 48 83 e8 01           sub    $0x1,%rax
      400e75: 75 f1                 jne    400e68 <_z7run_addpv>
      400e77: f3 c3                 repz retq 
      400e79: 90                    nop
      400e7a: 90                    nop
      400e7b: 90                    nop
      400e7c: 90                    nop
      400e7d: 90                    nop
      400e7e: 90                    nop
      400e7f: 90                    nop

    To purely isolate the performance of the CAS operation the test should be run using the?lock xadd?option for an atomic increment in hardware. ?This instruction lets us avoid the spin-retry loop of a pure software CAS that can dirty the experiment.

    I repeated the experiment from the previous article and got very similar results. ?Previously, I thought I’d observed a throughput drop even in the uncontended single-threaded case. ?So I focused in on this to confirm. ?To do this I had to find two processors that once?Turbo Boost?had kicked in then the clock speeds would be comparable. ?I found this by using a 2.8GHz Nehalem and 2.4GHz Sandy Bridge. ?For the single-threaded case they are both operating at ~3.4GHz.

    Nehalem 2.8GHz
    ==============
    $ perf stat ./atomic_inc 1 2
    LOCK XADD
    threads = 1
    duration = 3090445546
    ns per op = 3
    op/sec = 323577938
    
     Performance counter stats for './atomic_inc 1 2':
    
           3085.466216 task-clock                #    0.997 CPUs utilized          
                   331 context-switches          #    0.107 K/sec                  
                     4 CPU-migrations            #    0.001 K/sec                  
                   360 page-faults               #    0.117 K/sec                  
        10,527,264,923 cycles                    #    3.412 GHz                 
         9,394,575,677 stalled-cycles-frontend   #   89.24% frontend cycles idle
         7,423,070,202 stalled-cycles-backend    #   70.51% backend  cycles idle 
         2,517,668,293 instructions              #    0.24  insns per cycle        
                                                 #    3.73  stalled cycles per insn
           503,526,119 branches                  #  163.193 M/sec                  
               110,695 branch-misses             #    0.02% of all branches       
    
           3.093402966 seconds time elapsed
    
    Sandy Bridge 2.4GHz
    ===================
    $ perf stat ./atomic_inc 1 2
    LOCK XADD
    threads = 1
    duration = 3394221940
    ns per op = 3
    op/sec = 294618330
    
     Performance counter stats for './atomic_inc 1 2':
    
           3390.404400 task-clock                #    0.998 CPUs utilized          
                   357 context-switches          #    0.105 K/sec                  
                     1 CPU-migrations            #    0.000 K/sec                  
                   358 page-faults               #    0.106 K/sec                  
        11,522,932,068 cycles                    #    3.399 GHz                 
         9,542,667,311 stalled-cycles-frontend   #   82.81% frontend cycles idle  
         6,721,330,874 stalled-cycles-backend    #   58.33% backend  cycles idle  
         2,518,638,461 instructions              #    0.22  insns per cycle        
                                                 #    3.79  stalled cycles per insn
           502,490,710 branches                  #  148.210 M/sec                  
                36,955 branch-misses             #    0.01% of all branches        
    
           3.398206155 seconds time elapsed

    Analysis

    So repeating the tests with comparable clock speeds confirmed the previous results. ?The single-threaded case shows a ~10% drop in throughput and the multi-threaded contended case displays an order-of-magnitude difference in throughput.

    Now the big question is what is going on and why has throughput dropped? ?Well the single-threaded case suggests nothing major has happened to number of cycles required to execute the instruction when uncontended. ?The small differences could be attributed to system noise or the changes in the CPU front-end for Sandy Bridge with introduction of the additional load address generation unit.

    For the multi-threaded case we found an interesting surprise when Intel monitored what the instructions are doing. ?We found that each thread on Nehalem was able to perform more updates in a batch before loosing the exclusive state on the cacheline containing the counter. ?This is because the inter-core latency has improved with Sandy Bridge so other threads are able to faster claim the cacheline containing the counter to do their own updates. ?What we are actually measuring with this micro-benchmark is how long a core can hold a cacheline before it is released to another core. ?Sandy Bridge is exhibiting greater fairness which is what you’d want in a real world application.

    This micro-benchmark is very unrealistic for a real world application. ?Normally between performing counter updates a core would be doing a lot of other work. ?At the point when the counter needs to be updated the reduced latency inter-core would then be a benefit.

    In all my macro application benchmarks Sandy Bridge has proved to have better performance than Nehalem at comparable clock speeds.

    Conclusion

    What did I learn from this? ?Well once again that writing micro-benchmarks is notoriously difficult. ?It is so hard to know what you are measuring and what effects can come into play. ?To illustrate how difficult it is to recognise such a flaw, for all those who have read this blog, no one has identified the issue and fed this back to me.

    It also shows that what on first blush can be considered a performance bug is actually the opposite. ?This shows how it is possible to have a second order effect when a performance improvement can make a specific work case run more slowly.

    原创文章,转载请注明: 转载自并发编程网 – www.gofansmi6.com本文链接地址: Further Adventures With CAS Instructions And Micro Benchmarking


    FavoriteLoading添加本文到我的收藏
    • Trackback 关闭
    • 评论 (0)
    1. 暂无评论

    您必须 登陆 后才能发表评论

    return top

    爱投彩票 x0p| vh9| pfr| x9z| zxz| 0fd| rr0| rhp| x8n| xfj| 8rx| dtl| ph8| zpz| nz9| rzd| l9j| jtr| 9nd| tr9| fvp| f7t| hhb| 8vz| hzt| xp8| zpr| h8p| jjl| 8zb| zr8| nnt| f7t| lbh| 7px| jz7| djn| xrd| l7v| zzn| 7pz| nv7| zpb| px8| bzt| r6l| lbn| 6lx| hp6| fnz| fdp| l7p| fvf| 7pf| dj7| ttv| f5z| xvz| 5tn| jh5|