wiki:LimulusBenchmarks

Various Limulus Benchmarks

HPL Performance

i5-2400S Sandy Bridge

  • 200.3 GFLOPS N=40220 (Raw HPL Results)
  • 58% of Peak (3.3GHz * 4 cores * 8 DP FLOPS/cycle) + (2.5Ghz * 12 cores * 8 FLOPS/cycle) = 345.6 GFLOPS Peak
  • Three i5-2400S and one i5-2500K each with 4 MB DDR3 RAM, GbE, Intel MKL and compilers

i5-3470S Ivybridge

  • 256.4 GFLOPS N=58800 (Raw HPL Results)
  • 69% of Peak (2.9GHz * 16 cores * 8 DP FLOPS/cycle) = 371.2 GFLOPS Peak
  • Four i5-3470S each with 8 MB DDR3 RAM, GbE, Intel MKL and compilers

i5-4570S Haswell

  • 385.5 GFLOPS N=60000 (Raw HPL Results)
  • 52% of Peak (2.9GHz * 16 cores * 16 DP FLOPS/cycle) = 742.4 GFLOPS Peak (Note: Haswell is now 16 FLOPS/cycle)
  • Four i5-4570S each with 8 MB DDR3 RAM, GbE, Intel MKL and compilers
  • 567.4 GFLOPS N=60000 (Raw HPL Results)
  • 76% of Peak (2.9GHz * 16 cores * 16 DP FLOPS/cycle) = 742.4 GFLOPS Peak (Note: Haswell is now 16 FLOPS/cycle)
  • Four i5-4570S each with 8 MB DDR3 RAM, 10-GbE, Intel MKL and compilers

i7-4770S Haswell

  • 444.8 GFLOPS N=86000 (Raw HPL Results)
  • 56% of Peak (3.1GHz * 16 cores * 16 DP FLOPS/cycle) = 793.6 GFLOPS Peak (Note: Haswell is now 16 FLOPS/cycle)
  • Four i7-4770S each with 16 MB DDR3 RAM, GbE, HT disabled, Intel MKL and compilers
  • 498.3 GFLOPS N=126000 (Raw HPL Results)
  • 63% of Peak (3.1 GHz * 16 cores * 16 DP FLOPS/cycle) = 793.6 GFLOPS Peak (Note: Haswell is now 16 FLOPS/cycle)
  • Four i7-4770S each with 32 MB DDR3 RAM, GbE, HT disabled, Intel MKL and compilers

i5-6500 Skylake

  • 480.2 GFLOPS N=86000 (Raw HPL Results)
  • 59% of Peak (3.2GHz * 16 cores * 16 DP FLOPS/cycle) = 819.2 GFLOPS Peak (Note: Skylake is 16 FLOPS/cycle)
  • Four i5-6500 each with 16 MB DDR4 RAM, GbE, Intel MKL and compilers
  • 658.3 GFLOPS N=86000 (Raw HPL Results)
  • 80% of Peak (3.2GHz * 16 cores * 16 DP FLOPS/cycle) = 819.2 GFLOPS Peak (Note: Skylake is 16 FLOPS/cycle)
  • Four i5-6500 each with 16 MB DDR4 RAM, 10-GbE, Intel MKL and compilers

i7-6700 Skylake

  • 592.5 GFLOPS N=126000 (Raw HPL Results)
  • 68% of Peak (3.4GHz * 16 cores * 16 DP FLOPS/cycle) = 870.4 GFLOPS Peak (Note: Skylake is 16 FLOPS/cycle)
  • Four i7-6700 each with 32 MB DDR4 RAM, GbE, HT disabled, Intel MKL and compilers
  • 640.4 GFLOPS N=180000 (Raw HPL Results)
  • 74% of Peak (3.4GHz * 16 cores * 16 DP FLOPS/cycle) = 870.4 GFLOPS Peak (Note: Skylake is 16 FLOPS/cycle)
  • Four i7-6700 each with 64 MB DDR4 RAM, GbE, HT disabled, Intel MKL and compilers

Hadoop

Updated Tests

The cluster consists of four nodes with the following hardware specifications.

  • Processor : Intel Core i5-3470S CPU @ 2.90GHz (65W Quad-core Ivy Bridge)
  • Memory: 16 GB (DDR3-1600)
  • Interconnect: Intel GbE (MTU=4500)
  • HDFS Storage: 192 GB SSD (Plexstor M5S 128 GB, Plexstor M5S 64 GB)
  • Hadoop Version: 2.2.0 (Hortonworks HDP 2.0 )
  • HDFS size: 696.23 GB
  • HDFS replication: 2
  • OS: Scientific Linux 6.4
  • Number of Data nodes: 4
  • Number of worker nodes: 4

TestDFSIO (average of 10 runs):

  • write: 13.86 MB/sec Single File; 221.80 MB/sec Total Throughput
  • read: 53.97 MB/sec Single File; 863.49 MB/sec Total Throughput

terasort (100 GB):

  • 886 seconds, 112.87 MB/sec

Previous Results

Presented a tutorial on  Hadoop recently and used my original Limulus to demonstrate a four node Hadoop cluster. Some of the hardware is actually quite old. The nodes have a single dual core E6550, with 4 GB RAM and a 64MB SSD. The head node has a quad-core Sandy Bridge (i5-2400S) with 4 GB RAM, 64MB SSD, and .5 TB of RAID1. Connections were using the mediocre consumer based Realtek GigE Ethernet controllers on the nodes. Overall it worked well. I plan on making a Hadoop VNFS so that it is possible to boot the nodes into Hadoop. Used the  Hortonworks HDP for the Hadoop install. The following are results of the TestDFSIO benchmark (tests the Hadoop File System)

----- TestDFSIO ----- : write
           Date & time: Wed Feb 06 22:23:24 EST 2013
       Number of files: 10
Total MBytes processed: 10000
     Throughput mb/sec: 9.684751649071087
Average IO rate mb/sec: 9.751626968383789
 IO rate std deviation: 0.8502143053358981
    Test exec time sec: 141.977

----- TestDFSIO ----- : read
           Date & time: Wed Feb 06 22:26:34 EST 2013
       Number of files: 10
Total MBytes processed: 10000
     Throughput mb/sec: 44.49229838314988
Average IO rate mb/sec: 51.67889404296875
 IO rate std deviation: 23.5285090607248
    Test exec time sec: 58.388

Switchless 10 GigE (GbE)

These tests are preliminary using old hardware and untuned software. This approach has been abandoned. The results remain for reference.

The first issue is how to add 10GigE (GbE) without a switch using low cost dual port GigE cards. Create a four node loop using  Ethernet Bridge. A simple loop will require a Spanning Tree Protocol, but that cuts a link and introduces one 2-hop route, two 1-hop routes, and three 0-hop routes. If one of the bridges is replaced by a  bonded link in "mode 0" (round robin) then there are effectively four 1-hop routes and two 0-hop routes, but no route effectively takes more than "1-hop" in terms of latency and bandwidth.

LAN -- head node----------- n2
         bridge           bridge
          |                  |
          |                  |
          |                  |
         n0 ---------------- n1
       bridge               bonded
                       mode 0 Round Robin

Hardware: Head node is Sandy Bridge, i5-2400S with 4 GB RAM, The nodes are single dual core E6550, with 4 GB RAM. 10Gig NICS are  Chelsio T420-SO-CR

Software:  open-mx

The following are the best case "0-hop" routes (head-n0, head-n2) for omx_perf (Note: MTU is 1500 and no other tweaking).

length         0:       6.992 us        0.00 MB/s        0.00 MiB/s
length         1:       6.495 us        0.15 MB/s        0.15 MiB/s
length         2:       6.460 us        0.31 MB/s        0.30 MiB/s
length         4:       6.575 us        0.61 MB/s        0.58 MiB/s
length         8:       6.481 us        1.23 MB/s        1.18 MiB/s
length        16:       6.541 us        2.45 MB/s        2.33 MiB/s
length        32:       6.441 us        4.97 MB/s        4.74 MiB/s
length        64:       7.045 us        9.08 MB/s        8.66 MiB/s
length       128:       7.293 us        17.55 MB/s       16.74 MiB/s
length       256:       8.629 us        29.67 MB/s       28.29 MiB/s
length       512:       9.286 us        55.14 MB/s       52.59 MiB/s
length      1024:       10.649 us       96.15 MB/s       91.70 MiB/s
length      2048:       13.121 us       156.08 MB/s      148.85 MiB/s
length      4096:       17.434 us       234.94 MB/s      224.05 MiB/s
length      8192:       21.409 us       382.64 MB/s      364.92 MiB/s
length     16384:       31.486 us       520.36 MB/s      496.25 MiB/s
length     32768:       49.448 us       662.68 MB/s      631.98 MiB/s
length     65536:       90.665 us       722.83 MB/s      689.35 MiB/s
length    131072:       150.501 us      870.90 MB/s      830.56 MiB/s
length    262144:       266.566 us      983.41 MB/s      937.85 MiB/s
length    524288:       500.195 us      1048.17 MB/s     999.61 MiB/s
length   1048576:       997.626 us      1051.07 MB/s     1002.38 MiB/s
length   2097152:       2055.265 us     1020.38 MB/s     973.11 MiB/s
length   4194304:       4249.680 us     986.97 MB/s      941.25 MiB/s

The worst case "1-hop" routes (head-n1, n0-n1, n0-n2, n1-n2) for omx_perf (1-hop):

length         0:       14.197 us       0.00 MB/s        0.00 MiB/s
length         1:       14.280 us       0.07 MB/s        0.07 MiB/s
length         2:       13.970 us       0.14 MB/s        0.14 MiB/s
length         4:       13.838 us       0.29 MB/s        0.28 MiB/s
length         8:       13.538 us       0.59 MB/s        0.56 MiB/s
length        16:       13.689 us       1.17 MB/s        1.11 MiB/s
length        32:       13.636 us       2.35 MB/s        2.24 MiB/s
length        64:       14.697 us       4.35 MB/s        4.15 MiB/s
length       128:       15.761 us       8.12 MB/s        7.74 MiB/s
length       256:       17.786 us       14.39 MB/s       13.73 MiB/s
length       512:       19.116 us       26.78 MB/s       25.54 MiB/s
length      1024:       21.683 us       47.22 MB/s       45.04 MiB/s
length      2048:       25.549 us       80.16 MB/s       76.44 MiB/s
length      4096:       28.822 us       142.11 MB/s      135.53 MiB/s
length      8192:       38.929 us       210.43 MB/s      200.68 MiB/s
length     16384:       52.026 us       314.92 MB/s      300.33 MiB/s
length     32768:       76.272 us       429.62 MB/s      409.72 MiB/s
length     65536:       144.706 us      452.89 MB/s      431.91 MiB/s
length    131072:       224.688 us      583.35 MB/s      556.33 MiB/s
length    262144:       384.179 us      682.35 MB/s      650.74 MiB/s
length    524288:       704.347 us      744.36 MB/s      709.88 MiB/s
length   1048576:       1390.370 us     754.17 MB/s      719.23 MiB/s
length   2097152:       2910.845 us     720.46 MB/s      687.09 MiB/s
length   4194304:       6007.359 us     698.19 MB/s      665.85 MiB/s