Various Limulus Benchmarks
HPL Performance
i5-2400S Sandy Bridge
- 200.3 GFLOPS N=40220 (Raw HPL Results)
- 58% of Peak (3.3GHz * 4 cores * 8 DP FLOPS/cycle) + (2.5Ghz * 12 cores * 8 FLOPS/cycle) = 345.6 GFLOPS Peak
- Three i5-2400S and one i5-2500K each with 4 MB DDR3 RAM, GbE, Intel MKL and compilers
i5-3470S Ivybridge
- 256.4 GFLOPS N=58800 (Raw HPL Results)
- 69% of Peak (2.9GHz * 16 cores * 8 DP FLOPS/cycle) = 371.2 GFLOPS Peak
- Four i5-3470S each with 8 MB DDR3 RAM, GbE, Intel MKL and compilers
i5-4570S Haswell
- 385.5 GFLOPS N=60000 (Raw HPL Results)
- 52% of Peak (2.9GHz * 16 cores * 16 DP FLOPS/cycle) = 742.4 GFLOPS Peak (Note: Haswell is now 16 FLOPS/cycle)
- Four i5-4570S each with 8 MB DDR3 RAM, GbE, Intel MKL and compilers
- 567.4 GFLOPS N=60000 (Raw HPL Results)
- 76% of Peak (2.9GHz * 16 cores * 16 DP FLOPS/cycle) = 742.4 GFLOPS Peak (Note: Haswell is now 16 FLOPS/cycle)
- Four i5-4570S each with 8 MB DDR3 RAM, 10-GbE, Intel MKL and compilers
i7-4770S Haswell
- 444.8 GFLOPS N=86000 (Raw HPL Results)
- 56% of Peak (3.1GHz * 16 cores * 16 DP FLOPS/cycle) = 793.6 GFLOPS Peak (Note: Haswell is now 16 FLOPS/cycle)
- Four i7-4770S each with 16 MB DDR3 RAM, GbE, HT disabled, Intel MKL and compilers
- 498.3 GFLOPS N=126000 (Raw HPL Results)
- 63% of Peak (3.1 GHz * 16 cores * 16 DP FLOPS/cycle) = 793.6 GFLOPS Peak (Note: Haswell is now 16 FLOPS/cycle)
- Four i7-4770S each with 32 MB DDR3 RAM, GbE, HT disabled, Intel MKL and compilers
i5-6500 Skylake
- 480.2 GFLOPS N=86000 (Raw HPL Results)
- 59% of Peak (3.2GHz * 16 cores * 16 DP FLOPS/cycle) = 819.2 GFLOPS Peak (Note: Skylake is 16 FLOPS/cycle)
- Four i5-6500 each with 16 MB DDR4 RAM, GbE, Intel MKL and compilers
- 658.3 GFLOPS N=86000 (Raw HPL Results)
- 80% of Peak (3.2GHz * 16 cores * 16 DP FLOPS/cycle) = 819.2 GFLOPS Peak (Note: Skylake is 16 FLOPS/cycle)
- Four i5-6500 each with 16 MB DDR4 RAM, 10-GbE, Intel MKL and compilers
i7-6700 Skylake
- 592.5 GFLOPS N=126000 (Raw HPL Results)
- 68% of Peak (3.4GHz * 16 cores * 16 DP FLOPS/cycle) = 870.4 GFLOPS Peak (Note: Skylake is 16 FLOPS/cycle)
- Four i7-6700 each with 32 MB DDR4 RAM, GbE, HT disabled, Intel MKL and compilers
- 640.4 GFLOPS N=180000 (Raw HPL Results)
- 74% of Peak (3.4GHz * 16 cores * 16 DP FLOPS/cycle) = 870.4 GFLOPS Peak (Note: Skylake is 16 FLOPS/cycle)
- Four i7-6700 each with 64 MB DDR4 RAM, GbE, HT disabled, Intel MKL and compilers
Hadoop
Updated Tests
The cluster consists of four nodes with the following hardware specifications.
- Processor : Intel Core i5-3470S CPU @ 2.90GHz (65W Quad-core Ivy Bridge)
- Memory: 16 GB (DDR3-1600)
- Interconnect: Intel GbE (MTU=4500)
- HDFS Storage: 192 GB SSD (Plexstor M5S 128 GB, Plexstor M5S 64 GB)
- Hadoop Version: 2.2.0 (Hortonworks HDP 2.0 )
- HDFS size: 696.23 GB
- HDFS replication: 2
- OS: Scientific Linux 6.4
- Number of Data nodes: 4
- Number of worker nodes: 4
TestDFSIO (average of 10 runs):
- write: 13.86 MB/sec Single File; 221.80 MB/sec Total Throughput
- read: 53.97 MB/sec Single File; 863.49 MB/sec Total Throughput
terasort (100 GB):
- 886 seconds, 112.87 MB/sec
Previous Results
Presented a tutorial on Hadoop recently and used my original Limulus to demonstrate a four node Hadoop cluster. Some of the hardware is actually quite old. The nodes have a single dual core E6550, with 4 GB RAM and a 64MB SSD. The head node has a quad-core Sandy Bridge (i5-2400S) with 4 GB RAM, 64MB SSD, and .5 TB of RAID1. Connections were using the mediocre consumer based Realtek GigE Ethernet controllers on the nodes. Overall it worked well. I plan on making a Hadoop VNFS so that it is possible to boot the nodes into Hadoop. Used the Hortonworks HDP for the Hadoop install. The following are results of the TestDFSIO benchmark (tests the Hadoop File System)
----- TestDFSIO ----- : write Date & time: Wed Feb 06 22:23:24 EST 2013 Number of files: 10 Total MBytes processed: 10000 Throughput mb/sec: 9.684751649071087 Average IO rate mb/sec: 9.751626968383789 IO rate std deviation: 0.8502143053358981 Test exec time sec: 141.977 ----- TestDFSIO ----- : read Date & time: Wed Feb 06 22:26:34 EST 2013 Number of files: 10 Total MBytes processed: 10000 Throughput mb/sec: 44.49229838314988 Average IO rate mb/sec: 51.67889404296875 IO rate std deviation: 23.5285090607248 Test exec time sec: 58.388
Switchless 10 GigE (GbE)
These tests are preliminary using old hardware and untuned software. This approach has been abandoned. The results remain for reference.
The first issue is how to add 10GigE (GbE) without a switch using low cost dual port GigE cards. Create a four node loop using Ethernet Bridge. A simple loop will require a Spanning Tree Protocol, but that cuts a link and introduces one 2-hop route, two 1-hop routes, and three 0-hop routes. If one of the bridges is replaced by a bonded link in "mode 0" (round robin) then there are effectively four 1-hop routes and two 0-hop routes, but no route effectively takes more than "1-hop" in terms of latency and bandwidth.
LAN -- head node----------- n2 bridge bridge | | | | | | n0 ---------------- n1 bridge bonded mode 0 Round Robin
Hardware: Head node is Sandy Bridge, i5-2400S with 4 GB RAM, The nodes are single dual core E6550, with 4 GB RAM. 10Gig NICS are Chelsio T420-SO-CR
Software: open-mx
The following are the best case "0-hop" routes (head-n0, head-n2) for omx_perf (Note: MTU is 1500 and no other tweaking).
length 0: 6.992 us 0.00 MB/s 0.00 MiB/s length 1: 6.495 us 0.15 MB/s 0.15 MiB/s length 2: 6.460 us 0.31 MB/s 0.30 MiB/s length 4: 6.575 us 0.61 MB/s 0.58 MiB/s length 8: 6.481 us 1.23 MB/s 1.18 MiB/s length 16: 6.541 us 2.45 MB/s 2.33 MiB/s length 32: 6.441 us 4.97 MB/s 4.74 MiB/s length 64: 7.045 us 9.08 MB/s 8.66 MiB/s length 128: 7.293 us 17.55 MB/s 16.74 MiB/s length 256: 8.629 us 29.67 MB/s 28.29 MiB/s length 512: 9.286 us 55.14 MB/s 52.59 MiB/s length 1024: 10.649 us 96.15 MB/s 91.70 MiB/s length 2048: 13.121 us 156.08 MB/s 148.85 MiB/s length 4096: 17.434 us 234.94 MB/s 224.05 MiB/s length 8192: 21.409 us 382.64 MB/s 364.92 MiB/s length 16384: 31.486 us 520.36 MB/s 496.25 MiB/s length 32768: 49.448 us 662.68 MB/s 631.98 MiB/s length 65536: 90.665 us 722.83 MB/s 689.35 MiB/s length 131072: 150.501 us 870.90 MB/s 830.56 MiB/s length 262144: 266.566 us 983.41 MB/s 937.85 MiB/s length 524288: 500.195 us 1048.17 MB/s 999.61 MiB/s length 1048576: 997.626 us 1051.07 MB/s 1002.38 MiB/s length 2097152: 2055.265 us 1020.38 MB/s 973.11 MiB/s length 4194304: 4249.680 us 986.97 MB/s 941.25 MiB/s
The worst case "1-hop" routes (head-n1, n0-n1, n0-n2, n1-n2) for omx_perf (1-hop):
length 0: 14.197 us 0.00 MB/s 0.00 MiB/s length 1: 14.280 us 0.07 MB/s 0.07 MiB/s length 2: 13.970 us 0.14 MB/s 0.14 MiB/s length 4: 13.838 us 0.29 MB/s 0.28 MiB/s length 8: 13.538 us 0.59 MB/s 0.56 MiB/s length 16: 13.689 us 1.17 MB/s 1.11 MiB/s length 32: 13.636 us 2.35 MB/s 2.24 MiB/s length 64: 14.697 us 4.35 MB/s 4.15 MiB/s length 128: 15.761 us 8.12 MB/s 7.74 MiB/s length 256: 17.786 us 14.39 MB/s 13.73 MiB/s length 512: 19.116 us 26.78 MB/s 25.54 MiB/s length 1024: 21.683 us 47.22 MB/s 45.04 MiB/s length 2048: 25.549 us 80.16 MB/s 76.44 MiB/s length 4096: 28.822 us 142.11 MB/s 135.53 MiB/s length 8192: 38.929 us 210.43 MB/s 200.68 MiB/s length 16384: 52.026 us 314.92 MB/s 300.33 MiB/s length 32768: 76.272 us 429.62 MB/s 409.72 MiB/s length 65536: 144.706 us 452.89 MB/s 431.91 MiB/s length 131072: 224.688 us 583.35 MB/s 556.33 MiB/s length 262144: 384.179 us 682.35 MB/s 650.74 MiB/s length 524288: 704.347 us 744.36 MB/s 709.88 MiB/s length 1048576: 1390.370 us 754.17 MB/s 719.23 MiB/s length 2097152: 2910.845 us 720.46 MB/s 687.09 MiB/s length 4194304: 6007.359 us 698.19 MB/s 665.85 MiB/s