← Back to Projects
Compiler optimizations for improving performance of Harris Corner Detection Algorithm on multicore/SIMD CPUs
- Compiler Optimizations
- SIMD Vectorization
- AVX / SSE2
- Parallelization / OpenMP
- Image/Video processing algo
- Course Project Full report (pdf)
Source (github)
The objective of the assignment is to optimize and tune the Harris corner detection algorithm for performance using locality, SIMD and multicore parallelism transformations. We tune the
Using suitable compiler flags, transforms and optimizations we obtain a speed up of 11.5X over unparallelized reference implementation and 13.5X over OpenCV using GCC 4.9.2 compiler and 11.3X over unparallelized reference implementation and 14.6X over OpenCV using ICC 15.0 compiler. All experiments were performaned on Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz [Haswell μarch, 4 core, 64 KB L1 private / 256 KB L2 private / 8 MB L3 shared cache].
Highlight results below. Detailed performance comparison of ICC vs GCC and auto-vectorization, parallel scaling etc. available in the full report.
| OpenCV | Reference | Optimized | Speedup by locality transforms | |
| No Vectorize | 3515.29 | 3767.32 | 2442.4 | 1.54 |
| Vectorize | 3566.35 | 3035.41 | 930.90 | 3.26 |
| Vectorization | - | 1.24x | 2.62x | |
| Speedup |
◉ Speedup and Execution time (in ms) using ICC 15.0
| OpenCV | Reference | Optimized | Speedup w.r.t Reference | |
| 1 core | 3567.95 | 2755.83 | 904.61 | 3.04x |
| 2 core | - | 1617.88 | 355.724 | 4.54x |
| 4 core | - | 1444.89 | 243.19 | 5.94x |
| Speedup by | - | 1.90x | 3.72x | |
| Parallelism |
◉ Speedup and Execution time (in ms) using GCC 4.9.2
| OpenCV | Reference | Optimized | Speedup w.r.t Reference | |
| 1 core | 3566.35 | 3035.41 | 930.90 | 3.26x |
| 2 core | - | 1990.6 | 422.54 | 4.71x |
| 4 core | - | 1940.92 | 264.73 | 7.34x |
| Speedup by | - | 1.56x | 3.52x | |
| Parallelism |