Introduction   Algorithms & Interfaces   Supports & Services   Product Info   Fact Sheet  

Training  Parallel Issues  MINLP Research


"Solving convex optimization problems, whether generated directly or as sub-problems when solving non-convex problems, is the principal requirement in computational optimization. Ever-increasing problem size, particularly as a result of the growing importance and scale of data analysis, is driving the development of novel techniques for solving convex optimization problems using high performance computing systems." Convex Optimization and Beyond workshop, 27 June 2014, ICMS Edinburgh.
Parallel Optimization Services --- Parallel Computation in Mathematical Algorithmic Optimization (PCMAO)
The Research Area :

Title : Parallel Optimization in Engineering
Author : Henry Kar Ming Chan

For the parallel optimization in computational science and engineering, it is required to solve the large scales non-linear optimization problems in parallel processing mode. OpenMP supports multi-platform shared memory multi-processing architecture and is an industrial standard Application Programming Interface (API) for forking and joining program and sub-program tasks. From the product of optimizers, Knitro is an advanced solver for non-linear optimization problems, handling bound constraints, non-linear equalities and inequalities (both convex and non-convex), and complementarity constraints. This is because Knitro mainly has the state-of-the-art features using the algorithmic options, such as interior point methods and active set method, which can be switched between one another automatically or manually by changing the parameters settings. Knitro can handle the inequality constraints by an interior point algorithm and direct solution of the barrier sub-problems. In large scales non-linear problems, Knitro can handle the inequality constraints by an interior point algorithm and solution of the barrier sub-problems by conjugate gradient iterations. For detecting infeasibility, Knitro can handle the inequality constraints by an active set algorithm, which is especially beneficial when a good initial point is available and when solving a sequence of related problems. In some cases, the initial guess provided by the user and all subsequent iterates also satisfy these inequality constraints, then a feasible mode can be set in the parameters list. A simple test scenario is used as test cases. Different active set methods are tested in parallel processing using OpenMP section, loop or task parallelism. Within each active set methods, multi-start features are used to randomly select the initial points so as to find the global optimization points if possible. Parallel Knitro for nonlinear programming solver and Intel MKL library for vector and matrix processing are used with OpenMP directives. Gradient and Hessian vectors for directional and partial derivatives are generated with OpenMP features through Automatic differentiation. Finally, vectorization of data, pre-fetching of data and inter-procedural optimization for parallel OpenMP programs are necessary to be considered during program compilation to achieve the best performance.

In Parallel Branch and Bound : strong-branching and pseudo-costs; depth-first search and best-first search. Potential Problems with OpenMP : Load Balancing, Thread Cancellation, Error Model, Data Affinity, Data Sharing, Data Dependency, Task Dependency. Potential Parallelism in strong branching includes generating lists of candidates using job cancellation, memory aware load balancing features for parallelizing “(most likely nested) While Loop” subprogram within maximum iteration limit, and applying within-subprogram near search tree nodes by controlling threads placement on processors (spreading [instead of closing] the threads to idle processors for strong-branching and depth-first searching at root nodes and at nodes of roots especially for large scale integer problems). Please also note the "FREE" ramp-ups at the beginning and the "FREE" processors at the end.

Desirable Features for Parallel Branch and Bound

  • Parallel Multi-start (Lower/Upper bounds getting from Multi-starts)
  • Priority for variables with Branch Predictors features (Automatic Tuner, Profiler and Compiler options)
  • Fork-Join hierarchy (Workshare with Transactional Memory and Speculative Execution)
  • Vectorization/SIMD features (Common Array Shaping and Prefetch)
  • Loop, Section, Concurrent and Task parallelism (Looping constructs and Task Model refinements)
  • Memory-aware load balancing (CPU/Thread Affinity and Cache Hierarchy)
  • Cancellation features (Full Error Model and Event Based Programming)
Footnote : Some developers find that OpenMP in gfortran with Intel MKL BLAS and Sparse Linear Solver reaches linear speedup in the number of cores. If your application does not exploit these SIMD capabilities, you can easily lose a factor of 16x or 8x compared to the peak performance of the CPU.
Benchmarks :


NEW CASE for F-8 aircraft (In Parallel). A solution obtained by Knitro version 9.0 under NEOS server environment was tested by Henry Kar Ming Chan for the reduced model. The reduced model is defined by using less number of period of time (i.e. ntperu := 10). The standard defaults were used except mainly the strong branching strategies, maximum iterations and maximum nodes with parallel feature in MKL Blas only and without external optimizers. Final objective value is 0.020929. Number of nodes processed is 519 and number of sub-problems solved is 526. Total program CPU time is 30.6946 seconds with time spent in evaluations for 12.14195 seconds. The intervals on the equidistant grid on which w(t) = 1 holds, counting from 0 to 59, are 0, 1, 31, 32, 44, 52, and 53.

The future release of Knitro may have the features such as a new object-oriented interface, new SQP algorithm for derivative-free optimization (DFO) and parallel branch and bound.


A sample gfortran program using with OpenMP 5.0 features for Parallel Branch and Bound for testing purpose is attached for reference :

program parBandB
  use OMP_LIB
  implicit none
  integer :: shared_root_node, private_itr, max_itr
  double precision :: shared_max, private_temp, private_temp_prod
  double precision, allocatable, dimension(:) :: myarray
  max_itr = 100
  allocate ( myarray(max_itr) )
  do private_itr = 1, max_itr
     myarray(private_itr) = (max_itr + 1.0 - private_itr) / max_itr
  end do
  myarray(1) = 1.001
!$OMP PARALLEL proc_bind(spread) num_threads(4) &
!$OMP & reduction(max:shared_max) reduction(min:shared_root_node)
  shared_root_node = min(100,100)
  shared_max = max(0.999,0.999)
!$OMP TASK shared(shared_max, shared_root_node) default(firstprivate) priority(5) untied
  do while ((shared_root_node > 1))
!$OMP PARALLEL shared(shared_max, shared_root_node) proc_bind(close) num_threads(4)
!$OMP TASK depend(iterator(private_itr=(1:shared_root_node)):in:myarray(private_itr)) private(private_temp,private_temp_prod,private_itr) &

!$OMP & shared(shared_max, shared_root_node) default(firstprivate) priority(10) untied &

!$OMP & affinity(iterator(private_itr=(1:shared_root_node))) ! Display Affinity

! Depend clause has the options : in, out, inout, mutexinoutset, sink, source and iterator.

!$OMP TASKLOOP SIMD grainsize(4) priority(20) untied linear(private_itr) safelen(shared_root_node) &

!$OMP & private(private_temp,private_temp_prod) default(firstprivate) order(concurrent)

 ! Under testing with SIMPLIFIED priority and nogroup Clauses

!$OMP LOOP bind(thread) order(concurrent) private(private_temp,private_temp_prod,private_itr) 

! Under testing with atomic grouping constructs in ordered updating(replaced by depend/depobj and also declare reduction?...depobj will not be tested but taskwait will be tested.)

   do private_itr = 1, shared_root_node
      call random_number(private_temp)
      private_temp_prod = private_temp * myarray(private_itr)
      if ((private_temp_prod > shared_max) .or. (private_temp > shared_max)) then
        shared_max = max(shared_max,private_temp,private_temp_prod)
        shared_root_node = min(shared_root_node,private_itr)


!$OMP TASKWAIT depend(out:myarray(private_itr))

!$OMP CANCEL TASKGROUP IF (PARALLEL : shared_root_node == 1) ! Also for Task/Loop ?


print*, 'Maximum bound', shared_max
!$OMP CANCEL PARALLEL IF (TASKGROUP : shared_root_node == 1)
!$OMP CANCEL TASKGROUP IF (PARALLEL : shared_root_node == 1)
!$OMP CANCEL PARALLEL IF (TASKGROUP : shared_root_node == 1)
  deallocate( myarray )
end program parBandB

REMARK : It is good to consider to enable nested, enable cancellation, set max active levels and set maximum priority level. Sometimes, it is also good to set thread limit to prevent runaway oversubscription, set wait policy (also Taskyield directive) to minimize the idle situations, set dynamic to enable flexible adjustment of the number of threads and set to specify on which CPUs the threads should be placed.

Performance Analysis : (Concurrent Parallelization - Load Balance and No Idle Threads)

Outer Taskloop SIMD thread 0 running on cpu 0
Outer Taskloop SIMD thread 1 running on cpu 4

                                                                                 Current Thread             Previous Threads

Inner Task thread 0 of Outer Task 0 on cpu 0 (E-HH:HH:HH S-xx:yy:zz) (E-xx:yy:zz S-xx:yy:zz) (E-xx:yy:zz S-xx:yy:zz)..
Inner Task thread 1 of Outer Task 0 on cpu 1 (E-GG:GG:GG S-xx:yy:zz) (E-xx:yy:zz S-xx:yy:zz) (E-xx:yy:zz S-xx:yy:zz)...
Inner Task thread 2 of Outer Task 0 on cpu 2 (E-FF:FF:FF S-xx:yy:zz) (E-xx:yy:zz S-xx:yy:zz) (E-xx:yy:zz S-xx:yy:zz)...
Inner Task thread 3 of Outer Task 0 on cpu 3 (E-EE:EE:EE S-xx:yy:zz) (E-xx:yy:zz S-xx:yy:zz) (E-xx:yy:zz S-xx:yy:zz)...

Inner Task thread 0 of Outer Task 1 on cpu 4 (E-DD:DD:DD S-xx:yy:zz) (E-xx:yy:zz S-xx:yy:zz) (E-xx:yy:zz S-xx:yy:zz)...
Inner Task thread 1 of Outer Task 1 on cpu 5 (E-CC:CC:CC S-xx:yy:zz) (E-xx:yy:zz S-xx:yy:zz) (E-xx:yy:zz S-xx:yy:zz)...
Inner Task thread 2 of Outer Task 1 on cpu 6 (E-BB:BB:BB S-xx:yy:zz) (E-xx:yy:zz S-xx:yy:zz) (E-xx:yy:zz S-xx:yy:zz)...
Inner Task thread 3 of Outer Task 1 on cpu 7 (E-AA:AA:AA S-xx:yy:zz) (E-xx:yy:zz S-xx:yy:zz) (E-xx:yy:zz S-xx:yy:zz)...

REMARK : It is good to consider to try CPU and MEMORY Affinity if possible. In addition, it is also good to note that the Start time of other CPUs should be faster than the earliest End time. Mergeable clause is good if the compiler can support. The 'Taskloop' construct has the gransize, num_tasks, collapse and priority clauses. The nogroup clause and the 'Taskgroup' construct should be used carefully. For OpenMP 5.0, more flexible features like 'Atomic' constructs and 'Reduction' clauses are allowed to use under 'Taskgroup', 'Task', 'Taskloop' and 'Loop' constructs. Concurrent Parallelization can allow depend clause and depobj construct to use in 'Task' construct with 'Taskloop' construct to embed (nested) 'Loop' constructs.

Final remark : For nested factors, compiler optimizations, runtime system tuning, and context-aware parallel API design are the issues. It is good to note the performance criteria for (nested Parallel) Loop/Taskloop SIMD constructs with Atomic constructs features. Please note some important clauses for SIMD construct : Collapse (for nested loops), Linear (additional induction variables), Simdlen (preferred number of iterations to execute concurrently), Safelen (max iterations that can be executed concurrently), Nontemporal (specifies that accesses to the storage locations to which the list items refer have low temporal locality) and Aligned (tells compiler about data alignment). The compiler options like -fopenmp-simd should be noted. Remember to use OpenMP functions like omp_get_num_teams(), omp_get_team_num() and omp_get_team_size() for checking teams information. It is also quite important to know some state-of-the-art methods for detecting errors related to parallelism and concurrency for OpenMP Branch and Bound applications.

Learn by example : DataRaceBench is a benchmark suite designed to systematically and quantitatively evaluate the effectiveness of data race detection tools. It includes a set of microbenchmarks with and without data races. Parallelism is represented by OpenMP directives.