
"Solving convex optimization problems, whether generated directly or as subproblems when solving nonconvex problems, is the principal requirement in computational optimization. Everincreasing problem size, particularly as a result of the growing importance and scale of data analysis, is driving the development of novel techniques for solving convex optimization problems using high performance computing systems." Convex Optimization and Beyond workshop, 27 June 2014, ICMS Edinburgh. Parallel Optimization Services  Parallel Computation in Mathematical Algorithmic Optimization (PCMAO) The Research Area :
Title : Parallel Optimization in
Engineering
Author : Henry Kar Ming Chan
For the parallel optimization in
computational science and engineering, it is required to solve the large scales
nonlinear optimization problems in parallel processing mode. OpenMP supports
multiplatform shared memory multiprocessing architecture and is an industrial
standard Application Programming Interface (API) for forking and joining
program and subprogram tasks. From the product of optimizers, Knitro is an
advanced solver for nonlinear optimization problems, handling bound
constraints, nonlinear equalities and inequalities (both convex and
nonconvex), and complementarity constraints. This is because Knitro mainly has
the stateoftheart features using the algorithmic options, such as interior
point methods and active set method, which can be switched between one another
automatically or manually by changing the parameters settings. Knitro can
handle the inequality constraints by an interior point algorithm and direct
solution of the barrier subproblems. In large scales nonlinear problems, Knitro
can handle the inequality constraints by an interior point algorithm and
solution of the barrier subproblems by conjugate gradient iterations. For
detecting infeasibility, Knitro can handle the inequality constraints by an
active set algorithm, which is especially beneficial when a good initial point
is available and when solving a sequence of related problems. In some cases,
the initial guess provided by the user and all subsequent iterates also satisfy
these inequality constraints, then a feasible mode can be set in the parameters
list. A simple test scenario is used as test cases. Different active set
methods are tested in parallel processing using OpenMP section, loop or task
parallelism. Within each active set methods, multistart features are used to randomly
select the initial points so as to find the global optimization points if
possible. Parallel Knitro for nonlinear programming solver and Intel MKL
library for vector and matrix processing are used with OpenMP directives.
Gradient and Hessian vectors for directional and partial derivatives are
generated with OpenMP features through Automatic differentiation.
Finally, vectorization of data, prefetching of data and interprocedural
optimization for parallel OpenMP programs are necessary to be considered during
program compilation to achieve the best performance.
In Parallel Branch and
Bound : strongbranching and pseudocosts; depthfirst search and bestfirst
search. Potential Problems with OpenMP : Load Balancing, Thread Cancellation,
Error Model, Data Affinity, Data Sharing, Data Dependency, Task Dependency.
Potential Parallelism in strong branching includes generating lists of
candidates using job cancellation, memory aware load balancing features for
parallelizing “(most likely nested) While Loop” subprogram within maximum iteration limit, and
applying withinsubprogram near search tree nodes by controlling threads
placement on processors (spreading [instead of closing] the threads to idle
processors for strongbranching and depthfirst searching at root nodes and at nodes of roots especially for large scale integer problems). Please also note
the "FREE" rampups at the beginning and the "FREE"
processors at the end.
Desirable Features for Parallel
Branch and Bound Parallel Multistart (Lower/Upper bounds getting from
Multistarts) Priority for variables with Strong Branching features
(Automatic Tuner) ForkJoin hierarchy (Workshare with Transactional
Memory and Speculative Execution) Vectorization/SIMD features (Common Array Shaping) Loop, Section and Task parallelism (Looping constructs
and Task Model refinements) Memoryaware load balancing (CPU Affinity) Cancellation features (Full Error Model)
Footnote : Some developers find that
OpenMP in gfortran with Intel MKL BLAS and Sparse
Linear Solver reaches linear speedup in the number of cores. If your
application does not exploit these SIMD capabilities, you can easily lose a
factor of 16x or 8x compared to the peak performance of the CPU. External Link : 2125 July 2014. Reference :
Parallel Branch and
Bound :
Parallel Second/Higher
Order Approaches : Benchmarks : NEW CASE for F8 aircraft (In Parallel). A solution obtained by Knitro version 9.0 under NEOS server environment was tested by Henry Kar Ming Chan for the reduced model. The reduced model is defined by using less number of period of time (i.e. ntperu := 10). The standard defaults were used except mainly the strong branching strategies, maximum iterations and maximum nodes with parallel feature in MKL Blas only and without external optimizers. Final objective value is 0.020929. Number of nodes processed is 519 and number of subproblems solved is 526. Total program CPU time is 30.6946 seconds with time spent in evaluations for 12.14195 seconds. The intervals on the equidistant grid on which w(t) = 1 holds, counting from 0 to 59, are 0, 1, 31, 32, 44, 52, and 53. The future release of Knitro may have the features such as the tuner, new presolver, new parallel features and improvements in mixedinteger programming and the activeset algorithm. A sample gfortran program using compiler version 4.10 (development stage) with OpenMP 4.0 features for Parallel Branch and Bound for testing purpose is attached for reference : program parBandB use OMP_LIB implicit none integer :: shared_root_node, private_itr, max_itr double precision :: shared_max, private_temp, private_temp_prod double precision, allocatable, dimension(:) :: myarray max_itr = 100 allocate ( myarray(max_itr) ) do private_itr = 1, max_itr myarray(private_itr) = (max_itr + 1.0  private_itr) / max_itr end do myarray(1) = 1.001 !$OMP PARALLEL proc_bind(spread) num_threads(2) & !$OMP & reduction(max:shared_max) reduction(min:shared_root_node) !$OMP MASTER shared_root_node = min(100,100) shared_max = max(0.999,0.999) !$OMP END MASTER !$OMP SINGLE !$OMP TASKGROUP !$OMP TASK shared(shared_max, shared_root_node) default(firstprivate) untied mergeable do while ((shared_root_node > 1)) !$OMP PARALLEL shared(shared_max, shared_root_node) proc_bind(close) num_threads(2) !$OMP SINGLE !$OMP TASKGROUP !$OMP TASK depend(inout:myarray) private(private_temp,private_temp_prod,private_itr) & !$OMP & shared(shared_max, shared_root_node) default(firstprivate) untied mergeable !$OMP SIMD linear(private_itr) safelen(4) private(private_temp,private_temp_prod) do private_itr = 1, max_itr call random_number(private_temp) private_temp_prod = private_temp * myarray(private_itr) if ((private_temp_prod > shared_max) .or. (private_temp > shared_max)) then shared_max = max(shared_max,private_temp,private_temp_prod) shared_root_node = min(shared_root_node,private_itr) endif enddo !$OMP END SIMD !$OMP CANCEL TASKGROUP IF (shared_root_node == 1) !$OMP END TASK !$OMP END TASKGROUP !$OMP END SINGLE !$OMP SINGLE print*, 'Maximum bound', shared_max !$OMP END SINGLE !$OMP CANCEL PARALLEL IF (shared_root_node == 1) !$OMP END PARALLEL enddo !$OMP END TASK !$OMP END TASKGROUP !$OMP END SINGLE NOWAIT !$OMP CANCEL PARALLEL IF (shared_root_node == 1) !$OMP END PARALLEL deallocate( myarray ) STOP end program parBandB REMARK : It is good to consider to enable nested and set max active levels. Sometimes, it is also good to set thread limit to prevent runaway oversubscription and set dynamic to enable flexible adjustment of the number of threads. Performance Analysis : Outer Taskloop SIMD thread 0 running on cpu 0 Outer Taskloop SIMD thread 1 running on cpu 4 Previous Thread Current Thread Inner Task thread 0 of Outer Taskloop 0 gets task 0 on cpu 0 (startxx:yy:zz endxx:yy:zz) (startxx:yy:zz endxx:yy:zz) Inner Task thread 1 of Outer Taskloop 0 gets task 1 on cpu 1 (startxx:yy:zz endxx:yy:zz) (startxx:yy:zz endxx:yy:zz) Inner Task thread 2 of Outer Taskloop 0 gets task 2 on cpu 2 (startxx:yy:zz endxx:yy:zz) (startxx:yy:zz endxx:yy:zz) Inner Task thread 3 of Outer Taskloop 0 gets task 3 on cpu 3 (startxx:yy:zz endxx:yy:zz) (startxx:yy:zz endxx:yy:zz) Inner Task thread 0 of Outer Taskloop 1 gets task 0 on cpu 4 (startxx:yy:zz endxx:yy:zz) (startxx:yy:zz endxx:yy:zz) Inner Task thread 1 of Outer Taskloop 1 gets task 1 on cpu 5 (startxx:yy:zz endxx:yy:zz) (startxx:yy:zz endxx:yy:zz) Inner Task thread 2 of Outer Taskloop 1 gets task 2 on cpu 6 (startxx:yy:zz endxx:yy:zz) (startxx:yy:zz endxx:yy:zz) Inner Task thread 3 of Outer Taskloop 1 gets task 3 on cpu 7 (startxx:yy:zz endxx:yy:zz) (startxx:yy:zz endxx:yy:zz)
