Optimization is a widespread notion: a large number of concrete problems require, during the solving process, optimizing a set of parameters (or variables) with respect to a given objective function (or fitness function). These variables can take integer or real values. Such problems for which variables can only take integer values, are called combinatorial optimization problems. The set of all possible combinations of values for the variables of the problem represents the search space of the problem. Global combinatorial optimization problems - for which the whole search space is available – are our main focus. In a large number of real problems, in particular in bioinformatics, the objective function is partially or entirely unknown. In this contest, it is however possible to calculate the value of the function in each point of the search space. This kind of problem is called a “black box” problem. Classical techniques of operations research are weakly or not at all fitted to black box problems. Consequently, population-based
search algorithms have been developed specifically for black box problems. A population-based method uses a set of solutions rather than a single solution in each iteration of the algorithm and provide a natural, intrinsic way to explore the search space. They are inspired from living organisms which adapt themselves to their environment. The most well-known population-based method is Evolutionary Computation (EC) including Genetic Algorithms (GAs). EC algorithms usually use a operator called crossover/recombination to recombine two or more solutions (called also individuals) to generate new individuals. Another operator used in these algorithms is mutation which is kind of modifier which can change the composition of an individuals. Selection of individuals are based on a quality measure such as the value of an objective function or the results of some experiment. Selection can be considered as the driving force in EC algorithms . Individuals with higher fitness have the higher chance to be chosen for producing the next iteration individuals/search points set. The general idea behind the concept of EC is that the exploration of the solution space is guided by some information about the previous step of the exploration. These information come from the use of a set of solutions from which statistical properties can be extracted giving some insights about the structure of the optimization problem to solve. These statistical properties can in turn be used to generate new promising potential solutions. In GA, crossover is the operator which use the statistical information of the population as it generate a new solution by combining two (or more) previously generated solutions. The mutation operator gives the possibility to bring new information into the population that cannot be discovered by just combining the existing solutions. Finally, the selection process allows to drift the exploration process toward the solutions with higher fitness. The recombination operator in GAs manipulates the partial solutions of an optimization problem. These partial solutions are called building blocks. It often happens that the building blocks are loosely distributed in a problem domain. Therefore a fixed crossover operator can break the building blocks and lead to convergence to local optimum. This problem is called linkage problem. This problem makes the classical genetic algorithm inefficient in solving problems composed of sum of simple sub-problems. Another difficulty with classical GA is to define the parameters such as the crossover and mutation probabilities. In order to solve these deficiencies another group of evolutionary algorithms called Estimation of Distribution Algorithms (EDAs), also called Probabilistic Model Building Genetic Algorithms (PMGAs), have been proposed. EDAs learn distributions or in other words probabilistic models from the most promising solutions, to guide the search process and preserve the important building block in the next generation. EDAs have been used in different machine learning problems such as feature subset selection and classifier systems in many bioinformatics problems. Conception of a new EDA method Bayesian networks are usually used in Estimation of Distribution Algorithms (EDAs) for encoding the multivariate interactions among the variables of an optimization problem. Learning Bayesian networks, especially for large problems with high degree of dependencies among their variables is highly computationally expensive which makes it the bottleneck of EDAs. Therefore introducing efficient Bayesian learning algorithms in EDAs seems necessary in order to use them for large problems. We conceived an algorithm, called CMSS-BOA, which uses a recently introduced heuristic called max-min parent children (MMPC) in order to constraint the models search space. We compare the efficiency of CMSS-BOA with BOA, the standard Bayesian network based EDA for solving several benchmark problems. Our results show that our algorithm is able to obtain comparable final results (and some times better results) with BOA when the same number of generations is used. More interestingly, these results show clearly that the computational time needed to reach the same quality of solution is considerably less with our approach with a reduction of up to 90 percent. Moreover, it seems that the gain in computational time increases with the highest degree of dependencies. This is a very important result as the complexity of this kind of optimization problems grows exponentially with the degree of dependencies. Therefore it will make our algorithm practical for larger difficult real problems that include high level of dependencies which can hardly be solved by classical EDA approaches. This work has been published in:
Analysis of population-based approachesWe have investigated the behavior of population-based algorithms to solve combinatorial optimization problems. A large number of properties are involved in the intrinsic complexity of a global combinatorial optimization problem. We have focused our studies on two particularly important properties: epistasis (dependencies between the variables of the problem) and deceptiveness. We first show that the canonical genetic algorithm is unable to solve problems (even simple ones) with epistasis when no information is available on the nature of the dependencies. We then propose to distinguish two classes of strategies: (1) those based on an algorithm that is dedicated to a unique problem using expert knowledge on the structure of this problem - we call it the expert strategy - and (2) those based on the discovery, through modeling a bias sample of the search space, of the structure of the function to be optimized to determine a pertinent neighborhood for exploration - we call it the automated strategy. We study the capability of these two strategies using one representative of each class: hBOA, an EDA method, for the automated class and a very simple algorithm, exploiting the unique knowledge that the problem is deceptive, for the expert class. Depending on the nature of the problem, either one or the other of these classes can be efficient. If the first one is highly dependent to the information exploited and therefore to the expert’s knowledge, the second one is limited if the number of dependencies between the problem variables is high. However we have shown that hBOA can really discover dependencies even in non-adjacent and mixed configurations. We have also shown how the use of partial information on the problem structure can lead to the definition of a highly efficient algorithm, solving very complex problems in a few minutes. This work has been published in:
Detection of multiple dependenciesThere are many situations in which founding the dependencies among the variables of a domain is needed. Therefore having a model describing these dependencies provides significant information. For example, which variable(s) affect(s) the other variable(s) may be very useful for the problem of selection of variables; decomposition of a problem to independent sub-problems; predicting the value of a variable depending on other variables to solve the classification problem; founding an instantiation of a set of variables for maximizing the value of some function, etc. The classical model used for the detection of dependencies is the Bayesian network. This network is a factorization of the probability distribution of a set of examples. It is well known that the construction of a Bayesian network from examples is a NP-hard problem, thus different heuristic algorithms have been designed to solve this problem. Most of these heuristics are greedy and/or try to reduce the size of the exponential search space by a filtering strategy. The filtering is based on some measures that aim to discover sets of variables that have high potentiality to be mutually dependent or independent. However other measures exist which are not based on conditional probability measurements that have the ability to discover dependencies. Using another measure that is not based on conditional probabilities can provide another perspective about the structure of dependencies of variables of a domain. Statistical Implicative Analysis (SIA) has already shown a great capability in extracting quasi-implications also called as association rules. We have conceived a measure for multiple dependencies based on SIA and then used this measure in a greedy algorithm for solving the problem of multiple dependencies detection. We have compared our new algorithm for founding dependencies with one of the most successful conditional dependencies based heuristic introduced so far, MMPC. We have designed a set of experiments to evaluate the capacity of each of them to discover two kinds of knowledge: the fact that one variable conditionally depends on another one and the sets of variables that are involved in a conditional dependencies relation. Our results showed a good efficiency of the MMPC algorithm for discovering the dependencies. The algorithm appears to be little affected by the change in the complexity of the model and the distributions of the independent variables. However, it has some significant limitations to detect dependencies when a part of the variables is independent. The algorithm MMPC does not appear to be effective for the second problem: the selection of variables. Our SIA based algorithm, does not seem capable of directly detecting the dependencies whatever the configuration was. But it seems very effective to determine the dependent variables. However, it is less efficient in situations where the independent variables have extreme distributions. The two approaches seem complementary and promising. This work has been published in:
Effect of dependencies on feature selection
There are many feature selection approaches that have been proposed and broadly used on machine learning, bioinformatics and pattern recognition areas in recent decades. We cannot expect that one feature selection approach could be employed for all problems and get good performance. So questions arise here: Which are the suitable situations for different feature selection approaches respectively? In our research, we try to find some limitations of several feature selection approaches and focus on how the dependency can affect the capability of these feature selection approaches to extract the relevant features for classification. The goal of our research is to find how dependencies affect the capability of several feature selection approaches to extract of the relevant features for a classification purpose. The hypothesis is that more dependencies and higher level dependencies mean more complexity for the task. In our experiments, we intended to discover some limitations of several feature selection approaches by altering the degree of dependency of the test datasets. We proposed a new method using pre-designed Bayesian Networks to generate the test datasets with an easy tuning level of complexity for feature selection test. Relief, CFS, NB-GA, NB-BOA, SVM-GA, SVMBOA and SVM-mBOA are the filter or wrapper model feature selection approaches which we are used and evaluated in our experiments. Moreover, a multiobjective optimization method is used to keep the diversity of the populations in each generation of the BOA search algorithm improving the overall quality of solutions in our experiments. According to our experiment, we found that the higher level and more complexity dependency among the relevant features will greatly affect the capability to find the relevant features for classification. The higher dependency level, the more complex of the task is. The multi-objective optimization method not only helps to choose a trade-off solution from conflict objectives, but it also can help keeping the diversity of the populations in each generation and improve the overall quality of solutions for BOA in our experiments. Finally, Relief usually is a very effective and efficient feature selection method. One well known drawback of Relief is that it is difficult to filter redundant features. Another limitations of Relief we found in our experiments are that Relief cannot handle sophistic dependencies among the relevant features and these features have similar distribution for both classes at same time; moreover, the number or the dependencies among the irrelevant feature can also affect the capability of Relief. This work has been published in:
|


