The University of Arizona

Scalable Feature Subset Selection and Learning in Dynamic Environments

By Gregory Ditzler and Heng Lui

Figure 1 Feature subset selection is an important step towards producing a classifier that relies only on relevant features, while keeping the computational complexity of the classifier low. Feature selection is also used in making inferences on the importance of attributes, even when classification is not the ultimate goal. For example, in bioinformatics and genomics feature subset selection is used to make inferences between the variables that best describe multiple populations. Unfortunately, many feature selection algorithms require the subset size to be specified a priori, but knowing how many variables to select is typically a nontrivial task. Other approaches rely on a specific variable subset selection framework to be used. The University of Arizona’s Machine Learning and Data Analytics group examines approaches to feature subset selection that are scalable to large volumes of data. Our contributions include a Neyman-Pearson feature selection (NPFS) hypothesis test, which acts as a meta-subset selection algorithm. NPFS is a parallel feature selection algorithm that is scalable to a large volume of data, while allowing a user the choice of a filter-based objective function and identify the number of relevant features in a data set. Our research in feature selection has been used for data analytics in the life sciences

Figure 2 Two of the more common assumptions that applied machine learning researchers make when using an algorithm is that: (1) the training & testing data are sampled from a fixed probability distribution, and (2) there are an equal number of samples from all classes. The former is referred to as concept drift (a.k.a., learning in non-stationary environments) when new data are presented over times, and the latter is known as class imbalance. Our research focuses on developing solutions multiple classifier systems that consider both theoretical and empirical observations for learning in dynamic and uncertain environments.

Colaborators:

Salim Hariri


go back