There are two aims you may have when doing feature selection in a supervised setup — getting a set of features on which the final model works best, or trying to extract all features that are relevant to the decision. Contrary to the popular belief, these two aims are substantially different, and the first aim is much less tempting than it seems.

To show that, we will need some theory; but bear with me, it won’t be too much.
First, what does it mean that a feature x is relevant?
It certainly must contain information about the decision Y; still, our feature might not be directly correlated with the decision, but rather involved in some interaction with other features.
Formally, it boils down a check whether there is subset of features, say S, for which conditional mutual information between x and Y given S is positive.
It is important that CMI of a relevant feature may be 0 both because S is too small or too large; too small when it lacks features required to cover the interaction in which our feature is involved, too large if it contains features which completely duplicate the information about Y our feature has.
To this end, appropriate S may be hard to find, especially while there are 2^{M} options in a set with M features, and it is not unlikely that this space isn’t smooth enough for gradient-based optimisation.

The second issue is what does it mean that a model works on some set? The idealisation of a machine learning problem is the Bayes classifier, basically a table which assigns each possible combination of all feature values to an exact probability of an each class. In solvable cases these probabilities are zeroes and ones, otherwise we have a problem with noise that disallows perfect prediction even if we have all observable information; this noise is called risk. For reasonable classifiers we can prove or suspect consistency, that is that they will converge to a Bayes classifier in a limit of infinite number of training objects, and their error will converge toward lower bound defined by risk. Quite obviously, Bayes classifier will predict the same probability despite values of some features, let’s call them redundant; consequently removing them cannot break a good, consistent classifier. Even more, we may suspect better convergence on such a reduced set (as the classifier won’t waste information on unnecessary estimations), hence likely a smallest error in case of an actual, finite training set.

Now, the problem is that redundancy is not a strict negation of relevance, as relevant features may be redundant — for instance, feature A may explain 100% of the decision while feature B 99.7%; both would be relevant, but Bayes classifier would only need A. This is why both these aims are separate and require different methods.

Let’s see how this works in practice on a simple, binary problem: there are 5 independent features A, B, N_{1}, N_{2}, N_{3}, as well as 3 being function of A and B, A or B (AoB), A and B (AnB) and not A (nA).
The decision Y is A xor B (that is true when A is different than B).
To simplify things further, we will make a training set covering all possible objects in such case; code to to this in R would look like this:

```
setNames(
do.call(expand.grid,rep(list(c(T,F)),5)),
c("A","B","N1","N2","N3")
)->X
cbind(X,AoB=with(X,A|B),AnB=with(X,A&B),nA=!X$A)->X
factor(with(X,A!=B))->Y
data.frame(lapply(X,factor))->X
```

The relevant features are all except N_{1-3}, while there are 3 possible distinct Bayes classifies, made on A and B, nA and B, finally AnB and AoB.
Let’s then try an obscenely state-of-art backwards greedy selection driven by a cross-validation error approximation from an SVM model:

```
greedySVM<-function(X,Y,sel=names(X),err=Inf){
#Empty selection is not interesting
if(length(sel)==0) return(NULL)
#Try proposed selection
cerr<-try(kernlab::ksvm(Y~.,X[,sel,drop=FALSE],cross=5)@cross)
#Cross-validation may fail
if(inherits(cerr,"try-error")) return(NULL)
#Error increase, not good
if(cerr>err) return(NULL)
#It is not worse, but maybe can be even better?
for(remove in seq(along=sel))
if(!is.null(ans<-greedySVM(X,Y,sel[-remove],cerr)))
return(ans)
#Current solution is an optimum
return(paste(sort(sel),collapse="+"))
}
greedySVM(X,Y)
## [1] "AnB+AoB"
```

The answer is what can be expected, as only AnB and AoB has a marginal interaction with Y so are easiest to use; still the result may clearly depend on the order of features; 8!>40k is a bit too much to check, so let’s settle on 1k samples:

```
permuteGreedySVM<-function(X,Y,tries=100){
unlist(parallel::mclapply(1:tries,
function(.) greedySVM(X[,sample(ncol(X))],Y),
mc.cores=parallel::detectCores()
))->sels
sort(table(sels)/length(sels),dec=TRUE)
}
permuteGreedySVM(X,Y,1000)
## sels
## AnB+AoB B+nA A+B A+AoB+B A+B+N1+N2+N3+nA
## 0.408 0.302 0.287 0.001 0.001
## B+N2+nA
## 0.001
```

And so the aforementioned instability resurfaces; we can randomly converge to an either Bayes classifier (with slight preference to the easiest one), rarely a mixture, sometimes even with nonsense features (which is due to stochastic nature of cross-validation). Most people naturally won’t check the model structure, and just be satisfied to robustly get an 100% accurate model. On the other hand, more curious ones may focus on instability and come to conclusion that either whole problem is nonsense or that the sole important feature is B, as it is only one that appeared in more than half iterations. It is also important to mention that once the feature space is settled, we cannot detect such issue any more — feature space have been clipped to the surroundings of a single mechanism and it has likely become the only possible solution:

What is a solution to that problem? Naturally, to use an all relevant method; unfortunately, this is a way more complex enterprise than minimal-optimal selection, as it theoretically requires a full scan over all feature subsets, as well as an infinite number of objects to properly estimate feature independence. Still, there are some promising approximate methods, like for instance Boruta:

```
Boruta::Boruta(X,Y)
## Boruta performed 14 iterations in 0.9307714 secs.
## 5 attributes confirmed important: A, AnB, AoB, B, nA;
## 3 attributes confirmed unimportant: N1, N2, N3;
```

It works because it relays on an ensemble classifier which combine many attempts to solve slightly permuted versions of the original problem — hence, they visit all three mechanisms and obtain a comprehensive list of important features in a stable manner.

You may think now, cool, but that’s a carefully crafted problem which has nothing to do with real-world datasets. Indeed, it a rare case when there are few separate mechanisms with exactly the same minimal risk; still, we rarely can precisely estimate the accuracy of a model, hence it is much more likely to have optimal mechanisms of statistically indistinguishable risks.

There is an even deeper problem — sometimes good mechanisms are not even true, but rather emerged from random fluctuations, similarly to how spurious hypotheses get seemingly significant p-values in multiple hypothesis testing without an appropriate correction. Such situation is basically unavoidable in case of datasets with substantially more features then objects, which are in particular an usual product of high-throughput experiments popular in biology. In that case, all relevant selection is somewhat an only option not to get a model based purely on nonsense, if not some single well known factor of a mediocre prediction power leading to a very shallow understanding.

The code is here.

Previously: Some PR, some F, later: Yet another feature filter bundle.