Yet another feature filter bundle

Recently, I have released praznik v3; this version probably marks the maturity of the initially planned feature set, hence it is a good time to introduce it with a post here. So, praznik is an R package implementing a collection of information filters, i.e., feature selection methods which does not try to do any inference, rather just select features which share most of the information contained in the decision attribute. These are pretty old tools from the dawn of machine learning, yet still popular due to their simplicity and efficiency; despite this, they were surprisingly absent CRAN, so there was a gap to fill.

Because of the difficulties I’ve highlighted earlier, information filters are usually formulated as a greedy forward optimisers which only rank features and select as many of them as user wants; hence, their overall blueprint looks like this, in pascalesque pseudocode:

function(X,Y,K)
 selected:={}
 for k in 1..K
  for each feature i that is not in selected
   score[i]:=
    if k=1
     mutual information(X,Y)
    else
     criterion(X,Y,selected)
    end
  end
  add i of best score[i] to selected
 end
 return(selected)
end

Assuming this form, the difference between methods only lies in how criterion function is defined; still its structure often allows for certain optimisations, so the final code is often more sophisticated. Namely, in the methods praznik implements, criterion is either a sum or minimum over elements of selected, hence it pays off to progressively accumulate score. Moreover, in the latter case even a deeper optimisation is possible — as the accumulated score can only decrease, one can delay evaluation of the criterion for features that have lesser score than the current maximum, and calculate it in bunch for all features selected in the meantime, as proposed by Fleuret for his CMIM method.

In general, there are two main downsides of information filters; the first is that in practice, the input data must be categorical; otherwise the entropy estimation would become too complex for any generic approach. Hence in practice numerical features are usually just cut into few equally spaced bins, usually 10, without giving any rationale for that; this is also what praznik automagically does when given such feature (well, to some extent; the manual has the details). The other downside is that the criterion only ranks features, and you have to specify the size of selection a priori — hence you get a yet another hyper-parameter of the modelling pipeline, consequently another thing to optimise and another chance to over-fit. To be honest, some methods may stop returning less features, but this is a rare case which usually only happens on synthetic data, hence it is not a reliable idea.

Originally, praznik was a simple port of FEAST, a popular MATLAB library by Adam Pocock, born from an attempt to unify information filters under a single theoretical umbrella. By the way, this incarnation still exists under a different name; yet I ended up writing whole code from scratch, mostly because FEAST was not playing well with R idiosyncrasies. Not everything covered by FEAST was ported, but there are all good and important methods, namely CMIM, DISR, JMI and MRMR; there are also two extra, JMIM and NJMIM, published after FEAST was first released.

Generally, praznik turned out to faster than FEAST; here is a benchmark on the Madelon dataset (you can find it bundled with praznik), on a Ryzen 1700X machine: Comparison of praznik and feast speed for different methods and feature yields.

The other substantial feature is that praznik supports local parallelism via OpenMP (hence its C code has more lines doing feature selection than spawning, locking and syncing state); here are some speed-up plots, on a same computer: Scalability plots, follow text for summary. The scaling is pretty good, yet far from linear due to a high overhead of scalar parts: the pre-processing step to convert features and detect NAs, barrier guarding advancement into next selection, finally synchronisation of heterogeneously loaded threads in the delayed assessment methods — so, as almost always, if you happen to have spare RAM and a large list of praznik tasks, it is better to switch it to a single thread mode and rather execute tasks concurrently. Also the used CPU has 2× SMT, hence the speed-up above 8 threads is expected to be less pronounced; in fact, some will say it is surprising that the is no slow-down.

One more detail: accuracy. Filter methods are usually assessed by how well some model works on the subset of features they provide, and not surprisingly tuned in this regard. As you know I think such approach is close to useless, hence the selection of Madelon, which relevant features are mostly redundant. Despite the fact that details about original structure of Madelon are secret, I am close to certain that Boruta can solve this problem exactly, hence I used its selection as a golden standard.

library(praznik)
data(MadelonD)
K<-list(
 DISR=DISR,MIM=MIM,MRMR=MRMR,JMI=JMI,
 JMIM=JMIM,NJMIM=NJMIM,CMIM=CMIM)
sapply(
 #Pull 20 features using each praznik method
 lapply(
  K,
  do.call,c(MadelonD,list(k=20))
 ),
 #Relevant features have >>Rel<< in name
 function(ans){
  good<-grepl("Rel",names(ans$selection))
  c(hits=sum(good),streak=rle(good)$len[1])
 }
)

##        DISR MIM MRMR JMI JMIM NJMIM CMIM
## hits     20  15    3  20   12    14   13
## streak   20  14    1  20   11    14   11

And, like, surprise; both JMI and DISR happen to solve exactly an all relevant problem, *MIM methods formed the peleton of, let’s say, robust selection of a strong subset of relevant attributes, finally the actively redundancy avoiding MRMR recovered only 3 relevant features, and managed to return nonsense features as its first three individual guesses (data not shown; the initial, highest MI feature is common to all methods).

This was a toy example obviously, but gives some intuition of how these methods’ stability could behave… Who knows, maybe a good topic for a paper. For the time being, you can find a lot of mainstream benchmarks in the methods’ papers, just follow references from the manual; praznik should be result-compatible with reference implementations, not counting some quirks when there are features of a basically same score. If you wish to quickly hack around in R, praznik also contains mostly pure R implementations of all methods (used for testing and slow, hence not exported; but accessible with :::).

Finally thing, the name; you might expect that praznik is some kind of Slavic god of entropy estimation, while in fact it is just feast in Slovenian (to fulfil the rule R package name should be an embarrassing pun about R).

Praznik is available from CRAN, and is developed here. The code to reproduce the results in this post is here.

Previously: Relevance and redundancy, later: Eight legs still.

CC-BY mbq, written 23-1-2018, last revised 28-7-2018.