weka.classifiers.meta.OneClassClassifier Documentation
What is one-class classification?
One-class classification is a type of machine learning where information is only required about a single "target" class in order to build a model that can be used for prediction. Two judgements are made by one-class classifiers: they predict target if the instance appears to belong to the same class the model was trained on, or unknown if the instance does not seem to come from the same class as the training data. Sometimes this style of classification is known as outlier detection or novelty detection because it attempts to differentiate between data that appears normal and abnormal with respect to the training data. The literature can be confusing because different terms are used. Sometimes outlier/novelty detection refers to problems where all instances are available at training time, and instead of building a model for prediction the learning algorithm must distinguish what - if any - instances from the dataset are outliers. However, in this case we prefer one-class classification because we are attempting to predict membership of the target class (regardless of whether there are outliers in the training data).
What is this classifier?
The OneClassClassifier combines a class probability estimator (such as C4.5 decision trees) with a density estimator (such as a Gaussian distribution) in order to predict the likelihood a given instance belongs to the target class, as described in:
Kathryn Hempstalk, Eibe Frank, Ian H. Witten: One-Class Classification by Combining Density and Class Probability Estimation. In: Proceedings of the 12th European Conference on Principles and Practice of Knowledge Discovery in Databases and 19th European Conference on Machine Learning, ECMLPKDD2008, Berlin, pp. 505-519, 2008.
This paper can be downloaded from here, here or here.
This implementation is intended to provide the combined classifier from this paper, however, it also has the parameter densityOnly that forces the classifier to just use the density estimator for prediction. So this package provides not just the combined classifier, but density-based one-class classifiers. For example, using the GaussianGenerator with densityOnly set is one-class "Naive Bayes". Changing the generator will change the type of one-class density estimation used.
Where can I get the classifier from?
This classifier has been integrated into WEKA and the source code can be obtained by downloading the latest copy of WEKA from Subversion. Details on how to do this can be found here: http://weka.wikispaces.com/Subversion
The classifier also can be downloaded from here. You will also want two props files in order for the options to be editable in the WEKA gui. The props files are #1 here and #2 here. The source code is only available by downloading the latest copy of WEKA.
How do I run it?
Using weka-oneclass.jar:
First, make sure you have the weka.jar file. Next, make sure you have the props files in your home directory or the directory you are launching weka from (depending on your OS). Last, run the following command:
WINDOWS: java -cp weka.jar;weka-oneclass.jar weka.gui.explorer.Explorer
LINUX: java -cp weka.jar:weka-oneclass.jar weka.gui.explorer.Explorer
IMPORTANT:
Unlike the libSVM wrapper for WEKA, which requires the arff file to contain only one class label, this classifier can operate successfully on an arff file containing multiple class labels. However, it will only learn one class label at a time and you must tell it what class label you want it to learn. To do this, you MUST set the parameter targetClassLabel to match the label of the class you wish to predict. The classifier will complain until you set this parameter correctly. The other essential parameter to set is the targetRejectionRate. It adjusts the threshold between target and outlier instances. The default threshold of 0.1 will cause approximately 10% of all target instances to be classified as "outlier".
What does the classifier predict?
The classifier will make a binary classification of either the targetClassLabel (that you have set), or "outlier". If the "outlier" class label does not exist in the dataset, it will predict the class label as missing instead. Therefore, to get the most meaningful output from WEKA it is advisable to add this label into your set of class labels - regardless of whether there are instances belonging to this label or not.
A quick example:
- Open up the iris.arff data file, and edit the class label to contain "outlier", as well as the normal "iris-setosa", "iris-versicolor" and "iris-virginica".
- Open up the WEKA explorer, and load in the altered iris.arff file.
- On the classifiers pane, select weka.classifiers.meta.OneClassClassifier
- Edit the options, change the targetClassLabel to be "Iris-setosa". (no quotes)
- Run the classifier. The OneClassClassifier will attempt to use Iris-setosa as the target class, and during testing all other classes will fall under "outlier". The results window will reflect this. If step 1 was not completed, a prediction of outlier is replaced with "missingClassLabel" and there is less useful information in the explorer window.
What is the general procedure of the classifier?
The OneClassClassifier takes the following steps during training:
- Check that the classifier can handle the data it has been given
- Delete all data that does not belong to the target class
- Check that there are instances left that belong to the target class
- Add the outlier class label if it doesn't exist already
- Calculate the threshold by:
- Hold out some instances
- Build the classifier on artificially-generated and non-heldout data
- Find the threshold that results in a matching targetRejectionRate on held-out data
- Repeat from 6.1
- Set the threshold on the probability to be the average threshold from (6).
- Generate enough artificial data for the entire classifier, and build the full model.
How much is "some" will depend on the parameters that you have set.
What are all the possible parameters and what do they mean?
-trr <rate>
targetRejectionRate - Sets the target rejection rate (default: 0.1)
-tcl <label>
targetClassLabel - Sets the target class label (default: 'target')
-cvr <rep>
crossValidationRepeats - Sets the number of times to repeat cross validation to find the threshold (default: 10)
-P <prop>
proportionGenerated - Sets the proportion of generated data (default: 0.5)
-cvf <perc>
crossValidationFold - Sets the percentage of heldout data for each internal cross validation fold (default: 10)
-num <classname + options>
numericGenerator - Sets the numeric generator (default: weka.classifiers.meta.generators.GaussianGenerator)
-nom <classname + options>
nominalGenerator - Sets the nominal generator (default: weka.classifiers.meta.generators.NominalGenerator)
-L
laplaceCorrection - Sets whether to correct the number of classes to two, if omitted no correction will be made.
-E
densityOnly - Sets whether to exclusively use the density estimate P(X|A).
-I
instanceWeights - Sets whether to use instance weights.
What are all the generators?
The generators are the density estimators P(X|A), but they are called 'generators' here because they also are used to generate artificial data that is used to train the class probability estimator P(T|X). Each generator forms a slightly different density estimate of the data it is trained on, and generates data for a single attribute using the density model. A brief summary of the standard estimators follows...
Discrete (Numeric)
Values are put into buckets according to how often they appear, and P(X|A) of a value x is equal to probability of the bucket x falls into.
DiscreteUniform (Numeric)
Values are put into buckets as in the Discrete Generator, but the probability of every bucket is the same.
EM (Numeric)
Uses the expectation-maximisation algorithm (weka.classifiers.clusterers.EM) to fit a mixture of Gaussians to the training data. Each data point is generated by randomly selecting a single Gaussian, then generating a single value under it.
Gaussian (Numeric)
Fits a single Gaussian distribution to the data, and generates data using this distribution.
MixedGaussian (Numeric)
Fits two Gaussian distributions to the data, each with a mean three standard deviations (one above, one below) the mean of the training data. Each has half the probability. They have same standard deviation as the training data.
Nominal (Nominal)
The same as Discrete except for nominal attributes.
UniformData (Numeric)
Generates values that fall within the range of the lower and upper values of the training data. All values have the same probability.
How can I test a one-class classifier against multi-class classifiers?
In the case of security applications, where the task is to discriminate against new "attacker" classes, multi-class (supervised) classifiers have an unfair advantage over one-class classifiers when evaluated using standard evaluation methods: they have seen the negative data during training, whereas one-class classifiers always ignore negative data (even if it is present during training). It is possible to fairly compare the two different classification styles without biasing towards the multi-class situation. To do so, a heldout class is identified, and during training all instances from this class are deleted from the training data. A target class is also identified. During prediction, only instances from the target or heldout class are used to test the model. This means that the negative data used to test the model is always novel to the classifier.
This full procedure is described in the following paper which can be downloaded from here, here or here:
Hempstalk, K. and Frank, E. (2008) "Discriminating Against New Classes: One-Class versus Multi-Class Classification", In W. Wobcke and M. Zhang (Eds.), Proceedings 21st Australasian Joint Conference on Artificial Intelligence Auckland, New Zealand, December 1-5, 2008. pp. 325-336, Berlin, Springer.
How can I get more info?
The best way to understand this classifier is to read the paper "One-Class Classification by Combining Density and Class Probability Estimation". When the source code is available I recommend looking at that directly. There is probably nothing I can personally tell you that isn't in either the paper or the source code.
If you've found a bug - great, please let me know so I can squash it! My email is kah18 |at| cs.waikato.ac.nz
Final word...
I'm always keen to hear from people who are using my code, so if you've found a cool use for it or have had success with a particular dataset - I'd love to hear from you. If you have a publication that cites this classifier, please let me know.
(c) Dr. Kathryn Hempstalk, University of Waikato, 2009.