COMP416A/516A 2006: Topics in Artificial Intelligence

- Assignment 3: Automatic Data Cleansing -

UPDATED Section 4 (01/05/06)
UPDATED submission instructions (10/05/06) (see bottom of this page)
  1. Write a meta classifier that takes another classifier (i.e. a "base" classifier) as an argument and implements the following steps in buildClassifier():

    1. Build a base classifier on the training data.
    2. Classify the training data using the base classifier.
    3. Collect the training instances that have been classified correctly and make this the new training set.
    4. Go to Step 1 if the training set has changed in Step 3.

    Your new meta classifier should extend the class weka.classifiers.SingleClassifierEnhancer.

  2. Run your new meta classifier with the decision tree learner weka.classifiers.trees.J48 as the base learner on the breast-cancer, credit-g, hepatitis, lymph, sonar, and soybean datasets in /home/ml/datasets/UCI. Compare the size of the final tree generated by the meta classifier and the tree generated by plain J48.

  3. Perform a more extensive experiment to evaluate the effect of the meta classifier on accuracy. Use Weka's Experimenter and run a 10 times 10-fold cross-validation on all datasets in /home/ml/datasets/UCI. In your first experiment, compare the accuracy of plain J48 to the accuracy of the meta classifier applied in conjunction with J48. Also record how much data (in percent) is discarded by the meta classifier. (You can record that information in the Experimenter by making your meta classifier implement the weka.core.AdditionalMeasureProducer interface and implementing appropriate methods. The J48 class is an example of a class that implements this interface.)

  4. Repeat the same experiment with weka.classifiers.functions.SMO. Make sure to turn on the "-M" option of SMO.Because SMO is not very efficient for large datasets, use only the list of datasets of Section 2 (breast-cancer, credit-g, hepatitis, lymph, sonar, and soybean) for any SMO-based experiments. But do use the same procedure as in Section 3 (10times 10fold CV, etc.)

  5. Change your meta classier so that it only discards an instance if it is misclassified and the classifier is very confident in its prediction (i.e. if the classifier gets it "badly wrong"). To this end, introduce a new parameter X to your classifier and discard an instance if it is misclassified and the base classifier's probability for the predicted class is greater than the cutoff value X.

  6. Re-run all the above experiments with this new version of the meta classifier, setting the cutoff value X to 0.9.

  7. Write a report that records how your investigations proceeded and the results you obtained. Make sure you comment on the results and structure your report appropriately. Also comment on the method's ability to identify actual errors in the data.


Other Information

Value: 25% of the total marks for all four assignments
Due date: Friday, 12 May, 5:00 PM

No extensions will be granted except for sound, documented, medical reasons. Complete your assignment early: computers tend to go wrong at the last minute.

WHAT and HOW to submit: email me ([email protected]) the Java code for the version of your meta-classifier that was needed in Section 5. The report can be handed in on paper or electronically. Acceptable formats for electronical report submission in order of preference are: pdf, ps, txt, (OpenOffice or Word if you have to).