COMP416A/516A 2006: Topics in Artificial Intelligence

- Assignment 4: Solving a Real-World Regression Problem -

For this assignment you will work on real data collected from soil samples. Each soil sample is processed by a socalled near-infrared spectrometer producing a spectrogramm of about 2000 variables. Additionally for each sample "wet" chemistry determines an additional eight parameters. You are given two data files: clay.train.csv.gz and clay.test.csv.gz.

Both comprise one header line naming all attributes and 2000 training and 700-odd test examples respectively. In the test-file all wet chemistry data is missing, denoted by "?".

We are only interested in predicting the wet chemistry attribute "clay". On the other hand "clay" is also correlated to the other 7 wet chemistry attributes. So different learning scenarios are plausible:

  1. only use VAR2-VAR2152 to predict "clay", i.e. ignore the other 7 attributes.
  2. have a two-level setup: predict every of the 8 attributes in isolation (i.e. using only VAR2-VAR2152 each), and then have a second level where the 8 predictions are used to finally predict "clay"
  3. have a slightly different two-level setup: predict every of the other 7 attributes in isolation (i.e. using only VAR2-VAR2152 each), and use these predictions to fill in the missing bits and then predict "clay" from the union of VAR2-VAR2152 and the other 7 wet chemistry attributes.
  4. even more elaborate setups: ...
Weka supplies a reasonable set of algorithms for regression: linear regression, smo regression, m5, m5rules, REPTree, additiveRegression using any other regression algorithm, decisions stumps, simpleLinearRegression, regressionViaDiscretization, ...

2fold cross-validation might be good enough for ranking, and is faster than 10fold cross-validation.

VAR2-VAR2152 show a fair bit of correlation to close-by attributes. This can have negative consequences for some algorithms, but can also be exploited positively.

Work on scenario 1 from above, and on one more out of 2-4.
For both scenarios try to find/build the best possible classifier.

Write a report that records how your investigations proceeded and the results you obtained. Make sure you comment on the results and structure your report appropriately. Also comment on the method's ability to identify actual errors in the data.

Additionally email me what you think are you best predictions for "clay" for the test-file, one prediction per line, in order of the test file. E.g.:


There will be a spot prize for the best submission (as judged by highest correlation coefficient)

Other Information

Value: 25% of the total marks for all four assignments
Due date: Tuesday, 6 June, 5:00 PM

No extensions will be granted except for sound, documented, medical reasons. Complete your assignment early: computers tend to go wrong at the last minute.