COMP416A/516A 2006: Topics in Artificial Intelligence- Assignment 4: Solving a Real-World Regression Problem -For this assignment you will work on real data collected from soil samples. Each soil sample is processed by a socalled near-infrared spectrometer producing a spectrogramm of about 2000 variables. Additionally for each sample "wet" chemistry determines an additional eight parameters. You are given two data files: clay.train.csv.gz and clay.test.csv.gz.Both comprise one header line naming all attributes and 2000 training and 700-odd test examples respectively. In the test-file all wet chemistry data is missing, denoted by "?". We are only interested in predicting the wet chemistry attribute "clay". On the other hand "clay" is also correlated to the other 7 wet chemistry attributes. So different learning scenarios are plausible:
2fold cross-validation might be good enough for ranking, and is faster than 10fold cross-validation. VAR2-VAR2152 show a fair bit of correlation to close-by attributes. This can have negative consequences for some algorithms, but can also be exploited positively.
Work on scenario 1 from above, and on one more out of 2-4.
Write a report that records how your investigations proceeded and the results you obtained. Make sure you comment on the results and structure your report appropriately. Also comment on the method's ability to identify actual errors in the data.
Additionally email me what you think are you best predictions for "clay" for the test-file,
one prediction per line, in order of the test file. E.g.:
Other Information
Value: 25% of the total marks for all four assignments No extensions will be granted except for sound, documented, medical reasons. Complete your assignment early: computers tend to go wrong at the last minute.
|