Mining Big Data with Weka 3

A common misconception is that the Weka machine learning software cannot be applied to large datasets. When considering large datasets, it is important to distinguish between training of machine learning models and deploying such models for prediction. Weka is being used to make predictions in real time in very demanding real-world applications. This can be done with almost all Weka models once they have been trained. However, training classifiers on large datasets can be challenging, particularly using Weka's popular graphical Explorer user interface. The Explorer always loads the entire training dataset into the computer's main memory and also incurs significant overhead due to visualisation, etc. Moreover, the amount of memory usable by the Explorer depends on the "heap space" available to Java, which, by default, is less than the physical amount of memory in the computer. (It is possible to increase this heap space by configuring the Java environment for Weka appropriately.) Fortunately, there are alternatives: the Knowledge Flow interface for Weka, the command-line interface (e.g., Weka's SimpleCLI), or programmatic application of Weka with Java or a Java-based scripting language such as Groovy or Jython. They make it possible to process datasets that are too big to fit into the computer's main memory. For example, any so-called "UpdateableClassifier" in Weka can be trained incrementally by loading and processing each instance in a dataset separately. (The massiveOnlineAnalysis package for Weka provides access to the MOA data stream software containing state-of-the-art incremental algorithms for large datasets or data streams.) Additionally, non-incremental learning algorithms can be applied to large datasets by subsampling the data. (Reservoir sampling is an incremental sampling method that can be used for this purpose.) Weka also has optional support for distributed data mining with Hadoop and Spark. The distributedWekaBase package provides base "map" and "reduce" tasks that are not tied to any specific distributed platform. The distributedWekaHadoop package provides Hadoop-specific wrappers and jobs for these base tasks. The distributedWekaSpark package provides Spark-specific wrappers.