Weka on cloud

1-click AWS Deployment 1-click Azure Deployment 1-click Google Deployment

Overview

WEKA is an open source softwarewhich provides tools for data pre processing, implementation of several Machine Learning algorithms, and visualization tools so that we can develop machine learning techniques and apply them to real-world data mining problems. Following diagram will explain what WEKA offers:

Weka Summarized

First, you will start with the raw data collected from the field. This data may contain several null values and irrelevant fields. You use the data preprocessing tools provided in WEKA to cleanse the data.Then, you would save the preprocessed data in your local storage for applying ML algorithms.Next, depending on the kind of ML model that you are trying to develop you would select one of the options such as Classify, Cluster, or Associate. The Attributes Selection allows the automatic selection of features to create a reduced dataset.Note that under each category, WEKA provides the implementation of several algorithms. You would select an algorithm of your choice, set the desired parameters and run it on the dataset.WEKA would give you the statistical output of the model processing. It provides you a visualization tool to inspect the data.The various models can be applied on the same dataset. You can then compare the outputs of different models and select the best that meets your purpose.


Weka is a data mining/machine learning application. The purpose of this article is to teach you how to use the Weka Explorer, classify a dataset with Weka, and visualize the results.

Figure 1.1 Weka GUI
  1. Simple CLI is a simple command line interface provided to run Weka functions directly.

2. Explorer is an environment to discover the data.

3. Experimenter is an environment to make experiments and statistical tests between learning schemes.

4. KnowledgeFlow is a Java-Beans based interface for tuning and machine learning experiments.

I will use ‘Explorer’ for the exercises. Just click the Explorer button to switch to the Explorer section.

Pre-processing

Figure 2.1 Iris Dataset

If your data type is in xls format like in Figure 2.1, you have to convert the file. I’ll use the Iris dataset to illustrate the conversion:

  1. Convert your .xls to .csv format
  2. Open your CSV file in any text editor and first add @RELATION database_name to the first row of the CSV file
  3. Add attributes by using the following definition: @ATTRIBUTE attr_name attr_type. If attr_type is numeric you should define it as REAL, otherwise you have to add values between curly parentheses. Sample images are below.
  4. At last, add a @DATA tag just above on your data rows. Then save your file with .arff extension. You can see the illustration in Figure 2.2.

Figure 2.2 Iris Dataset in .arff format

Load Your Data

If you could follow all the steps so far, you can load your data set successfully and you’ll see attribute names (it is illustrated at the red area on above images). The pre-process stage is named as Filter in Weka, you can click the ‘Choose’ button from Filter and apply any filter you want. For example, if you would like to use Association Rule Mining as a training model, you have to dissociate numeric and continuous attributes. To be able to do that you can follow the path: Choose -> Filter -> Supervised -> Attribute -> Discritize

Classification

For this tutorial we will use Iris dataset to illustrate the usage of classification with Weka. You can download the dataset from here. Since Iris dataset doesn’t need pre-processing, we can do classification directly by using it. Weka is a good tool for beginners; it includes a tremendous amount of algorithms in it. After you load your dataset, by clicking the Classify section you can switch to another window which we will talk about in this post.

 ZeroR is the default classifier for Weka. But since ZeroR algorithm’s performance are not good for Iris dataset, we’ll switch it with the J48 algorithm known for its very good success rate for our dataset. By clicking the Choose button from Area 1 on the above Figure 4.1, a new algorithm can be selected from list. J48 algorithm is inside of trees directory in the Classifier list. Before running the algorithm we have to select the test options from Area 2. Test options consist of 4 options:

  1. Use training set: Classifies your model based on the dataset which you originally trained your model with.
  2. Supplied test set: Controls how your model is classified based on the dataset you supply from externally. Select a dataset file by clicking the Set button.
  3. Cross-validation: The cross validation option is a widely used one, especially if you have limited amount of datasets. The number you enter in the Fold section are used to divide your dataset into Fold numbers (let’s say it is 10). The original dataset is randomly partitioned into 10 subsets. After that, Weka uses set 1 for testing and 9 sets for training for the first training, then uses set 2 for testing and the other 9 sets for training, and repeat that 10 times in total by incrementing the set number each time. In the end, the average success rate is reported to the user.
  4. Percentage split: Divide your dataset into train and test according to the number you enter. By default the percentage value is 66%, it means 66% of your dataset will be used as training set and the other 33% will be your test set.

Figure 4.2 Parameters of Algorithm

By clicking the text area, (the arrow on Figure 4.2) you can edit the parameters of the algorithm according to your needs.

I chose the 10 fold cross validation from Test Options using the J48 algorithm. I chose my class feature from the drop down list as class and click the “Start” button from Area 2 in Figure 4.3. According the result, the success rate is 96%, you can see it from the Classifier Output has shown at Area 1 in Figure 4.3.

Figure 4.3 Classification Results

Run Information in Area 1 will give you detailed results as you can see in Figure 4.4. It consists of 5 parts; the first one is Run Information, which gives detailed information about the dataset and the model you used. As you can see in Figure 4.4, we used J48 as a classification model, our dataset was Iris dataset and its features are sepallength, sepalwidth, petallength, petalwidth, class. Our test mode is 10-fold cross-validation. Since J48 is a decision tree, our model created a pruned tree. As you can see on the tree, the first branching happened on petallength which shows the petal length of the flowers, if the value is smaller or equal to 0.6, the species is Iris-setosa, otherwise there is another branch that checks another specification to decide the species. In tree structure, ‘:’ represents the class label.

The Classifier Model part illustrates the model as a tree and gives some information about the tree, like number of leaves, size of the tree, etc. Next is the stratified cross-validation part and it shows the error rates. By checking this part you can see how successful your model is. For example, our model correctly classified 96% of the training data and our mean absolute error rate is 0.035, which is acceptable according to Iris dataset and our model.

Figure 4.4 Detailed Classification Result

You can see a Confusion Matrix and detailed Accuracy Table at the bottom of the report. F-Measure and ROC Area rates are important for the models and they are developed according to a confusion matrix. A confusion matrix represents the True Positive, True Negative, False Positive and False Negative rates, which I explain next. If you already understand Confustion Matrices you can directly skip to the Visualizing the Result part.

Confusion Matrix

Visualizing the Result

Figure 4.5 Visualize Tree Menu

By right clicking Visualize tree you’ll see your model’s illustration like in Figure 4.6.

Figure 4.6 Visualized Tree

If you’d like to see classification errors illustrated, select Visualize Classifier Errors in same menu. By sliding jitter (you can see in Area 1 at Figure 4.6) you can see all samples on coordinate plane. The X plane represents predicted classifier results, the Y plane represents actual classifier results. Squares represent wrongly classified samples. Stars represent true classified samples. Blue colored ones are Iris-setosa, red colored stars are Iris-versicolor, green ones Iris-virginica species. So, red square means our model classified this sample as Iris versicolor but it supposed to be Iris virginica.

Figure 4.7 Visualize Classifier Errors

If you click on one of the squares, you can see more detailed information. I clicked one of the blue ones as shown in Figure 4.8, and saw which sample is classified wrong in detail. But, why would we want to see wrongly classified samples in detail?

We have various  samples which have to classified in machine learning. Sometimes, looking by yourself at the samples, it gives you basic ideas to make your classifier model more robust or find outliers which are irrelevant information for the data you use, etc. So, however we call it as machine learning, most of the time it depends a human to control the data in datasets.

Weka is data mining software that uses a collection of machine learning algorithms. These algorithms can be applied directly to the data or called from the Java code.Weka is a collection of tools for:
  • Regression
  • Clustering
  • Association
  • Data pre-processing
  • Classification
  • Visualisation

The features of Weka are shown below

figure-1-weka-features
figure-2-wekas-application-interfaces
Weka is a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform. The algorithms can either be applied directly to a dataset or called from your own Java code.

Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

Weka is owned by Weka (https://sourceforge.net/projects/weka/) and they own all related trademarks and IP rights for this software.

Weka is open source software issued under the GNU General Public License.

Weka on cloud for AWS

Features

Five features of Weka are:

  • Open Source: It is released as open source software under the GNU GPL. It is dual licensed and Pentaho Corporation owns the exclusive license to use the platform for business intelligence in their own product.
  • Graphical Interface: It has a Graphical User Interface (GUI). This allows you to complete your machine learning projects without programming.
  • Command Line Interface: All features of the software can used from the command line. This can be very useful for scripting large jobs.
  • Java API: It is written in Java and provides a API that is well documented and promotes integration into your own applications. Note that the GNU GPL means that in turn your software would also have to be released as GPL.
  • Documentation: There books, manuals, wikis and MOOC courses that can train you how to use the platform effectively

Major Features of Weka

    • Machine Learning
    • Data Mining
    • Preprocessing
    • Classification
    • Regression
    • Clustering
    • Association rules
    • Attribute selection
    • Experiments
    • Workflow
  • Visualization

AWS

Installation Instructions For Windows

Note: How to find PublicDNS in AWS

Step 1) RDP  Connection: To connect to the deployed instance, Please follow Instructions to Connect to Windows  instance on AWS Cloud

1) Connect to the virtual machine using following RDP credentials:

  • Hostname: PublicDNS  / IP of machine
  • Port : 3389

Username: To connect to the operating system, use RDP and the username is Administrator.
Password: Please Click here to know how to get password .

Step 2) Click the Windows “Start” button and select “All Programs” and then point to Weka.

Step 3) Other Information:

1.Default installation path: will bein your root folder “C:\Program Files\Weka-3-8”
2.Default ports:

  • Windows Machines:  RDP Port – 3389
  • Http: 80
  • Https: 443

Note: Click on Desktop icon – Press start then App will open in the browser.

Configure custom inbound and outbound rules using this link

Installation Step by Step Screenshots

Google

Installation Instructions For Windows

Installation Instructions for Windows

Step 1) VM Creation:

  1. Click the Launch on Compute Engine button to choose the hardware and network settings.
  2. You can see at this page, an overview of Cognosys Image as well as estimated cost of running the instance.
  3. In the settings page, you can choose the number of CPUs and amount of RAM, the disk size and type etc.

Step 2) RDP Connection: To initialize the DB Server connect to the deployed instance, Please follow Instructions to Connect to Windows instance on Google Cloud

Step 2) Click the Windows “Start” button and select “All Programs” and then point to Weka.

Step 3) Other Information:

1.Default installation path: will bein your root folder “C:\Program Files\Weka-3-8”
2.Default ports:

  • Windows Machines:  RDP Port – 3389
  • Http: 80
  • Https: 443

Note: Click on Desktop icon – Press start then App will open in the browser.

Videos

Weka on cloud

Related Posts