GRAC


Project maintained by GlorimarCastro Hosted on GitHub Pages — Theme by mattgraham

Introduction

Description

GRAC is a python based special-purpose programming language for supervised machine learning and statistics. It allows the user to upload CSV files and run classification methods, predictions and statistics on it.

Motivation

Python and R are the most used languages for machine learning and data mining. Tools as Weka and Scikit-Learn have been created to facilitate machine learning, data mining, and big data analysis. Weka, however, is for Java programming and just work with attribute-relation file format (ARFF). Scikit-Learn is for Python but doesn't allow you to pre-process text data files. Also, Scikit-Learn requires a lot of dependencies (e.g. NumPy, SciPy, PyDot, etc.). R is mainly for statistical computing and graphics, making machine learning algorithms hard to code. For these reasons, a new programming tool that converges all the benefits of Weka, Scikit-Learn, and R, but at the same time allow you to program in Python, is desired. Here we propose the creation of a new Python based programming language: GRAC. GRAC is a programming language for machine learning, data mining and big data analysis that converges some of the benefits of Weka, Scikit-Learn, and R. At the same time GRAC allows you to pre-process text files and it let you use comma-separated values (CSV) files.

Machine learning, data mining and big data have been used for the advance in artificial intelligence, gene therapy, cybersecurity, bioinformatics, medical diagnosis, computer vision, and so on. Also, they have been used to improve financial trading, business processing, sport, law enforcement, telecommunication, search engines, terrorism detection, etc. A lot of different applications for machine learning, data mining and big data can be mentioned. For that reason, good tools for doing machine learning, data mining, and big data analysis are needed. As we presented before, different tools have been created already for these. We propose to add together all the benefits of these different tools in just one; being the main motivation to allow the users to work with CSV files (one of the most used format in the mentioned fields). The main purpose of GRAC is to allow the users to just indicate the name of the CSV file and a list of actions, so the research can be accomplished faster.

Statistic Section

The available methods for statitics are:

Machine Learning Section

For now, the machine learning classifiers available are supervised classifiers:

Also, GRAC allows you to calculate the best classifier for your data (based on accuracy), and to execute cross-validation

Version

1.0


Installation

Dependecies

GRAC uses a series of python packages, all listed in the requirement.txt file. To install this packages you can run the next command line in:

Ubuntu:

$ python -m pip install -r requirements.txt

Make sure that you have the newest version of pip in your system, since Ubuntu has an outdated version of pip. Inside the dependecies folder you can find the python file get-pip.py, this file is for installing the newest version of pip.

Windows:

For windows, if you have Anaconda or Conda yo can run the next command line:

$ conda create -n new environment --file requirements.txt

If you don't have Anaconda or Conda, you can install all the dependencies using the next command lines:

$ python -m pip install numpy
$ python -m pip install scipy
$ python -m pip install ply
$ python -m pip pydot
$ python -m pip scikit-learn

If you get an error installing scikit-learn you should download the source file. Users with Python 2.7 can find an installer for scikit-learn in the dependecies folder.

GRAC

To use GRAC you can clone this repository or you can download the zip file from the latest release.


Example Section

Examples of how to use GRAC can be found in the examples folder


Grac Grammar

alt tag


Language Tutorial

The next video contain a tutorial and a description of GRAC:

Also, the user can go to the Reference Manual secction for detailed instruction in how to use the diferent methods of GRAC.


Reference Manual

Basic Syntax

grac{
}
grac{
hasheader = true;
t = 9
}

Types

Booleans

Arrays

Variables

kfold:

$ grac{
$    uploadTrainingData(“C:\path\to\file”);
$    kfold = 10;
$    gnbc();
$    executeCV();
$    saveCVResult(“path\to\output\file”)
$}

hasheader:

$ grac{
$    hasheader = true;
$    class_column = 0;
$    features_columns = [1,2,52];
$    uploadTrainingData(“C:\path\to\file”);
$    gnbc();
$    execute()
$}

class_column

$ grac{
$    hasheader = true;
$    class_column = 0;
$    features_columns = [1,2,52];
$    uploadTrainingData(“C:\path\to\file”);
$    gnbc();
$    execute()
$}

test_class_column

$ grac{
$    hasheader = true;
$    class_column = 0;
$    test_class_column = 1;
$    features_columns = [1,2,52];
$    test_features_column = [2,5,8];
$    uploadTrainingData(“C:\path\to\file”);
$    uploadTestData(“C:\path\to\file”);
$    svc();
$    execute();
$    predict()
$}

features_columns

$ grac{
$    hasheader = true;
$    class_column = 0;
$    features_columns = [1,2,52];
$    uploadTrainingData(“C:\path\to\file”);
$    gnbc();
$    execute()
$}

test_features_column

$ grac{
$    hasheader = true;
$    class_column = 0;
$    test_class_column = 1;
$    features_columns = [1,2,52];
$    test_features_column = [2,5,8];
$    uploadTrainingData(“C:\path\to\file”);
$    uploadTestData(“C:\path\to\file”);
$    svc();
$    execute();
$    predict()
$}

Functions

Machine Learning

svc()

$ grac{
$     uploadTrainingData(“C:\path\to\file”);
$     svc();
$     execute()
$}

dtc()

$ grac{
$     uploadTrainingData(“C:\path\to\file”);
$     dtc();
$     execute()
$}

gnbc()

$ grac{
$     uploadTrainingData(“C:\path\to\file”);
$     gnbc();
$     execute()
$}

execute()

$ grac{
$    hasheader = true;
$    class_column = 0;
$    features_columns = [1,2,52];
$    uploadTrainingData(“C:\path\to\file”);
$    gnbc();
$    execute()
$}

executeCV()

$ grac{
$    uploadTrainingData(“C:\path\to\file”);
$    kfold = 3;
$    gnbc();
$    executeCV();
$    saveCVResult(“path\to\output\file”)
$}

getCVErrorRate()

$ grac{
$    uploadTrainingData(“C:\path\to\file”);
$    kfold = 3;
$    gnbc();
$    executeCV();
$    saveCVResult(“path\to\output\file”);
$    getCVErrorRate()
$}

printBestClassifier()

$ grac{
$    uploadTrainingData(“C:\path\to\file”);
$    printBestClassifier()
$    }

predict()

$ grac{
$    hasheader = true;
$    class_column = 0;
$    test_class_column = 1;
$    features_columns = [1,2,52];
$    test_features_column = [2,5,8];
$    uploadTrainingData(“C:\path\to\file”);
$    uploadTestData(“C:\path\to\file”);
$    svc();
$    execute();
$    predict()
$}

Statistic

mean(x)

$ grac{
$    x = [1,2,3,4];
$    mean(x)
$  }

avg(x)

$ grac{
$    x = [1,2,3,4];
$    mean(x);
$    avg([1,2,3,4])
$  }

min(x)

$ grac{
$    x = [1,2,3,4];
$    min(x)
$  }

max(x)

$ grac{
$    x = [1,2,3,4];
$    max(x)
$  }

mode(x)

$ grac{
$    x = [1,2,3,4];
$    mode(x)
$  }

least(x)

$ grac{
$    x = [1,2,3,4];
$    least(x)
$  }

rndm(x)

$ grac{
$    x = [1,2,3,4];
$    rndm(x)
$  }

count(x)

$ grac{
$    x = [1,2,3,4];
$    count(x)
$  }

stdev(x)

$ grac{
$    x = [1,2,3,4];
$    stdev(x)
$  }

Data Upload

uploadTrainingData()

$ grac{
$    hasheader = true;
$    class_column = 0;
$    test_class_column = 1;
$    features_columns = [1,2,52];
$    test_features_column = [2,5,8];
$    uploadTrainingData(“C:\path\to\file”);
$    svc();
$    execute();
 }

uploadTestData()

$ grac{
$    hasheader = true;
$    class_column = 0;
$    test_class_column = 1;
$    features_columns = [1,2,52];
$    test_features_column = [2,5,8];
$    uploadTrainingData(“C:\path\to\file”);
$    uploadTestData(“C:\path\to\file”);
$    svc();
$    execute();
$    predict()
$ }

uploadData()

$ grac{
$    hasheader = true;
$    class_column = 0;
$    test_class_column = 1;
$    features_columns = [1,2,52];
$    test_features_column = [2,5,8];
$    uploadTrainingData(“C:\path\to\file”);
$    uploadTestData(“C:\path\to\file”);
$    svc();
$    execute();
$    predict();
$    savePredResult(“fileName”);
$    uploadData(“fileName.csv”);
$    mode(0);
$    mean(0);
$    stdev(0);
$ }

Saving results:

saveCVResult():

$ grac{
$    hasheader = true;
$    uploadTrainingData(“C:\path\to\file”);
$    uploadTestData(“C:\path\to\file”);
$    svc();
$    executeCV();
$    saveCVResult()
$ }

saveStatResult():

$ grac{
$    x = [1,2,3,4];
$    stdev(x);
$    max(x);
$    mode(x);
$    saveStatResult()
$  }

savePredResult():

$ grac{
$    hasheader = true;
$    class_column = 0;
$    test_class_column = 1;
$    features_columns = [1,2,52];
$    test_features_column = [2,5,8];
$    uploadTrainingData(“C:\path\to\file”);
$    uploadTestData(“C:\path\to\file”);
$    svc();
$    execute();
$    predict();
$    savePredResult()
$ }

Language Development

Translator Architecture

For details of the architecture of the translator see the Grammar and Lexemes, Tokens and Syntax section.

Interfaces between the modules

In order to execute some modules, others must be executed first. The modules that depend on others are:

Dependant Module Methods that need to be run first
execute() uploadTrainingData(),
svc() | dtc() | gnbc()
executeCV() uploadTrainingData(),
svc() | dtc() | gnbc()
getCVErrorRate() uploadTrainingData(),
svc() | dtc() | gnbc(),
executeCV()
printBestClassifier() uploadTrainingData()
predict() uploadTrainingData(),
uploadTestData,
svc() | dtc() | gnbc(),execute()
saveCVResult() uploadTrainingData(),
uploadTestData,
svc() | dtc() | gnbc(),executeCV()
savePredResult() uploadTrainingData(),
uploadTestData,
svc() | dtc() | gnbc(),
execute(),
predict()

Softwares used for the translator

Ply

Test methodology & program used to test

Using python 2.7 we tested all the variables used and each method exhaustively and compared the results. All the code used for testing can be found in the example folder in the github account for the project


Conclusion

Machine Learning, data mining, and big data are important fields in programming, as they are used for the advance in Medicine, Cyber Security, Financial Analysis, etc. It would be convenient for a language that could do all of this without needing support from other languages and complicated file formats. This is why GRAC was created. With GRAC, the user can analyse data in a CSV file using machine learning or statistical methods.
GRAC is extremely convenient, as there is no need for the understanding of many languages. The user only needs to have a basic knowledge of Python, which is one of the most known languages, and be able to follow the basic guidelines of GRAC to fulfill the wanted analysis.


Developers