An Investigation of Machine Learning Based Prediction Systems

(File Last Modified Wed, May 29, 2002.)

Review: An Investigation of Machine Learning Based Prediction Systems

Investigation task and field
Motivation
Introduction of the ML techniques
Approach for the investigation
Results
Conclusion
Comments

Review: An Investigation of Machine Learning Based Prediction Systems

Investigation task and field

Machine learning methods have obtained a lot of attention for the software effort estimation task, which is traditionally researched by means of either off-the-shelf models(COCOMO) or local models using statistical techniques(stepwise regression).
This paper investigated three popular machine learning methods, namely artificial neural nets(ANNs), case-based reasoning(CBR) and rule induction(RI). Apply them to a dataset of software projects for effort estimation.
They compared the three prediction systems in terms of accuracy, explanatory value and configurability.

Motivation

The results of software effort estimation are frequently inaccurate by using algorithmic models: either off-the-shelf models or local models using statistical techniques. This is because algorithmic approaches are often unable to adequeately model the complex set of relationships in software development environments.
ML techniques have been used successfully in solving many difficult problems and have been proposed as an alternative way of predicting software effors.
Three ML techniques(ANN,CBR and RI) are selected on the grounds that there exists adequate software tool support and because their contrasting vantage points.

Introduction of the ML techniques

Artificial Neural Networks(ANNs): Most studies concerned with the use of ANNs to predict software development effort have focused on comparative accuracy with algorithmic models rather than on the suitablility of the approach for building software effort prediction systems.

Case-based reasoning(CBR): ther learning process is as follows

        retrieval of similar cases
        reuse of the retrieved cases to find a solution to the problem
        revision of the proposed solution if necessay
        retention of the solution to form a new case

The ``retrieval of similar cases'' step is usually based upon analogical reasoning and rules are inferred from the estimator's own protocols.

Rule Induction(RI) is based on: algorithms for induction which given a training set of examples, each of which is described by the values of an attribute and the outcome, will automatically build decision trees that will correctly classify not only all the examples in the training set, but unknown examples from the wider universe of examples of which the training set is presumed to provide a representative sample.

Approach for the investigation

An existing project effort dataset(77 software project with 10 attributes, no missing data) is selected and applied to each ML system respectively for prediction. A least squares regression(LSR) procedure is used to provide a benchmark comparison. Three ML is selected as follows:

ANN: multi-layer backpropagation net, Neuframe.
CBR: ANGEL
RI: Clementine

Results

All approaches are sensitive to outliers, ANN achieved the best accuracy while RI is consistently the least accurate one.
RI doesn't deal effectively with the categorical attribute(usually unique for each data entry).
Pruning is petentially an important aspect of configuring a prediction system.

Conclusion

Accuracy: ANN > CBR >> RI
Explanatory Value: RI > CBR >> ANN
Configurability: CBR > RI >> ANN
All three factors make those techniques have an equal, if not greater impact upon their adoption.

Comments

This paper introduced and compared the three popular ML techniques and their application on predicting the software effort estimation. The comparison is done by applying one data set to each technique respectively and the conclusion is drawn from the result accuracy.

The authors emphasize the practical value for adopting one technique against others is not solely based on the accuracy, rather, the explanatory value and configurability are equally important. The three techniques chosen have contrasting vantage points which make them equally promising in application. I think this is a good point.

However, this investigation seems not thorough enough, hence it does not convey much insight. For example:

Conclusion is drawn from only one case study, and the dataset used is relatively small.
Only one software is chosen for each technique, this makes me wonder the particular tool may not be representative for the technique or may not suitable for the special case.
The cons and pros of each technique analized in this paper is well-known, even for a begginner like me. I doubt that experienced researchers will find anything new in this paper. (Just my thought ^_^)

Build 11. Apr 12, 2003

Home

About this site

Literature Review

Data Mining

Machine Learning

Software Engineering

Research Notes

holte93.pod
Very Simple Classification Rules Perform Well On Most Commonly Used Datasets

jj99.pod
An Architecture for Exploring Large Design Spaces

mair00.pod
An Investigation of Machine Learning Based Prediction Systems

mbre01bo.pod
Using Machine Learning to Predict Projct Effort: Empirical Case Studies in Data-Starved Domains

quinlan86.pod
Induction of Decision Trees

shawlik91.pod
Symbolic and Neural Learning Algorithms: An Experimental Comparison

KARDIO.pod
Qualitative modelling and learning in KARDIO