An Investigation of Machine Learning Based Prediction Systems
-
Machine learning methods have obtained a lot of attention for the software effort estimation task, which is traditionally researched by means of either off-the-shelf
models(COCOMO) or local models using statistical techniques(stepwise regression).
-
This paper investigated three popular machine learning methods, namely artificial neural nets(ANNs), case-based
reasoning(CBR) and rule induction(RI). Apply them to a dataset of software projects for effort estimation.
-
They compared the three prediction systems in terms of accuracy, explanatory value and configurability.
-
The results of software effort estimation are frequently inaccurate by using algorithmic models: either off-the-shelf models or local models using statistical techniques. This is because algorithmic approaches are often unable to adequeately model the complex set of relationships in software development environments.
-
ML techniques have been used successfully in solving many difficult problems and have been proposed as an alternative way of predicting software effors.
-
Three ML techniques(ANN,CBR and RI) are selected on the grounds that there exists adequate software tool support and because their contrasting vantage points.
-
Artificial Neural Networks(ANNs): Most studies concerned with the use of ANNs to predict software development effort have focused on comparative accuracy with algorithmic models rather than on the suitablility of the approach for building software effort prediction systems.
-
Case-based reasoning(CBR): ther learning process is as follows
retrieval of similar cases
reuse of the retrieved cases to find a solution to the problem
revision of the proposed solution if necessay
retention of the solution to form a new case
The ``retrieval of similar cases'' step is usually based upon analogical reasoning and rules are inferred from the estimator's own protocols.
-
Rule
Induction(RI) is based on: algorithms for induction which given a training set of examples, each of which is described by the values of an attribute and the outcome, will automatically build decision trees that will correctly classify not only all the examples in the training set, but unknown examples from the wider universe of examples of which the training set is presumed to provide a representative sample.
An existing project effort dataset(77 software project with 10 attributes, no missing data) is selected and applied to each ML system respectively for prediction. A least squares regression(LSR) procedure is used to provide a benchmark comparison. Three ML is selected as follows:
-
ANN: multi-layer backpropagation net, Neuframe.
-
CBR: ANGEL
-
RI: Clementine
-
All approaches are sensitive to outliers, ANN achieved the best accuracy while RI is consistently the least accurate one.
-
RI doesn't deal effectively with the categorical attribute(usually unique for each data entry).
-
Pruning is petentially an important aspect of configuring a prediction system.
-
Accuracy: ANN > CBR >> RI
-
Explanatory Value: RI > CBR >> ANN
-
Configurability: CBR > RI >> ANN
-
All three factors make those techniques have an equal, if not greater impact upon their adoption.
This paper introduced and compared the three popular ML techniques and their application on predicting the software effort estimation. The comparison is done by applying one data set to each technique respectively and the conclusion is drawn from the result accuracy.
The authors emphasize the practical value for adopting one technique against others is not solely based on the accuracy, rather, the explanatory value and configurability are equally important. The three techniques chosen have contrasting vantage points which make them equally promising in application. I think this is a good point.
However, this investigation seems not thorough enough, hence it does not convey much insight. For example:
-
Conclusion is drawn from only one case study, and the dataset used is relatively small.
-
Only one software is chosen for each technique, this makes me wonder the particular tool may not be representative for the technique or may not suitable for the special case.
-
The cons and pros of each technique analized in this paper is well-known, even for a begginner like me. I doubt that experienced researchers will find anything new in this paper. (Just my thought ^_^)
| Build 11. Apr 12, 2003
Home
About this site
Literature Review
Data Mining
Machine Learning
Software Engineering
Research Notes
Hholte93.pod
Very Simple Classification Rules Perform Well On Most Commonly Used Datasets
Jjj99.pod
An Architecture for Exploring Large Design Spaces
Mmair00.pod
An Investigation of Machine Learning Based Prediction Systems mbre01bo.pod
Using Machine Learning to Predict Projct Effort: Empirical Case Studies in Data-Starved Domains
Qquinlan86.pod
Induction of Decision Trees
Sshawlik91.pod
Symbolic and Neural Learning Algorithms: An Experimental Comparison
KKARDIO.pod
Qualitative modelling and learning in KARDIO |