A Survey of Data Mining and Knowledge Discovery Software Tools

(File Last Modified Wed, May 29, 2002.)


Review: A Survey of Data Mining and Knowledge Discovery Software Tools

Goal

  • Provide an overview of existing knowledge discovery and data mining techniques.
  • Provide a feature classification scheme that identifies important features to study the tools.
  • Investigate existing knowledge discovery and data mining software tools using the above scheme
  • Identify the features that discovery software should possess for further reference.

KDD and data mining

Concepts:

  • KDD: Knowledge Discovery in Database. It is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
  • data mining: is the extraction of patterns or models from observed data.

Relation:

  • KDD refers to the overall process, while data mining is one step at the core of the process, dealing with the extraction of patterns and relationships from large amounts of data.
  • Data mining usually takes only a small part (15%-20%) of the overall effort.

Other steps:

  • Developing an understanding of the application domain and the goals of the data mining process.
  • Acquiring or selecting a target data set.
  • Integrating and checking the data set
  • Data cleaning, preprocessing and transformation.
  • Model development and hypothesis building.
  • Choosing suitable data mining algorithms.
  • Result interpretation and visualization.
  • Result testing and verification
  • Using and maintaining the discovered knowledge.

Data mining tasks

  • Data Processing
  • Prediction
  • Regression
  • Classification
  • Clustering
  • Link Analysis (Associations)
  • Model Visualization
  • Exploratory Data Analysis (EDA)

Data mining methodology (approaches to solve the above tasks)

  • Statistical Methods
  • Case-Based Reasoning (CBR)
  • Neural Networks (NN)
  • Decision Trees
  • Rule Induction
  • Bayesian Belief Networks
  • Genetic algorithms / volutionary Programming
  • Fuzzy Sets
  • Rough Sets

This paper proposes a scheme to study knowledge discovery and data mining tools and apply this scheme to review existing tools. In the scheme, the tools' features are classified into three groups. Here I use this scheme to study the TAR2 treatment learner:

General Characteristics

  • Production Status: Research Prototype
  • Legal Status: Freeware
  • Demo: Demo version available for download on the internet
  • Architecture: Standalone
  • Operating Systems: DOS

Database Connectivity

  • Data sources: Ascii text files
  • DB connection: Offline
  • Size: Medium(10k to 1000k records)
  • Data Model: Relational
  • Attributes: Continuous, Categorical(discrete numerical values) and Symbolic
  • Queries: Not applicable

Data Mining Characteristics

  • Discovery Tasks: Link Analysis (Associations)
  • Discovery Methodology: Rule Induction
  • Human Interaction: Human guided discovery process.

Investigation summary

  • Majority of currently available tools still support only a small number of data formats
  • Almost all of the reviewed products can analyze continuous as well as discrete and symbolic attribute types.
  • Most of the tools employ ``standard'' data mining techniques like rule induction, decision trees, and statistical methods.

Future research direction of KDD

  • Integreation of different techniques.
  • Extensibility
  • Seamless integration with databases
  • Support for both analysis experts and novice users
  • Managing changing data
  • Non-standard data types

Build 11. Apr 12, 2003


  *  Home

  *  About this site

Literature Review
  *  Data Mining

  *  Machine Learning

  *  Software Engineering

  *  Research Notes



B

bay99.pod
Detcting change in categorical data: mining contrast sets


C

cai98mining.pod
Mining association rules with weighted items

cohen.pod
Finding Interesting Associations without Support Pruning

confRule.pod
Mining Confident Rules Without Support Requirement


L

liu98.pod
Integrating Classification and Association Rule Mining


M

mbre01ri.pod
Modular Model Checking of SA/RT Models Using Assoiation Rules


W

webb00.pod
Efficient search for association rules


A

agrawal93.pod
Mining Association Rules between Sets of Items in Large Databases

agrawal94.pod
Fast algorithm for mining association rules


G

goebel99.pod
A Survey of Data Mining and Knowledge Discovery Software Tools

mendonca99.pod
Mining Software Engineering Data: A Survey