For my thesis research (thesis outline), I have
designed and delivered two software packages: TAR2 and TAR3. Both are specific
data mining tools called treatment learner. Treatment learners, like other
machine learners are rule discovery paradigms. However,
classical machine learners like C4.5 aim at discovering
classification rules: i.e. given a classified training set,
they output rules that are predictive of the class
attribute. Treatment learner differs from those learners in that:
-
Treatment learner assumes the classes can be assessed by their scores (some
domain-specific measure).
- Highly scored classes are preferable to
lower scored classes.
- Further, one class is more desirable than all
others, which is called the best class.
- Treatment learner finds rules that predict both increased
frequency of the best class and decreased frequency of the worst
class.
That is, treatment learner finds discriminate rules that drive the system from the worst class to the best class.
Treatment learner takes in classified data sets and output treatments. A
treatment is one or a conjunction of attribute value pairs. It is a
constraint on future controllable inputs of the system. In
summary, treatment learners give us controllers rather than
classifiers. To understand the distinction, consider the case
of someone reading a map. Classifiers say "you are here" on
the map while controllers say "go this way". You can
find a detailed illustration of how TAR2 works in intro.pdf. |
TAR2 and TAR3 are based on a prolog prototype "TARZAN"- a post-processor to a decision tree
learner. Description of TARZAN can be found in Practical Large Scale What-if Queries:
Case Studies with Software Risk Assessment
TAR2 and TAR3 are written in C. They are data miners that no
longer need the decision-tree pre-processor. Both learners involve a
combination of search and self-defined heuristic evaluation of
attribute utility. While TAR2's breadth-first search can grow
exponentially, TAR3 fix the problem by employing a series of
strategies including random sampling. On datasets where TAR2 is
exponential, TAR3 runs in linear time. | Download the file tar3.zip shown at the bottom of this page. Depending on the
experiment, TAR2 can also be downloaded for baseline comparison
purpose.
Simply unzip tar3.zip and you get the following:
- Source code of TAR3 and a N-way cross validation facility.
- DOS executables to run TAR2 and X-way cross validation
experiment (UNIX executables are not provided but can be easily generated by
compiling the source code files directly).
-
Sample datasets and corresponding configuration/output files.
- Documents including user manual and several associated research
papers.
The directory structure of the un-zipped TAR3 system is as follows:
-
README:
-
COPYRITE: includes the GPL-2 Copy
policy
- .\doc user instruction and pdf's
- .\src
source files for TAR3 and N-way cross validation
- .\bin
all executables
- .\samples sample data sets and
output files
|
See also \doc in the download zip files. | With Window98, TAR2 easily handles 350,000 examples (13
attributes) in 64M, but need more (suggest 196M) memory to
handle more than, (say)550,000 examples (in 80sec). | - TAR2.2;
start with .\dispatchTAR2\doc\TAR2intro.pdf.
- TAR3;
start with .\tar3\doc\TAR3manual.pdf, or have a look at the manual
page;
|
|