Datasets/hayes-roth.names

1. Title: Hayes-Roth & Hayes-Roth (1977) Database

2. Source Information:
   (a) Creators: Barbara and Frederick Hayes-Roth
   (b) Donor: David W. Aha (aha@ics.uci.edu) (714) 856-8779   
   (c) Date: March, 1989

3. Past Usage:
    1. Hayes-Roth, B., & Hayes-Roth, F. (1977).  Concept learning and the
       recognition and classification of exemplars.  Journal of Verbal Learning
       and Verbal Behavior, 16, 321-338.
       -- Results: 
          -- Human subjects classification and recognition performance:
	       1. decreases with distance from the prototype,
	       2. is better on unseen prototypes than old instances, and
	       3. improves with presentation frequency during learning.
    2. Anderson, J.R., & Kline, P.J. (1979).  A learning system and its 
       psychological implications.  In Proceedings of the Sixth International
       Joint Conference on Artificial Intelligence (pp. 16-21).  Tokyo, Japan:
       Morgan Kaufmann.
       -- Partitioned the results into 4 classes:
	    1. prototypes
	    2. near-prototypes with high presentation frequency during learning
	    3. near-prototypes with low presentation frequency during learning
	    4. instances that are far from protoypes
       -- Described evidence that ACT's classification confidence and
          recognition behaviors closely simulated human subjects' behaviors.
    3. Aha, D.W. (1989).  Incremental learning of independent, overlapping, and
       graded concept descriptions with an instance-based process framework.
       Manuscript submitted for publication.
       -- Used same partition as Anderson & Kline
       -- Described evidence that Bloom's classification confidence behavior
	  is similar to the human subjects' behavior.  Bloom fitted the data
	  more closely than did ACT. 

4. Relevant Information:
     This database contains 5 numeric-valued attributes.  Only a subset of
     3 are used during testing (the latter 3).  Furthermore, only 2 of the
     3 concepts are "used" during testing (i.e., those with the prototypes
     000 and 111).  I've mapped all values to their zero-indexing equivalents.

     Some instances could be placed in either category 0 or 1.  I've followed
     the authors' suggestion, placing them in each category with equal
     probability.

     I've replaced the actual values of the attributes (i.e., hobby has values
     chess, sports and stamps) with numeric values.  I think this is how 
     the authors' did this when testing the categorization models described
     in the paper.  I find this unfair.  While the subjects were able to bring
     background knowledge to bear on the attribute values and their
     relationships, the algorithms were provided with no such knowledge.  I'm
     uncertain whether the 2 distractor attributes (name and hobby) are
     presented to the authors' algorithms during testing.  However, it is clear
     that only the age, educational status, and marital status attributes are
     given during the human subjects' transfer tests.  
    
5. Number of Instances: 132 training instances, 28 test instances

6. Number of Attributes: 5 plus the class membership attribute.  3 concepts.

7. Attribute Information:
      -- 1. name: distinct for each instance and represented numerically
      -- 2. hobby: nominal values ranging between 1 and 3
      -- 3. age: nominal values ranging between 1 and 4
      -- 4. educational level: nominal values ranging between 1 and 4
      -- 5. marital status: nominal values ranging between 1 and 4
      -- 6. class: nominal value between 1 and 3

9. Missing Attribute Values: none

10. Class Distribution: see below

11. Detailed description of the experiment:
  1. 3 categories (1, 2, and neither -- which I call 3)
     -- some of the instances could be classified in either class 1 or 2, and
        they have been evenly distributed between the two classes
  2. 5 Attributes
     -- A. name (a randomly-generated number between 1 and 132)
     -- B. hobby (a randomly-generated number between 1 and 3)
     -- C. age (a number between 1 and 4)
     -- D. education level (a number between 1 and 4)
     -- E. marital status (a number between 1 and 4)
  3. Classification: 
     -- only attributes C-E are diagnostic; values for A and B are ignored
     -- Class Neither: if a 4 occurs for any attribute C-E
     -- Class 1: Otherwise, if (# of 1's)>(# of 2's) for attributes C-E
     -- Class 2: Otherwise, if (# of 2's)>(# of 1's) for attributes C-E
     -- Either 1 or 2: Otherwise, if (# of 2's)=(# of 1's) for attributes C-E
  4. Prototypes:
     -- Class 1: 111
     -- Class 2: 222
     -- Class Either: 333
     -- Class Neither: 444  
  5. Number of training instances: 132
     -- Each instance presented 0, 1, or 10 times
     -- None of the prototypes seen during training
     -- 3 instances from each of categories 1, 2, and either are repeated 
        10 times each
     -- 3 additional instances from the Either category are shown during
        learning
  5. Number of test instances: 28
     -- All 9 class 1
     -- All 9 class 2
     -- All 6 class Either
     -- All 4 prototypes
     --------------------
     --    28 total

Observations of interest:
  1. Relative classification confidence of 
     -- prototypes for classes 1 and 2 (2 instances)
        (Anderson calls these Class 1 instances)
     -- instances of class 1 with frequency 10 during training and
        instances of class 2 with frequency 10 during training that
        are 1 value away from their respective prototypes (6 instances)
        (Anderson calls these Class 2 instances)
     -- instances of class 1 with frequency 1 during training and 
        instances of class 2 with frequency 1 during training that
        are 1 value away from their respective prototypes (6 instances)
        (Anderson calls these Class 3 instances)
     -- instances of class 1 with frequency 1 during training and 
        instances of class 2 with frequency 1 during training that
        are 2 values away from their respective prototypes (6 instances)
        (Anderson calls these Class 4 instances)
 2. Relative classification recognition of them also

Some Expected results:
   Both frequency and distance from prototype will effect the classification
   accuracy of instances.  Greater the frequency, higher the classification
   confidence.  Closer to prototype, higher the classification confidence.