2000 DataMiningPracticalMLToolsWithJava: Difference between revisions

m (Text replace - "==Notes ==" to "==Notes==")
 
m (Text replacement - "." " to ".” ")
 
(30 intermediate revisions by 3 users not shown)
Line 1: Line 1:
* ([[2000_DataMiningPracticalMLToolsWithJava|Witten & Frank, 2000]]) ⇒ [[author::Ian H. Witten]], and [[author::Eibe Frank]]. ([[year::2000]]). "[http://books.google.com/books/elsevier?id=6lVEKlrTq8EC Data Mining: practical machine learning tools and techniques with Java implementations]." Morgan Kaufmann. ISBN:1558605525
* ([[2000_DataMiningPracticalMLToolsWithJava|Witten & Frank, 2000]]) [[author::Ian H. Witten]], and [[author::Eibe Frank]]. ([[year::2000]]). [https://books.google.ca/books?id=VzLZvhkIg9IC Data Mining: Practical Machine Learning Tools and Techniques with Java implementations].” [[Publisher::Morgan Kaufmann, 2000]]. ISBN:1558605525, 9781558605527. ISSN 1046-1698.


<b>Subject Headings:</B> [[Data Mining Text Book]].
<B>Subject Headings:</B> [[Data Mining Text Book]].


==Notes==
== Notes ==
* A second edition of the book is  ([[2005_DataMiningPracticalMLTools|Witten & Frank, 2005]])
* Book Summary: http://www09.sigmod.org/sigmod/record/issues/0203/bookreview2-geller.pdf
* [https://books.google.ca/books?id=VzLZvhkIg9IC Google Books ID: VzLZvhkIg9IC].
* Book Second Edition: ([[2005_DataMiningPracticalMLTools|Witten & Frank, 2005]]).
* Book Third Edition: ([[2011_DataMiningPracticalMLTools|Witten et al., 2011]]).
* Book Fourth Edition: ([[2016_DataMiningPracticalMLTools|Witten et al., 2016]]).
* Used as reference for: [[Associated]], [[Association Learning]], [[Attribute Value]], [[Attribute]], [[Boolean Attribute]], [[Categorical Attribute]], [[Class Value]], [[Classification Learning]], [[Classification Learning]], [[Classified]], [[Closed World Assumption]], [[Clustered]], [[Clustering]], [[Concept Description]], [[Concept]], [[Continuous Attribute]], [[Database Mining]], [[Denormalization]], [[Dichotomous Attribute]], [[Dichotomy]], [[Discrete Attribute]], [[Enumerated Attribute]], [[Example]], [[Feature]], [[File Mining]], [[Independent Instance]], [[Instance]], [[Integer-Valued Number]], [[Interval Quantity]], [[Learning Scheme]], [[Learning Style]], [[Machine Learning Scheme]], [[Machine Learning System]], [[Measurement Level]], [[Missing Value]], [[Nominal Quantity]], [[Numeric Attribute]], [[Numeric Prediction]], [[Numeric Quantity]], [[Numeric Value]], [[Posthoc Analysis]], [[Ratio Quantity]], [[Real-Valued Number]], [[Recursive Rule]], [[Supervised Classification Learning]], [[Supervised Learning]], [[Supervised]], [[Table Row]].
* Used as reference for: [[Associated]], [[Association Learning]], [[Attribute Value]], [[Attribute]], [[Boolean Attribute]], [[Categorical Attribute]], [[Class Value]], [[Classification Learning]], [[Classification Learning]], [[Classified]], [[Closed World Assumption]], [[Clustered]], [[Clustering]], [[Concept Description]], [[Concept]], [[Continuous Attribute]], [[Database Mining]], [[Denormalization]], [[Dichotomous Attribute]], [[Dichotomy]], [[Discrete Attribute]], [[Enumerated Attribute]], [[Example]], [[Feature]], [[File Mining]], [[Independent Instance]], [[Instance]], [[Integer-Valued Number]], [[Interval Quantity]], [[Learning Scheme]], [[Learning Style]], [[Machine Learning Scheme]], [[Machine Learning System]], [[Measurement Level]], [[Missing Value]], [[Nominal Quantity]], [[Numeric Attribute]], [[Numeric Prediction]], [[Numeric Quantity]], [[Numeric Value]], [[Posthoc Analysis]], [[Ratio Quantity]], [[Real-Valued Number]], [[Recursive Rule]], [[Supervised Classification Learning]], [[Supervised Learning]], [[Supervised]], [[Table Row]].


==Quotes==
== Quotes ==


===Book Overview===
=== Book Overview ===
This book offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. Inside, you'll learn all you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining including both tried-and-true techniques of the past and Java-based methods at the leading edge of contemporary research. If you're involved at any level in the work of extracting usable knowledge from large collections of data, this clearly written and effectively illustrated book will prove an invaluable resource. Complementing the authors' instruction is a fully functional platform-independent Java software system for machine learning, available for download. Apply it to the sample data sets provided to refine your data mining skills, apply it to your own data to discern meaningful patterns and generate valuable insights, adapt it for your specialized data mining applications, or use it to develop your own machine learning schemes.
This book offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. Inside, you'll learn all you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining including both tried-and-true techniques of the past and Java-based methods at the leading edge of contemporary research. If you're involved at any level in the work of extracting usable knowledge from large collections of data, this clearly written and effectively illustrated book will prove an invaluable resource.         <P>          Complementing the authors' instruction is a fully functional platform-independent Java software system for [[machine learning]], available for download. Apply it to the sample data sets provided to refine your data mining skills, apply it to your own data to discern meaningful patterns and generate valuable insights, adapt it for your specialized data mining applications, or use it to develop your own machine learning schemes.
* Helps you select appropriate approaches to particular problems and to compare and evaluate the results of different techniques.
* Covers performance improvement techniques, including input preprocessing and combining output from different methods.
* Comes with downloadable machine learning software: use it to master the techniques covered inside, apply it to your own projects, and/or customize it to meet special needs.


{{#ifanon:|
{{#ifanon:|


===2 Inputs: Concepts, instances, attributes===
=== 2 Inputs: Concepts, instances, attributes ===


===2.1 What's a concept?===
=== 2.1 What's a concept? ===
Four basically different styles of learning appear in data mining applications. In ''classification learning'', a learning scheme takes a set of classified examples from which it is expected to learn a way of classifying unseen examples. In ''association learning'', any association between features is sought, not just ones that predict a particular ''class'' value. In ''[[Clustering Task|clustering]]'', groups of [[Learning Example|examples]] that [[belong together]] are sought. In ''numeric prediction'', the outcome to be predicted is not a discrete class but a numeric quantity. Regardless of the type of learning involve, we call then thing to be learn the ''concept'', and the output produce by a learning scheme the ''concept description''.
Four basically different styles of learning appear in data mining applications. In ''classification learning'', a learning scheme takes a set of classified examples from which it is expected to learn a way of classifying unseen examples. In ''association learning'', any association between features is sought, not just ones that predict a particular ''class'' value. In ''[[Clustering Task|clustering]]'', groups of [[Learning Example|examples]] that [[belong together]] are sought. In ''numeric prediction'', the outcome to be predicted is not a discrete class but a numeric quantity. Regardless of the type of learning involve, [[we]] call then thing to be learn the ''concept'', and the output produce by a learning scheme the ''concept description''.


...


Classification learning is sometimes called ''supervised'' because, in a sense, the scheme operates under supervision by being provided with actual outcome for each of the training examples ...
Classification learning is sometimes called ''supervised'' because, in a sense, the scheme operates under supervision by being provided with actual outcome for each of the [[training examples]] …
... there are far more association rules than classification rules, and the challenge is to avoid being swamped with them. ...
... there are far more association rules than classification rules, and the challenge is to avoid being swamped with them.


When there is not specified class, clustering is use to group items that seem to fall naturally together.
When there is not specified class, clustering is use to group items that seem to fall naturally together.
Line 28: Line 35:
Numeric prediction is a variant of classification learning where the outcome is a numeric value rather than a category. The CPU performance problem is one example.
Numeric prediction is a variant of classification learning where the outcome is a numeric value rather than a category. The CPU performance problem is one example.


===2.2 What's in an example?===
=== 2.2 What's in an example? ===
The input to a machine learning scheme is a set of instances. These instances are the things that are to be classified, or associated, or clustered. Although until now we have called them ''examples'', henceforth we will use the more specific term ''instances'' to refer to the input. Each instance is an individual, independent example of the concept to be learned. And each one is characterized by the values of a set of predetermined attributes. ...
The input to a machine learning scheme is a set of instances. These instances are the things that are to be classified, or associated, or clustered. Although until now we have called them ''examples'', henceforth we will use the more specific term ''instances'' to refer to the input. Each instance is an individual, independent example of the concept to be learned. And each one is characterized by the values of a set of predetermined attributes.


... Problems often involve relationships between objects, rather than separate, independent instances.
... Problems often involve relationships between objects, rather than separate, independent instances.


... The idea of specifying only positive examples and adopting a standing assumption that the res are negative is called the ''closed world assumption''. It is frequently assumed in theoretical studies; however it is not of much practical use in  real-life problems because they rarely involve "closed" worlds where you can be certain that all cases are covered.
... The idea of specifying only [[positive example]]s and adopting a standing assumption that the res are negative is called the ''closed world assumption''. It is frequently assumed in theoretical studies; however it is not of much practical use in  real-life problems because they rarely involve "closed" worlds where you can be certain that all cases are covered.


... it has been suggested, disparagingly, that we should really talk of ''file mining'' rather than ''database mining''. Relational data is more complex than a flat file. A finite set of of finite relations can always be recast into a single table, although often at enormous cost in space. Moreover, denormalization can generate spurious regularities in the data, and it is essential to check the data for such artifacts before applying a learning scheme. Finally, potentially infinite concepts can be dealt with by learning rules that are recursive, though that is beyond the scope of this book.
... it has been suggested, disparagingly, that we should really talk of ''file mining</i> rather than ''database mining''. Relational data is more complex than a flat file. A finite set of of finite relations can always be recast into a single table, although often at enormous cost in space. Moreover, denormalization can generate spurious regularities in the data, and it is essential to check the data for such artifacts before applying a learning scheme. Finally, potentially infinite concepts can be dealt with by learning rules that are recursive, though that is beyond the scope of [[this book]].


===2.3 What's in an attribute?===
=== 2.3 What's in an attribute? ===
Each individual, independent instance that provides the input to machine learning is characterized by its value on a fixed, predefined set of features of ''attributes''. The instances are the rows of the tables that we have shown for the weather, contact lens, Iris, and CPU performance, problems, and the attribute are the columns. ... The use of fixed set of features imposes another restriction on the kinds of problems generally considered in practical data mining.
Each individual, independent instance that provides the input to machine learning is characterized by its value on a fixed, predefined set of features of ''attributes''. The instances are the rows of the tables that we have shown for the weather, contact lens, Iris, and CPU performance, problems, and the attribute are the columns. The use of fixed set of features imposes another restriction on the kinds of problems generally considered in practical data mining.


...


The value of an attribute for a particular instance is a measurement of the quantity that the attribute refer to. There is a broad distinction between the quantities that are numeric and ones that are ''nominal''. Numeric attributes, sometimes called ''continuous '' attributes, measure numbers - either real- or integer-valued. Note that the term ''continuous'' is routinely abused in this context: integer valued attributes are certainly not continuous in the mathematical sense. Nominal attributes take on value in a prespecified, finite set of possibilities and are sometimes called ''categorical''. But there are other possibilities. Statistics text often introduce "levels of measurement" such as ''nominal'', ''ordinal'', ''interval'', and ''ratio''.  
The value of an attribute for a particular instance is a measurement of the quantity that the attribute refer to. There is a broad distinction between the quantities that are numeric and ones that are ''nominal''. Numeric attributes, sometimes called ''continuous '' attributes, measure numbers - either real- or integer-valued. Note that the term ''continuous'' is routinely abused in this context: integer valued attributes are certainly not continuous in the mathematical sense. Nominal attributes take on value in a prespecified, finite set of possibilities and are sometimes called ''categorical''. But there are other possibilities. Statistics text often introduce "levels of measurement" such as ''nominal'', ''ordinal'', ''interval'', and ''ratio''.


Nominal quantities have values that are distinct symbols. The values themselves serve only as labels or names - hence the term ''nominal'', which from from the Latin word for ''name''. For example, ... <code>sunny, overcast, rainy</code>.
Nominal quantities have values that are distinct symbols. The values themselves serve only as labels or names - hence the term ''nominal'', which from from the Latin word for ''name''. For example, <code>sunny, overcast, rainy</code>.


Ordinal quantities are ones that make it possibly to rank order the categories. However, although there is a notion of ordering, there is no notion of ''distance''. For example, ... <code>hot, mild, cool</code>. ...
Ordinal quantities are ones that make it possibly to rank order the categories. However, although there is a notion of ordering, there is no notion of ''distance''. For example, <code>hot, mild, cool</code>.


Notice that the distinction between nominal and ordinal quantities is not always straightforward ... you might argue that ... <code>overcast</code> being somehow intermediate between <code>sunny</code> and <code>rainy</code> as weather goes from good to bad.
Notice that the distinction between nominal and ordinal quantities is not always straightforward you might argue that <code>overcast</code> being somehow intermediate between <code>sunny</code> and <code>rainy</code> as weather goes from good to bad.


Interval quantities have values that are not only ordered by measured in fixed and equal units. A good example is temperature, expressed in degrees ...
Interval quantities have values that are not only ordered by measured in fixed and equal units. A good example is temperature, expressed in degrees


Ratio quantities are ones for which the measurement scheme inherently defines a zero point. For example, when measuring distance from one object to others, the distance between the object and itself forms a natural zero. Ratio quantities are treated as real numbers: any mathematical operations allowed.
Ratio quantities are ones for which the measurement scheme inherently defines a zero point. For example, when measuring distance from one object to others, the distance between the object and itself forms a natural zero. Ratio quantities are treated as real numbers: any mathematical operations allowed.
Line 57: Line 64:
Ordinal attributes are generally called ''numeric'', or perhaps ''continuous'', but without the implication of mathematical continuity. A special case of the nominal scale is the ''dichotomy'', which has only two members - often designates are ''true'' and ''false'', or ''yes'' and ''no'' in the weather data. Such attributes are sometimes called ''boolean''.
Ordinal attributes are generally called ''numeric'', or perhaps ''continuous'', but without the implication of mathematical continuity. A special case of the nominal scale is the ''dichotomy'', which has only two members - often designates are ''true'' and ''false'', or ''yes'' and ''no'' in the weather data. Such attributes are sometimes called ''boolean''.


[[Machine Learning System|Machine learning systems]] can use a wide variety of other information about attributes. For instance, dimensional considerations could be used to restrict the search to expressions or comparisons that are dimensionally correct. Circular ordering could affect the kinds of tests that are considered. For example, in a temporal context, tests on a <code>day</code> attribute could involve <code>next day, previous day, next week, same day next week</code>. Partial orderings, that, generalize/specialization relations, frequently occur in practical situations. Information this kind is often referred to as ''metadata'', data about data. However, the kind of practical schemes currently used for data mining are rarely capable of taking metadata into account, although it is likely that these capabilities will develop rapidly in the future.
[[Machine Learning System|Machine learning system]]s can use a wide variety of other information about attributes. For instance, dimensional considerations could be used to restrict the search to expressions or comparisons that are dimensionally correct. Circular ordering could affect the kinds of tests that are considered. For example, in a temporal context, tests on a <code>day</code> attribute could involve <code>next day, previous day, next week, same day next week</code>. Partial orderings, that, generalize/specialization relations, frequently occur in practical situations. Information this kind is often referred to as ''metadata'', data about data. However, the kind of practical schemes currently used for data mining are rarely capable of taking metadata into account, although it is likely that these capabilities will develop rapidly in the future.


====Missing values====
==== Missing values ====


Most dataset encountered in practice ... contain missing values.  
Most dataset encountered in practice contain missing values.


You have to think carefully about the significance of missing values. They may occur for a number of reasons, such as malfunctioning measurement equipment, chances in experimental design during data collection, and collation of several similar but not identical datasets.  
You have to think carefully about the significance of missing values. They may occur for a number of reasons, such as malfunctioning measurement equipment, chances in experimental design during data collection, and collation of several similar but not identical datasets.


====Inaccurate values====
==== Inaccurate values ====
It is important to check data mining files carefully for rogue attributes and attribute values. The data used for mining has almost certainly not been gathered expressly for that purpose.
It is important to check data mining files carefully for rogue attributes and [[attribute value]]s. The data used for mining has almost certainly not been gathered expressly for that purpose.


}}
}}
Line 85: Line 92:
| ?year
| ?year
| format=bibtex
| format=bibtex
}}{{Publication|doi=|title=Data Mining: practical machine learning tools and techniques with Java implementations|titleUrl=http://books.google.com/books/elsevier?id=6lVEKlrTq8EC}}
}}{{Publication|doi=|title=Data Mining: Practical Machine Learning Tools and Techniques with Java implementations|titleUrl=http://books.google.com/books/elsevier?id=6lVEKlrTq8EC}}

Latest revision as of 04:30, 8 May 2024

Subject Headings: Data Mining Text Book.

Notes

Quotes

Book Overview

This book offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. Inside, you'll learn all you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining including both tried-and-true techniques of the past and Java-based methods at the leading edge of contemporary research. If you're involved at any level in the work of extracting usable knowledge from large collections of data, this clearly written and effectively illustrated book will prove an invaluable resource.

Complementing the authors' instruction is a fully functional platform-independent Java software system for machine learning, available for download. Apply it to the sample data sets provided to refine your data mining skills, apply it to your own data to discern meaningful patterns and generate valuable insights, adapt it for your specialized data mining applications, or use it to develop your own machine learning schemes.

  • Helps you select appropriate approaches to particular problems and to compare and evaluate the results of different techniques.
  • Covers performance improvement techniques, including input preprocessing and combining output from different methods.
  • Comes with downloadable machine learning software: use it to master the techniques covered inside, apply it to your own projects, and/or customize it to meet special needs.,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2000 DataMiningPracticalMLToolsWithJavaIan H. Witten
Eibe Frank
Data Mining: Practical Machine Learning Tools and Techniques with Java implementationshttp://books.google.com/books/elsevier?id=6lVEKlrTq8EC2000