2002 UsingTheIntelMinerForData

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Data Mining Glossary, IBM Intelligent Miner System

Notes

Quotes

Glossary

a

  • adaptive connection. A numeric weight used to describe the strength of the connection between two processing units in a neural network. The connection is called adaptive because it is adjusted during training. Values typically range from zero to one, or -0.5 to +0.5.
  • AFS; Andrew File System. A distributed file system developed by IBM and Carnegie-Mellon University.
  • aggregate. To summarize data in a field.
  • application program interface (API). A functional interface supplied by the operating system or a separately orderable licensed program that allows an application program written in a high-level language to use specific data or functions of the operating system or the licensed program.
  • architecture. The number of processing units in the input, output, and hidden layers of a neural network. The number of units in the input and output layers is calculated from the mining data and input parameters.
  • associations. The relationship of items in a transaction in such a way that items imply the presence of other items in the same transaction.
  • attribute. Characteristics or properties that can be controlled, usually to obtain a required appearance. For example, the color is an attribute of a line. In object-oriented programming, a data element defined within a class.

b

  • back propagation. A general-purpose neural network named for the method used to adjust its weights while learning data patterns. The Neural Classification mining function uses such a network.
  • boundary field. The upper limit of an interval as used for the Discretization using ranges processing function.
  • bucket. One of the bars in a bar chart representing the frequency distribution of a continuous field. A bucket shows how many values lie within a specific frequency range.

c

  • chi-square test. A test to check whether two variables are statistically dependent or not. Chi-square is calculated by subtracting the expected frequencies (imaginary values) from the observed frequencies (actual values). The expected frequencies represent the values that were to be expected if the variables in question were statistically independent.
  • classification. The assignment of objects into groups or categories based on their characteristics.
  • cluster. A group of records with similar characteristics.
  • cluster prototype. The attribute values that are typical of all records in a given cluster. Used to compare the input records to determine if a record should be assigned to the cluster represented by these values.
  • clustering. A mining function that creates groups of data records within the input data on the basis of similar characteristics. Each group is called a cluster.
  • confidence factor. Indicates the strength or the reliability of the associations detected.
  • comma-separated variables format (CSV). A file format used by spreadsheet, database, and statistical applications.
  • CSV. See comma-separated variables file format.

d

  • Database 2 (DB2). An IBM relational database management system.
  • database table. A table residing in a database.
  • database view. An alternative representation of data from one or more database tables. A view can include all or some of the columns contained in the database table or tables on which it is defined.
  • data field. In a database table, the intersection from table description and table column where the corresponding data is entered.
  • data format. There are different kinds of data formats, for example, database tables, database views, pipes, or flat files.
  • data table. A data table, regardless of the data format it contains.
  • data type. There are different kinds of Intelligent Miner data types, for example, categorical, continuous, or discrete-numeric.
  • delimiter. A character used to indicate the beginning and end of a character string.
  • discrete. Pertaining to data that consists of distinct elements such as characters, or to physical quantities having a finite number of distinctly recognizable values.
  • discretization. The act of assigning continuous values to intervals.
  • distributed file system. A file system composed of files or directories that physically reside on more than one computer in a communication network.
  • dotted decimal. A common notation for Internet host addresses that divides the 32-bit address into four 8-bit fields. The value of each field is specified as a decimal number and the fields are separated by periods, for example, 010.002.000.052 or 10.2.0.52.
  • double-byte character set (DBCS). A set of characters in which each character is represented by two bytes.

E

  • envelope. The area between two curves that are parallel to a curve of time-sequence data. The first curve runs above the curve of time-sequence data, the second one below. Both curves have the same distance to the curve of time-sequence data. The width of the envelope, that is, the distance from the first parallel curve to the second, is defined by epsilon.
  • epsilon. The maximum width of an envelope that encloses a sequence. Another sequence is epsilon-similar if it fits in this envelope.
  • epsilon-similar. Two sequences are epsilon-similar if one sequence does not go beyond the envelope that encloses the other sequence.
  • equality compatible. Pertaining to different data types that can be operands for the = logical operator.
  • Euclidean distance. The square root of the sum of the squared differences between two numeric vectors. The Euclidean distance is used to calculate the error between the calculated network output and the target output in Neural Classification, and to calculate the difference between a record and a prototype cluster value in Neural Clustering. A zero value indicates an exact match; larger numbers indicate greater differences.

F

  • field. A set of one or more related data items grouped for processing. In this document, with regard to database tables and views, field is synonymous to column.
  • file. A collection of related data that is stored and retrieved by an assigned name. file name. (1) A name assigned or declared for a file. (2) The name used by a program to identify a file.
  • file-selection box. A box that enables the user to choose a file to work with by selecting a file name from the ones listed or by typing a file name into the space provided.
  • file specification. The name and location of a file. file system. The collection of files and file management structures on a physical or logical mass storage device, such as a diskette or minidisk. See distributed file system, virtual file system.
  • flat file. (1) A one-dimensional or two-dimensional array: a list or table of items. (2) A file that has no hierarchical structure.
  • formatted information. An arrangement of information into discrete units and structures in a manner that facilitates its access and processing. Contrast with narrative information. frequent item sets. The total volume of items above the specified support factor returned by the Associations mining function.
  • F-test. A statistical test that checks whether two estimates of the variances of two independent samples are the same. In addition, the F-test checks whether the null hypothesis is true or false.
  • function. Any instruction or set of related instructions that perform a specific operation.

H

  • hidden layer. A set of processing units in a neural network used to calculate its outputs. Hidden layer processing units take their inputs from the preceding hidden layer units, or from the input layer. Their outputs are passed to either a succeeding hidden layer or the network's output layer. The number of hidden layers and the number of processing units in each hidden layer is part of the network architecture.
  • host. Pertaining to a computer controlling all or part of a network, and providing an access method to that network.

I

  • index. In SQL, pointers that are logically arranged by the values of a key. Indexes provide quick access and can enforce uniqueness on the rows in a table.
  • input data. The metadata of the database table, database view, or flat file containing the data you specified to be mined.
  • input layer. A set of processing units in a neural network which present the numeric values derived from user data to the network. The number of fields and type of data in those fields is used to calculate the number of processing units in the input layer.
  • interval. A set of real numbers between two numbers either including or excluding both of them.
  • interval boundaries. Values that represent the upper and lower limits of an interval.
  • item category. A categorization of an item. For example, a room in a hotel can have the following categories: Standard, Comfort, Superior, Luxury. The lowest category is called child item category. Each child item category can have several parent item categories. Each parent item category can have several grandparent item categories.
  • item description. The descriptive name of a character string in a data table.
  • item ID. The identifier for an item.
  • item set. A collection of items. For example, all items bought by one customer during one visit to a department store.

K

  • key. In SQL, a column or an ordered collection of columns identified in the description of an index.
  • Kohonen Feature Map. A neural network model comprised of processing units arranged in an input layer and output layer. All processors in the input layer are connected to each processor in the output layer by an adaptive connection. The learning algorithm used involves competition between units for each input pattern and the declaration of a winning unit. Used in neural clustering to partition data into similar record groups.

L

  • learning algorithm. The set of well-defined rules used during the training process to adjust the connection weights of a neural network. The criteria and methods used to adjust the weights define the different learning algorithms.
  • learning parameters. The variables used by each neural network model to control the training of a neural network which is accomplished by modifying network weights. lift. Confidence factor divided by expected confidence.

M

  • metadata. In databases, data that describes data objects.
  • mining. Synonym for analyzing or searching.
  • mining base. A repository where all the information about the mining data, the mining run settings, and the corresponding results is stored.
  • model. A specific type of neural network and its associated learning algorithm. Examples include the Kohonen Feature Map and back propagation.
  • mount. (1) To place a data medium in a position to operate. (2) To make recording media accessible.

N

  • name mapping. A table containing descriptive names or translations of other languages mapped to the numerals or the character strings of a data table.
  • named pipe. A named buffer that provides client-to-server, server-to-client, or full duplex communication between unrelated processes.
  • narrative information. Information that is presented according to the syntax of a natural language. Contrast with formatted information.
  • neural network. A collection of processing units and adaptive connections that is designed to perform a specific processing function.
  • nonsupervised learning. A learning algorithm that requires only input data to be present in the data source during the training process. No target output is provided; instead, the desired output is discovered during the mining run. A Kohonen Feature Map, for example, uses nonsupervised learning.

O

  • offset. (1) The number of measuring units from an arbitrary starting point in a record, area, or control block, to some other point. (2) The distance from the beginning of an object to the beginning of a particular field.
  • operator. (1) A symbol that represents an operation to be done. (2) In a language statement, the lexical entity that indicates the action to be performed on operands.
  • output data object. The metadata of the database table, database view, or flat file containing the data being produced or to be produced by a function.
  • output layer. A set of processing units in a neural network which contain the output calculated by the network. The number of outputs depends on the number of classification categories or maximum clusters value in Neural Classification and Neural Clustering, respectively.

P

  • pass. One cycle of processing a body of data. During a pass, each record is read once.
  • path. The route used to locate files; the storage location of a file. A fully qualified path lists the drive identifier, directory name, subdirectory name (if any), and file name with the associated extension.
  • pipe. A named or unnamed buffer used to pass data between processes.
  • prediction model. A model of the dependency and the variation of one field's value within a record on the other fields within the same record. A profile is then generated that can predict a value for the particular field in a new record of the same form, based on its other field values.
  • processing unit. A processing unit in a neural network is used to calculate an output value by summing all incoming values multiplied by their respective adaptive connection weights.

Q

  • quantile range. One of a finite number of nonoverlapping subranges or intervals, each of which is represented by an assigned value. Q is an N%-quantile of a value set S when: v Approximately N percent of the values in S are lower than or equal to Q. v Approximately (100-N) percent of the values are greater than or equal to Q. The approximation is less exact when there are many values equal to Q. N is called the quantile label or quantile limit. The 50%-quantile represents the median.

R

  • Radial Basis Function (RBF). The individual Radial Basis Functions are functions of the distance or the radius from a particular point. They are used to build up approximations to more complicated functions. The RBF-Prediction mining function uses Radial Basis Functions to predict values.
  • record. A set of one or more related data items grouped for processing. In reference to a database table, record is synonymous to row. region. (Sub)set of records with similar characteristics in their active fields. Regions are used to visualize a prediction result.
  • root. In the AIX operating system, the user name for the system user with the highest authority.
  • round-robin method. A method by which items are sequentially assigned to units. When an item has been assigned to the last unit in the series, the next item is assigned to the first again. This process is repeated until the last item has been assigned.
  • rule. A clause in the form head body. It specifies that the head is true if the body is true. rule body. Represents the specified input data for a mining function.
  • rule group. Covers all rules containing the same items in different variations.
  • rule head. Represents the derived items detected by the Associations mining function.

S

  • scale. A system of mathematical notation: fixed-point or floating-point scale of an arithmetic value.
  • scaling. To adjust the representation of a quantity by a factor in order to bring its range within prescribed limits.
  • scale factor. A number used as a multiplier in scaling. For example, a scale factor of 1/1000 would be suitable to scale the values 856, 432, -95, and /182 to lie in the range from -1 to +1, inclusive.
  • schema. A logical grouping for database objects. When a database object is created, it is assigned to one schema, which is determined by the name of the object. For example, the following command creates table X in schema C: CREATE TABLE C.X self-organizing feature map. See Kohonen Feature Map.
  • sensitivity analysis. An output from the Neural Classification mining function that shows which input fields are relevant to the classification decision.
  • sequential patterns. Intertransaction patterns such that the presence of one set of items is followed by another set of items in a database of transactions over a period of time. similar sequences. Occurrences of similar sequences in a database of sequences.
  • Structured Query Language (SQL). An established set of statements used to manage information stored in a database. By using these statements, users can add, delete, or update information in a table, request information through a query, and display the results in a report.
  • supervised learning. A learning algorithm that requires input and resulting output pairs to be presented to the network during the training process. Back propagation, for example, uses supervised learning and makes adjustments during training so that the value computed by the neural network will approach the actual value as the network learns from the data presented. Supervised learning is used in the techniques provided for classification as well as value prediction.
  • support factor. Indicates the occurrence of the detected association rules and sequential patterns based on the input data.
  • swapping. A process that interchanges the contents of an area of real storage with the contents of an area in auxiliary storage.
  • symbolic name. In a programming language, a unique name used to represent an entity such as a field, file, data structure, or label. In the Intelligent Miner you specify symbolic names, for example, for input data, name mappings, or taxonomies.

T

  • taxonomy. Represents a hierarchy or a lattice of associations between the item categories of an item. These associations are called taxonomy relations.
  • taxonomy relation. The hierarchical associations between the item categories you defined for an item. A taxonomy relation consists of a child item category and a parent item category.
  • trained network. A neural network containing connection weights that have been adjusted by a learning algorithm. A trained network can be considered a virtual processor; it transforms inputs to outputs.
  • training. The process of developing a model which understands the input data. In neural networks, the model is created by reading the records of the input data and modifying the network weights until the network calculates the desired output data.
  • translation process. Converting the data provided in the database to scaled numeric values in the appropriate range for a mining kernel using neural networks. Different techniques are used depending on whether the data is numeric or symbolic. Also, converting neural network output back to the units used in the database.
  • transaction. A set of items or events that are linked by a common key value, for example, the articles (items) bought by a customer (customer number) on a particular date (transaction identifier). In this example, the customer number represents the key value.
  • transaction ID. The identifier for a transaction, for example, the date of a transaction.
  • transaction group. The identifier for a set of transactions. For example, a customer number can represent a transaction group that includes all purchases of a particular customer during the month of May.

V

  • vector. A quantity usually characterized by an ordered set of numbers. virtual file system. In the AIX operating system, a remote file system that has been mounted so that it is accessible to the local user.

W

  • weight. The numeric value of an adaptive connection representing the strength of the connection between two processing units in a neural network.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2002 UsingTheIntelMinerForDataInternational Business Machines (IBM) Corporation (1911-)Using the Intelligent Miner for Data, v8 r1http://publibfp.boulder.ibm.com/epubs/pdf/h1267500.pdf2002