2001 ScientificDataMiningInteAndVis

(Mann et al., 2001) ⇒ Bob Mann, Roy Williams, Malcolm Atkinson, Ken Brodlie, Amos Storkey, Chris Williams. (2001). “Scientific Data Mining, Integration, and Visualization.” In: Report on the Workshop on Scientific Data Mining, Integration and Visualization (SDMIV).

Subject Headings: Scientific Data Mining, Scientific Data.

Notes

Cited By

~8 http://scholar.google.com/scholar?cites=5743326407429456457

Quotes

Executive Summary

This report summarises the workshop on Scientific Data Mining, Integration and Visualization (SDMIV) held at the e-Science Institute, Edinburgh (eSI[1] ) on 24-25 October 2002, and presents a set of recommendations arising from the discussion that took place there. The aims of the workshop were three-fold: (A) To inform researchers in the SDMIV communities of the infrastructural advances being made by computing initiatives, such as the Grid; (B) To feed back requirements from the SDMIV areas to those developing the computational infrastructure; and (C) To foster interaction among all these communities, since the coordinated efforts of all of them will be required to realise the potential for scientific knowledge extraction offered by e-science initiatives worldwide.

1.3.1. Scientific Data

Much of the scientific data discussed at the workshop fell into three categories, and, while these do not represent an exhaustive list of scientific data types, much of the technology discussed in the meeting was directed to them. The three categories are:

The datacube, or array, class - meaning an annotated block of data in one, two, or more dimensions. This includes time-series and spectra (one dimensional); images, frequency-time spectra, etc (two-dimensional); voxel datasets and hyperspectral images (three-dimensional), and so on. The highly-optimised chips of modern computers handle these data structures well.
Records, or events, collected as a table. Also known as multi-parameter data. These datasets may come directly from an instrument (for example in a particle accelerator) or may be derived by picking features from a datacube (when stars are identified from an astronomical image). Relational databases hold these data effectively.
Sequences of symbols, for example a biological gene is represented by a …

3.1.2. Data Mining Overview – Chris Williams

The essence of data mining is the finding of structure in data. A large number of different tasks fall under the heading of data mining – such as exploratory data analysis, both descriptive and predictive modelling, as well as the discovery of association rules and outliers – and many practical problems arise from their application to complex types of data. Predictive modelling centres on the learning from existing input/output pairs, so that the output(s) can be predicted given further input sets, and this can comprise use of a number of techniques, such as neural networks, decision trees, nearest neighbour methods and Support Vector Machines. All such supervised learning is inherently inductive in nature, and its key issue is generalisation – how to make predictions for new inputs based on previous knowledge. Descriptive modelling seeks to find significant patterns in data with no external guidance, and this is simply done using techniques such as clustering and reducing the dimensionality of the dataset by fitting it to a lower dimensional manifold.

3.2.1. Computational astrostatistics – Bob Nichol

The Pittsburgh Computational Astrostatistics (PiCA[37] ) Group brings together statisticians,

computer scientists and astronomers from Carnegie Mellon University and the University of Pittsburgh, to develop new statistical tools for the analysis of large astronomical datasets, notably the Sloan Digital Sky Survey (SDSS[38]). This collaboration works well because it is of benefit to all concerned: the astronomers want to exploit the large, rich SDSS dataset to the full scientifically, and that requires expertise in algorithms for knowledge extraction, while that same size and richness challenges and stimulates the computer scientists and statisticians who work on such algorithms. The key to the success of the PiCA group’s algorithms is the use of multi-resolution k-d trees for data storage. As well as partitioning the data effectively, these trees also store basic statistical information about the objects stored in each node. This representation of the data is usually sufficiently condensed that many operations – such as the calculation of the N-pt spatial correlation functions, which characterise the clustering of the galaxies in the SDSS – can be performed in memory, with the tree structure avoiding many needless computations made by traditional techniques.

3.2.3. Bioinformatics – David Gilbert

Bioinformatics is the application of molecular biology, computer science, artificial intelligence, statistics and mathematics to model, organise, understand and discover interesting knowledge associated with large-scale molecular biology databases. This combination of expertise is required not only because of the rapid increase in the volume of molecular biology data, but also because of how it used; the life sciences are characterized by coordinated study at many different levels of granularity – from a single nucleotide sequence, through protein structure to a cell, to an organ, all the way up to the physiology of a whole organism. Classification is a major part of biology, so classification techniques feature strongly in bioinformatics, often using similarities of structure (found through pattern-matching – e.g. in gene sequences) to infer similarity of function.

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2001 ScientificDataMiningInteAndVis	Bob Mann Roy Williams Malcolm Atkinson Ken Brodlie Amos Storkey Chris Williams			Scientific Data Mining, Integration, and Visualization			http://www.cacr.caltech.edu/~roy/papers/sdmiv.pdf