0xdata H2O Environment
- See: Data Robots.
- QUOTE: H2O is for data scientists and business analysts who need scalable and fast machine learning. H2O is an open source predictive analytics platform. Unlike traditional analytics tools, H2O provides a combination of extraordinary math and high performance parallel processing with unrivaled ease of use. H2O speaks the language of data science with support for R, Python, Scala, Java and a robust REST API. Smart business applications are powered by H2O’s NanoFastTM Scoring Engine. …
… H2O makes it possible for anyone to easily apply math and predictive analytics to solve today’s most challenging business problems. It intelligently combines unique features not currently found in other machine learning platforms including:
- Best of Breed Open Source Technology – Enjoy the freedom that comes with big data science powered by OpenSource technology. H2O leverages the most popular OpenSource products like Apache Hadoop® and Spark to give customers the flexibility to solve their most challenging data problems.
- Easy-to-use WebUI and Familiar Interfaces – Set up and get started quickly using either H2O’s intuitive Web-based user interface or familiar programming environ- ments like R, Java, Scala, Python, JSON, and through our powerful APIs.
- Data Agnostic Support for all Common Database and File Types – Easily explore and model big data from within Microsoft Excel, R Studio, Tableau and more. Connect to data from HDFS, S3, SQL and NoSQL data sources. Install and deploy anywhere
- Massively Scalable Big Data Analysis – Train a model on complete data sets, not just small samples, and iterate and develop models in real-time with H2O’s rapid in-memory distributed parallel processing.
- Real-time Data Scoring – Use the Nanofast Scoring Engine to score data against models for accurate predictions in just nanoseconds in any environment. Enjoy 10X faster scoring and predictions than the next nearest technology in the market.
- Prepare Your Data For Modeling
- Munge Tool Description
- Data Profiling Quickly summarize the shape of your dataset to avoid bias or missing information before you start building your model. Missing data, zero values, text, and a visual distribution of the data are visualized automatically upon data ingestion.
- Summary Statistics Visualize your data with summary statistics to get the mean, standard deviation, min, max, cardinality, quantile and a preview of the data set.
- Aggregate, Filter, Bin, and Derive Columns Build unique views with Group functions, Filtering, Binning, and Derived Columns.
- Slice, Log Transform, and Anonymize Normalize, anonymize, and partition to get your data into the right shape for modeling.
- Variable Creation Highly customizable variable value creation to hone in on the key data characteristics to model.
- PCA Principal Component Analysis makes feature selection easy with a simple to use interface and standard input values.
- Training and Validation Sampling Plan Design a random or stratified sampling plan to generate data sets for model training and scoring.
- Model with State of the Art Machine Learning Algorithms
- Model Description
- Generalized Linear Models (GLM) A flexible generalization of ordinary linear regression for response variables that have error distribution models other than a normal distribution. GLM unifies various other statistical models, including linear, logistic, Poisson, and more.
- Decision Trees A decision support tool that uses a tree-like graph or model of decisions and their possible consequences.
- Gradient Boosting (GBM) A method to produce a prediction model in the form of an ensemble of weak prediction models. It builds the model in a stage-wise fashion and is generalized by allowing an arbitrary differentiable loss function. It is one of the most powerful methods available today.
- K-Means A method to uncover groups or clusters of data points often used for segmentation. It clusters observations into k certain points with the nearest mean.
- Anomaly Detection Identify the outliers in your data by invoking a powerful pattern recognition model.
- Deep Learning Model high-level abstractions in data by using non-linear transformations in a layer-by-layer method. Deep learning is an example of unsupervised learning and can make use of unlabeled data that other algorithms cannot.
- Naïve Bayes A probabilistic classifier that assumes the value of a particular feature is unrelated to the presence or absence of any other feature, given the class variable. It is often used in text categorization.
- Grid Search Is the standard way of performing hyper parameter optimization to make model configuration easier. It is measured by cross-validation of an independent data set.
- Score Models with Confidence
- Score Tool Description
- Predict Generate outcomes of a data set with any model. Predict with GLM, GBM, Decision Trees or Deep Learning models.
- Confusion Matrix Visualize the performance of an algorithm in a table to understand how a model performs.
- AUC A graphical plot to visualize the performance of a model by its sensitivity, true positive, false positive to select the best model.
- HitRatio A classification matrix to visualize the ratio of the number of correctly classified and incorrectly classified cases.
- PCA Score Determine how well your feature selection is for a particular model.
- Multi-Model Scoring Compare and contrast multiple models on a data set to find the best performer to deploy into production.
- Prepare Your Data For Modeling