0xdata H2O Environment

Jump to: navigation, search

An 0xdata H2O Environment is a data mining environment produced by 0xdata.



  • QUOTE: H2O is for data scientists and business analysts who need scalable and fast machine learning. H2O is an open source predictive analytics platform. Unlike traditional analytics tools, H2O provides a combination of extraordinary math and high performance parallel processing with unrivaled ease of use. H2O speaks the language of data science with support for R, Python, Scala, Java and a robust REST API. Smart business applications are powered by H2O’s NanoFastTM Scoring Engine. …

    … H2O makes it possible for anyone to easily apply math and predictive analytics to solve today’s most challenging business problems. It intelligently combines unique features not currently found in other machine learning platforms including:

    • Best of Breed Open Source Technology – Enjoy the freedom that comes with big data science powered by OpenSource technology. H2O leverages the most popular OpenSource products like Apache Hadoop® and Spark to give customers the flexibility to solve their most challenging data problems.
    • Easy-to-use WebUI and Familiar Interfaces – Set up and get started quickly using either H2O’s intuitive Web-based user interface or familiar programming environ- ments like R, Java, Scala, Python, JSON, and through our powerful APIs.
    • Data Agnostic Support for all Common Database and File Types – Easily explore and model big data from within Microsoft Excel, R Studio, Tableau and more. Connect to data from HDFS, S3, SQL and NoSQL data sources. Install and deploy anywhere
    • Massively Scalable Big Data Analysis – Train a model on complete data sets, not just small samples, and iterate and develop models in real-time with H2O’s rapid in-memory distributed parallel processing.
    • Real-time Data Scoring – Use the Nanofast Scoring Engine to score data against models for accurate predictions in just nanoseconds in any environment. Enjoy 10X faster scoring and predictions than the next nearest technology in the market.

  • http://0xdata.com/product/algorithms/
    • Prepare Your Data For Modeling
      • Munge Tool Description
      • Data Profiling Quickly summarize the shape of your dataset to avoid bias or missing information before you start building your model. Missing data, zero values, text, and a visual distribution of the data are visualized automatically upon data ingestion.
      • Summary Statistics Visualize your data with summary statistics to get the mean, standard deviation, min, max, cardinality, quantile and a preview of the data set.
      • Aggregate, Filter, Bin, and Derive Columns Build unique views with Group functions, Filtering, Binning, and Derived Columns.
      • Slice, Log Transform, and Anonymize Normalize, anonymize, and partition to get your data into the right shape for modeling.
      • Variable Creation Highly customizable variable value creation to hone in on the key data characteristics to model.
      • PCA Principal Component Analysis makes feature selection easy with a simple to use interface and standard input values.
      • Training and Validation Sampling Plan Design a random or stratified sampling plan to generate data sets for model training and scoring.
    • Model with State of the Art Machine Learning Algorithms
    • Score Models with Confidence
      • Score Tool Description
      • Predict Generate outcomes of a data set with any model. Predict with GLM, GBM, Decision Trees or Deep Learning models.
      • Confusion Matrix Visualize the performance of an algorithm in a table to understand how a model performs.
      • AUC A graphical plot to visualize the performance of a model by its sensitivity, true positive, false positive to select the best model.
      • HitRatio A classification matrix to visualize the ratio of the number of correctly classified and incorrectly classified cases.
      • PCA Score Determine how well your feature selection is for a particular model.
      • Multi-Model Scoring Compare and contrast multiple models on a data set to find the best performer to deploy into production.