Machine Learning (ML) Feature Store Platform

From GM-RKB
(Redirected from ML feature store platform)
Jump to navigation Jump to search

A Machine Learning (ML) Feature Store Platform is a purpose-build data platform to manage ML feature sets.



References

2021

  • https://aws.amazon.com/sagemaker/feature-store/
    • QUOTE: Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, update, retrieve, and share machine learning (ML) features.

      Features are the attributes or properties models use during training and inference to make predictions. For example, in a ML application that recommends a music playlist, features could include song ratings, which songs were listened to previously, and how long songs were listened to. The accuracy of a ML model is based on a precise set and composition of features. Often, these features are used repeatedly by multiple teams training multiple models. And whichever feature set was used to train the model needs to be available to make real-time predictions (inference). Keeping a single source of features that is consistent and up-to-date across these different access patterns is a challenge as most organizations keep two different feature stores, one for training and one for inference.

       Amazon SageMaker Feature Store is a purpose-built repository where you can store and access features so it’s much easier to name, organize, and reuse them across teams. SageMaker Feature Store provides a unified store for features during training and real-time inference without the need to write additional code or create manual processes to keep features consistent. SageMaker Feature Store keeps track of the metadata of stored features (e.g. feature name or version number) so that you can query the features for the right attributes in batches or in real time using Amazon Athena, an interactive query service. SageMaker Feature Store also keeps features updated, because as new data is generated during inference, the single repository is updated so new features are always available for models to use during training and inference.

2020

Feature Store Comparison

Platform
Open-Source
Offline
Online
Metadata
Feature Engineering
Supported  Platforms
TimeTravel /
Point-in-Time Queries
Training Data
AGPL-V3
Hudi/Hive
MySQL Cluster
DB Tables, Elasticsearch
(Py)Spark, Python
AWS, GCP, On-Prem
SQL Join or
Hudi Queries
.tfrecords, .csv, .npy, .petastorm, .hf5, etc
N/A
Hive
Cassandra
Content
Spark, DSL
Proprietary
SQL Join
Streamed to models?
Apache V2
BigQuery
BigTable/Redis
DB Tables
Beam, Python
GCP
SQL Join
Streamed to models
N/A
Kafka/Cassandra
Kafka/ Cassandra
Protocol Buffers
Shared libraries
Proprietary
?
Protobuf
N/A
Hive
KV Store
KV Entries
Flink, Spark, DSL
Proprietary
Schema
Streamed to models?
N/A
HDFS, Cassandra
Kafka / Redis
Github
Flink, Spark
Proprietary
No?
Unknown
N/A
Kafka & S3
Kafka & Microservices
Protobufs
Spark, shared  libraries
Proprietary
Custom
Protobuf
N/A
HDFS
Strato / Manhatten
Scala shared feature libraries
Scala DSL, Scalding,  
shared libraries
Proprietary
No
Unknown
N/A
?
Yes, no details
Yes, no details
?
Proprietary
?
Unknown
N/A
S3/Hive
Yes, no details
Yes, no details
DSL (Linchpin), Spark
Proprietary
?
Unknown
N/A
Parquet
Yes, in mem database
Yes, no details
Spark, Python, Nuclio
AWS, Azure, GCP, on-prem
Yes, native time series or SQL
Yes, no details