Machine Learning (ML) Feature Store System

From GM-RKB
(Redirected from ML Feature Repository)
Jump to navigation Jump to search

A Machine Learning (ML) Feature Store System is an specialized database system (for ML features) that can support feature store tasks.



References

2020

  • https://www.tecton.ai/blog/what-is-a-feature-store/
    • QUOTE: Feature stores make it easy to:
      • Productionize new features without extensive engineering support
      • Automate feature computation, backfills, and logging
      • Share and reuse feature pipelines across teams
      • Track feature versions, lineage, and metadata
      • Achieve consistency between training and serving data
      • Monitor the health of feature pipelines in production
    • Feature stores aim to solve the full set of data management problems encountered when building and operating operational ML applications.
    • A feature store is an ML-specific data system that:
      • Runs data pipelines that transform raw data into feature values
      • Stores and manages the feature data itself, and
      • Serves feature data consistently for training and inference purposes

2017b

  • "Meet Michelangelo: Uber’s Machine Learning Platform." 2017-07-05
    • QUOTE: We found great value in building a centralized Feature Store in which teams around Uber can create and manage canonical features to be used by their teams and shared with others. At a high level, it accomplishes two things:
      1. It allows users to easily add features they have built into a shared feature store, requiring only a small amount of extra metadata (owner, description, SLA, etc.) on top of what would be required for a feature generated for private, project-specific usage.
      2. Once features are in the Feature Store, they are very easy to consume, both online and offline, by referencing a feature’s simple canonical name in the model configuration. Equipped with this information, the system handles joining in the correct HDFS data sets for model training or batch prediction and fetching the right value from Cassandra for online predictions.
    • At the moment, we have approximately 10,000 features in Feature Store that are used to accelerate machine learning projects, and teams across the company are adding new ones all the time. Features in the Feature Store are automatically calculated and updated daily. In the future, we intend to explore the possibility of building an automated system to search through Feature Store and identify the most useful and important features for solving a given prediction problem.
    • Domain specific language for feature selection and transformation.

      Often the features generated by data pipelines or sent from a client service are not in the proper format for the model, and they may be missing values that need to be filled. Moreover, the model may only need a subset of features provided. In some cases, it may be more useful for the model to transform a timestamp into an hour-of-day or day-of-week to better capture seasonal patterns. In other cases, feature values may need to be normalized (e.g., subtract the mean and divide by standard deviation).

      To address these issues, we created a DSL (domain specific language) that modelers use to select, transform, and combine the features that are sent to the model at training and prediction times. The DSL is implemented as sub-set of Scala.

2017a

  • "Using Machine Learning to Predict Value of Homes On Airbnb." 2017-07-17
    • QUOTE: One of the first steps of any supervised machine learning project is to define relevant features that are correlated with the chosen outcome variable, a process called feature engineering. For example, in predicting LTV, one might compute the percentage of the next 180 calendar dates that a listing is available or a listing’s price relative to comparable listings in the same market.

      At Airbnb, feature engineering often means writing Hive queries to create features from scratch. However, this work is tedious and time consuming as it requires specific domain knowledge and business logic, which means the feature pipelines are often not easily sharable or even reusable. To make this work more scalable, we developed Zipline — a training feature repository that provides features at different levels of granularity, such as at the host, guest, listing, or market level.

      The crowdsourced nature of this internal tool allows data scientists to use a wide variety of high quality, vetted features that others have prepared for past projects. If a desired feature is not available, a user can create her own feature with a feature configuration file like the following:

2015