AWS SageMaker Model Training Service

From GM-RKB
Jump to navigation Jump to search

An AWS SageMaker Model Training Service is a model training service within AWS SageMaker (a fully managed end-to-end machine learning service).



References

2018a

2018b

  • https://userX-180207.notebook.us-east-2.sagemaker.aws/notebooks/sample-notebooks/advanced_functionality/data_distribution_types/data_distribution_types.ipynb#Train
    • QUOTE: Now that we have our data in S3, we can begin training. We'll use Amazon SageMaker's linear regression algorithm, and will actually fit two models in order to properly compare data distribution types:
      • In the first job, we'll use FullyReplicated for our train channel. This will pass every file in our input S3 location to every machine (in this case we're using 5 machines).
      • While in the second job, we'll use ShardedByS3Key for the train channel (note that we'll keep FullyReplicated for the validation channel. So, for the training data, we'll pass each S3 object to a separate machine. Since we have 5 files (one for each year), we'll train on 5 machines, meaning each machine will get a year's worth of records.
      • First let's setup a list of training parameters which are common across the two jobs.
common_training_params = {
   ...
   "ResourceConfig": {
       "InstanceCount": 5,
       "InstanceType": "ml.c4.2xlarge",
       "VolumeSizeInGB": 10
    …

2018c

  • https://aws.amazon.com/blogs/aws/sagemaker/
    • QUOTE: I’m going to leave out the actual model training code here for brevity, but in general for any kind of Amazon SageMaker common framework training you can implement a simple training interface that looks something like this:
def train(
   channel_input_dirs, hyperparameters, output_data_dir,
   model_dir, num_gpus, hosts, current_host):
   pass
def save(model):
   pass
    • I want to create a distributed training job on 4 ml.p2.xlarge instances in my Amazon SageMaker infrastructure. I’ve already downloaded all of the data I need locally.
import sagemaker
from sagemaker.mxnet import MXNet
m = MXNet("cifar10.py", role=role,
         train_instance_count=4, train_instance_type="ml.p2.xlarge",
         hyperparameters={'batch_size': 128, 'epochs': 50,
                          'learning_rate': 0.1, 'momentum': 0.9})
    • Now that we’ve constructed our model training job we can feed it data by calling: m.fit("s3://randall-likes-sagemaker/data/gluon-cifar10").
    • If I navigate to the jobs console I can see that my job is running!