This document is a spike to find out things about Amazon Machine Learning. It assumes that the reader has some basic knowledge on:
- Machine Learning
- What Amazone Machine Learning does
- Get enough samples from different data sources
- Normalising the data and turn it into a proper dataset
- (For supervised learning) Specify an algorithm e.g.
Linear Regression - Split the dataset into two parts e.g. 80% vs 20% , one for training and the other for validation
- Use the model to predict results with new input parameters
Most time-consuming part would probably be collecting and normalising data in most cases.
- For models that predict a category, we measure the quality by the percentage of correctness
- For models that predict a number, we measure the quality by
RMSE(root-mean-square error)
- Collecting and normalising dataset is done outside
AML - Upload your dataset to either
S3,RedShiftorRDS - Create a
datasourceinAMLfrom one of the above sources - Create and train a
modelinAMLwith thedatasourceyou created - Use the console to do some trial predicts
- Potentially bind to an HTTP endpoint if you are happy with the model
- There are many ways to automate the data collection process, like crawling a website etc.
- The creation of
datasourceandmodelcan be done either in AWS console,awsclior SDKs likeboto3 - Although you can automate the above step, you need to specify the json schema of your datasource
- It's probably not helping a lot if you want a one-off training
model - It will be worth it if you want to continuously train the
modelwith on-goingdatasets
- It's probably not helping a lot if you want a one-off training
- You can automate the creation of endpoint as well.
- For more details on automation see awscli doc and boto3 doc
- Managed service, no hassles of GPU EC2 instances etc.
- Capability of training simple models without having to write a single line of code
- Fast training speed - finished 120,000 samples leaning in 2 minutes
- Trivial to bind a model to an endpoint
- Batch prediction capability
- In most cases cheaper than training on GPU EC2 with tools like
TensorFlow
- No option to export the training model i.e. the model can only be used in AWS
- Very limited training result stats (only
RMSE) - Options to manipulate training behaviour are quite limited
- (Same as
API Gateway) Potential bill shock as a result ofDDoSattack after binding to an endpoint
- You can easily get started and train a model in minutes by follwing the guide
- To be very good at machine learning you need to be equipted with adequate mathmatics and statistics knowledge.
- Relevant knowledge covers a wide range, including but not are limited to
L0, L1, L2 Normalisation,RMSEetc.
I've scraped 120,000 ads from carsales on 20170726. The following parameters has been chosen as datasource features:
| Feature | Type | Input / Output | Example |
|---|---|---|---|
| make | Categorical | Input | Toyota |
| model | Categorical | Input | Camry |
| badge | Categorical | Input | Altise |
| year | Categorical | Input | 2014 |
| kilometers | Numeric | Input | 20000 |
| transmission | Categorical | Input | Manual |
| state | Categorical | Input | NSW |
| price | Numeric | Output | 22000 |
The model is trained to predict the car price with the above input parameters. I've trained multiple models with either the complete or a subset of the 120,000 samples.
| Datasource Scope | # Samples | RMSE (smaller is better) |
|---|---|---|
| Full Dataset (all brands) | 113,195 | $12,595.750 |
| Toyota | 15,211 | $5,835.131 |
| Toyota-Camry | 1,897 | $1,872.218 |
| Toyota-Camry-Altise | 873 | $1,884.205 |
- The results are quite good and all were able to predict the sell prices with a reasonable accuracy.
- The accuracy is MUCH better when you remove more variables from the input parameters list.
- All models could figure out that the price difference between
AutomaticandManualis a few thousand dollars. - All models could figure out that
NSWhas a slightly higher price comparing toQLD. - When
yearis regarded asCategorical, the result is better than regarding it asNumericor converting it toage. - It is hard to tell the exact accuracy because advertiser's strategies and car conditions vary from case to case.
- Supervised training would require your domain knowledge (in this case assuming a linear distribution, knowing the impact of having multiple makes / models may affect your prediction accuracy etc.)
- Having a larger samples size does NOT necessarily result in a more accurate model
- Removing input parameters (by having more constraints) might increase the model accuracy by quite a bit
- It is a good idea not to be too greedy and think about removing unnecessary variables before you start training
- The most time-consuming step would be collecting and normalising
datasets in most cases AMLis generally very fast at least at the scale of up to 120,000 samplesAMLdoes supervised training i.e. you need to specify a model algorithm in advanceAMLis probably suitable for proof of concept or finding out the level of relevance of a particular feature- We need to use more sophisticated tools that requires coding if we need to build complex and fine-tuned model