AML Spike (Daniel Deng, 20160726 - 20160728)

Targeted Reader

This document is a spike to find out things about Amazon Machine Learning. It assumes that the reader has some basic knowledge on:

Machine Learning
What Amazone Machine Learning does

Machine Learning in general

General steps involved to train a model

Get enough samples from different data sources
Normalising the data and turn it into a proper dataset
(For supervised learning) Specify an algorithm e.g. Linear Regression
Split the dataset into two parts e.g. 80% vs 20% , one for training and the other for validation
Use the model to predict results with new input parameters

Most time-consuming part would probably be collecting and normalising data in most cases.

Measuring the quality of a trained model

For models that predict a category, we measure the quality by the percentage of correctness
For models that predict a number, we measure the quality by RMSE (root-mean-square error)

Amazon Machine Learning (`AML`) Specific

Steps involved

Collecting and normalising dataset is done outside AML
Upload your dataset to either S3, RedShift or RDS
Create a datasource in AML from one of the above sources
Create and train a model in AML with the datasource you created
Use the console to do some trial predicts
Potentially bind to an HTTP endpoint if you are happy with the model

Automation

There are many ways to automate the data collection process, like crawling a website etc.
The creation of datasource and model can be done either in AWS console, awscli or SDKs like boto3
Although you can automate the above step, you need to specify the json schema of your datasource
- It's probably not helping a lot if you want a one-off training model
- It will be worth it if you want to continuously train the model with on-going datasets
You can automate the creation of endpoint as well.
For more details on automation see awscli doc and boto3 doc

Pros

Managed service, no hassles of GPU EC2 instances etc.
Capability of training simple models without having to write a single line of code
Fast training speed - finished 120,000 samples leaning in 2 minutes
Trivial to bind a model to an endpoint
Batch prediction capability
In most cases cheaper than training on GPU EC2 with tools like TensorFlow

Cons

No option to export the training model i.e. the model can only be used in AWS
Very limited training result stats (only RMSE)
Options to manipulate training behaviour are quite limited
(Same as API Gateway) Potential bill shock as a result of DDoS attack after binding to an endpoint

Is it easy to learn?

You can easily get started and train a model in minutes by follwing the guide
To be very good at machine learning you need to be equipted with adequate mathmatics and statistics knowledge.
Relevant knowledge covers a wide range, including but not are limited to L0, L1, L2 Normalisation, RMSE etc.

Models trained in this spike

I've scraped 120,000 ads from carsales on 20170726. The following parameters has been chosen as datasource features:

Feature	Type	Input / Output	Example
make	Categorical	Input	Toyota
model	Categorical	Input	Camry
badge	Categorical	Input	Altise
year	Categorical	Input	2014
kilometers	Numeric	Input	20000
transmission	Categorical	Input	Manual
state	Categorical	Input	NSW
price	Numeric	Output	22000

The model is trained to predict the car price with the above input parameters. I've trained multiple models with either the complete or a subset of the 120,000 samples.

Datasource Scope	# Samples	RMSE (smaller is better)
Full Dataset (all brands)	113,195	$12,595.750
Toyota	15,211	$5,835.131
Toyota-Camry	1,897	$1,872.218
Toyota-Camry-Altise	873	$1,884.205

The results are quite good and all were able to predict the sell prices with a reasonable accuracy.
The accuracy is MUCH better when you remove more variables from the input parameters list.
All models could figure out that the price difference between Automatic and Manual is a few thousand dollars.
All models could figure out that NSW has a slightly higher price comparing to QLD.
When year is regarded as Categorical, the result is better than regarding it as Numeric or converting it to age.
It is hard to tell the exact accuracy because advertiser's strategies and car conditions vary from case to case.

Takeaways

Supervised training would require your domain knowledge (in this case assuming a linear distribution, knowing the impact of having multiple makes / models may affect your prediction accuracy etc.)
Having a larger samples size does NOT necessarily result in a more accurate model
Removing input parameters (by having more constraints) might increase the model accuracy by quite a bit
It is a good idea not to be too greedy and think about removing unnecessary variables before you start training
The most time-consuming step would be collecting and normalising datasets in most cases
AML is generally very fast at least at the scale of up to 120,000 samples
AML does supervised training i.e. you need to specify a model algorithm in advance
AML is probably suitable for proof of concept or finding out the level of relevance of a particular feature
We need to use more sophisticated tools that requires coding if we need to build complex and fine-tuned model

sinogermany/aml-spike-20160728.md

Select an option

No results found

Select an option

No results found

AML Spike (Daniel Deng, 20160726 - 20160728)

Targeted Reader

Machine Learning in general

General steps involved to train a model

Measuring the quality of a trained model

Amazon Machine Learning (`AML`) Specific

Steps involved

Automation

Pros

Cons

Is it easy to learn?

Models trained in this spike

Takeaways

Related links of this spike

sinogermany/aml-spike-20160728.md

AML Spike (Daniel Deng, 20160726 - 20160728)

Targeted Reader

Machine Learning in general

General steps involved to train a model

Measuring the quality of a trained model

Amazon Machine Learning (AML) Specific

Steps involved

Automation

Pros

Cons

Is it easy to learn?

Models trained in this spike

Takeaways

Related links of this spike

Amazon Machine Learning (`AML`) Specific