Build a movie recommendation using Amazon Sagemaker's factor -decomposed machine
Promend is one of the most common applications in machine learning (ML).In this blog post, we will introduce how to build a video recommendation model based on factor -solutions.This is one of the built -in algorithms, a Movielens dataset that is often used in Amazon Sagemaker.
About factor decomposition machines
Factorization Machines, FM is a mechanical learning technology introduced in 2010 (research paper, PDF).FM was named because the problem of decomposition was possible to reduce the number of questions.。
FM can be used for classification and regression, and can be much more computable in large -scale datasets than conventional algorithms such as linear regression.Therefore, FM is widely used for recommendation.Although the number of actual recommendations is very small (users do not evaluate all available items), the number of users and the number of items are usually large.
Here are some simple examples.This is the case when incorporating a sparse evaluation matrix (dimension 4x4) in a dense user matrix (dimension 4x2) and a dense item matrix (2x4).As you can see, the number of factors (2) is smaller than the number of columns in the evaluation matrix (4).In addition, this multiplier allows you to fill all blank values in the evaluation matrix.You can use this to recommend new items to any user.
Source: Data-Artisans.com
In this article, we use FM to build a movie recommendation.Companion Jupyter notebooks can be downloaded from Amazon S3 or GitHub.
Movielens dataset
This dataset is a good step in building a recommendation.There are various sizes.In this blog post, 1682 films use ML100K: 100,000 rating from his 943 users.As you can see, the rating line of the ML100K is very sparse, as only 100,000 among the 1,586,126 (943*1682) can be considered (93)..6 %).
The first 10 lines of the dataset are as follows.: The user 754 has given a 2 -star evaluation to the movie 595.
ユーザー ID 数、ムービー ID、評価、タイムスタンプ754 595 2 879452073932 157 4 891250667751 100 4 889132252101 820 3 877136954606 1277 3 878148493581 475 4 87964185013505 882140001457 595 882397575111 321 3 891680076123 657 4 879872066
Preparation of dataset
As described earlier, FM works best in high -dimensional datasets.As a result, you will encode the user ID and movie ID (ignoring the time stamp).Therefore, if you set only two values for each dataset sample to 1 for the user ID and movie ID, it will be 2,625 Boulian vector (943+1682).
Build a recorder for binaries (that is, you like or not).The evaluation of 4 stars and 5 stars is set to 1.Set the lower evaluation to 0.
Finally, in the Amazon Sagemaker FM implementation, you need to save training and test data in the float32 tensor in Protobuf.(It looks complicated. But Amazon Sagemaker SDK has a convenient utility function that does this, so don't worry too much.)
High -level view
The procedure for implementation is as follows.
let's do it.
Movielens datasetの読み込み
The ML-100K contains multiple text files, but here you build a model using two files.
Both files are the same tab separated format.
As a result, the following data structure will be built.
Confirmation: Each sample must be a single ONE-HOT encoding function vector.User IDs, movie IDs, and features that may be added must connect the One-Hot encoded value.Creating a list of different vectors is not the right way (one is for user ID, the other is for movie IDs, etc.).
This training line is more sparse so far: 237,746,250 values (90,570*2,625), not only 181,140 (90,570*2).In other words, this line is 99.It is a 92 % spacious line.Saving this as a high -density line will waste a large amount of storage and computing skills.
To avoid this, SCIPY for the sample.Lil_matrix is a sparse line and Numpy array for the label.
Make sure the number of samples per class is almost the same.Unbalanced datasets are serious problems for classifications.
print(np.count_nonzero(Y_train)/nbRatingsTrain)0.55print(np.count_nonzero(Y_test)/nbRatingsTest)0.58
It's a bit imbalanced, but it's okay.Let's proceed next.
Write to PROTOBUF file
Next, write down the training set and test set to two Protobuf files stored in Amazon S3.Fortunately, you can leave it to the write_spmatrix_to_sparse_tensor () utility function.Write the sample and label in a sparse multidimensional sequence (AKA Tensol) encoded with In memory ProtobufuF.
Then commit the buffer to Amazon S3.Once this step is completed, the data is ready.From now on, I will work on training jobs.
Trouble shooting hints for training
Note: Upgrade to the latest Amazon Sagemaker SDK
source activate python2pip install -U sagemaker
The training set on Amazon S3 is as follows.: Five.Only 5MB sparse line is the best!
$ aws s3 ls s3://jsimon-sagemaker-us/sagemaker/fm-movielens/train/train.protobuf2018-01-28 16:50:29 5796480 train.protobuf
Execute training job
First, create an Estimeter based on the FM container available in AWS region.Then you need to set up some Hyper parameters (all lists in the document).:
Other items used here are options (no explanation).
Finally, let's run a training job.All you need to call the Fit () API is to pass both the training and the test set to host on Amazon S3.It's simple and legant.
In a few minutes, the training is complete.Check out the training log in either Jupyter notebook or Amazon Cloudwatch Logs.( / AWS / SAGEMAKER / TRAININGJOBS Log Group).
The test accuracy after 50 epochs is 71.At 5 %, the F1 value (typical metric of binary classification) is 0.75 (1 indicates a complete classification).It wasn't a great value, but I didn't adjust the hyper parameters so much because I was excited by the spaces and Protobuf.Of course, it is possible to do it in a better way than this example.
[01/29/2018 13:42:41 INFO 140015814588224] #test_score (algo-1): binary_classification_accuracy [01/29/2018 13:42:41 INFO 14001581458224].7159[01/29/2018 13:42:41 INFO 140015814588224] #test_score (algo-1) : binary_classification_cross_entropy[01/29/2018 13:42:41 INFO 140015814588224] #test_score (algo-1) : 0.581087609863 [01/29/2018 13:42:41 INFO 140015814588224] #test_score (algo-1): binary_f_1.000 [01/29/2018 13:42:41 INFO 140015814588224] #Test_score (Algo-1): 0.74558968389
Finally, cover the model deployment.
Model deployment
All you need is a simple API call for a model deployment.A long time ago (about 6 months ago), AWS needed a lot of work.All you have to do is call DEPLOY ().
Thanks to Predict () API, the model HTTP endpoint is ready.The format of both the request and the response data is JSON.For this reason, it is necessary to prepare a simple serializer and convert sparse -row samples to JSON.
You can now categorize any movie of any user.Create a new dataset, process it in the same way as training and test set, and use Predict () to get results.Also, try various prediction thresholds (set the prediction above a certain score and set 0 underneath) to check the value that brings the most effective recommendation.Movielens datasets include the title of the movie, so you can try more.
summary
The built -in algorithm is an excellent way to complete the work without having to write a training code.It is necessary to prepare a lot of data, but as you can see in this blog post, the point is that you can quickly and scalable a very large training job.
If you are interested in Amazon Sagemaker built -in algorithm, this is a past related article.:
For those who want to know more about the recommendation system, here are some interesting resources below.
Thank you for reading.On Twitter, we will be happy to answer questions.
AWS colleagues gave me great advice and debugging hints.Thank you very much, Sireesha Muppala, Yuri astashanok, David Arpin, and GUY ERNEST.
Julien is an EMEA artificial intelligence and Machine Learning Evangelist.He is active in support for realizing ideas for developers and corporate companies.He has read JRR TOLKIEN's work many times during his leisure time.