by Jayani Withanawasam November 3rd, 2016

Movie Recommender using Talend Machine Learning

Why Talend?

Talend provides specialised support for big data integration. The noticeable advantage of using Talend for machine learning over other machine learning libraries is the fact that no coding effort is required when implementing a machine learning solution. The solution can be designed using drag and drop controls and native code is generated automatically.

Accordingly, Talend facilitates convenient implementation especially for the developers who are not familiar with machine learning and big data analytics. It provides easy integration with big data platforms such as Cloudera, Hortonworks and AWS. Usually, big data related operations are computationally expensive, therefore, Talend enables scaling computation using Hadoop and Spark (batch or in-memory processing).

Talend Machine Learning Features

Talend machine learning features are powered by Spark MLlib. The following machine learning features are provided in Talend OpenStudio for Big data.

  • Classification
  • Clustering
  • Recommendation
  • Regression

Talend machine learning features are provided as enterprise features. However, a free trial version is available as “Talend Big Data Sandbox with Hadoop Docker Environment”.

Content Recommendations

Recommendations are an application area of machine learning, which provides the capability to recommend items (E.g., Movies, books, friends) based on analysing patterns of user’s behaviors or actions (E.g., Likes, Ratings, Buy, View) on items. Recommendations can be either content based filtering or collaborative filtering. Content based filtering considers the attributes of the user (E.g., age, gender), are matched with attributes of items (E.g., movie genre for movies). Collaborative filtering is finding patterns between users and items. Talend machine learning features provide collaborative filtering using Alternating Least Squares (ALS) algorithm.
Fig 1. Talend machine learning Components

Collaborative Filtering using ALS

ALS is a matrix factorisation based recommendation algorithm. The idea is to fill the missing entries of the user-item association matrix to predict recommendations. Recommendations are given based on latent factors identified by the algorithm. For movie recommendations latent factors can be similar genre, similar concept, similar actors and anything which we are not aware of as humans.

MovieLens Dataset

MovieLens dataset created by GroupLens research includes movie preferences of different users in “user ID - movie id - ratings” format. We used MovieLens 100k dataset to implement an example movie recommender using Talend machine learning features.

How to Implement a Movie Recommender using Talend Machine Learning?

The purpose of this demo is to create a movie recommender using Talend. As a first step we have to train a model with a data set that contains user, movie and ratings. Then based on the created model, most suitable five movies are predicted for each user.

Pre-requisites
  • Hadoop installation - Cloudera or HortonWorks

 

Standard Job

In Talend Big Data studio, standard jobs can be used for data ingestion from different data sources to Hadoop cluster. We created a standard job to upload required data files to the HDFS (Hadoop Distributed File System).
Fig 2. File Upload Job
This job file contains a single component called “tHDFSPut”, we can find it in File > Hadoop > tHDFSPut in palette of talend studio. tHDFSPut connects to HDFS to load the large scale file into it with optimized performance. This is the place that we should put all training data set into HDFS.

Big data batch job

Big data batch job uses spark engine for job execution to process big data. We created a big data batch job to create the ALS Model and recommender.

Training

The training job will train a model using the available data. As mentioned before, we use HDFS as our data platform which we have to get the trained data from. For this job, the following components should be included:

  • Storage > tHDFSConfiguration
  • File > Input > tFileInputDelimited
  • Custom Code > tJavaRow
  • Machine Learning > Recommendation > tALSModel
Fig 3. ALS Model Creation

tHDFSConfiguration provides HDFS connection information. This connection can be reused on the same job. We can find the basic setting of this component in Basic setting tab in component view.

  • Property type : Repository (where data is stored centrally)
  • Distribution : we used Hortonworks (used to create, distribute, and support enterprise ready open data platforms)
  • Version : appropriate version of selected distribution
  • NameNode URI : master node of Hadoop system (eg hdfs://masternode:portno).In this example it will be as hdfs://talend-hdp240.weave.local:8020
  • Username : default username is “talend” and leaving username as empty will lead to use the hosting machine's username as username

Once it is successfully connected, we need a component to read data from the repository. tFileInputDelimited is used to read data from file row by row with simple comma separated fields. We defined the HDFS repository as data platform. By ticking the checkbox called “define a storage configuration component” we selected HDFSConfiguration_1 as to provide configuration information. It will provide the access to centralised file system.

Once the connection works, we can retrieve the input file from the repository. In this case, we are selecting a csv file with three attributes called userid, movieid and rating as input. Fetched data is passed to tJavaRow which allows you to enter custom Java code. Here, It is used to convert the data type of received attribute as required.

Finally, preprocessed data is passed to a component called tALSModel. tALSModel generates user, rate, movie associated matrix based on user, movie interactive data. Training percentage should be expressed in decimal points. Appropriate value should be set for Number of latent factor parameters, with which each user or movie hidden feature is measured.

The number of iterations and size of training data set may cause stack overflow issues. In such a case, we have to increase the stack size by adding -Xss argument in JVM Setting table in Advanced settings tab of Run view. Also, we have to give an appropriate regularisation factor to avoid overfitting.

Each component should be connected with main rows and a trained model will be generated as the result of this job execution and resulted model will be stored in a path of HDFS.

Recommendations

After creating the model, we should have another job to predict movies. In this job, we are going to input a known user and get the most relevant 5 movies for the user.

Following components should be included:
  • Storage > tHDFSConfiguration
  • File > Input > tFileInputDelimited
  • Custom Code > tJavaRow
  • Machine Learning > Recommendation > tRecommend
Fig 4. Recommender

Recommender job will recommend movies based on the user, product matrix, created using training job. tFileInputDelimited_2 reads user id as input and passes it to t_JavaRow_2 in order to convert the received string into integer. We can find another machine learning component named tRecommend under the machine learning category in palette of Talend big data studio. tRecommend receives input from precedent component and predicts results based on the trained model. We have to give the path of the trained model in basic setting of tRecommend_1. Our previous job generated the model in HDFS. Since we are using repository as data platform, there should be a valid HDFS Configuration in tRecommend as well.

Finally tLogRow component is used to log the output from tRecommend_1. Recommended results are given below. Output format in the recommended results is given as “userid | movieid | rating” and the output is ordered by the rating level.

Fig 5. Recommended movies for the given user with their predicted ratings

Operationalising the Model

Data is one of the most valuable assets for any organization and the idea is to realise the insights it has to offer effectively and efficiently for better business decision making.

Accuracy of the machine learning model depends heavily on the available training data. Typically, new data is generated at high velocity in large volumes. The problem with the current machine learning solutions is that they take long time to deploy the model in production. Accordingly, there is a chance of model becoming outdated and accuracy of the results degrades.

We will document how to effectively operationalise Talend built models in a future blog.

As you can see, Talend provides a rich set of features to make the optimal use of available data for its end users. We, at Zaizi, are ambitious on finding ways to offer a better customer experience exploiting the features provided by Talend.

Movie Recommender using Talend Demo

Jayani Withanawasam's picture
about the author: Jayani Withanawasam
Jayani possesses more than 6 years of industry experience and has worked on areas such as machine learning, natural language processing, and semantic web. She is passionate about working with semantic technologies and Big Data. Follow her on twitter @jaywith_7