What is the best way to implement an SVM using Hadoop?

Suchi SariaStanford machine learning, CS PhD
Written Nov 17, 2010 · Upvoted by Vladimir Novakovskistarted Quora machine learning team, 2012-2014 and Tao XuBuilt ML systems at Airbnb, Quora, Facebook and Microsoft.
Thought it might be worthwhile to put all the proposed solutions in context to see when one might use one over the other.

The atbrox blog (http://atbrox.com/2010/02/08/par...) that Soren points to gives details on map-reduce implementation but the version of SVM implemented can be swapped out based on the desired properties of the solution: 

1. If the original problem is easy enough such that data required to train the model can fit on a single machine, then train using a single machine (with say, LibSVM) but classify using multiple machines (as Peter Skomoroch suggests). To check if your problem is "easy", look to see the performance on your test data as you add training data. If test performance plateaus before you reach max memory on a single machine, you're good to use this version. 

2. Compared to LibSVM that solves the dual problem, Chu et al, 2006 (http://www.cs.stanford.edu/peopl...) solves the primal SVM formulation directly using a second order method like Newton's method. This gives an exact solution that can be implemented within map-reduce and has all the guarantees as long as your newton method converges. For derivation details, see the original paper by Chapelle 2006 (http://citeseerx.ist.psu.edu/vie... ). This algorithm will allow you to use the kernel trick (if you need to use this) as described in section 3. The only assumption the algorithm makes is that you are able to do a matrix inverse, an O(s^3) operation, where s is the number of support vectors. To evaluate if this is true for your data, intuitively, if the data seems separable, then the decision boundary shouldn't be too complex and the number of support vectors required is small and definitely does not scale with the data. If this is not the case, then you have to resort to approximate solutions.

3. The PSVM solution by Chang et al, 2007 ( http://books.nips.cc/papers/file... ) is an approximate solution to the SVM objective. There are no guarantees associated with it. But google has a version implemented as Jonathan pointed out: http://code.google.com/p/psvm/

Special cases:
4. If you care about having a sparse solution for your SVM (i.e. feature selection is important), then use the Problem 6 formulation in the following paper (that Amund pointed to):
http://jmlr.csail.mit.edu/papers...
The original problem is NP-hard but they solve an approximate version of the objective.

5. All of the above solutions assume that the number of features given is not very large and the size issue is only due to the large number of data samples. If the number of features is also really large, you'll want to split across features as well. Use the version Boyd's group proposes (as Matt mentioned):
http://www.stanford.edu/~boyd/pa...
This is also approximate and does not give you a way to use the kernel trick.

In summary, for most problems, I'd say Chapelle/Chu would be your best bet. I'd love to hear feedback if you see mistakes in my summary.
15.2k Views · View Upvotes
Vineet Yadav
Vineet YadavText Analytics, Natural language processing and Big Data Developer
Written Nov 7, 2010 · Upvoted by Sean OwenDirector, Data Science @ Cloudera
Well Apache mahout( http://mahout.apache.org/) is machine learning library which uses hadoop. Currently they don't have support vector machine functionality,  but they have it in their future plan(https://cwiki.apache.org/conflue...) . Parallel GPDT( http://dm.unife.it/gpdt/) supports parallel training of support vector machine.
2.4k Views · View Upvotes
Peter Skomoroch
Peter SkomorochSr. Data Scientist @ LinkedIn
You can train your classifiers and compile the binaries, then apply the trained classifiers to each image via a hadoop streaming mapper - sending the trained model as side data via the distributed cache.
3k Views · View Upvotes
SVMs don't really fit the mapreduce model.

Google, however, has opensourced their parallel SVM implementation:
http://code.google.com/p/psvm/
2.7k Views · View Upvotes
Charles H Martin
Charles H MartinCalculation Consulting; we predict things
FYI, LibSVM supports a parallelism on shared memory, multi-core machines (this has to be way easier than trying to use Hadooop)

http://www.csie.ntu.edu.tw/~cjli...
1.5k Views · View Upvotes
Soren Macbeth
Soren MacbethI like playing with data.
Written Nov 7, 2010 · Upvoted by Amund TveitPhD in machine learning.
Here's an example of a proximal svm in python and hadoop streaming:

http://atbrox.com/2010/02/08/par...
1.7k Views · View Upvotes
Matt Kraning
Matt Kraning
Updated May 27, 2011 · Upvoted by Amund TveitPhD in machine learning.
Our group has been working on this very problem:
http://www.stanford.edu/~boyd/pa...
2.2k Views · View Upvotes
From what I remember of using LibSVM/LibLinear, the resulting models are sets of attribute values that can be averaged together to produce a final model. So you could easily distribute the training dataset using Hadoop and train N classifiers, then combine.

But in reality you should be able to sub-sample your labeled document space for training purposes, train a single model, cross-check it against the held-out documents, and then do exactly as Pete says - distribute that one model (e.g. use Hadoop's DistributedCache) to every task.
1.4k Views
There is an implementation of R in Hadoop called RHadoop. You just have to install R in all of your clusters and then install 3 packages (Rmr, Rhdfs, Rhbase) and you are good to go. This firm, Revolution Analytics, has essentially re-written the R engine to support parallel processing and you can run e1071 or any other relevant SVM packages on RHadoop.

PS: I am not associated with Revolution Analytics in any way.
983 Views · View Upvotes

Apache Hadoop is a Big Data ecosystem consisting of open source components that essentially change the way large data sets are analyzed, stored, transferred and processed. Contrasting to traditional distributed processing systems, Hadoop facilitates multiple kinds of analytic workloads on same data sets at the same time. Hadoop Training & Big Data Online Certification training | Intellipaat The most widely and frequently used framework to manage massive data across a number of computing platforms and servers in every industry, Hadoop is rocketing ahead in enterprises. It lets organizations store files that are bigger than what you can store on a specific node or server. More importantly, Hadoop is not just a storage platform, it is one of the most optimized and efficient computational frameworks for big data analytics.

This Hadoop tutorial is an excellent guide for students and professionals to gain expertise in Hadoop technology and its related components. With the aim of serving larger audiences worldwide, the tutorial is designed to tutor Developers, Administrators, Analysts and Testers on this most commonly applied Big Data framework. Right from Installation to application benefits to future scope, the tutorial provides explanatory aspects of how learners can make the most efficient use of Hadoop and its ecosystem. It also gives insights into many of Hadoop libraries and packages that are not known to many Big data Analysts and Architects.

Together with, several significant and advanced big data platforms like Map Reduce, Yarn, H Base, Impala, ETL Connectivity, Multi-Node Cluster setup, advanced Oozie, advanced Flume, advanced Hue and Zookeeper are also explained extensively via real-time examples and scenarios, in this learning package.

For many such outstanding technological-serving benefits, Hadoop adoption is expediting. Since the number of business organizations embracing Hadoop technology to contest on data analytics, increase customer traffic and improve overall business operations is growing at a rapid rate, the respective number of jobs and demand for expert Hadoop Professionals is increasing at an ever-faster pace. More and more number of individuals are looking forward to mastering their Hadoop skills through Professional training courses that could prepare them for various Cloud era Hadoop Certifications like CCAH and CCDH.

After finishing this tutorial, you can see yourself moderately proficient in Hadoop ecosystem and related mechanisms. You could then better know about the concepts so much so that you can confidently explain them to peer groups and will give quality answers to many of Hadoop questions asked by seniors or experts.

If you find this tutorial helpful, we would suggest you browse through our Big Data and Hadoop training courses, which will

Recommended Audience

  • Intellipaat’s Hadoop tutorial is designed for Programming Developers and System Administrators
  • Project Managers eager to learn new techniques of maintaining large datasets
  • Experienced working professionals aiming to become Big Data Analysts
  • Mainframe Professionals, Architects & Testing Professionals
  • Entry-level programmers and working professionals in Java, Python, C++, eager to learn the latest Big Data technology

Prerequisites

  • Before starting with this Hadoop tutorial, it is advised to have prior programming language experience in Java and Linux Operating system.
  • Basic command knowledge of UNIX and SQL Scripting can be beneficial to better understand the Big data concepts in Hadoop applications.
204 Views
Mariana Soffer
Mariana SofferArtificial Intelligence, Machine Learning, Data Mining, Neuroscience
I would recommend using the best SVM library which is libsvm, it has several versions with different varieties.
Comments