LARGE DATA WITH SCIKIT-LEARN

  1. Alex Perrier - @alexip
  2. Data & Software - @BerkleeOnline - Day

  3. Data Science contributor - @ODSC - Night

  • 1) What is large data?

  • WHAT IS LARGE DATA?

    Cannot fit in the main memory (RAM)

    2-10Gb makes the computer swap too much. Slow!

    From in-memory to on-disk

    • Being able to work with data that fits on disk on a single machine with no cluster
    • MANY GREAT ALTERNATIVES

    • TERMINOLOGY

      • Out-of-core / External memory

      Data does not fit in main memory => access data stored in slow data store (disk, ...), slower (10x, 100x)

      • Offline: all the data is available from the start 
      • Online: Serialized. Same results as offline. Act on new data right away

      • Streaming Serialized. limited number of passes over the data, can postpone processing. "old" data is of less importance.

      • Minibatch, Serialized in blocks

      • Incremental = online + minibatch

    • clf = Some Model or Transformation

      • Train on training data X_train, y_train

        clf.fit(X_train,y_train)

      • Predict on test data: X_test => y_test

        y_test = clf.predict(X_test)

      • Assess model performance on test: y_truth, vs y_test

        clf.score(y_truth, y_test, metric = ...)

      • Predict on new data: y^=clf.predict(Xnew)

    • out-of-core, streaming, online, batch?
    • SCIKIT-LEARN: OUT-OF-CORE

      • Split the training data in blocks: mini-batch (sequential, random, ...)

      • Load each block in sequence

      • Train the model adaptively on each block

      • Convergence!


    • SCIKIT-LEARN MINIBATCH LEARNING
      Batch size n
      split the data in N blocks
      .partial_fit instead of .fit
      clf.partial_fit(X_train(i),y(i), all_classes)
      All possible classes given to partial_fit on first call
      i = 1..N number of blocks

    SCIKIT: IMPLEMENTATION

    Implementation with generators

    Generator code

    better example

    ALGORITHMS

    RegressionClassification
    Stochastic Gradient (1)XX
    Naive Bayes (2)X

    (1) SG + PassiveAggressive + Perceptron (clf only)

    (2) NB: MultinomialNB, BernoulliNB

    ClusteringDecomposition
    MiniBatchKMeansXX
    DictionaryLearning, PCAX
    2) Algorithms
  • 3) Implementation

    4) Examples

  • Comments