Handling Large training dataset in machine learning

Introduction

Laymen explanation

Have you ever tried working with a large dataset on a 4GB RAM machine? It starts heating up while doing simplest of machine learning tasks? This is a common problem data scientists face when working with restricted computational resources.

Technical explanation

Exploring and applying machine learning algorithms to datasets that are too large to fit into memory is pretty common.

Working with large amounts of data can cause out-of-memory errors, re-drawing issues, and sluggish behaviour.

Impact of large dataset

- Training will be slow
- Out of memory issue
- Compare multiple algorithm will be slower since cross-validation needs training and it will take lots of time for each algorithm.

Possible approaches

Resource enhancement is reactive action. However, before doing this, it is better to work with data.

Sizing of data

Look for redundancy in the data. Redundancy can be

in the form of data - Data condensation technique can be used. k-means clustering can be one approach. this paper talks about another approach. Effectiveness of data condensation approach is measured based on model accuracy.
in the form of features - In this case, apply feature selection approach. This article can help.

Data cleanup

You can remove noisy samples.

Also, You can perform removal of data with missing values etc. However, be careful to not affect the generalisation capability.

Point to remember

Based on the problem at hand, requirement for training data size varies. As a thumb rule, there should be at least 10 observations/samples per variable(Note that in a categorical feature, each category is a variable). And so it is pretty okay to work with large size dataset.

Reference

https://machinelearningmastery.com/large-data-files-machine-learning/

https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/role-of-redundant-features-in-machine-learning

https://machinelearningmastery.com/feature-selection-to-improve-accuracy-and-decrease-training-time/

https://www.isical.ac.in/~sankar/paper/PAMI_02_PM_CAM_SKP_2.pdf

https://youtu.be/q8gVpKl1f-4

https://www.analyticsvidhya.com/blog/2018/08/dask-big-datasets-machine_learning-python/

http://www.fekete.com/san/webhelp/welltest/webhelp/content/html_files/procedures/preparing_data_for_analysis/handling_large_datasets.htm

https://images.app.goo.gl/1fH1aDuBWQctSS957

https://www.statisticssolutions.com/sample-size-formula/?__cf_chl_jschl_tk__=5032fe736fc7563364c6269501bfc92b20b9b3ee-1612011879-0-AeWcCMM-10x3Co-znQ5Bv013JJ0wpw8sc6-lTj2od6NPfA6TwFWcd6mIfYKsCE7JPyqFRqoGOr2sAE1HRBpVZMOih_gRRtIqHkYrho6NzWs7V9QAIduock-lK1RTP1GysuNxtiYhYPM5brdPpItu3JEUWXuVy4wrPTT1P5cVGsWZmJFzW1D1rkIROBt5rm4fOhysZDamAT6YdmHRUd3ZnMZuS2Xrg4QdcDcIfiTcA3rtZw6IRKtGEHXkXyzdRjvgb3NpQKwCtvJXnUT0YHjJPiHiQHp6axXiarIF-qGku3524p8MEVy_LAJTzRahebfUL_lOWH_c_cRZTn9ymhrihHU

Page updated

Google Sites

Report abuse