Trang chủ‎ > ‎IT‎ > ‎Data Science - Python‎ > ‎

Handle missing values problems

1. Missing data orientation with numpy

2. Using MASK in Numpy

3. Blog about missing data in python by Aleksey Bilogur

4. Recommended using Pandas to handle missing values

5. Another tutorial handling missing values in Pandas

6. Use sklearn.preprocessing.Imputer class to re-generate missing values
3 ways to re-generate missing values:
- use "mean" in NaN location
- use "median"
- use "most frequent value"

7. Using Interpolate function in Pandas to handle missing data values

support methods : {‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’,

‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘krogh’, ‘polynomial’, ‘spline’, ‘piecewise_polynomial’, ‘from_derivatives’, ‘pchip’, ‘akima’}

  • ‘linear’: ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes. default
  • ‘time’: interpolation works on daily and higher resolution data to interpolate given length of interval
  • ‘index’, ‘values’: use the actual numerical values of the index
  • ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘polynomial’ is passed to scipy.interpolate.interp1d. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method=’polynomial’, order=4). These use the actual numerical values of the index.
  • ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ are all wrappers around the scipy interpolation methods of similar names. These use the actual numerical values of the index. See the scipy documentation for more on their behavior here # noqa and here # noqa
  • ‘from_derivatives’ refers to BPoly.from_derivatives which replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18

New in version 0.18.1: Added support for the ‘akima’ method Added interpolate method ‘from_derivatives’ which replaces ‘piecewise_polynomial’ in scipy 0.18; backwards-compatible with scipy < 0.18

8. Other handling missing values methods in Pandas

Missing data handling

DataFrame.dropna([axis, how, thresh, ...])Return object with labels on given axis omitted where alternately any
DataFrame.fillna([value, method, axis, ...])Fill NA/NaN values using the specified method
DataFrame.replace([to_replace, value, ...])Replace values given in ‘to_replace’ with ‘value’.