特征工程

一些笔记:

Standardrization对于需要计算两点之间距离的有必要,比如kmeans聚类,否则某个太大的属性会统治整个计算。对于svm也需要,因为损失函数是点到分割面的距离和。对于逻辑回归,如果用了reguralization因子,l1或l2,也需要先standardize,某个大属性的权重可能过大而影响reguralization对不同属性的公平。

normalization是对单条数据而言,每个属性都是向量的一维,normalization则把向量scale成单位向量,即长度为1

连续型数值属性往往是模型不稳定的因素,合适的话把它分成类别

填补缺失值,如果少量<20%缺失,则补成na。缺失>20%但<50%,则补成na后创建一个新变量0/1来标明某个数值是否补上的。缺失超过一半,要么只用新变量来表明是否缺失,要么删掉这个特征

对于样本太小的某个数值,不要单独分成一类

样本值大于五倍标准差,可以认为是异常值,删掉该记录或者数值调整成边界值

特征构造需要发掘出对业务有影响的数值表现形式,比如imei号码的前几位代表了手机的品牌从而表达出消费能力

特征抽取 extraction目的是降维,pca和lda都可以找到大量特征中重要的一组,不太重要的那些特征

特征选择目的是选出相关的特征,而去掉不相关的特征。计算目标值和特征值的关联度,比如方差,互相关性。去掉关联度小的。也可以一个个加或者减特征,看模型表现是否加强,不能加强就去掉。此外随即森林,xgboost这些树的算法也可以估计特征的重要性。

Example

To predict how likely a person like Coke (1) or Pepsi (0)

The model features include age, country, etc.

The problem is with the country feature, which has about 200+ values.

There is no order between countries so you cant assign an ordinal value to encode it.

One hot coding

Every country is a feature with value 1 or 0.

In this way it introduces 200+ features.

Dummy coding

Only difference to One Hot coding is it uses N-1 instead of N feature values. So one country is left out.

Mean encoding / Target encoding

Pre-calculates the mean value of Coke/Pepsi for each country.

That mean value is essentially the probability/rate of people like Coke in that country.

Use mean value as the feature, so only one feature is added.

The downside is the feature value is derived from the target label so it is prone to overfitting.