일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
- 엘리스
- Semantic Segmentation
- DilatedNet
- 3줄 논문
- pytorch
- MySQL
- TEAM-EDA
- 큐
- 프로그래머스
- Image Segmentation
- 입문
- 협업필터링
- Machine Learning Advanced
- Object Detection
- 튜토리얼
- 한빛미디어
- 나는 리뷰어다
- 파이썬
- Segmentation
- 코딩테스트
- TEAM EDA
- hackerrank
- 스택
- DFS
- 추천시스템
- 알고리즘
- 나는리뷰어다
- Python
- Recsys-KR
- eda
- Today
- Total
TEAM EDA
Chris의 Feature Engineering 팁 본문
원문 : https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575#latest-624919
Feature Engineering Techniques
Engineering features is key to improving your LB score. Below are some ideas on how to engineer new features. Create a new feature and then evaluate it with a local validation scheme to see if it improves your model's CV (and thus LB). Keep beneficial features and discard the others.
If you create lots of new features at once, you can use forward feature selection, recursive feature elimination, LGBM importance, or permutation importance to determine which are useful.
The kernel here by Konstantin shows this procedure and demonstrates many of the following techniques
NAN processing
If you give np.nan to LGBM, then at each tree node split, it will split the non-NAN values and then send all the NANs to either the left child or right child depending on what’s best. Therefore NANs get special treatment at every node and can become overfit. By simply converting all NAN to a negative number lower than all non-NAN values (such as - 999),
df[col].fillna(-999, inplace=True)
df[col].fillna(-999, inplace=True)
then LGBM will no longer overprocess NAN. Instead it will give it the same attention as other numbers. Try both ways and see which gives the highest CV.
Label Encode/ Factorize/ Memory reduction
Label encoding (factorizing) converts a (string, category, object) column to integers. Afterward you can cast it to int8, int16, or int32 depending on whether max is less than 128, less than 32768, or not. Factorizing reduces memory and turns NAN into a number (i.e. -1) which affects CV and LB as described above. Factorizing also gives you the choice to treat categorical variable as numeric described below.
df[col],_ = df[col].factorize() if df[col].max()<128: df[col] = df[col].astype('int8') elif df[col].max()<32768: df[col] = df[col].astype('int16') else: df[col].astype('int32')
df[col],_ = df[col].factorize()
if df[col].max()<128: df[col] = df[col].astype('int8')
elif df[col].max()<32768: df[col] = df[col].astype('int16')
else: df[col].astype('int32')
Additionally for memory reduction, people use the popular memory_reduce function on the other columns. A simpler and safer approach is to convert all float64 to float32 and all int64 to int32. (It's best to avoid float16. You can use int8 and int16 if you like).
for col in df.columns:
if df[col].dtype=='float64': df[col] = df[col].astype('float32')
if df[col].dtype=='int64': df[col] = df[col].astype('int32')
Categorical Features
With categorical variables, you have the choice of telling LGBM that they are categorical or you can tell LGBM to treat it as numerical (if you label encode it first). Either way, LGBM can extract the category classes. Try both ways and see which gives the highest CV. After label encoding do the following for category or leave it as int for numeric
df[col] = df[col].astype('category')
Splitting
A single (string or numeric) column can be made into two columns by splitting. For example a string column id_30 such as “Mac OS X 10_9_5” can be split into Operating System “Mac OS X” and Version “10_9_5”. Or for example number column TransactionAmt “1230.45” can be split into Dollars “1230” and Cents “45”. LGBM cannot see these pieces on its own, you need to split them.
Combining
Two (string or numeric) columns can be combined into one column. For example card1 and card2 can become a new column with
df['uid'] = df[‘card1’].astype(str)+’_’+df[‘card2’].astype(str)
Numeric columns can combined with adding, subtracting, multiplying, etc. This helps LGBM because by themselves card1 and card2 may not correlate with target and therefore LGBM won’t split them at a tree node. But the interaction uid = card1_card2 may correlate with target and now LGBM will split it.
Frequency Encoding
Frequency encoding is a powerful technique that allows LGBM to see whether column values are rare or common. For example, if you want LGBM to "see" which credit cards are used infrequently, try
temp = df['card1'].value_counts.to_dict() df['card1_counts'] = df['card1'].map(temp)
Aggregations / Group Statistics
Providing LGBM with group statistics allows LGBM to determine if a value is common or rare for a particular group. You calculate group statistics by providing pandas with 3 variables. You give it the group, variable of interest, and type of statistic. For example,
temp = df.groupby('card1')['TransactionAmt'].agg(['mean'])
.rename({'mean':'TransactionAmt_card1_mean'},axis=1)
df = pd.merge(df,temp,on='card1',how='left')
The feature here adds to each row what the average TransactionAmt is for that row's card1 group. Therefore LGBM can now tell if a row has an abnormal TransactionAmt for their card1 group.
Normalize / Standardize
You can normalize columns against themselves. For example
df[col] = ( df[col]-df[col].mean() ) / df[col].std()
Or you can normalize one column against another column. For example if you create a Group Statistic (described above) indicating the mean value for D3 each week. Then you can remove time dependence by
df['D3_remove_time'] = df['D3'] - df['D3_week_mean']
The new variable D3_remove_time no longer increases as we advance in time because we have normalized it against the affects of time.
Outlier Removal / Relax / Smooth / PCA
Normally you want to remove anomalies from your data because they confuse your models. However in this competition, we want to find anomalies so use smoothing techniques carefully. The idea behind these methods and to determine and remove uncommon values. For example, after frequency encoding a variable, you can remove all values that appear less than 0.1% by replacing them with a new value like -9999.