negative down sampling をしたときの logistic regression の最適な方法・設定を調べる。
結論としては、negative down sampling をする前に、 を平均から求める( )。それを対数化すれば切片の重みとなる( )。negative down sampling 後のデータは class_weight='balanced', fit_intercept=False
で学習を行う。
背景
logistic regression では サンプルの割合を変化させると学習はうまくいかない。
http://quinonero.net/Publications/predicting-clicks-facebook.pdf では、negative down sampling を行っているが、Re-Calibration という計算において、down sampling により曲げられた予測から再補正を行うことにより、もとの予測に戻している。
ところが、log-loss を損失関数にして考えた場合で、かつそれを元に学習パラメータの最適化を行った場合、この補正では満足な結果が得られない。理由としては、down sampling の前と後では、log-loss の値が変わり、そして、その極大点(損失関数的には neg_log_loss なので極小点)も異なるところにあると考えられるからである。
検証では次のサイトの例を参考にした。
https://stats.stackexchange.com/questions/67903/does-down-sampling-change-logistic-regression-coefficients
ここで確認したいこと
- intercept への罰則は適切ではないので、
intercept_scaling
を大きく設定
- negative down sampling することで neg_log_loss の極値となるパラメータに違いはあるか
class_weight
を使った補正の有効性
- balanced の検証
intercept_scaling
を大きく設定したほうが良いか
実験用データ生成
FAKE = 10
import pandas as pd
import random
def sample_data(size=10000, fake=FAKE):
""" サンプルデータ生成。
x==1 のとき 10%, x==0 のとき 1%。
"""
y = []
x = []
z = [[] for i in range(fake)]
r = []
for i in range(size):
xx = 1 if random.random() < 0.05 else 0
x.append(xx)
if xx == 1:
yy = 1 if random.random() < 0.1 else 0
else:
yy = 1 if random.random() < 0.01 else 0
y.append(yy)
for j in range(fake):
zz = 1 if random.random() > 0.1 else 0
z[j].append(zz)
rr = random.random() if yy == 0 else 0
r.append(rr)
data_dict = {'z'+str(i): z[i] for i in range(fake)}
data_dict['y'] = y
data_dict['x'] = x
data_dict['r'] = r
df = pd.DataFrame.from_dict( data_dict )
return df
df_test = sample_data()
df_test[df_test['r']< 0.005][:20]
|
r |
x |
y |
z0 |
z1 |
z2 |
z3 |
z4 |
z5 |
z6 |
z7 |
z8 |
z9 |
83 |
0.000000 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
132 |
0.000000 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
147 |
0.000000 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
324 |
0.005000 |
0 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
329 |
0.000000 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
363 |
0.001359 |
0 |
0 |
1 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
1 |
419 |
0.001078 |
0 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
494 |
0.000486 |
0 |
0 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
647 |
0.000000 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
698 |
0.003799 |
0 |
0 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
720 |
0.000000 |
0 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
737 |
0.000000 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
793 |
0.000000 |
0 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
797 |
0.000000 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
809 |
0.000000 |
0 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
925 |
0.000000 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1035 |
0.000000 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
1093 |
0.000000 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1110 |
0.000000 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1145 |
0.000000 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
df_list = [sample_data() for i in range(10)]
df = df_list[0]
ダウンサンプリング率の変更による重みの確認
from sklearn.linear_model import LogisticRegression
from IPython.core.display import display, Markdown
def check_coef_by_down_sampling(df_list, intercept_scale=1):
for ratio in [1, 2, 5, 10, 20, 50, 100, 200]:
display(Markdown("### ratio: %s" % str(ratio)))
for i in range(0, 10):
df = df_list[i]
df_d = df[df['r'] < 1.0/ratio]
logreg = LogisticRegression(penalty='l1', C=1, intercept_scaling=intercept_scale)
column_list = ['x'] + ['z'+str(i) for i in range(FAKE)]
logreg.fit(df_d[column_list], pd.Series(df_d['y'].values.flatten()))
w_x = logreg.coef_.tolist()[0][0]
w_z = sum([w*w for w in logreg.coef_.tolist()[0][1:]])
print([round(x,5) for x in [logreg.intercept_[0], w_x, w_z] ])
check_coef_by_down_sampling(df_list)
ratio: 1
[-4.91204, 2.74654, 0.17877]
[-4.0515, 2.3485, 0.83755]
[-5.42401, 2.51179, 0.4848]
[-4.03337, 2.06947, 0.44515]
[-4.69719, 2.57411, 0.59647]
[-3.76437, 2.35255, 0.35005]
[-4.39169, 2.05972, 0.61103]
[-4.11071, 2.53458, 0.17263]
[-2.99168, 2.46254, 0.63564]
[-3.99944, 2.30797, 0.55562]
ratio: 2
[-4.2107, 2.70525, 0.23816]
[-3.38297, 2.31001, 0.92924]
[-4.77464, 2.50352, 0.55668]
[-3.37317, 2.03352, 0.4837]
[-4.03382, 2.53982, 0.58345]
[-2.97067, 2.31873, 0.38171]
[-3.41311, 2.12467, 0.67182]
[-3.44971, 2.49872, 0.19891]
[-2.39652, 2.40039, 0.5989]
[-3.3235, 2.31424, 0.65252]
ratio: 5
[-3.35169, 2.72117, 0.23848]
[-2.27844, 2.41117, 0.77865]
[-3.52712, 2.46208, 0.72024]
[-2.23162, 2.27092, 0.56094]
[-3.22692, 2.54749, 0.57494]
[-1.91845, 2.26757, 0.57179]
[-2.54464, 2.15011, 0.68264]
[-2.45929, 2.51433, 0.26709]
[-1.19256, 2.56356, 0.80125]
[-2.54425, 2.20229, 0.74773]
ratio: 10
[-3.25613, 2.75951, 0.52]
[-1.42508, 2.38864, 0.48791]
[-2.41798, 2.45274, 0.90139]
[-1.60434, 2.24338, 0.59875]
[-1.95903, 2.56728, 0.65356]
[-1.02283, 2.17213, 0.57539]
[-1.56087, 2.2855, 0.89761]
[-1.4418, 2.45697, 0.47024]
[-0.72239, 2.52637, 0.75559]
[-1.6296, 2.47578, 0.9896]
ratio: 20
[-2.40845, 2.86692, 0.62559]
[-1.03619, 2.22112, 0.40693]
[-1.43415, 2.61045, 0.5262]
[-0.45256, 2.03482, 0.52997]
[-1.19788, 2.51803, 0.47951]
[0.0, 2.08584, 0.65126]
[-1.23638, 2.23326, 1.23025]
[-0.46025, 2.78909, 0.70659]
[-0.45358, 2.60837, 0.5026]
[-1.25244, 2.28267, 0.91885]
ratio: 50
[-0.66989, 2.5969, 0.87268]
[0.0, 2.07957, 0.88407]
[0.0, 2.63922, 0.5973]
[0.0, 1.92209, 0.67016]
[-0.52315, 2.67275, 0.84047]
[0.0, 1.69736, 0.38221]
[0.0, 2.49584, 1.50871]
[0.0, 2.58153, 0.85577]
[0.0, 2.57994, 0.81271]
[-0.88678, 2.0582, 0.9175]
ratio: 100
[0.0, 3.11442, 0.84759]
[0.0, 1.93646, 1.44019]
[0.0, 2.61444, 1.39052]
[0.0, 1.8008, 0.48262]
[-0.67109, 3.00093, 1.18918]
[0.0, 1.37616, 0.1724]
[0.0, 2.52167, 1.31528]
[0.0, 2.44528, 0.80262]
[0.0, 2.23357, 0.57737]
[0.0, 2.59899, 0.89581]
ratio: 200
[0.0, 2.86792, 1.13424]
[0.0, 2.1795, 0.93253]
[0.2598, 2.56551, 1.55929]
[0.05154, 1.95667, 0.3683]
[0.0, 2.43416, 1.52139]
[0.0, 2.09143, 0.38268]
[0.0, 2.10479, 1.78338]
[0.18917, 2.81943, 0.89836]
[0.0, 3.27659, 1.14055]
[0.0, 2.41917, 0.91509]
一つ目は、切片、2つめは y
の生成に寄与している x
、それ以降は全く関係ない z*
の回帰係数の二乗の合計である。なお、二番目の x
の重みは、1%から10%へ10倍なので、10の自然対数である、2.302585 が理想的である。
ダウンサンプリング率によって
x
の回帰係数の重みはほとんど変化しないが、切片は大きく変わる。
- ダウンサンプリング率が大きくなると、ノイズの
z*
の重みを誤って学習しやすい
ということがわかる。
check_coef_by_down_sampling(df_list, intercept_scale=10000)
ratio: 1
[-5.45556, 2.75912, 0.28262]
[-4.6428, 2.35239, 0.92031]
[-5.78945, 2.52124, 0.64852]
[-4.49992, 2.07685, 0.49226]
[-5.20276, 2.58633, 0.69073]
[-4.34302, 2.36255, 0.28314]
[-5.01426, 2.0722, 0.7266]
[-4.6828, 2.54017, 0.13943]
[-3.47544, 2.47044, 0.48384]
[-4.54506, 2.31714, 0.56532]
ratio: 2
[-4.47498, 2.71107, 0.27932]
[-4.04952, 2.31372, 1.04347]
[-5.06899, 2.50929, 0.67163]
[-3.87872, 2.04466, 0.54536]
[-4.33338, 2.54452, 0.60214]
[-3.54863, 2.32828, 0.29453]
[-4.04781, 2.13091, 0.7376]
[-4.02237, 2.50532, 0.17511]
[-2.98376, 2.41174, 0.45404]
[-3.84655, 2.32399, 0.66282]
ratio: 5
[-3.97839, 2.73628, 0.37851]
[-2.93231, 2.40569, 0.85465]
[-4.28537, 2.47596, 0.97301]
[-2.35712, 2.27262, 0.55879]
[-4.03678, 2.56098, 0.76631]
[-2.53755, 2.27179, 0.45956]
[-3.13683, 2.15767, 0.73814]
[-2.9908, 2.51932, 0.22433]
[-1.8004, 2.56204, 0.57123]
[-3.09578, 2.21253, 0.79138]
ratio: 10
[-3.72013, 2.77988, 0.7003]
[-2.08104, 2.37616, 0.48348]
[-3.22456, 2.4633, 1.11463]
[-2.22255, 2.2543, 0.64177]
[-2.92231, 2.58176, 0.72124]
[-1.64388, 2.18194, 0.4027]
[-2.34855, 2.29581, 0.92035]
[-2.15464, 2.45468, 0.38139]
[-1.33011, 2.52442, 0.55185]
[-2.25828, 2.48406, 1.00449]
ratio: 20
[-3.00527, 2.89685, 0.8809]
[-1.62548, 2.2128, 0.43464]
[-2.2563, 2.62659, 0.68069]
[-0.98092, 2.04177, 0.4648]
[-1.47433, 2.52369, 0.45993]
[-0.57554, 2.09909, 0.42959]
[-1.85798, 2.26083, 1.32768]
[-1.37675, 2.77282, 0.49475]
[-1.26171, 2.60299, 0.35655]
[-1.74328, 2.28939, 0.97978]
ratio: 50
[-1.84861, 2.62367, 1.05701]
[-0.51107, 2.07379, 0.84463]
[-1.10631, 2.63535, 0.60667]
[-0.1167, 1.92674, 0.56876]
[-1.42434, 2.7109, 0.87041]
[-0.18693, 1.70358, 0.35005]
[-0.7797, 2.51998, 1.44281]
[0.09158, 2.58324, 0.88211]
[-0.46089, 2.57796, 0.76345]
[-1.71747, 2.1023, 1.21375]
ratio: 100
[-0.59101, 3.13524, 0.90419]
[-0.1334, 1.9426, 0.97734]
[0.07708, 2.61341, 1.3952]
[0.67081, 1.79631, 0.55785]
[-2.34283, 3.06153, 2.02438]
[-0.277, 1.39244, 0.18829]
[0.65186, 2.51285, 1.41825]
[0.3768, 2.44727, 0.88092]
[0.29966, 2.23463, 0.60469]
[-1.20692, 2.6587, 1.16563]
ratio: 200
[-0.40691, 2.90292, 1.28256]
[0.53202, 2.184, 0.87883]
[1.80984, 2.59295, 1.85358]
[0.95417, 1.92716, 0.19392]
[-0.92467, 2.48612, 1.94237]
[-0.38448, 2.11439, 0.50066]
[1.05276, 2.08541, 1.95901]
[2.15207, 2.8209, 1.84377]
[1.16178, 3.27734, 1.26292]
[-0.13707, 2.42296, 0.95436]
intercept_scaling を大きく設定する(後半の設定)
切片の重みが無理なく付くようになり、x
への重みも、2.302585 に近い値を安定して獲得できるようになった。 negative down sampling で意図的に変更している以上、切片の重みに罰則を与える意味は全く無い。
なお、切片の重みは、与えた intercept_scaling
ですでに乗算してあるらしく、乗算する必要はない。ソースはここ
https://github.com/scikit-learn/scikit-learn/blob/45dc891c96eebdb3b81bf14c2737d8f6540fabfe/sklearn/svm/base.py#L902
intercept_scaling
を大きく設定したほうが良い
- negative down sampling をしないほうがいいし、率は小さい方がいい。(それでもメモリの問題などの理由がある)
intercept_scaling
を大きく設定したほうが、無駄に切片の重みを学習せずにすむ
negative down sampling することで neg_log_loss の極値となるパラメータに違いはあるか
探索用 cross validation
from sklearn.model_selection import GridSearchCV
def grid_search_cv(X, y, class_weight=None, fit_intercept=True):
test_parameters = [
{'C': [0.01, 0.1, 1, 10, 100] }
]
clf = GridSearchCV(
LogisticRegression(penalty='l1',
intercept_scaling=10000,
class_weight=class_weight,
fit_intercept=fit_intercept),
test_parameters,
cv=20,
scoring=['neg_log_loss', 'accuracy'],
n_jobs=-1,
refit='neg_log_loss'
)
clf.fit(X, y)
return clf
for ratio in [1, 10, 100, 200]:
display(Markdown("### ratio: %s" % str(ratio)))
df_d = df[df['r'] < 1.0/ratio]
column_list = ['x'] + ['z'+str(i) for i in range(FAKE)]
grid_search_result = grid_search_cv(df_d[column_list], pd.Series(df_d['y'].values.flatten()))
df_result = pd.DataFrame(grid_search_result.cv_results_)
display(df_result[['params', 'mean_test_neg_log_loss', 'std_test_neg_log_loss']])
ratio: 1
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-0.077886 |
0.004059 |
1 |
{u'C': 0.1} |
-0.068239 |
0.006209 |
2 |
{u'C': 1} |
-0.068588 |
0.007413 |
3 |
{u'C': 10} |
-0.068781 |
0.007660 |
4 |
{u'C': 100} |
-0.069028 |
0.007834 |
ratio: 10
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-0.390054 |
0.012941 |
1 |
{u'C': 0.1} |
-0.326330 |
0.028839 |
2 |
{u'C': 1} |
-0.326252 |
0.038349 |
3 |
{u'C': 10} |
-0.329250 |
0.041436 |
4 |
{u'C': 100} |
-0.329717 |
0.041785 |
ratio: 100
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-0.661588 |
0.005336 |
1 |
{u'C': 0.1} |
-0.576321 |
0.037786 |
2 |
{u'C': 1} |
-0.554997 |
0.121124 |
3 |
{u'C': 10} |
-0.571344 |
0.137919 |
4 |
{u'C': 100} |
-0.587997 |
0.181002 |
ratio: 200
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-0.554096 |
0.031750 |
1 |
{u'C': 0.1} |
-0.516677 |
0.032798 |
2 |
{u'C': 1} |
-0.473868 |
0.112612 |
3 |
{u'C': 10} |
-0.530726 |
0.213819 |
4 |
{u'C': 100} |
-0.554108 |
0.242809 |
negative down sampling することで neg_log_loss の極値となるパラメータに違いは確認できず
import numpy as np
def make_class_weight(orig, down, log):
"""クラスの重みを計算する。down sampling によって減った事例数を戻すように補正する"""
neg_w = float(orig) / down
if log:
neg_w = np.log(neg_w + 1)
return {0: neg_w, 1: 1}
def check_learn_param(class_weight=False, log=False):
for ratio in [1, 2, 5, 10, 20, 50, 100, 200]:
display(Markdown("### ratio: %s" % str(ratio)))
df_d = df[df['r'] < 1.0/ratio]
cw = make_class_weight(len(df[df['y']==0]), len(df_d[df_d['y']==0]), log) if class_weight else None
column_list = ['x'] + ['z'+str(i) for i in range(FAKE)]
grid_search_result = grid_search_cv(df_d[column_list],
pd.Series(df_d['y'].values.flatten()),
class_weight=cw)
df_result = pd.DataFrame(grid_search_result.cv_results_)
display(df_result[['params', 'mean_test_neg_log_loss', 'std_test_neg_log_loss']])
check_learn_param(class_weight=True)
ratio: 1
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-0.077886 |
0.004059 |
1 |
{u'C': 0.1} |
-0.068239 |
0.006205 |
2 |
{u'C': 1} |
-0.068631 |
0.007474 |
3 |
{u'C': 10} |
-0.068918 |
0.007800 |
4 |
{u'C': 100} |
-0.069015 |
0.007842 |
ratio: 2
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-0.141263 |
0.008058 |
1 |
{u'C': 0.1} |
-0.122711 |
0.011923 |
2 |
{u'C': 1} |
-0.123334 |
0.014466 |
3 |
{u'C': 10} |
-0.123813 |
0.015084 |
4 |
{u'C': 100} |
-0.124233 |
0.015270 |
ratio: 5
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-0.310091 |
0.018207 |
1 |
{u'C': 0.1} |
-0.266433 |
0.026865 |
2 |
{u'C': 1} |
-0.267755 |
0.032059 |
3 |
{u'C': 10} |
-0.268953 |
0.034197 |
4 |
{u'C': 100} |
-0.269233 |
0.033790 |
ratio: 10
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-0.567181 |
0.029106 |
1 |
{u'C': 0.1} |
-0.485387 |
0.049120 |
2 |
{u'C': 1} |
-0.485032 |
0.061664 |
3 |
{u'C': 10} |
-0.486115 |
0.064668 |
4 |
{u'C': 100} |
-0.486117 |
0.063645 |
ratio: 20
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-0.991310 |
0.041022 |
1 |
{u'C': 0.1} |
-0.833435 |
0.086217 |
2 |
{u'C': 1} |
-0.830881 |
0.113298 |
3 |
{u'C': 10} |
-0.830402 |
0.123086 |
4 |
{u'C': 100} |
-0.833500 |
0.125014 |
ratio: 50
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-1.803340 |
0.068163 |
1 |
{u'C': 0.1} |
-1.551492 |
0.148318 |
2 |
{u'C': 1} |
-1.519491 |
0.213549 |
3 |
{u'C': 10} |
-1.532629 |
0.227115 |
4 |
{u'C': 100} |
-1.526625 |
0.225905 |
ratio: 100
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-2.630216 |
0.045231 |
1 |
{u'C': 0.1} |
-1.994720 |
0.314057 |
2 |
{u'C': 1} |
-1.969576 |
0.448479 |
3 |
{u'C': 10} |
-1.995761 |
0.481577 |
4 |
{u'C': 100} |
-1.961631 |
0.429366 |
ratio: 200
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-3.184818 |
0.122700 |
1 |
{u'C': 0.1} |
-2.228196 |
0.525339 |
2 |
{u'C': 1} |
-2.231107 |
0.676387 |
3 |
{u'C': 10} |
-2.318523 |
0.762673 |
4 |
{u'C': 100} |
-2.345855 |
0.750384 |
neg_log_loss
は予想に反して同じ値にならなかった。ただ、極値は保存されているようである。
係数の重みの確認
def check_coefs_for_each_ratio(df_list, intercept_scale=1, class_weight=False, log=False):
for ratio in [1, 2, 5, 10, 20, 50, 100, 200]:
display(Markdown("### ratio: %s" % str(ratio)))
for i in range(0, 10):
df = df_list[i]
df_d = df[df['r'] < 1.0/ratio]
cw = make_class_weight(len(df[df['y']==0]), len(df_d[df_d['y']==0]), log) if class_weight else None
logreg = LogisticRegression(penalty='l1', C=1, intercept_scaling=intercept_scale, class_weight=cw)
column_list = ['x'] + ['z'+str(i) for i in range(FAKE)]
logreg.fit(df_d[column_list], pd.Series(df_d['y'].values.flatten()))
w_x = logreg.coef_.tolist()[0][0]
w_z = sum([w*w for w in logreg.coef_.tolist()[0][1:]])
print([round(x,5) for x in [logreg.intercept_[0], w_x, w_z] ])
check_coefs_for_each_ratio(df_list, intercept_scale=10000, class_weight=True)
ratio: 1
[-5.456, 2.75913, 0.28275]
[-4.68984, 2.35276, 0.93339]
[-5.93321, 2.52496, 0.73554]
[-4.49972, 2.07685, 0.49214]
[-5.07817, 2.58326, 0.65842]
[-4.18139, 2.36003, 0.29171]
[-5.01522, 2.07222, 0.72691]
[-4.63656, 2.53985, 0.13835]
[-2.93413, 2.46116, 0.66034]
[-4.58423, 2.31785, 0.56987]
ratio: 2
[-4.91084, 2.70796, 0.24419]
[-4.73899, 2.3222, 1.09332]
[-5.49024, 2.50968, 0.57039]
[-4.57879, 2.0469, 0.54657]
[-5.37927, 2.55492, 0.71975]
[-4.18683, 2.33364, 0.30944]
[-4.72929, 2.13681, 0.76036]
[-4.78652, 2.50797, 0.18094]
[-3.74175, 2.41311, 0.45017]
[-4.47882, 2.32848, 0.68062]
ratio: 5
[-5.52737, 2.75556, 0.44584]
[-4.32948, 2.43492, 0.95364]
[-5.36533, 2.48112, 0.79776]
[-4.49862, 2.30294, 0.67979]
[-4.66343, 2.56162, 0.58413]
[-3.80696, 2.3091, 0.61495]
[-4.68544, 2.18322, 0.82486]
[-4.55734, 2.52214, 0.21192]
[-3.49043, 2.57608, 0.64307]
[-4.55503, 2.22938, 0.85661]
ratio: 10
[-6.32183, 2.84784, 1.00543]
[-4.18039, 2.42215, 0.64773]
[-5.70742, 2.54972, 1.53197]
[-4.58312, 2.32635, 0.85909]
[-5.08476, 2.60348, 0.69908]
[-3.32753, 2.26653, 0.80517]
[-4.30632, 2.3637, 1.1924]
[-4.57177, 2.47107, 0.39637]
[-3.74324, 2.5844, 0.79115]
[-4.25131, 2.58967, 1.42814]
ratio: 20
[-3.73082, 2.97579, 1.38839]
[-4.48063, 2.30836, 0.82273]
[-5.26905, 2.81853, 1.29472]
[-4.26113, 2.13226, 0.80125]
[-5.23599, 2.62421, 0.66548]
[-2.60202, 2.29986, 1.61421]
[-4.92805, 2.39065, 2.24484]
[-4.5201, 2.85804, 0.65841]
[-4.47336, 2.67563, 0.54306]
[-4.70123, 2.38434, 1.21893]
ratio: 50
[-6.1262, 3.10075, 3.12108]
[-4.1213, 2.26967, 1.92844]
[-6.14338, 2.9595, 1.32346]
[-4.26593, 2.21141, 1.51868]
[-3.33964, 3.0612, 1.31777]
[-3.13537, 1.96733, 1.71331]
[-3.31786, 3.01146, 3.9546]
[-3.43166, 3.07311, 4.31343]
[-4.97486, 2.72593, 0.82405]
[-5.88227, 2.49085, 3.02388]
ratio: 100
[-8.0263, 5.20309, 14.74563]
[-3.96654, 2.50111, 5.92777]
[-3.77454, 3.32975, 4.06035]
[-2.66965, 2.29368, 2.19824]
[-9.43126, 5.73963, 11.59807]
[-4.36287, 1.76316, 1.47788]
[-1.57113, 3.0898, 4.08203]
[-4.62372, 3.20552, 5.05104]
[-4.9451, 2.44147, 0.9702]
[-8.34704, 4.66164, 11.21491]
ratio: 200
[-9.04834, 6.70274, 26.74032]
[-3.0757, 2.93908, 6.78269]
[4.31901, 3.79058, 15.57901]
[-3.1984, 2.82233, 2.86396]
[-9.31711, 5.73608, 16.86566]
[-9.31487, 5.56711, 19.90607]
[-0.56633, 2.71122, 15.52765]
[-6.83777, 6.20401, 20.82009]
[-1.64302, 8.69927, 9.08549]
[-7.99419, 5.54549, 9.48558]
切片も出ているが、x
への重みのばらつき(2.302585が理想)や、z
の重みへのばらつき(0が理想)が大きい。
重みの対数化
安直ではあるが、10倍するのではなく、log(10)倍することで、過学習を緩和したい
check_coefs_for_each_ratio(df_list, intercept_scale=10000, class_weight=True, log=True)
ratio: 1
[-5.06334, 2.75755, 0.2739]
[-4.285, 2.34939, 0.90332]
[-5.54231, 2.52186, 0.71662]
[-4.11011, 2.07534, 0.48425]
[-4.79435, 2.58338, 0.6687]
[-3.54648, 2.35229, 0.31755]
[-4.61396, 2.07039, 0.71198]
[-4.25854, 2.53891, 0.1378]
[-3.00963, 2.46912, 0.50412]
[-4.22915, 2.31719, 0.56841]
ratio: 2
[-4.9026, 2.71966, 0.36531]
[-4.14688, 2.31516, 1.05298]
[-5.48156, 2.51683, 0.83853]
[-3.98731, 2.04515, 0.54704]
[-4.81269, 2.55341, 0.71851]
[-3.62939, 2.32907, 0.29694]
[-4.14463, 2.13188, 0.74179]
[-3.78081, 2.50169, 0.1792]
[-3.0896, 2.41201, 0.45301]
[-3.6255, 2.31867, 0.64602]
ratio: 5
[-4.55296, 2.74552, 0.41394]
[-3.24339, 2.42041, 0.86243]
[-4.98751, 2.48683, 1.04366]
[-3.44419, 2.29147, 0.63844]
[-4.27659, 2.56317, 0.66561]
[-2.95918, 2.28709, 0.5224]
[-3.69191, 2.16886, 0.7741]
[-3.6336, 2.52211, 0.22204]
[-2.41829, 2.56805, 0.60344]
[-3.38312, 2.21611, 0.78615]
ratio: 10
[-4.97196, 2.82519, 0.98587]
[-2.9001, 2.39795, 0.56099]
[-3.63711, 2.48865, 1.09703]
[-3.22029, 2.29187, 0.7576]
[-3.71282, 2.59307, 0.70221]
[-2.18677, 2.21791, 0.57243]
[-3.16291, 2.32609, 1.03614]
[-2.98868, 2.4629, 0.40026]
[-2.37152, 2.55224, 0.64066]
[-3.01113, 2.52921, 1.17777]
ratio: 20
[-4.48405, 3.00463, 1.45081]
[-2.78745, 2.25786, 0.62019]
[-3.80631, 2.71905, 1.10831]
[-2.33652, 2.0821, 0.60125]
[-2.78862, 2.57325, 0.5167]
[-1.23831, 2.17586, 0.85325]
[-3.00064, 2.318, 1.66552]
[-2.54605, 2.8181, 0.56879]
[-2.62482, 2.64067, 0.43063]
[-2.9084, 2.34565, 1.14212]
ratio: 50
[-3.61535, 2.8079, 1.90597]
[-1.765, 2.1605, 1.30307]
[-2.90974, 2.74135, 0.71409]
[-1.00802, 2.01203, 0.95008]
[-2.69431, 2.86702, 1.32219]
[-1.26446, 1.80058, 0.77125]
[-1.10304, 2.69649, 2.32091]
[-1.15043, 2.74996, 1.84605]
[-2.011, 2.66888, 0.8709]
[-2.50288, 2.23787, 1.62665]
ratio: 100
[-2.38678, 3.55555, 2.05781]
[-1.07891, 2.12761, 2.53363]
[-1.02447, 2.88643, 2.16754]
[-0.47467, 1.97916, 1.08508]
[-3.95229, 3.45432, 2.45714]
[-1.86771, 1.53591, 0.67398]
[-0.29807, 2.81919, 2.541]
[-1.16426, 2.69413, 1.97518]
[-1.24353, 2.35371, 0.90051]
[-2.73079, 2.96275, 1.71517]
ratio: 200
[-2.57214, 3.65385, 3.48845]
[-0.68745, 2.47665, 2.19183]
[2.5264, 3.1807, 5.11516]
[-0.41136, 2.27868, 0.87186]
[-2.83457, 2.97796, 3.13823]
[-2.42656, 2.67244, 1.94582]
[0.97235, 2.61841, 6.52571]
[-0.15809, 3.33408, 3.0626]
[0.37536, 4.97194, 3.57977]
[-1.90523, 2.93487, 1.37925]
ある程度は防ぐことが出来たが、切片を求めたいのでなければ、特に重みをつけるメリットはなさそう。
class-weight を使った補正の有効性はなさそう
[https://swarbrickjones.wordpress.com/2017/03/28/cross-entropy-and-training-test-class-imbalance/:embed:cite]
の最後に、
Also note that we heavily relied on the assumption that the positives/negatives in X were equidistributed to the positives/negatives in X' respectively. If that it is not true, then this analysis could be of limited use!
とある通り、本来は balanced
を使うべきではないだろう。
ところが、down sampling できる状況では、切片重みは事前に求められるはずである。むしろ、ダウンサンプリングして切片重み求めることのほうが効率が悪いかもしれない。切片重みは全体の事例から求める( [tex:{ \ln \hat{pi} }] )として、balanced
オプションで、切片は0としてfit_intercept=False
に設定し計算してみる。
def check_coefs_for_each_ratio_with_balanced(df_list):
for ratio in [1, 2, 5, 10, 20, 50, 100, 200]:
display(Markdown("### ratio: %s" % str(ratio)))
for i in range(0, 10):
df = df_list[i]
intercept = np.log( float(len(df[df['y'] == 1])) / len(df) )
df_d = df[df['r'] < 1.0/ratio]
logreg = LogisticRegression(penalty='l1', C=1, fit_intercept=False, class_weight='balanced')
column_list = ['x'] + ['z'+str(i) for i in range(FAKE)]
logreg.fit(df_d[column_list], pd.Series(df_d['y'].values.flatten()))
w_x = logreg.coef_.tolist()[0][0]
w_z = sum([w*w for w in logreg.coef_.tolist()[0][1:]])
print([round(x,5) for x in [intercept, w_x, w_z] ])
check_coefs_for_each_ratio_with_balanced(df_list)
ratio: 1
[-4.19971, 2.80006, 0.58413]
[-4.14144, 2.37942, 1.11962]
[-4.14775, 2.60704, 0.80826]
[-4.05129, 2.12227, 0.7282]
[-4.43966, 2.70717, 1.61314]
[-4.2475, 2.38503, 0.50004]
[-4.23361, 2.15592, 1.2708]
[-4.2687, 2.61576, 0.52414]
[-4.16048, 2.54132, 0.72488]
[-4.20639, 2.43589, 1.23491]
ratio: 2
[-4.19971, 2.73953, 0.51565]
[-4.14144, 2.32173, 1.09193]
[-4.14775, 2.59147, 0.94536]
[-4.05129, 2.08336, 0.76821]
[-4.43966, 2.6852, 1.72355]
[-4.2475, 2.35151, 0.54959]
[-4.23361, 2.19345, 1.20504]
[-4.2687, 2.57608, 0.56743]
[-4.16048, 2.48444, 0.72722]
[-4.20639, 2.41925, 1.15235]
ratio: 5
[-4.19971, 2.73559, 0.41098]
[-4.14144, 2.40825, 0.89882]
[-4.14775, 2.51915, 1.05386]
[-4.05129, 2.29807, 0.76]
[-4.43966, 2.63634, 1.39499]
[-4.2475, 2.27439, 0.62931]
[-4.23361, 2.21276, 1.2259]
[-4.2687, 2.58391, 0.63055]
[-4.16048, 2.61542, 0.69615]
[-4.20639, 2.27213, 1.08081]
ratio: 10
[-4.19971, 2.72486, 0.48284]
[-4.14144, 2.38394, 0.65321]
[-4.14775, 2.47043, 1.16771]
[-4.05129, 2.2557, 0.76595]
[-4.43966, 2.64495, 1.43846]
[-4.2475, 2.17495, 0.47283]
[-4.23361, 2.31494, 1.26874]
[-4.2687, 2.49734, 0.62739]
[-4.16048, 2.55463, 0.55818]
[-4.20639, 2.50076, 1.1878]
ratio: 20
[-4.19971, 2.79394, 0.49988]
[-4.14144, 2.20952, 0.49212]
[-4.14775, 2.61027, 0.60249]
[-4.05129, 2.04884, 0.57751]
[-4.43966, 2.53871, 0.77375]
[-4.2475, 2.09275, 0.34751]
[-4.23361, 2.24294, 1.47811]
[-4.2687, 2.77854, 0.68346]
[-4.16048, 2.61743, 0.496]
[-4.20639, 2.29761, 1.10332]
ratio: 50
[-4.19971, 2.57246, 0.9157]
[-4.14144, 2.07143, 0.86454]
[-4.14775, 2.63409, 0.61976]
[-4.05129, 1.91869, 0.65903]
[-4.43966, 2.65093, 0.9455]
[-4.2475, 1.69904, 0.30351]
[-4.23361, 2.48733, 1.46496]
[-4.2687, 2.55024, 0.71679]
[-4.16048, 2.56966, 0.81352]
[-4.20639, 2.03153, 0.87826]
ratio: 100
[-4.19971, 3.18644, 0.98824]
[-4.14144, 1.96485, 1.55056]
[-4.14775, 2.66571, 1.46552]
[-4.05129, 1.83133, 0.55313]
[-4.43966, 2.9996, 1.15968]
[-4.2475, 1.3735, 0.21852]
[-4.23361, 2.57707, 1.42351]
[-4.2687, 2.45839, 0.84825]
[-4.16048, 2.25478, 0.61956]
[-4.20639, 2.65479, 0.88477]
ratio: 200
[-4.19971, 3.15883, 1.66064]
[-4.14144, 2.32423, 1.09636]
[-4.14775, 2.85993, 2.00946]
[-4.05129, 2.1121, 0.3894]
[-4.43966, 2.5296, 1.53771]
[-4.2475, 2.26719, 0.72349]
[-4.23361, 2.36902, 2.97373]
[-4.2687, 3.02366, 1.29845]
[-4.16048, 3.97411, 1.76377]
[-4.20639, 2.64598, 0.87854]
よさそうである。まず切片はダウンサンプリングに関係なく全体事例を使っているのでかなり精度よく推定されているといえる。x
の重み(2.302585が理想)もこれまでの方法の中で最も良い値が得られた様に思える。これは切片が固定されているため、これが探索空間の絞り込みになり非常に良い推定を与えると考えられる。
balanaced
の学習パラメータの探索
def check_learn_param_balanced():
for ratio in [1, 2, 5, 10, 20, 50, 100, 200]:
display(Markdown("### ratio: %s" % str(ratio)))
df_d = df[df['r'] < 1.0/ratio]
column_list = ['x'] + ['z'+str(i) for i in range(FAKE)]
grid_search_result = grid_search_cv(df_d[column_list],
pd.Series(df_d['y'].values.flatten()),
class_weight='balanced',
fit_intercept=False)
df_result = pd.DataFrame(grid_search_result.cv_results_)
display(df_result[['params', 'mean_test_neg_log_loss', 'std_test_neg_log_loss']])
check_learn_param_balanced()
ratio: 1
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-0.579756 |
0.016201 |
1 |
{u'C': 0.1} |
-0.557804 |
0.020383 |
2 |
{u'C': 1} |
-0.557315 |
0.020976 |
3 |
{u'C': 10} |
-0.557317 |
0.021046 |
4 |
{u'C': 100} |
-0.557318 |
0.021053 |
ratio: 2
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-0.614889 |
0.015515 |
1 |
{u'C': 0.1} |
-0.562747 |
0.024808 |
2 |
{u'C': 1} |
-0.561251 |
0.026736 |
3 |
{u'C': 10} |
-0.561220 |
0.026921 |
4 |
{u'C': 100} |
-0.561219 |
0.026940 |
ratio: 5
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-0.704480 |
0.007989 |
1 |
{u'C': 0.1} |
-0.569775 |
0.027439 |
2 |
{u'C': 1} |
-0.565177 |
0.031148 |
3 |
{u'C': 10} |
-0.565230 |
0.031662 |
4 |
{u'C': 100} |
-0.565249 |
0.031717 |
ratio: 10
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-0.692749 |
0.000850 |
1 |
{u'C': 0.1} |
-0.582792 |
0.040261 |
2 |
{u'C': 1} |
-0.570911 |
0.051377 |
3 |
{u'C': 10} |
-0.571530 |
0.053412 |
4 |
{u'C': 100} |
-0.571638 |
0.053624 |
ratio: 20
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-0.693147 |
2.135512e-16 |
1 |
{u'C': 0.1} |
-0.591359 |
5.392871e-02 |
2 |
{u'C': 1} |
-0.573781 |
8.087916e-02 |
3 |
{u'C': 10} |
-0.575191 |
8.687242e-02 |
4 |
{u'C': 100} |
-0.575433 |
8.755660e-02 |
ratio: 50
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-0.693147 |
1.110223e-16 |
1 |
{u'C': 0.1} |
-0.610073 |
4.553321e-02 |
2 |
{u'C': 1} |
-0.589434 |
7.575629e-02 |
3 |
{u'C': 10} |
-0.598316 |
8.391099e-02 |
4 |
{u'C': 100} |
-0.599609 |
8.501846e-02 |
ratio: 100
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-0.693147 |
2.220446e-16 |
1 |
{u'C': 0.1} |
-0.588712 |
4.332537e-02 |
2 |
{u'C': 1} |
-0.574640 |
1.232090e-01 |
3 |
{u'C': 10} |
-0.598204 |
1.625511e-01 |
4 |
{u'C': 100} |
-0.601811 |
1.678043e-01 |
ratio: 200
|
params |
mean_test_neg_log_loss |
std_test_neg_log_loss |
0 |
{u'C': 0.01} |
-0.693147 |
1.110223e-16 |
1 |
{u'C': 0.1} |
-0.580985 |
3.726425e-02 |
2 |
{u'C': 1} |
-0.583421 |
1.358876e-01 |
3 |
{u'C': 10} |
-0.626849 |
2.029742e-01 |
4 |
{u'C': 100} |
-0.643918 |
2.217282e-01 |
特に極値となる学習パラメータが変動することもなさそうで、よさそう。
Logistic Regression における Negative Down Sampling の設定まとめ
balanced
を使う方法
- 切片重みを down sampling する前に求める
class_weight='balanced', fit_intercept=False
で学習を行う
Re-Calibration
は必要ない
- どうしても
balanced
を使いたくない場合
intercept_scale=10000
にする
Re-Calibration
が必要
お約束