中野智文のブログ

データ・マエショリストのメモ

Logistic Regression における Negative Down Sampling

negative down sampling をしたときの logistic regression の最適な方法・設定を調べる。

結論としては、negative down sampling をする前に、{ P(y=1) } を平均から求める( { \hat{\pi} } )。それを対数化すれば切片の重みとなる( { \beta_0 = \ln{\hat{\pi}} } )。negative down sampling 後のデータは class_weight='balanced', fit_intercept=False で学習を行う。

背景

logistic regression では サンプルの割合を変化させると学習はうまくいかない。 http://quinonero.net/Publications/predicting-clicks-facebook.pdf では、negative down sampling を行っているが、Re-Calibration という計算において、down sampling により曲げられた予測から再補正を行うことにより、もとの予測に戻している。 ところが、log-loss を損失関数にして考えた場合で、かつそれを元に学習パラメータの最適化を行った場合、この補正では満足な結果が得られない。理由としては、down sampling の前と後では、log-loss の値が変わり、そして、その極大点(損失関数的には neg_log_loss なので極小点)も異なるところにあると考えられるからである。 検証では次のサイトの例を参考にした。 https://stats.stackexchange.com/questions/67903/does-down-sampling-change-logistic-regression-coefficients

ここで確認したいこと

  • intercept への罰則は適切ではないので、intercept_scaling を大きく設定
  • negative down sampling することで neg_log_loss の極値となるパラメータに違いはあるか
  • class_weight を使った補正の有効性
  • balanced の検証

    intercept_scaling を大きく設定したほうが良いか

    実験用データ生成

FAKE = 10 # フェイク変数の数
import pandas as pd
import random
def sample_data(size=10000, fake=FAKE):
    """ サンプルデータ生成。
    x==1 のとき 10%, x==0 のとき 1%。
    """
    y = []
    x = []
    z = [[] for i in range(fake)] # fake
    r = []
    for i in range(size):
        xx = 1 if random.random() < 0.05 else 0
        x.append(xx)
        if xx == 1:
            yy = 1 if random.random() < 0.1 else 0
        else:
            yy = 1 if random.random() < 0.01 else 0
        y.append(yy)
        for j in range(fake):
            zz = 1 if random.random() > 0.1 else 0
            z[j].append(zz)
        rr = random.random() if yy == 0 else 0
        r.append(rr)
    data_dict = {'z'+str(i): z[i] for i in range(fake)}
    data_dict['y'] = y
    data_dict['x'] = x
    data_dict['r'] = r 
    df = pd.DataFrame.from_dict( data_dict )
    return df
df_test = sample_data()
df_test[df_test['r']< 0.005][:20]
r x y z0 z1 z2 z3 z4 z5 z6 z7 z8 z9
83 0.000000 0 1 1 1 1 1 1 1 1 1 1 1
132 0.000000 0 1 1 1 1 1 1 1 0 1 1 1
147 0.000000 0 1 1 1 1 1 1 0 1 1 1 1
324 0.005000 0 0 1 1 1 1 1 1 0 1 1 1
329 0.000000 1 1 1 1 1 1 1 1 1 1 1 1
363 0.001359 0 0 1 1 1 0 1 1 0 1 1 1
419 0.001078 0 0 1 1 1 1 1 1 1 1 1 1
494 0.000486 0 0 1 1 1 1 1 0 1 1 1 1
647 0.000000 1 1 0 1 1 0 1 1 1 1 1 1
698 0.003799 0 0 0 1 1 1 1 1 1 1 1 1
720 0.000000 0 1 1 1 1 0 1 1 1 1 1 1
737 0.000000 0 1 1 1 1 1 1 1 0 1 1 1
793 0.000000 0 1 1 0 1 1 1 1 1 1 1 1
797 0.000000 0 1 1 1 1 1 1 1 1 1 1 1
809 0.000000 0 1 1 1 1 1 0 1 0 1 1 1
925 0.000000 1 1 1 1 1 1 1 1 1 1 1 1
1035 0.000000 0 1 1 1 1 1 1 0 1 1 1 1
1093 0.000000 0 1 1 1 1 1 1 1 1 1 1 1
1110 0.000000 0 1 1 1 1 1 1 1 1 1 1 1
1145 0.000000 0 1 1 1 1 1 1 1 1 1 0 1
df_list = [sample_data() for i in range(10)]
df = df_list[0]

ダウンサンプリング率の変更による重みの確認

from sklearn.linear_model import LogisticRegression 
from IPython.core.display import display, Markdown
def check_coef_by_down_sampling(df_list, intercept_scale=1):
    for ratio in [1, 2, 5, 10, 20, 50, 100, 200]:
        display(Markdown("### ratio: %s" % str(ratio)))
        for i in range(0, 10):
            df = df_list[i]
            df_d = df[df['r'] < 1.0/ratio]
            logreg = LogisticRegression(penalty='l1', C=1, intercept_scaling=intercept_scale)
            column_list = ['x'] + ['z'+str(i) for i in range(FAKE)]
            logreg.fit(df_d[column_list], pd.Series(df_d['y'].values.flatten()))
            w_x = logreg.coef_.tolist()[0][0]
            w_z = sum([w*w for w in logreg.coef_.tolist()[0][1:]]) # 二乗して合計(ここはフェイクなので小さい方がいい)
            print([round(x,5) for x in [logreg.intercept_[0], w_x, w_z] ])
check_coef_by_down_sampling(df_list)

ratio: 1

[-4.91204, 2.74654, 0.17877]
[-4.0515, 2.3485, 0.83755]
[-5.42401, 2.51179, 0.4848]
[-4.03337, 2.06947, 0.44515]
[-4.69719, 2.57411, 0.59647]
[-3.76437, 2.35255, 0.35005]
[-4.39169, 2.05972, 0.61103]
[-4.11071, 2.53458, 0.17263]
[-2.99168, 2.46254, 0.63564]
[-3.99944, 2.30797, 0.55562]

ratio: 2

[-4.2107, 2.70525, 0.23816]
[-3.38297, 2.31001, 0.92924]
[-4.77464, 2.50352, 0.55668]
[-3.37317, 2.03352, 0.4837]
[-4.03382, 2.53982, 0.58345]
[-2.97067, 2.31873, 0.38171]
[-3.41311, 2.12467, 0.67182]
[-3.44971, 2.49872, 0.19891]
[-2.39652, 2.40039, 0.5989]
[-3.3235, 2.31424, 0.65252]

ratio: 5

[-3.35169, 2.72117, 0.23848]
[-2.27844, 2.41117, 0.77865]
[-3.52712, 2.46208, 0.72024]
[-2.23162, 2.27092, 0.56094]
[-3.22692, 2.54749, 0.57494]
[-1.91845, 2.26757, 0.57179]
[-2.54464, 2.15011, 0.68264]
[-2.45929, 2.51433, 0.26709]
[-1.19256, 2.56356, 0.80125]
[-2.54425, 2.20229, 0.74773]

ratio: 10

[-3.25613, 2.75951, 0.52]
[-1.42508, 2.38864, 0.48791]
[-2.41798, 2.45274, 0.90139]
[-1.60434, 2.24338, 0.59875]
[-1.95903, 2.56728, 0.65356]
[-1.02283, 2.17213, 0.57539]
[-1.56087, 2.2855, 0.89761]
[-1.4418, 2.45697, 0.47024]
[-0.72239, 2.52637, 0.75559]
[-1.6296, 2.47578, 0.9896]

ratio: 20

[-2.40845, 2.86692, 0.62559]
[-1.03619, 2.22112, 0.40693]
[-1.43415, 2.61045, 0.5262]
[-0.45256, 2.03482, 0.52997]
[-1.19788, 2.51803, 0.47951]
[0.0, 2.08584, 0.65126]
[-1.23638, 2.23326, 1.23025]
[-0.46025, 2.78909, 0.70659]
[-0.45358, 2.60837, 0.5026]
[-1.25244, 2.28267, 0.91885]

ratio: 50

[-0.66989, 2.5969, 0.87268]
[0.0, 2.07957, 0.88407]
[0.0, 2.63922, 0.5973]
[0.0, 1.92209, 0.67016]
[-0.52315, 2.67275, 0.84047]
[0.0, 1.69736, 0.38221]
[0.0, 2.49584, 1.50871]
[0.0, 2.58153, 0.85577]
[0.0, 2.57994, 0.81271]
[-0.88678, 2.0582, 0.9175]

ratio: 100

[0.0, 3.11442, 0.84759]
[0.0, 1.93646, 1.44019]
[0.0, 2.61444, 1.39052]
[0.0, 1.8008, 0.48262]
[-0.67109, 3.00093, 1.18918]
[0.0, 1.37616, 0.1724]
[0.0, 2.52167, 1.31528]
[0.0, 2.44528, 0.80262]
[0.0, 2.23357, 0.57737]
[0.0, 2.59899, 0.89581]

ratio: 200

[0.0, 2.86792, 1.13424]
[0.0, 2.1795, 0.93253]
[0.2598, 2.56551, 1.55929]
[0.05154, 1.95667, 0.3683]
[0.0, 2.43416, 1.52139]
[0.0, 2.09143, 0.38268]
[0.0, 2.10479, 1.78338]
[0.18917, 2.81943, 0.89836]
[0.0, 3.27659, 1.14055]
[0.0, 2.41917, 0.91509]

一つ目は、切片、2つめは y の生成に寄与している x 、それ以降は全く関係ない z* の回帰係数の二乗の合計である。なお、二番目の x の重みは、1%から10%へ10倍なので、10の自然対数である、2.302585 が理想的である。

ダウンサンプリング率によって

  • x の回帰係数の重みはほとんど変化しないが、切片は大きく変わる。
  • ダウンサンプリング率が大きくなると、ノイズのz* の重みを誤って学習しやすい ということがわかる。
check_coef_by_down_sampling(df_list, intercept_scale=10000)

ratio: 1

[-5.45556, 2.75912, 0.28262]
[-4.6428, 2.35239, 0.92031]
[-5.78945, 2.52124, 0.64852]
[-4.49992, 2.07685, 0.49226]
[-5.20276, 2.58633, 0.69073]
[-4.34302, 2.36255, 0.28314]
[-5.01426, 2.0722, 0.7266]
[-4.6828, 2.54017, 0.13943]
[-3.47544, 2.47044, 0.48384]
[-4.54506, 2.31714, 0.56532]

ratio: 2

[-4.47498, 2.71107, 0.27932]
[-4.04952, 2.31372, 1.04347]
[-5.06899, 2.50929, 0.67163]
[-3.87872, 2.04466, 0.54536]
[-4.33338, 2.54452, 0.60214]
[-3.54863, 2.32828, 0.29453]
[-4.04781, 2.13091, 0.7376]
[-4.02237, 2.50532, 0.17511]
[-2.98376, 2.41174, 0.45404]
[-3.84655, 2.32399, 0.66282]

ratio: 5

[-3.97839, 2.73628, 0.37851]
[-2.93231, 2.40569, 0.85465]
[-4.28537, 2.47596, 0.97301]
[-2.35712, 2.27262, 0.55879]
[-4.03678, 2.56098, 0.76631]
[-2.53755, 2.27179, 0.45956]
[-3.13683, 2.15767, 0.73814]
[-2.9908, 2.51932, 0.22433]
[-1.8004, 2.56204, 0.57123]
[-3.09578, 2.21253, 0.79138]

ratio: 10

[-3.72013, 2.77988, 0.7003]
[-2.08104, 2.37616, 0.48348]
[-3.22456, 2.4633, 1.11463]
[-2.22255, 2.2543, 0.64177]
[-2.92231, 2.58176, 0.72124]
[-1.64388, 2.18194, 0.4027]
[-2.34855, 2.29581, 0.92035]
[-2.15464, 2.45468, 0.38139]
[-1.33011, 2.52442, 0.55185]
[-2.25828, 2.48406, 1.00449]

ratio: 20

[-3.00527, 2.89685, 0.8809]
[-1.62548, 2.2128, 0.43464]
[-2.2563, 2.62659, 0.68069]
[-0.98092, 2.04177, 0.4648]
[-1.47433, 2.52369, 0.45993]
[-0.57554, 2.09909, 0.42959]
[-1.85798, 2.26083, 1.32768]
[-1.37675, 2.77282, 0.49475]
[-1.26171, 2.60299, 0.35655]
[-1.74328, 2.28939, 0.97978]

ratio: 50

[-1.84861, 2.62367, 1.05701]
[-0.51107, 2.07379, 0.84463]
[-1.10631, 2.63535, 0.60667]
[-0.1167, 1.92674, 0.56876]
[-1.42434, 2.7109, 0.87041]
[-0.18693, 1.70358, 0.35005]
[-0.7797, 2.51998, 1.44281]
[0.09158, 2.58324, 0.88211]
[-0.46089, 2.57796, 0.76345]
[-1.71747, 2.1023, 1.21375]

ratio: 100

[-0.59101, 3.13524, 0.90419]
[-0.1334, 1.9426, 0.97734]
[0.07708, 2.61341, 1.3952]
[0.67081, 1.79631, 0.55785]
[-2.34283, 3.06153, 2.02438]
[-0.277, 1.39244, 0.18829]
[0.65186, 2.51285, 1.41825]
[0.3768, 2.44727, 0.88092]
[0.29966, 2.23463, 0.60469]
[-1.20692, 2.6587, 1.16563]

ratio: 200

[-0.40691, 2.90292, 1.28256]
[0.53202, 2.184, 0.87883]
[1.80984, 2.59295, 1.85358]
[0.95417, 1.92716, 0.19392]
[-0.92467, 2.48612, 1.94237]
[-0.38448, 2.11439, 0.50066]
[1.05276, 2.08541, 1.95901]
[2.15207, 2.8209, 1.84377]
[1.16178, 3.27734, 1.26292]
[-0.13707, 2.42296, 0.95436]

intercept_scaling を大きく設定する(後半の設定)

切片の重みが無理なく付くようになり、x への重みも、2.302585 に近い値を安定して獲得できるようになった。 negative down sampling で意図的に変更している以上、切片の重みに罰則を与える意味は全く無い。 なお、切片の重みは、与えた intercept_scaling ですでに乗算してあるらしく、乗算する必要はない。ソースはここ

https://github.com/scikit-learn/scikit-learn/blob/45dc891c96eebdb3b81bf14c2737d8f6540fabfe/sklearn/svm/base.py#L902

intercept_scaling を大きく設定したほうが良い

  • negative down sampling をしないほうがいいし、率は小さい方がいい。(それでもメモリの問題などの理由がある)
  • intercept_scaling を大きく設定したほうが、無駄に切片の重みを学習せずにすむ

    negative down sampling することで neg_log_loss の極値となるパラメータに違いはあるか

    探索用 cross validation

from sklearn.model_selection import GridSearchCV
def grid_search_cv(X, y, class_weight=None, fit_intercept=True):    
    test_parameters = [
        {'C': [0.01, 0.1, 1, 10, 100] }
    ]
    clf = GridSearchCV(
        LogisticRegression(penalty='l1', 
                           intercept_scaling=10000,
                           class_weight=class_weight,
                           fit_intercept=fit_intercept),
        test_parameters,
        cv=20,
        scoring=['neg_log_loss', 'accuracy'],
        n_jobs=-1,
        refit='neg_log_loss'
    )
    clf.fit(X, y)
    return clf
for ratio in [1, 10, 100, 200]:
    display(Markdown("### ratio: %s" % str(ratio)))
    df_d = df[df['r'] < 1.0/ratio]
    column_list = ['x'] + ['z'+str(i) for i in range(FAKE)]
    grid_search_result = grid_search_cv(df_d[column_list], pd.Series(df_d['y'].values.flatten()))
    df_result = pd.DataFrame(grid_search_result.cv_results_)
    display(df_result[['params', 'mean_test_neg_log_loss', 'std_test_neg_log_loss']])

ratio: 1

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -0.077886 0.004059
1 {u'C': 0.1} -0.068239 0.006209
2 {u'C': 1} -0.068588 0.007413
3 {u'C': 10} -0.068781 0.007660
4 {u'C': 100} -0.069028 0.007834

ratio: 10

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -0.390054 0.012941
1 {u'C': 0.1} -0.326330 0.028839
2 {u'C': 1} -0.326252 0.038349
3 {u'C': 10} -0.329250 0.041436
4 {u'C': 100} -0.329717 0.041785

ratio: 100

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -0.661588 0.005336
1 {u'C': 0.1} -0.576321 0.037786
2 {u'C': 1} -0.554997 0.121124
3 {u'C': 10} -0.571344 0.137919
4 {u'C': 100} -0.587997 0.181002

ratio: 200

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -0.554096 0.031750
1 {u'C': 0.1} -0.516677 0.032798
2 {u'C': 1} -0.473868 0.112612
3 {u'C': 10} -0.530726 0.213819
4 {u'C': 100} -0.554108 0.242809

negative down sampling することで neg_log_loss の極値となるパラメータに違いは確認できず

  • neg_log_loss の値自体に違いがあった。(事例数が異なるから当たり前か)
  • neg_log_loss の極値となる学習パラメータに多少の違いはあったが、誤差の範囲で、今回の検証では分からなかった。

    class_weight を使った補正の有効性

    class_weight を使うことで失われた事例分を残った事例に重みに重ねることができる。これで学習時のバランスが補正されないかというアイデア

import numpy as np
def make_class_weight(orig, down, log):
    """クラスの重みを計算する。down sampling によって減った事例数を戻すように補正する"""
    neg_w = float(orig) / down
    if log:
        neg_w = np.log(neg_w + 1)
    return {0: neg_w, 1: 1}
def check_learn_param(class_weight=False, log=False):
    for ratio in [1, 2, 5, 10, 20, 50, 100, 200]:
        display(Markdown("### ratio: %s" % str(ratio)))
        df_d = df[df['r'] < 1.0/ratio]
        cw = make_class_weight(len(df[df['y']==0]), len(df_d[df_d['y']==0]), log) if class_weight else None
        column_list = ['x'] + ['z'+str(i) for i in range(FAKE)]
        grid_search_result = grid_search_cv(df_d[column_list], 
                                            pd.Series(df_d['y'].values.flatten()),
                                            class_weight=cw)
        df_result = pd.DataFrame(grid_search_result.cv_results_)
        display(df_result[['params', 'mean_test_neg_log_loss', 'std_test_neg_log_loss']])
check_learn_param(class_weight=True)

ratio: 1

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -0.077886 0.004059
1 {u'C': 0.1} -0.068239 0.006205
2 {u'C': 1} -0.068631 0.007474
3 {u'C': 10} -0.068918 0.007800
4 {u'C': 100} -0.069015 0.007842

ratio: 2

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -0.141263 0.008058
1 {u'C': 0.1} -0.122711 0.011923
2 {u'C': 1} -0.123334 0.014466
3 {u'C': 10} -0.123813 0.015084
4 {u'C': 100} -0.124233 0.015270

ratio: 5

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -0.310091 0.018207
1 {u'C': 0.1} -0.266433 0.026865
2 {u'C': 1} -0.267755 0.032059
3 {u'C': 10} -0.268953 0.034197
4 {u'C': 100} -0.269233 0.033790

ratio: 10

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -0.567181 0.029106
1 {u'C': 0.1} -0.485387 0.049120
2 {u'C': 1} -0.485032 0.061664
3 {u'C': 10} -0.486115 0.064668
4 {u'C': 100} -0.486117 0.063645

ratio: 20

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -0.991310 0.041022
1 {u'C': 0.1} -0.833435 0.086217
2 {u'C': 1} -0.830881 0.113298
3 {u'C': 10} -0.830402 0.123086
4 {u'C': 100} -0.833500 0.125014

ratio: 50

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -1.803340 0.068163
1 {u'C': 0.1} -1.551492 0.148318
2 {u'C': 1} -1.519491 0.213549
3 {u'C': 10} -1.532629 0.227115
4 {u'C': 100} -1.526625 0.225905

ratio: 100

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -2.630216 0.045231
1 {u'C': 0.1} -1.994720 0.314057
2 {u'C': 1} -1.969576 0.448479
3 {u'C': 10} -1.995761 0.481577
4 {u'C': 100} -1.961631 0.429366

ratio: 200

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -3.184818 0.122700
1 {u'C': 0.1} -2.228196 0.525339
2 {u'C': 1} -2.231107 0.676387
3 {u'C': 10} -2.318523 0.762673
4 {u'C': 100} -2.345855 0.750384

neg_log_loss は予想に反して同じ値にならなかった。ただ、極値は保存されているようである。

係数の重みの確認

def check_coefs_for_each_ratio(df_list, intercept_scale=1, class_weight=False, log=False):
    for ratio in [1, 2, 5, 10, 20, 50, 100, 200]:
        display(Markdown("### ratio: %s" % str(ratio)))
        for i in range(0, 10):
            df = df_list[i]
            df_d = df[df['r'] < 1.0/ratio]
            cw = make_class_weight(len(df[df['y']==0]), len(df_d[df_d['y']==0]), log) if class_weight else None
            logreg = LogisticRegression(penalty='l1', C=1, intercept_scaling=intercept_scale, class_weight=cw)
            column_list = ['x'] + ['z'+str(i) for i in range(FAKE)]
            logreg.fit(df_d[column_list], pd.Series(df_d['y'].values.flatten()))
            w_x = logreg.coef_.tolist()[0][0]
            w_z = sum([w*w for w in logreg.coef_.tolist()[0][1:]]) # 二乗して合計(ここはフェイクなので小さい方がいい)
            print([round(x,5) for x in [logreg.intercept_[0], w_x, w_z] ])
check_coefs_for_each_ratio(df_list, intercept_scale=10000, class_weight=True)

ratio: 1

[-5.456, 2.75913, 0.28275]
[-4.68984, 2.35276, 0.93339]
[-5.93321, 2.52496, 0.73554]
[-4.49972, 2.07685, 0.49214]
[-5.07817, 2.58326, 0.65842]
[-4.18139, 2.36003, 0.29171]
[-5.01522, 2.07222, 0.72691]
[-4.63656, 2.53985, 0.13835]
[-2.93413, 2.46116, 0.66034]
[-4.58423, 2.31785, 0.56987]

ratio: 2

[-4.91084, 2.70796, 0.24419]
[-4.73899, 2.3222, 1.09332]
[-5.49024, 2.50968, 0.57039]
[-4.57879, 2.0469, 0.54657]
[-5.37927, 2.55492, 0.71975]
[-4.18683, 2.33364, 0.30944]
[-4.72929, 2.13681, 0.76036]
[-4.78652, 2.50797, 0.18094]
[-3.74175, 2.41311, 0.45017]
[-4.47882, 2.32848, 0.68062]

ratio: 5

[-5.52737, 2.75556, 0.44584]
[-4.32948, 2.43492, 0.95364]
[-5.36533, 2.48112, 0.79776]
[-4.49862, 2.30294, 0.67979]
[-4.66343, 2.56162, 0.58413]
[-3.80696, 2.3091, 0.61495]
[-4.68544, 2.18322, 0.82486]
[-4.55734, 2.52214, 0.21192]
[-3.49043, 2.57608, 0.64307]
[-4.55503, 2.22938, 0.85661]

ratio: 10

[-6.32183, 2.84784, 1.00543]
[-4.18039, 2.42215, 0.64773]
[-5.70742, 2.54972, 1.53197]
[-4.58312, 2.32635, 0.85909]
[-5.08476, 2.60348, 0.69908]
[-3.32753, 2.26653, 0.80517]
[-4.30632, 2.3637, 1.1924]
[-4.57177, 2.47107, 0.39637]
[-3.74324, 2.5844, 0.79115]
[-4.25131, 2.58967, 1.42814]

ratio: 20

[-3.73082, 2.97579, 1.38839]
[-4.48063, 2.30836, 0.82273]
[-5.26905, 2.81853, 1.29472]
[-4.26113, 2.13226, 0.80125]
[-5.23599, 2.62421, 0.66548]
[-2.60202, 2.29986, 1.61421]
[-4.92805, 2.39065, 2.24484]
[-4.5201, 2.85804, 0.65841]
[-4.47336, 2.67563, 0.54306]
[-4.70123, 2.38434, 1.21893]

ratio: 50

[-6.1262, 3.10075, 3.12108]
[-4.1213, 2.26967, 1.92844]
[-6.14338, 2.9595, 1.32346]
[-4.26593, 2.21141, 1.51868]
[-3.33964, 3.0612, 1.31777]
[-3.13537, 1.96733, 1.71331]
[-3.31786, 3.01146, 3.9546]
[-3.43166, 3.07311, 4.31343]
[-4.97486, 2.72593, 0.82405]
[-5.88227, 2.49085, 3.02388]

ratio: 100

[-8.0263, 5.20309, 14.74563]
[-3.96654, 2.50111, 5.92777]
[-3.77454, 3.32975, 4.06035]
[-2.66965, 2.29368, 2.19824]
[-9.43126, 5.73963, 11.59807]
[-4.36287, 1.76316, 1.47788]
[-1.57113, 3.0898, 4.08203]
[-4.62372, 3.20552, 5.05104]
[-4.9451, 2.44147, 0.9702]
[-8.34704, 4.66164, 11.21491]

ratio: 200

[-9.04834, 6.70274, 26.74032]
[-3.0757, 2.93908, 6.78269]
[4.31901, 3.79058, 15.57901]
[-3.1984, 2.82233, 2.86396]
[-9.31711, 5.73608, 16.86566]
[-9.31487, 5.56711, 19.90607]
[-0.56633, 2.71122, 15.52765]
[-6.83777, 6.20401, 20.82009]
[-1.64302, 8.69927, 9.08549]
[-7.99419, 5.54549, 9.48558]

切片も出ているが、x への重みのばらつき(2.302585が理想)や、z の重みへのばらつき(0が理想)が大きい。

重みの対数化

安直ではあるが、10倍するのではなく、log(10)倍することで、過学習を緩和したい

check_coefs_for_each_ratio(df_list, intercept_scale=10000, class_weight=True, log=True)

ratio: 1

[-5.06334, 2.75755, 0.2739]
[-4.285, 2.34939, 0.90332]
[-5.54231, 2.52186, 0.71662]
[-4.11011, 2.07534, 0.48425]
[-4.79435, 2.58338, 0.6687]
[-3.54648, 2.35229, 0.31755]
[-4.61396, 2.07039, 0.71198]
[-4.25854, 2.53891, 0.1378]
[-3.00963, 2.46912, 0.50412]
[-4.22915, 2.31719, 0.56841]

ratio: 2

[-4.9026, 2.71966, 0.36531]
[-4.14688, 2.31516, 1.05298]
[-5.48156, 2.51683, 0.83853]
[-3.98731, 2.04515, 0.54704]
[-4.81269, 2.55341, 0.71851]
[-3.62939, 2.32907, 0.29694]
[-4.14463, 2.13188, 0.74179]
[-3.78081, 2.50169, 0.1792]
[-3.0896, 2.41201, 0.45301]
[-3.6255, 2.31867, 0.64602]

ratio: 5

[-4.55296, 2.74552, 0.41394]
[-3.24339, 2.42041, 0.86243]
[-4.98751, 2.48683, 1.04366]
[-3.44419, 2.29147, 0.63844]
[-4.27659, 2.56317, 0.66561]
[-2.95918, 2.28709, 0.5224]
[-3.69191, 2.16886, 0.7741]
[-3.6336, 2.52211, 0.22204]
[-2.41829, 2.56805, 0.60344]
[-3.38312, 2.21611, 0.78615]

ratio: 10

[-4.97196, 2.82519, 0.98587]
[-2.9001, 2.39795, 0.56099]
[-3.63711, 2.48865, 1.09703]
[-3.22029, 2.29187, 0.7576]
[-3.71282, 2.59307, 0.70221]
[-2.18677, 2.21791, 0.57243]
[-3.16291, 2.32609, 1.03614]
[-2.98868, 2.4629, 0.40026]
[-2.37152, 2.55224, 0.64066]
[-3.01113, 2.52921, 1.17777]

ratio: 20

[-4.48405, 3.00463, 1.45081]
[-2.78745, 2.25786, 0.62019]
[-3.80631, 2.71905, 1.10831]
[-2.33652, 2.0821, 0.60125]
[-2.78862, 2.57325, 0.5167]
[-1.23831, 2.17586, 0.85325]
[-3.00064, 2.318, 1.66552]
[-2.54605, 2.8181, 0.56879]
[-2.62482, 2.64067, 0.43063]
[-2.9084, 2.34565, 1.14212]

ratio: 50

[-3.61535, 2.8079, 1.90597]
[-1.765, 2.1605, 1.30307]
[-2.90974, 2.74135, 0.71409]
[-1.00802, 2.01203, 0.95008]
[-2.69431, 2.86702, 1.32219]
[-1.26446, 1.80058, 0.77125]
[-1.10304, 2.69649, 2.32091]
[-1.15043, 2.74996, 1.84605]
[-2.011, 2.66888, 0.8709]
[-2.50288, 2.23787, 1.62665]

ratio: 100

[-2.38678, 3.55555, 2.05781]
[-1.07891, 2.12761, 2.53363]
[-1.02447, 2.88643, 2.16754]
[-0.47467, 1.97916, 1.08508]
[-3.95229, 3.45432, 2.45714]
[-1.86771, 1.53591, 0.67398]
[-0.29807, 2.81919, 2.541]
[-1.16426, 2.69413, 1.97518]
[-1.24353, 2.35371, 0.90051]
[-2.73079, 2.96275, 1.71517]

ratio: 200

[-2.57214, 3.65385, 3.48845]
[-0.68745, 2.47665, 2.19183]
[2.5264, 3.1807, 5.11516]
[-0.41136, 2.27868, 0.87186]
[-2.83457, 2.97796, 3.13823]
[-2.42656, 2.67244, 1.94582]
[0.97235, 2.61841, 6.52571]
[-0.15809, 3.33408, 3.0626]
[0.37536, 4.97194, 3.57977]
[-1.90523, 2.93487, 1.37925]

ある程度は防ぐことが出来たが、切片を求めたいのでなければ、特に重みをつけるメリットはなさそう。

class-weight を使った補正の有効性はなさそう

  • negative down sampling としては、重みをつけることメリットはなさそう。

    balanced を使っていいか

    balanced を logistic regression で使ってしまうと、切片重みの情報が完全に失われてしまう。

[https://swarbrickjones.wordpress.com/2017/03/28/cross-entropy-and-training-test-class-imbalance/:embed:cite]

の最後に、

Also note that we heavily relied on the assumption that the positives/negatives in X were equidistributed to the positives/negatives in X' respectively. If that it is not true, then this analysis could be of limited use!

とある通り、本来は balanced を使うべきではないだろう。 ところが、down sampling できる状況では、切片重みは事前に求められるはずである。むしろ、ダウンサンプリングして切片重み求めることのほうが効率が悪いかもしれない。切片重みは全体の事例から求める( [tex:{ \ln \hat{pi} }] )として、balanced オプションで、切片は0としてfit_intercept=Falseに設定し計算してみる。

def check_coefs_for_each_ratio_with_balanced(df_list):
    for ratio in [1, 2, 5, 10, 20, 50, 100, 200]:
        display(Markdown("### ratio: %s" % str(ratio)))
        for i in range(0, 10):
            df = df_list[i]
            intercept = np.log( float(len(df[df['y'] == 1])) / len(df) )
            df_d = df[df['r'] < 1.0/ratio]
            logreg = LogisticRegression(penalty='l1', C=1, fit_intercept=False, class_weight='balanced')
            column_list = ['x'] + ['z'+str(i) for i in range(FAKE)]
            logreg.fit(df_d[column_list], pd.Series(df_d['y'].values.flatten()))
            w_x = logreg.coef_.tolist()[0][0]
            w_z = sum([w*w for w in logreg.coef_.tolist()[0][1:]]) # 二乗して合計(ここはフェイクなので小さい方がいい)
            print([round(x,5) for x in [intercept, w_x, w_z] ])
check_coefs_for_each_ratio_with_balanced(df_list)

ratio: 1

[-4.19971, 2.80006, 0.58413]
[-4.14144, 2.37942, 1.11962]
[-4.14775, 2.60704, 0.80826]
[-4.05129, 2.12227, 0.7282]
[-4.43966, 2.70717, 1.61314]
[-4.2475, 2.38503, 0.50004]
[-4.23361, 2.15592, 1.2708]
[-4.2687, 2.61576, 0.52414]
[-4.16048, 2.54132, 0.72488]
[-4.20639, 2.43589, 1.23491]

ratio: 2

[-4.19971, 2.73953, 0.51565]
[-4.14144, 2.32173, 1.09193]
[-4.14775, 2.59147, 0.94536]
[-4.05129, 2.08336, 0.76821]
[-4.43966, 2.6852, 1.72355]
[-4.2475, 2.35151, 0.54959]
[-4.23361, 2.19345, 1.20504]
[-4.2687, 2.57608, 0.56743]
[-4.16048, 2.48444, 0.72722]
[-4.20639, 2.41925, 1.15235]

ratio: 5

[-4.19971, 2.73559, 0.41098]
[-4.14144, 2.40825, 0.89882]
[-4.14775, 2.51915, 1.05386]
[-4.05129, 2.29807, 0.76]
[-4.43966, 2.63634, 1.39499]
[-4.2475, 2.27439, 0.62931]
[-4.23361, 2.21276, 1.2259]
[-4.2687, 2.58391, 0.63055]
[-4.16048, 2.61542, 0.69615]
[-4.20639, 2.27213, 1.08081]

ratio: 10

[-4.19971, 2.72486, 0.48284]
[-4.14144, 2.38394, 0.65321]
[-4.14775, 2.47043, 1.16771]
[-4.05129, 2.2557, 0.76595]
[-4.43966, 2.64495, 1.43846]
[-4.2475, 2.17495, 0.47283]
[-4.23361, 2.31494, 1.26874]
[-4.2687, 2.49734, 0.62739]
[-4.16048, 2.55463, 0.55818]
[-4.20639, 2.50076, 1.1878]

ratio: 20

[-4.19971, 2.79394, 0.49988]
[-4.14144, 2.20952, 0.49212]
[-4.14775, 2.61027, 0.60249]
[-4.05129, 2.04884, 0.57751]
[-4.43966, 2.53871, 0.77375]
[-4.2475, 2.09275, 0.34751]
[-4.23361, 2.24294, 1.47811]
[-4.2687, 2.77854, 0.68346]
[-4.16048, 2.61743, 0.496]
[-4.20639, 2.29761, 1.10332]

ratio: 50

[-4.19971, 2.57246, 0.9157]
[-4.14144, 2.07143, 0.86454]
[-4.14775, 2.63409, 0.61976]
[-4.05129, 1.91869, 0.65903]
[-4.43966, 2.65093, 0.9455]
[-4.2475, 1.69904, 0.30351]
[-4.23361, 2.48733, 1.46496]
[-4.2687, 2.55024, 0.71679]
[-4.16048, 2.56966, 0.81352]
[-4.20639, 2.03153, 0.87826]

ratio: 100

[-4.19971, 3.18644, 0.98824]
[-4.14144, 1.96485, 1.55056]
[-4.14775, 2.66571, 1.46552]
[-4.05129, 1.83133, 0.55313]
[-4.43966, 2.9996, 1.15968]
[-4.2475, 1.3735, 0.21852]
[-4.23361, 2.57707, 1.42351]
[-4.2687, 2.45839, 0.84825]
[-4.16048, 2.25478, 0.61956]
[-4.20639, 2.65479, 0.88477]

ratio: 200

[-4.19971, 3.15883, 1.66064]
[-4.14144, 2.32423, 1.09636]
[-4.14775, 2.85993, 2.00946]
[-4.05129, 2.1121, 0.3894]
[-4.43966, 2.5296, 1.53771]
[-4.2475, 2.26719, 0.72349]
[-4.23361, 2.36902, 2.97373]
[-4.2687, 3.02366, 1.29845]
[-4.16048, 3.97411, 1.76377]
[-4.20639, 2.64598, 0.87854]

よさそうである。まず切片はダウンサンプリングに関係なく全体事例を使っているのでかなり精度よく推定されているといえる。x の重み(2.302585が理想)もこれまでの方法の中で最も良い値が得られた様に思える。これは切片が固定されているため、これが探索空間の絞り込みになり非常に良い推定を与えると考えられる。

balanaced の学習パラメータの探索

def check_learn_param_balanced():
    for ratio in [1, 2, 5, 10, 20, 50, 100, 200]:
        display(Markdown("### ratio: %s" % str(ratio)))
        df_d = df[df['r'] < 1.0/ratio]
        column_list = ['x'] + ['z'+str(i) for i in range(FAKE)]
        grid_search_result = grid_search_cv(df_d[column_list], 
                                            pd.Series(df_d['y'].values.flatten()),
                                            class_weight='balanced',
                                            fit_intercept=False)
        df_result = pd.DataFrame(grid_search_result.cv_results_)
        display(df_result[['params', 'mean_test_neg_log_loss', 'std_test_neg_log_loss']])
check_learn_param_balanced()

ratio: 1

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -0.579756 0.016201
1 {u'C': 0.1} -0.557804 0.020383
2 {u'C': 1} -0.557315 0.020976
3 {u'C': 10} -0.557317 0.021046
4 {u'C': 100} -0.557318 0.021053

ratio: 2

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -0.614889 0.015515
1 {u'C': 0.1} -0.562747 0.024808
2 {u'C': 1} -0.561251 0.026736
3 {u'C': 10} -0.561220 0.026921
4 {u'C': 100} -0.561219 0.026940

ratio: 5

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -0.704480 0.007989
1 {u'C': 0.1} -0.569775 0.027439
2 {u'C': 1} -0.565177 0.031148
3 {u'C': 10} -0.565230 0.031662
4 {u'C': 100} -0.565249 0.031717

ratio: 10

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -0.692749 0.000850
1 {u'C': 0.1} -0.582792 0.040261
2 {u'C': 1} -0.570911 0.051377
3 {u'C': 10} -0.571530 0.053412
4 {u'C': 100} -0.571638 0.053624

ratio: 20

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -0.693147 2.135512e-16
1 {u'C': 0.1} -0.591359 5.392871e-02
2 {u'C': 1} -0.573781 8.087916e-02
3 {u'C': 10} -0.575191 8.687242e-02
4 {u'C': 100} -0.575433 8.755660e-02

ratio: 50

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -0.693147 1.110223e-16
1 {u'C': 0.1} -0.610073 4.553321e-02
2 {u'C': 1} -0.589434 7.575629e-02
3 {u'C': 10} -0.598316 8.391099e-02
4 {u'C': 100} -0.599609 8.501846e-02

ratio: 100

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -0.693147 2.220446e-16
1 {u'C': 0.1} -0.588712 4.332537e-02
2 {u'C': 1} -0.574640 1.232090e-01
3 {u'C': 10} -0.598204 1.625511e-01
4 {u'C': 100} -0.601811 1.678043e-01

ratio: 200

params mean_test_neg_log_loss std_test_neg_log_loss
0 {u'C': 0.01} -0.693147 1.110223e-16
1 {u'C': 0.1} -0.580985 3.726425e-02
2 {u'C': 1} -0.583421 1.358876e-01
3 {u'C': 10} -0.626849 2.029742e-01
4 {u'C': 100} -0.643918 2.217282e-01

特に極値となる学習パラメータが変動することもなさそうで、よさそう。

Logistic Regression における Negative Down Sampling の設定まとめ

  • balanced を使う方法
    • 切片重みを down sampling する前に求める { \ln \hat{\pi} }
    • class_weight='balanced', fit_intercept=False で学習を行う
    • Re-Calibration は必要ない
  • どうしても balanced を使いたくない場合
    • intercept_scale=10000 にする
    • Re-Calibration が必要

お約束