背景

scikit learn で学習しようとすると、メモリーを使い尽くす。

色々なスパースな型に変換して学習

準備

まず、データは次のものを利用。

measurements = [
    {'city': 'Dubai', 'temperature': 31.0, 'country': 'U.A.E.'},
    {'city': 'London', 'country': 'U.K.', 'temperature': 27.0},
    {'city': 'San Fransisco', 'country': 'U.S.', 'temperature': 24.0},
]

ただし変換をして、DataFrame用に。

x_column_dict = {}
for d in measurements:
    for c, v in d.items():
        if c in x_column_dict:
            x_column_dict[c].append(v)
        else:
            x_column_dict[c] = [v]
x_column_dict

{'city': ['Dubai', 'London', 'San Fransisco'],
 'country': ['U.A.E.', 'U.K.', 'U.S.'],
 'temperature': [31.0, 27.0, 24.0]}

import pandas as pd
df = pd.DataFrame(x_column_dict)
df

	city	country	temperature
0	Dubai	U.A.E.	31.0
1	London	U.K.	27.0
2	San Fransisco	U.S.	24.0

教師データを準備

import numpy as np
y = np.array([1,0,0])

get_dummies で `sparse=True` を使う。

X = pd.get_dummies(df, sparse=True)
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 7 columns):
temperature           3 non-null float64
city_Dubai            3 non-null uint8
city_London           3 non-null uint8
city_San Fransisco    3 non-null uint8
country_U.A.E.        3 non-null uint8
country_U.K.          3 non-null uint8
country_U.S.          3 non-null uint8
dtypes: float64(1), uint8(6)
memory usage: 114.0 bytes

↑ SparseDataFrame 型ではないのですが、詐欺ですか？

from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

一応学習はできる（DataFrame型なので）

`to_sparse()` を使う。

X = pd.get_dummies(df).to_sparse()
X.info()

<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 7 columns):
temperature           3 non-null float64
city_Dubai            3 non-null uint8
city_London           3 non-null uint8
city_San Fransisco    3 non-null uint8
country_U.A.E.        3 non-null uint8
country_U.K.          3 non-null uint8
country_U.S.          3 non-null uint8
dtypes: float64(1), uint8(6)
memory usage: 114.0 bytes

確かに SparseDataFrame 型に変換されている。ただしメモリの使用量は同じ。

logistic.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

学習できたことを確認。

さらに、`fill_value=0` を使う。

X = pd.get_dummies(df).to_sparse(fill_value=0)
X.info()

<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 7 columns):
temperature           3 non-null float64
city_Dubai            3 non-null uint8
city_London           3 non-null uint8
city_San Fransisco    3 non-null uint8
country_U.A.E.        3 non-null uint8
country_U.K.          3 non-null uint8
country_U.S.          3 non-null uint8
dtypes: float64(1), uint8(6)
memory usage: 102.0 bytes

↑メモリの使用量が減っている。ただしこれは学習前の話。

logistic.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

↑実際には上記の処理を行っても、かなりメモリを使ってしまう。

New!! さらに coo に変換する。

[https://pandas.pydata.org/pandas-docs/stable/generated/pandas.SparseSeries.to_coo.html]

さらに、scipy の sparse matrix に変換できるらしい。（@hagino3000 さんに教えてもらった）

X = pd.get_dummies(df).to_sparse(fill_value=0)
Xcoo = X.to_coo()
print Xcoo

  (0, 0)    31.0
  (1, 0)    27.0
  (2, 0)    24.0
  (0, 1)    1.0
  (1, 2)    1.0
  (2, 3)    1.0
  (0, 4)    1.0
  (1, 5)    1.0
  (2, 6)    1.0

logistic.fit(Xcoo, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

↑学習時のメモリの使用量がぐっと減る。

つづき

logistic.coef_

array([[-0.01603272,  0.43215653, -0.25978198, -0.26661549,  0.43215653,
        -0.25978198, -0.26661549]])

w = dict(zip(X.columns.tolist(), logistic.coef_[0]))
w

{'city_Dubai': 0.43215653143933291,
 'city_London': -0.25978198285599019,
 'city_San Fransisco': -0.26661548568954441,
 'country_U.A.E.': 0.43215653143933291,
 'country_U.K.': -0.25978198285599019,
 'country_U.S.': -0.26661548568954441,
 'temperature': -0.016032719041480594}

b = logistic.intercept_[0]
b

-0.094240937106201669

まとめ

get_dummies の sparse=True オプションは意味がない
to_sparse(fill_value=0) で0を除去。
to_coo() で学習時のメモリを最小化。（@hagino3000 さん感謝）

中野智文のブログ

データ・マエショリストのメモ

scikit learn で DataFrameから色々なスパースな型に変換して学習

背景

色々なスパースな型に変換して学習

準備

get_dummies で `sparse=True` を使う。

`to_sparse()` を使う。

さらに、`fill_value=0` を使う。

New!! さらに coo に変換する。

つづき

まとめ

背景

色々なスパースな型に変換して学習

準備

get_dummies で sparse=True を使う。

to_sparse() を使う。

さらに、fill_value=0 を使う。

New!! さらに coo に変換する。

つづき

まとめ

get_dummies で `sparse=True` を使う。

`to_sparse()` を使う。

さらに、`fill_value=0` を使う。