歩行量とWi-Fiデータの比較¶

歩行量(survery_data)とWi-Fiアドレス数(wifi_data)との比較を、階層線形モデルで行う。

手順:

２日間のデータを訓練データとして、lmerにより、地点ごとの回帰係数（切片、傾き）を求める
その係数を用いて、もう１日のWi-Fiデータから、地点・時間ごとに予測値を求める
両者の散布図をつくる

マニュアルページ：http://www.statsmodels.org/stable/mixed_linear.html

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
# テストデータ
df = {}
df[1]= pd.read_csv('/home/tamada/survey_wifi20181130.csv', names = ('point','date','time','wifi_data','survey_data'))
df[2] = pd.read_csv('/home/tamada/survey_wifi20181201.csv', names = ('point','date','time','wifi_data','survey_data'))
df[3] = pd.read_csv('/home/tamada/survey_wifi20181202.csv', names = ('point','date','time','wifi_data','survey_data')) 
plt.scatter(df[1]['wifi_data'], df[1]['survey_data'], label = "20181130")
plt.scatter(df[2]['wifi_data'], df[2]['survey_data'], label = "20181201")
plt.scatter(df[3]['wifi_data'], df[3]['survey_data'], label = "20181202")
plt.xlabel("wifi_data")
plt.ylabel("survey_data")
plt.legend()
plt.show()

# テストデータ日を除いた日を訓練データ(train_df)として係数算出 (OLS)
train_df = {}
train_df[1] = pd.concat([df[2],df[3]])
train_df[2] = pd.concat([df[3],df[1]])
train_df[3] = pd.concat([df[1],df[2]])
import statsmodels.api as sm
result = {}
for i in train_df.keys():
    Y = train_df[i]['survey_data'].values
    df_X = train_df[i][["wifi_data"]]
    X = df_X.values
    X1 = sm.add_constant(X)
    model = sm.OLS(Y,X1)
    result[i] = model.fit()
    print(result[i].summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.237
Model:                            OLS   Adj. R-squared:                  0.235
Method:                 Least Squares   F-statistic:                     123.6
Date:                Mon, 10 Jun 2019   Prob (F-statistic):           3.43e-25
Time:                        19:12:59   Log-Likelihood:                -2506.8
No. Observations:                 400   AIC:                             5018.
Df Residuals:                     398   BIC:                             5026.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         88.7806     10.024      8.857      0.000      69.073     108.488
x1             0.2431      0.022     11.118      0.000       0.200       0.286
==============================================================================
Omnibus:                      112.728   Durbin-Watson:                   0.318
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              240.417
Skew:                           1.477   Prob(JB):                     6.22e-53
Kurtosis:                       5.388   Cond. No.                         719.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.310
Model:                            OLS   Adj. R-squared:                  0.309
Method:                 Least Squares   F-statistic:                     179.1
Date:                Mon, 10 Jun 2019   Prob (F-statistic):           5.48e-34
Time:                        19:12:59   Log-Likelihood:                -2518.4
No. Observations:                 400   AIC:                             5041.
Df Residuals:                     398   BIC:                             5049.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         80.2520      9.912      8.096      0.000      60.766      99.738
x1             0.2841      0.021     13.383      0.000       0.242       0.326
==============================================================================
Omnibus:                      113.122   Durbin-Watson:                   0.334
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              280.737
Skew:                           1.391   Prob(JB):                     1.09e-61
Kurtosis:                       6.018   Cond. No.                         704.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.244
Model:                            OLS   Adj. R-squared:                  0.242
Method:                 Least Squares   F-statistic:                     128.3
Date:                Mon, 10 Jun 2019   Prob (F-statistic):           5.76e-26
Time:                        19:12:59   Log-Likelihood:                -2545.7
No. Observations:                 400   AIC:                             5095.
Df Residuals:                     398   BIC:                             5103.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         99.2634     11.264      8.812      0.000      77.118     121.409
x1             0.2382      0.021     11.326      0.000       0.197       0.280
==============================================================================
Omnibus:                      126.608   Durbin-Watson:                   0.381
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              327.007
Skew:                           1.547   Prob(JB):                     9.80e-72
Kurtosis:                       6.170   Cond. No.                         856.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

# 混合線形モデル(mixedlm)での係数算出
import statsmodels.formula.api as smf
for i in train_df.keys():
    md = smf.mixedlm("survey_data ~ wifi_data", train_df[i], groups=train_df[i]["point"], re_formula="~wifi_data")
    mdf = md.fit()
    # print(mdf.fe_params)

# mixedlmでの地点ごとの係数を(目的日、地点)ごとに算出
# coef[テストデータ用ID][地点ID]  訓練データは地点データを除いた２日間
import statsmodels.formula.api as smf
coef = {}
for i in df.keys():
    md = smf.mixedlm("survey_data ~ wifi_data", train_df[i], groups=train_df[i]["point"], re_formula="~wifi_data")
    mdf = md.fit()
    Intercept_common = mdf.fe_params.Intercept # 共通切片
    coef_common = mdf.fe_params.wifi_data      # 共通傾き
    random_coef = mdf.random_effects           # ランダム項
    coef[i] = {}
    for j in random_coef.keys(): # 共通とランダムの和を計算
        coef[i][j] = [random_coef[j].Group+ Intercept_common, random_coef[j].wifi_data+ coef_common]
# import pprint
# pprint.pprint(coef) # 係数を書き出してみる

# 訓練データから得られた回帰係数を用いてテストデータ日のWi-Fiデータより歩行者数の予測値を計算
dfa = {} # 結果を入れるdataframe
for i in df.keys():
    predict_dict = {}
    for j in df[i].index:
        item = df[i].loc[j]
        this_coef = coef[i][item.point] # 地点ごとの係数
        predict_dict[j] = {"predict_val": item.wifi_data*this_coef[1]+ this_coef[0]}
    #print(predict_dict)  
    predict_fr = pd.DataFrame(predict_dict).T
    dfa[i] = pd.concat([df[i],predict_fr], axis=1)

# 結果表示と描画
print(dfa[1][['predict_val','survey_data']].corr())
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(17, 5))
dfa[1]['color'] = dfa[1]['point'].astype(int) # 色は整数値で指定
dfa[1].plot(kind="scatter", ax=axes[0], x="predict_val", y="survey_data", 
            title="訓練データ12/1,12/2 テストデータ 11/30",
            c="color", colormap='Accent', colorbar=False)

print(dfa[2][['predict_val','survey_data']].corr())
dfa[2]['color'] = dfa[2]['point'].astype(int) # 色は整数値で指定
dfa[2].plot(kind="scatter", ax=axes[1], x="predict_val", y="survey_data",
            title="訓練データ11/30,12/1 テストデータ 12/1",
            c="color", colormap='Accent', colorbar=False)

print(dfa[3][['predict_val','survey_data']].corr())
dfa[3]['color'] = dfa[3]['point'].astype(int) # 色は整数値で指定
dfa[3].plot(kind="scatter", ax=axes[2], x="predict_val", y="survey_data",
            title="訓練データ11/30, 2/1 テストデータ 12/2",
            c="color", colormap='Accent', colorbar=False)
plt.savefig("comp_kofu_2018.png")

             predict_val  survey_data
predict_val       1.0000       0.9315
survey_data       0.9315       1.0000
             predict_val  survey_data
predict_val     1.000000     0.901211
survey_data     0.901211     1.000000
             predict_val  survey_data
predict_val     1.000000     0.930131
survey_data     0.930131     1.000000

# 予測値と歩行量調査値との差（率）が大きい順に並べてみる

# 地点名を各日のデータフレームに追加
name_data= pd.read_csv('/home/tamada/kofupointname.csv', encoding='cp932',names = ('point','name'))
point_name = {}
for i,n in name_data.iterrows():
    point_name[n['point']] = n['name']
name_ser = [point_name[val['point']] for i,val in dfa[1].iterrows()]

for i in dfa:
    # 地点名を追加
    dfa[i]['point_name'] = name_ser
    # 予測と歩行量調査との差を追加
    dfa[i]['diff'] = dfa[i]['predict_val'] - dfa[i]['survey_data']
    # 差の率を追加
    dfa[i]['diff_ratio'] = abs(dfa[i]['predict_val'] - dfa[i]['survey_data'])/ dfa[i]['predict_val']
    # 並べ替え
    dfa[i].sort_values('diff_ratio',ascending=False)

# dfa[1] # 表示

# 地点ごとの誤差の合計
# groupbyを使う　参考： http://sinhrks.hatenablog.com/entry/2014/10/13/005327
daily_sum = {}
for i in dfa:
    daily_sum[i] =dfa[i].groupby('point_name')['survey_data','predict_val','diff'].sum().sort_values('diff', ascending=False)

#　ラベルの変更
daily_sum[1].columns = ["survey1130", "predict1130","diff1130"]
daily_sum[2].columns = ["survey1201", "predict1201","diff1201"]
daily_sum[3].columns = ["survey1202", "predict1202","diff1202"]

# 3つのデータフレームのマージ
three_day_sum = pd.merge(daily_sum[1], daily_sum[2], on=['point_name'])
three_day_sum = pd.merge(three_day_sum, daily_sum[3], on=['point_name'])

# 各日の誤差率


# 3日間の合計
three_day_sum['歩行合計'] = three_day_sum['survey1130'] + three_day_sum['survey1201'] + three_day_sum['survey1202']
three_day_sum['推計合計'] = three_day_sum['predict1130'] + three_day_sum['predict1201'] + three_day_sum['predict1202']
three_day_sum['差合計'] = three_day_sum['diff1130'] + three_day_sum['diff1201'] + three_day_sum['diff1202']
three_day_sum['誤差率'] = three_day_sum['差合計'] / three_day_sum['歩行合計']
three_day_sum

three_day_sum[['歩行合計', '推計合計', '差合計', '誤差率']]

循環的に訓練データとテストデータを入れ替えて評価しているので、プラスマイナス逆になったものを使うことになって、誤差が相殺されてしまっているかもしれない。

以下は元のまま

mdf.fittedvalues[:5]

0    63.844070
1    54.110198
2    61.062964
3    59.672411
4    59.116189
dtype: float64

df_all['fitted_lmer'] = mdf.fittedvalues

df_all['handcount'] = d2['survey_data']

df_all.head()

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))
df_all['color'] = df_all['point'].astype(int) # 色は整数値で指定
df_all.plot(kind="scatter", ax=axes[0], x="fitted_olm", y="handcount", c="color", colormap='Accent', colorbar=False)
df_all.plot(kind="scatter", ax=axes[1], x="fitted_lmer", y="handcount", c="color",colormap='tab20', colorbar=False)
print(df_all[['fitted_olm','survey_data']].corr())
print(df_all[['fitted_lmer','handcount']].corr())

             fitted_olm  survey_data
fitted_olm     1.000000     0.557101
survey_data    0.557101     1.000000
             fitted_lmer  handcount
fitted_lmer     1.000000   0.900756
handcount       0.900756   1.000000

name_data= pd.read_csv('/home/tamada/kofupointname.csv', encoding='cp932',names = ('point','name'))
name_data

pd.merge(df_all,name_data, on='point')

point_num = df_all["point"].unique()
ig, axes = plt.subplots(nrows=25, ncols=1, figsize=(5, 100))
for point in point_num:
    pd.merge(df_all,name_data, on='point')
    df_kofu = df_all[df_all["point"] == point].reset_index(drop = True)
    df_kofu.plot(kind="scatter", ax=axes[point], x="fitted_lmer", y="handcount", c='color',colormap='tab20', colorbar=False,label=point)

df_all.plot(kind = "scatter" , y = "fitted_lmer", x ="wifi_data")

<matplotlib.axes._subplots.AxesSubplot at 0x7f54d26c97b8>

	survey1130	predict1130	diff1130	survey1201	predict1201	diff1201	survey1202	predict1202	diff1202	歩行合計	推計合計	差合計	誤差率
point_name
松木呉服店前	1464	1978.942241	514.942241	2019	1445.842848	-573.157152	1452	1481.848963	29.848963	4935	4906.634052	-28.365948	-0.005748
セブンイレブン前	1342	1619.749097	277.749097	1498	1017.183412	-480.816588	905	1231.387727	326.387727	3745	3868.320235	123.320235	0.032929
風月堂前	816	967.443190	151.443190	864	818.266584	-45.733416	610	496.290588	-113.709412	2290	2282.000363	-7.999637	-0.003493
桜通り北交差点西	2506	2641.576872	135.576872	3049	3202.095384	153.095384	2751	2629.806101	-121.193899	8306	8473.478356	167.478356	0.020164
河野スポーツ前	1101	1146.713240	45.713240	1302	1644.886304	342.886304	1122	1038.135624	-83.864376	3525	3829.735168	304.735168	0.086450
永田楽器	442	433.854975	-8.145025	395	383.547349	-11.452651	294	311.778808	17.778808	1131	1129.181132	-1.818868	-0.001608
防災新館南	2125	2101.621324	-23.378676	2415	2615.389612	200.389612	1907	1888.542919	-18.457081	6447	6605.553855	158.553855	0.024593
三枝豆店前	701	664.522336	-36.477664	818	689.334943	-128.665057	569	780.247297	211.247297	2088	2134.104576	46.104576	0.022081
きぬや前	1911	1853.242973	-57.757027	2227	2082.413474	-144.586526	1856	2040.544644	184.544644	5994	5976.201090	-17.798910	-0.002969
内藤セイビドー眼鏡店	1021	933.311735	-87.688265	1103	900.006671	-202.993329	682	998.536166	316.536166	2806	2831.854571	25.854571	0.009214
玉屋前	701	567.783000	-133.217000	690	673.976865	-16.023135	459	653.969506	194.969506	1850	1895.729372	45.729372	0.024719
オスカー前	1189	1006.132225	-182.867775	1141	1231.692098	90.692098	641	719.537873	78.537873	2971	2957.362196	-13.637804	-0.004590
ブラザー前	1448	1190.609349	-257.390651	1350	1473.547767	123.547767	1122	1334.084175	212.084175	3920	3998.241292	78.241292	0.019960
小林動物病院前	1828	1405.107591	-422.892409	1374	1389.251069	15.251069	991	1263.433776	272.433776	4193	4057.792435	-135.207565	-0.032246
KoKoriオリオン通り入り口南	4392	3936.482679	-455.517321	4254	4352.796448	98.796448	4435	4778.849975	343.849975	13081	13068.129102	-12.870898	-0.000984
ファミリーマート前	2533	2052.623926	-480.376074	2006	2414.372516	408.372516	1479	1199.292732	-279.707268	6018	5666.289175	-351.710825	-0.058443
奥藤本店前	7015	6449.701619	-565.298381	6537	6005.086730	-531.913270	5558	6543.671242	985.671242	19110	18998.459591	-111.540409	-0.005837
KoKori紅梅南入り口西	2318	1672.666962	-645.333038	1824	2585.178982	761.178982	1469	1989.541570	520.541570	5611	6247.387514	636.387514	0.113418
古名屋ホテル前	2455	1807.587651	-647.412349	1585	1649.882399	64.882399	1139	1391.008958	252.008958	5179	4848.479008	-330.520992	-0.063819
ライフテクトナカゴミ前	3423	2293.173574	-1129.826426	2348	3379.488366	1031.488366	1617	2108.911517	491.911517	7388	7781.573457	393.573457	0.053272

	歩行合計	推計合計	差合計	誤差率
point_name
松木呉服店前	4935	4906.634052	-28.365948	-0.005748
セブンイレブン前	3745	3868.320235	123.320235	0.032929
風月堂前	2290	2282.000363	-7.999637	-0.003493
桜通り北交差点西	8306	8473.478356	167.478356	0.020164
河野スポーツ前	3525	3829.735168	304.735168	0.086450
永田楽器	1131	1129.181132	-1.818868	-0.001608
防災新館南	6447	6605.553855	158.553855	0.024593
三枝豆店前	2088	2134.104576	46.104576	0.022081
きぬや前	5994	5976.201090	-17.798910	-0.002969
内藤セイビドー眼鏡店	2806	2831.854571	25.854571	0.009214
玉屋前	1850	1895.729372	45.729372	0.024719
オスカー前	2971	2957.362196	-13.637804	-0.004590
ブラザー前	3920	3998.241292	78.241292	0.019960
小林動物病院前	4193	4057.792435	-135.207565	-0.032246
KoKoriオリオン通り入り口南	13081	13068.129102	-12.870898	-0.000984
ファミリーマート前	6018	5666.289175	-351.710825	-0.058443
奥藤本店前	19110	18998.459591	-111.540409	-0.005837
KoKori紅梅南入り口西	5611	6247.387514	636.387514	0.113418
古名屋ホテル前	5179	4848.479008	-330.520992	-0.063819
ライフテクトナカゴミ前	7388	7781.573457	393.573457	0.053272

	point	date	time	wifi_data	survey_data	fitted_olm	fitted_lmer	handcount
0	2	20181130	10	59	51	97.011055	63.844070	60
1	2	20181130	11	24	49	87.069267	54.110198	87
2	2	20181130	12	49	77	94.170544	61.062964	82
3	2	20181130	13	44	67	92.750289	59.672411	78
4	2	20181130	14	42	58	92.182187	59.116189	110

	point	date	time	wifi_data	survey_data	fitted_olm	fitted_lmer	handcount	color	name
0	2	20181130	10	59	51	97.011055	63.844070	60	2	三枝豆店前
1	2	20181130	11	24	49	87.069267	54.110198	87	2	三枝豆店前
2	2	20181130	12	49	77	94.170544	61.062964	82	2	三枝豆店前
3	2	20181130	13	44	67	92.750289	59.672411	78	2	三枝豆店前
4	2	20181130	14	42	58	92.182187	59.116189	110	2	三枝豆店前
5	2	20181130	15	40	53	91.614084	58.559968	57	2	三枝豆店前
6	2	20181130	16	56	46	96.158901	63.009738	59	2	三枝豆店前
7	2	20181130	17	57	99	96.442953	63.287849	66	2	三枝豆店前
8	2	20181130	18	54	91	95.590799	62.453517	102	2	三枝豆店前
9	2	20181130	19	79	110	102.692076	69.406282	117	2	三枝豆店前
10	2	20181202	10	20	41	85.933063	52.997756	60	2	三枝豆店前
11	2	20181202	11	52	49	95.022697	61.897296	87	2	三枝豆店前
12	2	20181202	12	57	58	96.442953	63.287849	82	2	三枝豆店前
13	2	20181202	13	65	50	98.715361	65.512734	78	2	三枝豆店前
14	2	20181202	14	96	67	107.520944	74.134163	110	2	三枝豆店前
15	2	20181202	15	63	65	98.147259	64.956512	57	2	三枝豆店前
16	2	20181202	16	50	65	94.454595	61.341074	59	2	三枝豆店前
17	2	20181202	17	143	71	120.871344	87.205362	66	2	三枝豆店前
18	2	20181202	18	75	44	101.555872	68.293840	102	2	三枝豆店前
19	2	20181202	19	68	59	99.567514	66.347066	117	2	三枝豆店前
20	3	20181130	10	658	40	267.157644	66.419328	18	3	風月堂前
21	3	20181130	11	696	55	277.951585	70.446003	44	3	風月堂前
22	3	20181130	12	663	68	268.577900	66.949154	63	3	風月堂前
23	3	20181130	13	660	46	267.725747	66.631258	63	3	風月堂前
24	3	20181130	14	595	59	249.262427	59.743525	58	3	風月堂前
25	3	20181130	15	659	54	267.441696	66.525293	35	3	風月堂前
26	3	20181130	16	776	77	300.675671	78.923214	78	3	風月堂前
27	3	20181130	17	1009	100	366.859569	103.613090	120	3	風月堂前
28	3	20181130	18	1313	157	453.211094	135.826491	201	3	風月堂前
29	3	20181130	19	1628	160	542.687181	169.205508	184	3	風月堂前
...	...	...	...	...	...	...	...	...	...	...
370	23	20181202	10	273	86	157.797983	75.375742	100	23	小林動物病院前
371	23	20181202	11	323	108	172.000537	87.578001	100	23	小林動物病院前
372	23	20181202	12	657	106	266.873593	169.089091	136	23	小林動物病院前
373	23	20181202	13	463	79	211.767686	121.744326	131	23	小林動物病院前
374	23	20181202	14	319	92	170.864332	86.601820	139	23	小林動物病院前
375	23	20181202	15	413	101	197.565133	109.542067	92	23	小林動物病院前
376	23	20181202	16	499	81	221.993525	130.529953	125	23	小林動物病院前
377	23	20181202	17	524	117	229.094801	136.631082	151	23	小林動物病院前
378	23	20181202	18	555	125	237.900384	144.196483	218	23	小林動物病院前