機械学習（LightGBM）を使った株価予測入門（初心者向けソースコードあり）

将来の夢は、

株で一生利益を得て安泰に暮らすことです！

（Ｃ）楳図かずお／漂流教室

想像してみて下さい。

金融のプロがこんな取引をやってると思いますか？

2本の移動平均線が〇〇となったら「買い（又は売り）」
ストキャスティクスとMACDが〇〇だったら「買い（又は売り）」
寄り付きから〇円以上上昇した後、〇円変動したら「買い（又は売り）」
NYダウが〇〇で日経が〇〇だったら「買い（又は売り）」
移動平均乖離率が〇〇％以上だったら「買い（又は売り）」

価格が上昇したからGC（ゴールデンクロス）が表れたのであって、これから上昇するワケではありません。

機械学習で株価予測の事始め
利用したストラテジーのおさらい
1. 説明変数（利用した指標）は何か？
機械学習による株価予測の期待値検証
まとめ
1. ソースコード

ランダムウォークで解けないとされる問題に何年もチャレンジして初老になってしまった……。

個人投資家にも一握りの成功者もいるけど、他に成功率の高いビジネスは沢山検討できたんじゃ無いのかな……。

そもそも「誰かが勝てば誰かが負ける」時点で

投資は知の総合格闘技

だからね……

僕は蚊に右腕を噛まれても

「まだ空腹では御座いませんか？」

と、左腕も差し出す心優しい性格なので、争い事に勝てた試しがない。

負け犬人生の僕には最初から勝ち目が無いことは分かっていたわ……。

せめてシステムトレードの知識を活かして「機械学習」に応用しつつ、本業で活かせるようになるかーー。

因みに、最近は機械学習をやってなかった理由は

自分の実装したコードの動かし方を忘れたから。

という糞のような理由。

今回は初心に戻って動かせるところまでは進めてみよう。

機械学習で株価予測の事始め

AUC（評価指標）が7割超えしていた時代もあったが、もはや自分で作ったソースコードの動かし方が分からない。

けれど、幾つかの実装をコメントアウトしたら実行できた。

更に簡単に実行出来るようにソースを書き換えたので実行手順をまとめておく。

事前準備

PythonがインストールされているWindows PC
OSSのProtra（最新株価をダウロード済）

まず、Protraで全株価をCVSとして出力させておく。

GPUも不要。マシンが多少古くても動作する。

これで事前準備は完了だ。

株価予測の実行

pipコマンドで各種ライブラリをインストールする。

pip install lightgbm
pip install optuna
pip install seaborn
pip install TA_Lib-0.4.19-cp38-cp38-win_amd64.whl
pip install pandas-profiling

pip install lightgbm

pip install optuna

pip install seaborn

pip install TA_Lib-0.4.19-cp38-cp38-win_amd64.whl

pip install pandas-profiling

そして、次の実行コマンドを打つ。

python new2021.py

1	python new2021.py

※Protraのフォルダ位置のパスを書き換える行あり

これでProtraのバックテスト用のスクリプトが吐き出される。

処理時間は銘柄数や層の深さによるが、今回の実行であれば5秒程度だ。

バックテストの実行

Protraを使ってバックテストを行う。

選択銘柄は株価予測した銘柄だけに絞っていると高速になる。

今回の実行であれば2分程度で結果が出る。

うん。簡単だね。

利用したストラテジーのおさらい

【資金管理条件】

銘柄選定（時価総額ランキングTop20位）
1回の購入資金 (50万円)
投資総額 (300万円)
単利運用

【買いルール】

3日後の始値が50%以上の確率で3%以上上がると判断した場合に翌日の始値で買い

【手仕舞いルール】

3日経過したら翌日の始値で手仕舞い

【機械学習データ】

[目的変数] 翌日の始値から3日後の始値が3%以上上がったもの
[学習モデル] 勾配ブースティング（LightGBM）
[モデル評価] KFold（K-分割交差検証）

説明変数（利用した指標）は何か？

以前は国債とか騰落レシオなども使っていたけど無効化している。

合計32個を利用した。

【一般データ】

全部で7個。

日付、始値、高値、安値、終値、出来高、曜日

【テクニカル指標】

全部で25個。

移動平均（3日、15日、50日、75日、100日）
ボリンジャーバンド（σ1、σ2、σ3）
MACD（シグナル、ヒストグラム）
RSI（9日、14日）
ADX（平均方向性指数）
CCI（商品チャンネル指数(Commodity Channel Index)）
ROC（rate of change）
ADOSC（チャイキンオシレーター:A/DのMACD）
ATR（Average True Range）
移動平均乖離率（5日、15日、25日）
前日比（1日、2日、3日）
VR（Volume Ratio）

テクニカル指標は何年もやってきたし、もっと増やせると思ってる。

今回の指標で学習過程で重要だとLightGBMが判断したものは次のとおり。

機械学習による株価予測の期待値検証

計算時間は機械学習含めて前述のとおり3分。

早いし自動的にパラメータ調整してくれるし、これで儲かるなら、機械学習の方が断然良いわ。

結果は次のとおり。

株価データ: 日足
銘柄リスト: 時価総額ランキングTop20
2000/01/05～2021/08/06における成績です。
----------------------------------------
全トレード数		2635
勝ちトレード数(勝率)	1854(70.36%)
負けトレード数(負率)	781(29.64%)

全トレード平均利率	2.67%
勝ちトレード平均利率	4.83%
負けトレード平均損率	-2.46%

勝ちトレード最大利率	37.42%
負けトレード最大損率	-28.42%

全トレード平均期間	4.43
勝ちトレード平均期間	4.41
負けトレード平均期間	4.46
----------------------------------------
必要資金		¥4,478,800
最大ポジション(簿価)	¥9,998,900
最大ポジション(時価)	¥12,495,700

純利益			¥60,930,700
勝ちトレード総利益		¥77,375,170
負けトレード総損失		-¥16,444,470

全トレード平均利益	¥23,124
勝ちトレード平均利益	¥41,734
負けトレード平均損失	-¥21,056

勝ちトレード最大利益	¥341,700
負けトレード最大損失	-¥216,000

プロフィットファクター		4.71
最大ドローダウン(簿価)	-¥837,700
最大ドローダウン(時価)	-¥1,137,100
----------------------------------------
現在進行中のトレード数	0
----------------------------------------
平均年利		61.84%
平均年利(直近5年)	11.38%
最大連勝		15回
最大連敗		7回
----------------------------------------
[年度別レポート]
年度	取引回数	運用損益	年利	勝率	PF	最大DD
2021年	   26回		¥27,900円	0.62%	57.69%	 1.15倍	-3.86%
2020年	   86回		¥1,199,200円	26.78%	60.47%	 1.89倍	-15.31%
2019年	   57回		¥395,700円	8.83%	63.16%	 2.03倍	-6.42%
2018年	   63回		¥676,200円	15.10%	66.67%	 2.16倍	-12.69%
2017年	   56回		¥249,400円	5.57%	50.00%	 1.84倍	-4.59%
2016年	  123回		¥3,044,600円	67.98%	74.80%	 6.90倍	-6.59%
2015年	   97回		¥2,476,000円	55.28%	74.23%	 7.41倍	-5.25%
2014年	   83回		¥1,472,200円	32.87%	78.31%	 9.18倍	-3.24%
2013年	  156回		¥4,118,800円	91.96%	74.36%	 8.23倍	-4.28%
2012年	  111回		¥2,365,100円	52.81%	69.37%	 6.57倍	-4.46%
2011年	   98回		¥1,172,000円	26.17%	62.24%	 2.87倍	-8.12%
2010年	  100回		¥971,500円	21.69%	67.00%	 2.44倍	-7.41%
2009年	  182回		¥6,124,800円	136.75%	74.73%	 7.80倍	-8.61%
2008年	  213回		¥7,014,700円	156.62%	75.59%	 4.70倍	-16.41%
2007年	   78回		¥1,410,600円	31.50%	78.21%	 4.26倍	-8.06%
2006年	  101回		¥1,498,700円	33.46%	72.28%	 4.03倍	-8.98%
2005年	   78回		¥1,523,300円	34.01%	71.79%	 5.78倍	-4.53%
2004年	  110回		¥2,537,300円	56.65%	75.45%	 5.44倍	-10.45%
2003年	  235回		¥5,787,900円	129.23%	75.74%	 4.96倍	-28.42%
2002年	  197回		¥5,459,800円	121.90%	73.10%	 5.76倍	-7.80%
2001年	  210回		¥6,167,700円	137.71%	67.62%	 4.61倍	-19.75%
2000年	  175回		¥5,237,300円	116.94%	72.00%	 4.92倍	-15.02%

株価データ: 日足

銘柄リスト: 時価総額ランキングTop20

2000/01/05～2021/08/06における成績です。

----------------------------------------

全トレード数 2635

勝ちトレード数(勝率) 1854(70.36%)

負けトレード数(負率) 781(29.64%)

全トレード平均利率 2.67%

勝ちトレード平均利率 4.83%

負けトレード平均損率 -2.46%

勝ちトレード最大利率 37.42%

負けトレード最大損率 -28.42%

全トレード平均期間 4.43

勝ちトレード平均期間 4.41

負けトレード平均期間 4.46

----------------------------------------

必要資金 ¥4,478,800

最大ポジション(簿価) ¥9,998,900

最大ポジション(時価) ¥12,495,700

純利益 ¥60,930,700

勝ちトレード総利益 ¥77,375,170

負けトレード総損失 -¥16,444,470

全トレード平均利益 ¥23,124

勝ちトレード平均利益 ¥41,734

負けトレード平均損失 -¥21,056

勝ちトレード最大利益 ¥341,700

負けトレード最大損失 -¥216,000

プロフィットファクター 4.71

最大ドローダウン(簿価) -¥837,700

最大ドローダウン(時価) -¥1,137,100

----------------------------------------

現在進行中のトレード数 0

----------------------------------------

平均年利 61.84%

平均年利(直近5年) 11.38%

最大連勝 15回

最大連敗 7回

----------------------------------------

[年度別レポート]

年度取引回数運用損益年利勝率 PF 最大DD

2021年 26回 ¥27,900円 0.62% 57.69% 1.15倍 -3.86%

2020年 86回 ¥1,199,200円 26.78% 60.47% 1.89倍 -15.31%

2019年 57回 ¥395,700円 8.83% 63.16% 2.03倍 -6.42%

2018年 63回 ¥676,200円 15.10% 66.67% 2.16倍 -12.69%

2017年 56回 ¥249,400円 5.57% 50.00% 1.84倍 -4.59%

2016年 123回 ¥3,044,600円 67.98% 74.80% 6.90倍 -6.59%

2015年 97回 ¥2,476,000円 55.28% 74.23% 7.41倍 -5.25%

2014年 83回 ¥1,472,200円 32.87% 78.31% 9.18倍 -3.24%

2013年 156回 ¥4,118,800円 91.96% 74.36% 8.23倍 -4.28%

2012年 111回 ¥2,365,100円 52.81% 69.37% 6.57倍 -4.46%

2011年 98回 ¥1,172,000円 26.17% 62.24% 2.87倍 -8.12%

2010年 100回 ¥971,500円 21.69% 67.00% 2.44倍 -7.41%

2009年 182回 ¥6,124,800円 136.75% 74.73% 7.80倍 -8.61%

2008年 213回 ¥7,014,700円 156.62% 75.59% 4.70倍 -16.41%

2007年 78回 ¥1,410,600円 31.50% 78.21% 4.26倍 -8.06%

2006年 101回 ¥1,498,700円 33.46% 72.28% 4.03倍 -8.98%

2005年 78回 ¥1,523,300円 34.01% 71.79% 5.78倍 -4.53%

2004年 110回 ¥2,537,300円 56.65% 75.45% 5.44倍 -10.45%

2003年 235回 ¥5,787,900円 129.23% 75.74% 4.96倍 -28.42%

2002年 197回 ¥5,459,800円 121.90% 73.10% 5.76倍 -7.80%

2001年 210回 ¥6,167,700円 137.71% 67.62% 4.61倍 -19.75%

2000年 175回 ¥5,237,300円 116.94% 72.00% 4.92倍 -15.02%

利益曲線は、次の通り。

近年は利益率が下がるけど、利益は出ているっぽいね。

機械学習を検討していた時はカーブフィッティングに泣かされ諦めた。

この結果は、まだワンチャンスあるんじゃない？

まとめ

なんとか動作したし実行時間も超早い。

これで儲かるなら、これが良いわ。

ただしフォワードテストを全く行ってないので、毎日自動動作させてしばらく結果を確認していきたい。

CSVの読み取り部分をどうするかな……。

ソースコード

バックテストには無料OSSの「Protra」を利用した。

誰でも実行して確認できるように、GitHubにすべてのファイルを置いている。

import os
import gc
import time
import pickle
import optuna
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score, roc_curve
from sklearn.model_selection import KFold, StratifiedKFold
from contextlib import contextmanager
from nehori import tilib

@contextmanager
def timer(title):
    t0 = time.time()
    yield
    print("{} - done in {:.0f}s".format(title, time.time() - t0))

# CSVの読み込み
def read_csv(stock_id, skiprows, skipfooter):
    file = "tosho/" + str(stock_id) + ".csv"
    if not os.path.exists(file):
        print("[Error] " + file + "　does not exist.")
        return None, False
    else:
        return pd.read_csv(file, skiprows=skiprows,
                           skipfooter=skipfooter, engine="python",
                           names=("date", "open", "high", "low", "close", "volume")), True

# 概要出力
def display_overview(df):
    # それぞれのデータのサイズを確認
    print("The size of df is : "+str(df.shape))
    # 列名を表示
    print(df.columns)
    # 表の一部分表示
    print(df.head().append(df.tail()))

# 予測値（*日後の始値の上昇値）
def get_target_value(df):
    df['target'] = (df['open'].shift(-3) - df['open'].shift(-1)) / df['open'].shift(-1)
    df.loc[(df['target'] > 0.03), 'target'] = 1
    df.loc[(-0.03 > df['target']), 'target'] = 0
    return df

# データ前処理
def pre_processing(df):
    # 目的変数（*日後の始値の上昇値）
    df = get_target_value(df)
    # 曜日追加
    #df['day'] = pd.to_datetime(df['date']).dt.dayofweek
    # 新特徴データ
    df = tilib.add_new_features(df)
    # 欠損値を列の1つ手前の値で埋める
    df = df.fillna(method='ffill')
    return df

# feature importanceをプロット
def display_importances(feature_importance_df_):
    cols = feature_importance_df_[["feature", "importance"]].groupby("feature").mean().sort_values(by = "importance", ascending = False)[:40].index
    best_features = feature_importance_df_.loc[feature_importance_df_.feature.isin(cols)]
    plt.figure(figsize = (8, 10))
    sns.barplot(x = "importance", y = "feature", data = best_features.sort_values(by = "importance", ascending=False))
    plt.title('LightGBM Features (avg over folds)')
    plt.tight_layout()
    plt.savefig('lgbm_importances01.png')

# ROC曲線をプロット
def display_roc(list_label, list_score):
    fpr, tpr, thresholds = roc_curve(list_label, list_score)
    auc = metrics.auc(fpr, tpr)
    plt.plot(fpr, tpr, label='ROC curve (area = %.2f)'%auc)
    plt.legend()
    plt.title('ROC curve')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.grid(True)

# Optuna(ハイパーパラメータ自動最適化ツール)
class Objective:
    def __init__(self, x, y, excluded_feats, num_folds = 4, stratified = False):
        self.x = x
        self.y = y
        self.excluded_feats = excluded_feats
        self.stratified = stratified
        self.num_folds = num_folds

    def __call__(self, trial):
        df_train = self.x
        y = self.y
        excluded_feats = self.excluded_feats
        stratified = self.stratified
        num_folds = self.num_folds
        # Cross validation model
        if stratified:
            folds = StratifiedKFold(n_splits = num_folds, shuffle = True, random_state = 1001)
        else:
            folds = KFold(n_splits = num_folds, shuffle = True, random_state = 1001)
        oof_preds = np.zeros(df_train.shape[0])
        feats = [f for f in df_train.columns if f not in excluded_feats] 
        for n_fold, (train_idx, valid_idx) in enumerate(folds.split(df_train[feats], y)):
            X_train, y_train = df_train[feats].iloc[train_idx], y.iloc[train_idx]
            X_valid, y_valid = df_train[feats].iloc[valid_idx], y.iloc[valid_idx]
            clf = LGBMClassifier(objective = 'binary',
                                    reg_alpha = trial.suggest_loguniform('reg_alpha', 1e-4, 100.0),
                                    reg_lambda = trial.suggest_loguniform('reg_lambda', 1e-4, 100.0),
                                    num_leaves = trial.suggest_int('num_leaves', 10, 40),
                                    silent = True)
            # trainとvalidを指定し学習
            clf.fit(X_train, y_train, eval_set = [(X_train, y_train), (X_valid, y_valid)], 
                    eval_metric = 'auc', verbose = 0, early_stopping_rounds = 200)
            oof_preds[valid_idx] = clf.predict_proba(X_valid, num_iteration = clf.best_iteration_)[:, 1]
        accuracy = roc_auc_score(y, oof_preds)
        return 1.0 - accuracy

import lightgbm as lgb

# 決定木を可視化
def display_tree(clf):
    #ax = lgb.plot_tree(clf, tree_index=0, figsize=(20, 20), show_info=['split_gain'])
    #plt.show()
    print('Plotting tree with graphviz...')
    graph = lgb.create_tree_digraph(clf, tree_index=1, format='png', name='Tree',
                                    show_info=['split_gain','internal_weight','leaf_weight','internal_value','leaf_count'])
    graph.render(view=True)
    
def load_model(num):
    clf = None
    file = "model" + str(num) + ".pickle"
    if os.path.exists(file):
       with open(file, mode='rb') as fp:
           clf = pickle.load(fp)
    return clf

def save_model(num, clf):
    with open("model" + str(num) + ".pickle", mode='wb') as fp:
          pickle.dump(clf, fp, protocol=2)

# Cross validation with KFold
def cross_validation(df_train, y, df_test, excluded_feats, num_folds = 4, stratified = False, debug = False):
    print("Starting cross_validation. Train shape: {}, test shape: {}".format(df_train.shape, df_test.shape))
    # Cross validation model
    if stratified:
        folds = StratifiedKFold(n_splits = num_folds, shuffle = True, random_state = 1001)
    else:
        folds = KFold(n_splits = num_folds, shuffle = True, random_state = 1001)
    # Create arrays and dataframes to store results
    oof_preds = np.zeros(df_train.shape[0])
    sub_preds = np.zeros(df_test.shape[0])
    df_feature_importance = pd.DataFrame()
    feats = [f for f in df_train.columns if f not in excluded_feats] 
    for n_fold, (train_idx, valid_idx) in enumerate(folds.split(df_train[feats], y)):
        X_train, y_train = df_train[feats].iloc[train_idx], y.iloc[train_idx]
        X_valid, y_valid = df_train[feats].iloc[valid_idx], y.iloc[valid_idx]
        # LightGBM
        clf = LGBMClassifier() #reg_alpha = 0.44004414216369864,
                             #reg_lambda = 0.07343092808809583, 
                             #num_leaves = 29)
        
        # trainとvalidを指定し学習
        clf.fit(X_train, y_train, eval_set = [(X_train, y_train), (X_valid, y_valid)], 
                eval_metric = "auc", verbose = 0, early_stopping_rounds = 200)

        oof_preds[valid_idx] = clf.predict_proba(X_valid, num_iteration = clf.best_iteration_)[:, 1]
        sub_preds = clf.predict_proba(df_test[feats], num_iteration = clf.best_iteration_)[:, 1]
        df_fold_importance = pd.DataFrame()
        df_fold_importance["feature"] = feats
        df_fold_importance["importance"] = clf.feature_importances_
        df_fold_importance["fold"] = n_fold + 1
        df_feature_importance = pd.concat([df_feature_importance, df_fold_importance], axis=0)
        #print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(y_valid, oof_preds[valid_idx])))
        save_model(n_fold, clf)
        del clf, X_train, y_train, X_valid, y_valid
        gc.collect()
    print('Full AUC score %.6f' % roc_auc_score(y, oof_preds))
    #display_roc(y, oof_preds)
    display_importances(df_feature_importance)
    return sub_preds

def pred_load_model(clfs, df, stock_id, excluded_feats):
    n_splits = len(clfs)
    sub_preds = np.zeros(df.shape[0])
    feats = [f for f in df.columns if f not in excluded_feats]
    for clf in clfs:
        sub_preds += clf.predict_proba(df[feats], num_iteration = clf.best_iteration_)[:, 1] / n_splits
    s = tilib.create_protra_dataset(stock_id, df["date"], sub_preds, 0.6)
    return s

# 時価総額ランキングTop20
stock_names = [
"7203", "6861", "6758", "9432", "9984", "6098", "8306", "9983", "9433", "6367",
"6594", "4063", "8035", "9434", "7974", "4519", "7267", "7741", "6902", "6501"
             ]

def main(df_train, df_test, stock_id):
    # 概要出力
    display_overview(df_train)
    # 学習モデル構築
    df_test = df_test.drop("target", axis=1)
    df_train = df_train.dropna(subset=["target"])
    # 正解データ・失敗データだけ利用する
    df_train = df_train[(df_train['target'] == 1) | (df_train['target'] == 0)]
    excluded_feats = ['target', 'date']
    s = ""
    # 学習データが存在する場合
    if (len(df_train)):
       if True:
          # 交差検証
          y_pred = cross_validation(df_train, df_train['target'], df_test, excluded_feats, 2, True, True)
          print(y_pred)
          s = tilib.create_protra_dataset(stock_id, df_test["date"], y_pred, 0.8)
       else:
          # ハイパーパラメータ探索
          objective = Objective(x=df_train, y=df_train['target'],
                                excluded_feats=excluded_feats, num_folds = 5, stratified = True)
          study = optuna.create_study(sampler = optuna.samplers.RandomSampler(seed = 0))
          study.optimize(objective, n_trials = 50)
    return s

DIR = "（Protraを置いているパス）\\Protra\\lib\\"

# 結合版
if __name__ == '__main__':
    with timer("Cross validation"):
        df_train = pd.DataFrame()
        df_test = pd.DataFrame()        
        for stock_id in stock_names:
            print(str(stock_id))
            # CSVの読み込み
            df, val = read_csv(stock_id, 0, 0)
            #display_overview(df)
            #df, val = read_csv(stock_id, 1000, 0)
            if not val:
                continue
            df = pre_processing(df)
            df_train = pd.concat([df_train, df])
            # CVを使っているのでTest用に一定数を未知のデータとする
            df_test, val = read_csv(stock_id, 0, 500)
            #df_test = pd.concat([df_test, df])
            # データ前処理
        df_test = pre_processing(df_test)
        #display_overview(df_train)
        # closeの欠損値が含まれている行を削除
        df_train = df_train.dropna(subset=["close"])
        main(df_train, df_test, stock_id)
    s = ""
    with timer("start back test"):
        clf = []
        for i in range(2):
            clf.append(load_model(i))
        excluded_feats = ['target', 'date']
        for stock_id in stock_names:
            df_test, val = read_csv(stock_id, 0, 0)
            df_test = pre_processing(df_test)
            s += pred_load_model(clf, df_test, stock_id, excluded_feats)
    with open(DIR + "LightGBM.pt", mode='w') as f:
        f.write(tilib.merge_protra_dataset(s))

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

import os

import gc

import time

import pickle

import optuna

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn import metrics

from lightgbm import LGBMClassifier

from sklearn.metrics import roc_auc_score

from sklearn.metrics import accuracy_score, roc_curve

from sklearn.model_selection import KFold, StratifiedKFold

from contextlib import contextmanager

from nehori import tilib

@contextmanager

def timer(title):

t0 = time.time()

yield

print("{} - done in {:.0f}s".format(title, time.time() - t0))

# CSVの読み込み

def read_csv(stock_id, skiprows, skipfooter):

file = "tosho/" + str(stock_id) + ".csv"

if not os.path.exists(file):

print("[Error] " + file + "　does not exist.")

return None, False

else:

return pd.read_csv(file, skiprows=skiprows,

skipfooter=skipfooter, engine="python",

names=("date", "open", "high", "low", "close", "volume")), True

# 概要出力

def display_overview(df):

# それぞれのデータのサイズを確認

print("The size of df is : "+str(df.shape))

# 列名を表示

print(df.columns)

# 表の一部分表示

print(df.head().append(df.tail()))

# 予測値（*日後の始値の上昇値）

def get_target_value(df):

df['target'] = (df['open'].shift(-3) - df['open'].shift(-1)) / df['open'].shift(-1)

df.loc[(df['target'] > 0.03), 'target'] = 1

df.loc[(-0.03 > df['target']), 'target'] = 0

return df

# データ前処理

def pre_processing(df):

# 目的変数（*日後の始値の上昇値）

df = get_target_value(df)

# 曜日追加

#df['day'] = pd.to_datetime(df['date']).dt.dayofweek

# 新特徴データ

df = tilib.add_new_features(df)

# 欠損値を列の1つ手前の値で埋める

df = df.fillna(method='ffill')

return df

# feature importanceをプロット

def display_importances(feature_importance_df_):

cols = feature_importance_df_[["feature", "importance"]].groupby("feature").mean().sort_values(by = "importance", ascending = False)[:40].index

best_features = feature_importance_df_.loc[feature_importance_df_.feature.isin(cols)]

plt.figure(figsize = (8, 10))

sns.barplot(x = "importance", y = "feature", data = best_features.sort_values(by = "importance", ascending=False))

plt.title('LightGBM Features (avg over folds)')

plt.tight_layout()

plt.savefig('lgbm_importances01.png')

# ROC曲線をプロット

def display_roc(list_label, list_score):

fpr, tpr, thresholds = roc_curve(list_label, list_score)

auc = metrics.auc(fpr, tpr)

plt.plot(fpr, tpr, label='ROC curve (area = %.2f)'%auc)

plt.legend()

plt.title('ROC curve')

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.grid(True)

# Optuna(ハイパーパラメータ自動最適化ツール)

class Objective:

def __init__(self, x, y, excluded_feats, num_folds = 4, stratified = False):

self.x = x

self.y = y

self.excluded_feats = excluded_feats

self.stratified = stratified

self.num_folds = num_folds

def __call__(self, trial):

df_train = self.x

y = self.y

excluded_feats = self.excluded_feats

stratified = self.stratified

num_folds = self.num_folds

# Cross validation model

if stratified:

folds = StratifiedKFold(n_splits = num_folds, shuffle = True, random_state = 1001)

else:

folds = KFold(n_splits = num_folds, shuffle = True, random_state = 1001)

oof_preds = np.zeros(df_train.shape[0])

feats = [f for f in df_train.columns if f not in excluded_feats]

for n_fold, (train_idx, valid_idx) in enumerate(folds.split(df_train[feats], y)):

X_train, y_train = df_train[feats].iloc[train_idx], y.iloc[train_idx]

X_valid, y_valid = df_train[feats].iloc[valid_idx], y.iloc[valid_idx]

clf = LGBMClassifier(objective = 'binary',

reg_alpha = trial.suggest_loguniform('reg_alpha', 1e-4, 100.0),

reg_lambda = trial.suggest_loguniform('reg_lambda', 1e-4, 100.0),

num_leaves = trial.suggest_int('num_leaves', 10, 40),

silent = True)

# trainとvalidを指定し学習

clf.fit(X_train, y_train, eval_set = [(X_train, y_train), (X_valid, y_valid)],

eval_metric = 'auc', verbose = 0, early_stopping_rounds = 200)

oof_preds[valid_idx] = clf.predict_proba(X_valid, num_iteration = clf.best_iteration_)[:, 1]

accuracy = roc_auc_score(y, oof_preds)

return 1.0 - accuracy

import lightgbm as lgb

# 決定木を可視化

def display_tree(clf):

#ax = lgb.plot_tree(clf, tree_index=0, figsize=(20, 20), show_info=['split_gain'])

#plt.show()

print('Plotting tree with graphviz...')

graph = lgb.create_tree_digraph(clf, tree_index=1, format='png', name='Tree',

show_info=['split_gain','internal_weight','leaf_weight','internal_value','leaf_count'])

graph.render(view=True)

def load_model(num):

clf = None

file = "model" + str(num) + ".pickle"

if os.path.exists(file):

with open(file, mode='rb') as fp:

clf = pickle.load(fp)

return clf

def save_model(num, clf):

with open("model" + str(num) + ".pickle", mode='wb') as fp:

pickle.dump(clf, fp, protocol=2)

# Cross validation with KFold

def cross_validation(df_train, y, df_test, excluded_feats, num_folds = 4, stratified = False, debug = False):

print("Starting cross_validation. Train shape: {}, test shape: {}".format(df_train.shape, df_test.shape))

# Cross validation model

if stratified:

folds = StratifiedKFold(n_splits = num_folds, shuffle = True, random_state = 1001)

else:

folds = KFold(n_splits = num_folds, shuffle = True, random_state = 1001)

# Create arrays and dataframes to store results

oof_preds = np.zeros(df_train.shape[0])

sub_preds = np.zeros(df_test.shape[0])

df_feature_importance = pd.DataFrame()

feats = [f for f in df_train.columns if f not in excluded_feats]

for n_fold, (train_idx, valid_idx) in enumerate(folds.split(df_train[feats], y)):

X_train, y_train = df_train[feats].iloc[train_idx], y.iloc[train_idx]

X_valid, y_valid = df_train[feats].iloc[valid_idx], y.iloc[valid_idx]

# LightGBM

clf = LGBMClassifier() #reg_alpha = 0.44004414216369864,

#reg_lambda = 0.07343092808809583,

#num_leaves = 29)

# trainとvalidを指定し学習

clf.fit(X_train, y_train, eval_set = [(X_train, y_train), (X_valid, y_valid)],

eval_metric = "auc", verbose = 0, early_stopping_rounds = 200)

oof_preds[valid_idx] = clf.predict_proba(X_valid, num_iteration = clf.best_iteration_)[:, 1]

sub_preds = clf.predict_proba(df_test[feats], num_iteration = clf.best_iteration_)[:, 1]

df_fold_importance = pd.DataFrame()

df_fold_importance["feature"] = feats

df_fold_importance["importance"] = clf.feature_importances_

df_fold_importance["fold"] = n_fold + 1

df_feature_importance = pd.concat([df_feature_importance, df_fold_importance], axis=0)

#print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(y_valid, oof_preds[valid_idx])))

save_model(n_fold, clf)

del clf, X_train, y_train, X_valid, y_valid

gc.collect()

print('Full AUC score %.6f' % roc_auc_score(y, oof_preds))

#display_roc(y, oof_preds)

display_importances(df_feature_importance)

return sub_preds

def pred_load_model(clfs, df, stock_id, excluded_feats):

n_splits = len(clfs)

sub_preds = np.zeros(df.shape[0])

feats = [f for f in df.columns if f not in excluded_feats]

for clf in clfs:

sub_preds += clf.predict_proba(df[feats], num_iteration = clf.best_iteration_)[:, 1] / n_splits

s = tilib.create_protra_dataset(stock_id, df["date"], sub_preds, 0.6)

return s

# 時価総額ランキングTop20

stock_names = [

"7203", "6861", "6758", "9432", "9984", "6098", "8306", "9983", "9433", "6367",

"6594", "4063", "8035", "9434", "7974", "4519", "7267", "7741", "6902", "6501"

]

def main(df_train, df_test, stock_id):

# 概要出力

display_overview(df_train)

# 学習モデル構築

df_test = df_test.drop("target", axis=1)

df_train = df_train.dropna(subset=["target"])

# 正解データ・失敗データだけ利用する

df_train = df_train[(df_train['target'] == 1) | (df_train['target'] == 0)]

excluded_feats = ['target', 'date']

s = ""

# 学習データが存在する場合

if (len(df_train)):

if True:

# 交差検証

y_pred = cross_validation(df_train, df_train['target'], df_test, excluded_feats, 2, True, True)

print(y_pred)

s = tilib.create_protra_dataset(stock_id, df_test["date"], y_pred, 0.8)

else:

# ハイパーパラメータ探索

objective = Objective(x=df_train, y=df_train['target'],

excluded_feats=excluded_feats, num_folds = 5, stratified = True)

study = optuna.create_study(sampler = optuna.samplers.RandomSampler(seed = 0))

study.optimize(objective, n_trials = 50)

return s

DIR = "（Protraを置いているパス）\\Protra\\lib\\"

# 結合版

if __name__ == '__main__':

with timer("Cross validation"):

df_train = pd.DataFrame()

df_test = pd.DataFrame()

for stock_id in stock_names:

print(str(stock_id))

# CSVの読み込み

df, val = read_csv(stock_id, 0, 0)

#display_overview(df)

#df, val = read_csv(stock_id, 1000, 0)

if not val:

continue

df = pre_processing(df)

df_train = pd.concat([df_train, df])

# CVを使っているのでTest用に一定数を未知のデータとする

df_test, val = read_csv(stock_id, 0, 500)

#df_test = pd.concat([df_test, df])

# データ前処理

df_test = pre_processing(df_test)

#display_overview(df_train)

# closeの欠損値が含まれている行を削除

df_train = df_train.dropna(subset=["close"])

main(df_train, df_test, stock_id)

s = ""

with timer("start back test"):

clf = []

for i in range(2):

clf.append(load_model(i))

excluded_feats = ['target', 'date']

for stock_id in stock_names:

df_test, val = read_csv(stock_id, 0, 0)

df_test = pre_processing(df_test)

s += pred_load_model(clf, df_test, stock_id, excluded_feats)

with open(DIR + "LightGBM.pt", mode='w') as f:

f.write(tilib.merge_protra_dataset(s))