Posted:

ProgrammingStock Market Data/Price Prediction Script - PythonPosted: Wed Feb 15, 2023 11:34 pm

SiDev
• Resident Elite
Status: Offline
Joined: Dec 13, 20203Year Member
Posts: 287
Reputation Power: 567
Status: Offline
Joined: Dec 13, 20203Year Member
Posts: 287
Reputation Power: 567
 https://gyazo.com/21fd330e78c30c521278bdaa54e82e93.png Project Background Information This script was a project I worked on for my Python class during college. I have not yet gotten this script to work 100% correctly just yet. There are also some errors with the price prediction. It will accurately display the current days opening price; however, it will not accurately predict next day opening price as intended. The purpose of this script is to datascrape yahoo finance and other finance sources to output financial data for a given ticker that you give it. This data will include charts for open/close/volume/high/low points, graph the rolling mean / standard deviation. It will also display results from the Dickey-Fuller test and ARIMA models. *This tutorial will NOT show you how to install libraries or your python environment, but it will walk you through the code and thought process of the script. If you need help installing libraries or python work environment, there are tons of tutorials on YouTube. *Note: All data/graphs shown are using data from 'TSLA' ticker. This data is not up to date because I had retrieved it months ago. Script should work to receive new data - for most of the features. *Full script at bottom of post* What is the Dickey-Fuller test? In statistics, the Dickey-Fuller test tests the null hypothesis that a unit root is present in an autoregressive time series model. The alternative hypothesis is different depending on which version of the test is used, but is usually stationarity or trend-stationarity. The test is named after the statisticians David Dickey and Wayne Fuller, who developed it in 1979.[1] What are ARIMA models? ARIMA models provide another approach to time series forecasting. Exponential smoothing and ARIMA models are the two most widely used approaches to time series forecasting, and provide complementary approaches to the problem. While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data. ARIMA models are used for machine learning. There are tons of different ways they can be utilized. For our case, they help with price prediction. Now For the Code Installing Libraries ```#import libraries import random import tensorflow as tf from yahoo_finance import Share import matplotlib.pyplot as plt import matplotlib as mpl import numpy as np import pandas as pd import yfinance as yf from pandas_datareader import data, wb import datetime from statsmodels.tsa.stattools import adfuller from statsmodels.tsa.arima_model import ARIMA from keras import metrics from sklearn.metrics import mean_squared_error import matplotlib.dates as mdates import matplotlib.cbook as cbook import datetime as dt from pandas_datareader import data as pdr ``` This section imports the many libraries related to data scraping / data visualization and allows us access to financial data to pull from. Allowing Ticker Input / Data Fetching / Price Display ```# Ticker Input / Pricing Display / Data Fetching ticker = input('Enter stock ticker: ') start = pd.to_datetime('2020-02-04') end = pd.to_datetime('today') stock0 = yf.Ticker(ticker) hist = stock0.history(period="max") hist.to_csv(ticker + '.csv') ticker0 = pd.read_csv(ticker + '.csv') ticker0['Date'] = pd.to_datetime(ticker0['Date']) stock = data.DataReader(ticker, 'yahoo', start , end) stock ``` This section defines how a ticker is given to our script and what to do once it is given one. Cleaning / Sorting the Data ```# Data Cleaning / Sorting # Set target series series = ticker0['Close'] # Create train data set train_split_date = '2020-12-31' train_split_index = np.where(ticker0.Date == train_split_date)[0][0] x_train = ticker0.loc[ticker0['Date'] <= train_split_date]['Close'] # Create test data set test_split_date = '2021-06-15' test_split_index = np.where(ticker0.Date == test_split_date)[0][0] x_test = ticker0.loc[ticker0['Date'] >= test_split_date]['Close'] # Create valid data set valid_split_index = (train_split_index.max(),test_split_index.min()) x_valid = ticker0.loc[(ticker0['Date'] < test_split_date) & (ticker0['Date'] > train_split_date)]['Close'] #printed index values are: #0-5521(train), 5522-6527(valid), 6528-6947(test) ``` This section defines where the data is split to determine which data is shown and which data is hidden. Stationary Test ```# Stationary Test def test_stationarity(timeseries, window = 12, cutoff = 0.01):     #Determing rolling statistics     rolmean = timeseries.rolling(window).mean()     rolstd = timeseries.rolling(window).std()     #Plot rolling statistics:     fig = plt.figure(figsize=(12, 8))     orig = plt.plot(timeseries, color='blue',label='Original')     mean = plt.plot(rolmean, color='red', label='Rolling Mean')     std = plt.plot(rolstd, color='black', label = 'Rolling Std')     plt.legend(loc='best')     plt.title('Rolling Mean & Standard Deviation')     plt.show()     #Perform Dickey-Fuller test:     print('Results of Dickey-Fuller Test:')     dftest = adfuller(timeseries, autolag='AIC', maxlag = 20 )     dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])     for key,value in dftest[4].items():         dfoutput['Critical Value (%s)'%key] = value         pvalue = dftest[1]         if pvalue < cutoff:             print('p-value = %.4f. The series is likely stationary.' % pvalue)         else:             print('p-value = %.4f. The series is likely non-stationary.' % pvalue)         print(dfoutput) ``` Call Stationary Test `test_stationarity(series) ` This section defines our stationary test and displays a graph for output. Stationary Test Output Stationary Test w/ Adjusted Close Point ```# Get the difference of each Adj Close point ticker0_close_diff_1 = series.diff() # Drop the first row as it will have a null value in this column ticker0_close_diff_1.dropna(inplace=True) ``` Call Stationary Test w/ Adjusted Close Point `test_stationarity(ticker0_close_diff_1) ` Stationary Test w/ Adjusted Close Point Output Graph Graphing (Partial) Autocorrelation ```from statsmodels.graphics.tsaplots import plot_acf,plot_pacf plot_acf(ticker0_close_diff_1) plt.xlabel('Lags (Days)') plt.show() # Break these into two separate cells plot_pacf(ticker0_close_diff_1) plt.xlabel('Lags (Days)') plt.show() ``` (Partial) Autocorrelation Graph Output AMIRA Models ```# Use this block to # fit model ticker0_arima = ARIMA(x_train, order=(1,1,1)) ticker0_arima_fit = ticker0_arima.fit(disp=0) print(ticker0_arima_fit.summary()) ``` AMIRA Output (with warnings) Create List for Predictions / data points ```# Create list of x train valuess history = [x for x in x_train] # establish list for predictions model_predictions = [] # Count number of test data points N_test_observations = len(x_test) # loop through every data point for time_point in list(x_test.index):     model = ARIMA(history, order=(1,1,1))     model_fit = model.fit(disp=0)     output = model_fit.forecast()     yhat = output[0]     model_predictions.append(yhat)     true_test_value = x_test[time_point]     history.append(true_test_value) MAE_error = metrics.mean_absolute_error(x_test, model_predictions).numpy() print('Testing Mean Squared Error is {}'.format(MAE_error)) # store model_predictions model_fit.save(ticker + '.pkl') ``` This model tests the mean squared and saves it. Output Image Check Model ```# Check to see if it reloaded model_predictions[:5] # Load model from statsmodels.tsa.arima.model import ARIMAResults loaded = ARIMAResults.load(ticker + '.pkl') arima_mae = mean_squared_error(x_test,model_predictions) arima_mae plt.rcParams['figure.figsize'] = [10, 10] plt.plot(x_test.index[-100:], model_predictions[-100:], color='blue',label='Predicted Price') plt.plot(x_test.index[-100:], x_test[-100:], color='red', label='Actual Price') plt.title(ticker + ' Price Prediction') plt.xlabel('Date') plt.ylabel('Prices') # plt.xticks(np.arange(881,1259,50), df.Date[881:1259:50]) plt.legend() plt.figure(figsize=(10,6)) plt.show() ``` Next Day Price Predictions `print("next day predicted value: ",model_predictions[-1]) ` Output Image Green Line Setup / Get Monthly Data / Green Line Indicator Definition Green Line ```# Green Line yf.pdr_override() # <== that's all it takes :-) start =dt.datetime(1980,12,1) now = dt.datetime.now() stockline = ticker ``` Get Monthly Data ```# Get Monthly Data def get_monthly_data(stockline, start, end):     df = pdr.get_data_yahoo(stock, start, end)     df.to_csv(stock +'.csv', index=False)     df.drop(df[df["Volume"]<1000].index, inplace=True)         dfmonth = df.groupby(pd.Grouper(freq="M"))["High"].max()     return dfmonth ``` Green Line Indicator Definition ```# Green Line Indicator Definition def calculate_GreenLine(dfmonth):     glDate=0     lastGLV=0 #last green line value     currentDate=""     curentGLV=0 # current greenline value     for index, value in dfmonth.items():         if value > curentGLV: #current greenline value             curentGLV=value #update             currentDate=index #update             counter=0 #reset the counter         if value < curentGLV:             counter=counter+1 # update the counter for the three month             if counter==3 and ((index.month != now.month) or (index.year != now.year)):                 #if curentGLV != lastGLV:                 #    print(curentGLV)                 glDate=currentDate                 lastGLV=curentGLV                 counter=0     if lastGLV==0:         message=stock+" has not formed a green line yet"     else:         message=("Last Green Line: "+str(lastGLV)+" on "+str(glDate))     print(message) ``` Call Green Line Calculation `calculate_GreenLine(dfmonth) ` Errors I am getting / Help Wanted / Additional Information Additional Information This project was a project I worked on for my python class. It was made in Jupyter Notebook which is why the code format looks kind of odd. I can share the .ipynb file for anyone that may be interested in helping get this project where it is intended to be. Help Wanted I am needing help situating the warnings/errors you have seen in these images. I am also wanting to get the next day price prediction to work properly. It currently gets same day opening price, not next day as intended. I would also like to eventually add some sort of weighted calculation to give meaning to the data it receives/outputs. This way, it can provide a recommendation of good/bad investment or something of the sort. Any suggestions / Feedback for improvement is always welcome! Error Images Full Script ```#import libraries import random import tensorflow as tf from yahoo_finance import Share import matplotlib.pyplot as plt import matplotlib as mpl import numpy as np import pandas as pd import yfinance as yf from pandas_datareader import data, wb import datetime from statsmodels.tsa.stattools import adfuller from statsmodels.tsa.arima_model import ARIMA from keras import metrics from sklearn.metrics import mean_squared_error import matplotlib.dates as mdates import matplotlib.cbook as cbook import datetime as dt from pandas_datareader import data as pdr # Ticker Input / Pricing Display / Data Fetching ticker = input('Enter stock ticker: ') start = pd.to_datetime('2020-02-04') end = pd.to_datetime('today') stock0 = yf.Ticker(ticker) hist = stock0.history(period="max") hist.to_csv(ticker + '.csv') ticker0 = pd.read_csv(ticker + '.csv') ticker0['Date'] = pd.to_datetime(ticker0['Date']) stock = data.DataReader(ticker, 'yahoo', start , end) stock # Data Cleaning / Sorting # Set target series series = ticker0['Close'] # Create train data set train_split_date = '2020-12-31' train_split_index = np.where(ticker0.Date == train_split_date)[0][0] x_train = ticker0.loc[ticker0['Date'] <= train_split_date]['Close'] # Create test data set test_split_date = '2021-06-15' test_split_index = np.where(ticker0.Date == test_split_date)[0][0] x_test = ticker0.loc[ticker0['Date'] >= test_split_date]['Close'] # Create valid data set valid_split_index = (train_split_index.max(),test_split_index.min()) x_valid = ticker0.loc[(ticker0['Date'] < test_split_date) & (ticker0['Date'] > train_split_date)]['Close'] #printed index values are: #0-5521(train), 5522-6527(valid), 6528-6947(test) # Stationary Test def test_stationarity(timeseries, window = 12, cutoff = 0.01):     #Determing rolling statistics     rolmean = timeseries.rolling(window).mean()     rolstd = timeseries.rolling(window).std()     #Plot rolling statistics:     fig = plt.figure(figsize=(12, 8))     orig = plt.plot(timeseries, color='blue',label='Original')     mean = plt.plot(rolmean, color='red', label='Rolling Mean')     std = plt.plot(rolstd, color='black', label = 'Rolling Std')     plt.legend(loc='best')     plt.title('Rolling Mean & Standard Deviation')     plt.show()     #Perform Dickey-Fuller test:     print('Results of Dickey-Fuller Test:')     dftest = adfuller(timeseries, autolag='AIC', maxlag = 20 )     dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])     for key,value in dftest[4].items():         dfoutput['Critical Value (%s)'%key] = value         pvalue = dftest[1]         if pvalue < cutoff:             print('p-value = %.4f. The series is likely stationary.' % pvalue)         else:             print('p-value = %.4f. The series is likely non-stationary.' % pvalue)         print(dfoutput) test_stationarity(series) # Get the difference of each Adj Close point ticker0_close_diff_1 = series.diff() # Drop the first row as it will have a null value in this column ticker0_close_diff_1.dropna(inplace=True) test_stationarity(ticker0_close_diff_1) from statsmodels.graphics.tsaplots import plot_acf,plot_pacf plot_acf(ticker0_close_diff_1) plt.xlabel('Lags (Days)') plt.show() # Break these into two separate cells plot_pacf(ticker0_close_diff_1) plt.xlabel('Lags (Days)') plt.show() # Use this block to # fit model ticker0_arima = ARIMA(x_train, order=(1,1,1)) ticker0_arima_fit = ticker0_arima.fit(disp=0) print(ticker0_arima_fit.summary()) # Create list of x train valuess history = [x for x in x_train] # establish list for predictions model_predictions = [] # Count number of test data points N_test_observations = len(x_test) # loop through every data point for time_point in list(x_test.index):     model = ARIMA(history, order=(1,1,1))     model_fit = model.fit(disp=0)     output = model_fit.forecast()     yhat = output[0]     model_predictions.append(yhat)     true_test_value = x_test[time_point]     history.append(true_test_value) MAE_error = metrics.mean_absolute_error(x_test, model_predictions).numpy() print('Testing Mean Squared Error is {}'.format(MAE_error)) # store model_predictions model_fit.save(ticker + '.pkl') # Check to see if it reloaded model_predictions[:5] # Load model from statsmodels.tsa.arima.model import ARIMAResults loaded = ARIMAResults.load(ticker + '.pkl') arima_mae = mean_squared_error(x_test,model_predictions) arima_mae plt.rcParams['figure.figsize'] = [10, 10] plt.plot(x_test.index[-100:], model_predictions[-100:], color='blue',label='Predicted Price') plt.plot(x_test.index[-100:], x_test[-100:], color='red', label='Actual Price') plt.title(ticker + ' Price Prediction') plt.xlabel('Date') plt.ylabel('Prices') # plt.xticks(np.arange(881,1259,50), df.Date[881:1259:50]) plt.legend() plt.figure(figsize=(10,6)) plt.show() print("next day predicted value: ",model_predictions[-1]) # Green Line yf.pdr_override() # <== that's all it takes :-) start =dt.datetime(1980,12,1) now = dt.datetime.now() stockline = ticker # Get Monthly Data def get_monthly_data(stockline, start, end):     df = pdr.get_data_yahoo(stock, start, end)     df.to_csv(stock +'.csv', index=False)     df.drop(df[df["Volume"]<1000].index, inplace=True)         dfmonth = df.groupby(pd.Grouper(freq="M"))["High"].max()     return dfmonth # Green Line Indicator Definition def calculate_GreenLine(dfmonth):     glDate=0     lastGLV=0 #last green line value     currentDate=""     curentGLV=0 # current greenline value     for index, value in dfmonth.items():         if value > curentGLV: #current greenline value             curentGLV=value #update             currentDate=index #update             counter=0 #reset the counter         if value < curentGLV:             counter=counter+1 # update the counter for the three month             if counter==3 and ((index.month != now.month) or (index.year != now.year)):                 #if curentGLV != lastGLV:                 #    print(curentGLV)                 glDate=currentDate                 lastGLV=curentGLV                 counter=0     if lastGLV==0:         message=stock+" has not formed a green line yet"     else:         message=("Last Green Line: "+str(lastGLV)+" on "+str(glDate))     print(message) calculate_GreenLine(dfmonth) ``` Closing Remarks If you have made it this far, I'd like to thank you for your time and interest in this script. The stock market and programming have always been of interest to me. More tutorials / programming posts to come. Stay tuned.

Last edited by SiDev ; edited 3 times in total

The following 2 users thanked SiDev for this useful post:

CriticaI (08-19-2023), Scizor (02-15-2023)
#2. Posted:
SiDev
• Summer 2023
Status: Offline
Joined: Dec 13, 20203Year Member
Posts: 287
Reputation Power: 567
Status: Offline
Joined: Dec 13, 20203Year Member
Posts: 287
Reputation Power: 567
 This project also slightly utilizes TensorFlow. I did not mention this in the post. TensorFlow deserves its own post dedicated to just that. It has tons of capabilities and I would recommend anyone to look into it if interested in AI/Machine Learning.
#3. Posted:
SiDev
• Summer 2023
Status: Offline
Joined: Dec 13, 20203Year Member
Posts: 287
Reputation Power: 567
Status: Offline
Joined: Dec 13, 20203Year Member
Posts: 287
Reputation Power: 567
#4. Posted:
TK
• Game Night
Status: Offline
Joined: May 24, 201410Year Member
Posts: 1,100
Reputation Power: 2689
Status: Offline
Joined: May 24, 201410Year Member
Posts: 1,100
Reputation Power: 2689
 I know nothing about this but if you could get it to accurately predict or come close enough to make a very educated guess of next day numbers this would be very useful and game changer. I would expect this to be hard due to so many variables so I am to say the least extremely impressed!
#5. Posted:
CriticaI
• Summer 2018
Status: Offline
Joined: Nov 05, 201310Year Member
Posts: 2,747
Reputation Power: 451
Status: Offline
Joined: Nov 05, 201310Year Member
Posts: 2,747
Reputation Power: 451
 Cool project! For getting around errors, my advice is to read the error messages multiple times and really think about what you want your code to do. It says a variable is not defined, meaning you did not create the variable yet, or it is not available in the scope you think it is. Split is a method available to strings. It turns a string into an array (list) of smaller strings. If your variable is not a string but instead something like a DataFrame, number, or list, that method is unavailable because Python does not know how to split something like a DataFrame. Also, if you haven't already, definitely check out ChatGPT. It is great for getting things explained in laymen's terms.
#6. Posted:
TCAR
• Gold Member
Status: Offline
Joined: Jun 15, 201410Year Member
Posts: 1,018
Reputation Power: 20463
Motto: There's magic on the other side of fear.
Motto: There's magic on the other side of fear.
Status: Offline
Joined: Jun 15, 201410Year Member
Posts: 1,018
Reputation Power: 20463
Motto: There's magic on the other side of fear.
 not even gonna read all of it but looks like alot
Users browsing this topic: None