Analysis of currency exchange EUR/USD rate by processing natural language of top news Timo Valeri Junolainen December, 2020-August, 2021 1 Table of Contents Part I Abstract .......................................................................... 4 Part II Introduction ................................................................... 5 Part III Preparing the data ........................................................ 6 1 Datasets ..................................................................................................... 6 2 Words to tokens ......................................................................................... 8 3 Sentence to matrix padding ...................................................................... 11 4 Price to learn ............................................................................................ 13 Part IV Learning the data ....................................................... 15 1 About neural network – forward propagation ........................................... 15 1.1 Sigmoid ............................................................................................ 16 1.2 Tanh ................................................................................................. 16 1.3 ReLU ................................................................................................ 17 1.4 Softmax ............................................................................................ 17 2 About neural network – backward propagation ........................................ 17 2.1 Cost functions .................................................................................. 17 2.2 Loss functions ................................................................................... 18 2.3 Gradient of cost function w.r.t weights and biases ........................... 18 2.4 Optimizers ........................................................................................ 19 2.4.1 Gradient descent ................................................................... 19 2.4.2 Momentum ............................................................................ 19 2.4.3 Adagrad ................................................................................. 20 2.4.4 Adadelta ................................................................................ 20 2.4.5 Adam ..................................................................................... 21 Part V Building Neural Network ............................................. 23 2 1 Scaffold ..................................................................................................... 23 2 Fitting Neural Network ............................................................................ 23 3 Validating Neural Networks ..................................................................... 24 Part VI Conclusions ................................................................. 25 Part VII Bibliography .............................................................. 26 3 Part I Abstract NLP – natural language processing is an area of research, that explores how computer can understand and manipulate natural language for useful things. [1] In this work we are going to explore appliance of NLP to forex currency exchange, because it is self evident, that news influence a society, and a society makes then decision – to buy or to sell. By processing top news we can predict what changes are going to be made, or in other words – we are going to try to predict how news are going to influence currency exchange rates. Technique used in this paper is somewhat primitive, and wasn't tested thoroughly enough to tell its precise usability. For more elaborate and precise model, it would be suggested to consult my next paper on the subject, however this model does indeed predict some jumps of a currency rate, which is somewhat useful, more than random guessing. In this paper we are going to use simplest NLP technique, and assign a number to different words, thus making tokenized version of a text, and then fitting a dense neural network over tokenized texts with a goal of predicting noticeable price changes. 4 Part II Introduction NLP is marvelously interesting subset of machine learning discipline, which concentrates on reading, understanding and producing text written by human. Here, I would like to list examples of successful NLP usage: • Google translate [2] service used for translation of texts from one language to another. • Predictive text, used by Google search engine itself, it is when user types a text, predictive technology used by Google tries to guess what next text is going to be typed, making task easier for user. In this work we are going to process the archive of news headlines in text format from Kaggle [3] using simplest natural language technique, by tokenization words, that is assigning to each distinct word different integer, making the text readable for neural network. For neural network we will use dense neural network, which will be covered later in this paper. 5 Part III Preparing the data 1 Datasets First in order to process information, we naturally would need information, or in other words – datasets. We are going to use following datasets from Kaggle [3] : • EUR USD Forex Pair Historical Data [4] • Daily News for Stock Market Prediction [5] Naturally all work will be done in Python3 [6] /Jupyter, enhanced with following extensions: • Natural language toolkit(NTLK [7] ), language processing package, required to remove stopwords from text. • Pandas [8] , python data processing package • Keras [9] , deep learning python package, required for tokenization of text to integers. Used to build artifical neural network. • TensorFlow [10] , large ecosystem of different machine learning procedures. In order to read and process text, operate with datasets, constructing and training neural network we require libraries and packages imported into the model. Below are block which shows this part of the work: import pandas as pd import re import numpy from keras.preprocessing.text import text_to_word_sequence import nltk from keras.preprocessing.text import Tokenizer from keras.models import Sequential from keras.layers import Dense import tensorflow as tf from sklearn.model_selection import train_test_split from nltk.corpus import stopwords nltk.download('stopwords') 6 In this section we are not only importing necessary packages, but also downloading list of stopwords belonging to NTLK [7] package. 7 2 Words to tokens In order to process a text we require text to be processed into suitable for neural network form News dataset [5] : news_data = pd.read_csv("./Combined_News_DJIA.csv") Now we have news with dates in pandas [8] dataframe. It is good, readable text, formatted in following way: • Date - date of news • News - 20 top news from following days in separate columns It is neat, readable by human text, but tensorflow [10] , sadly, doesn't know how to read it. First we would like to join all columns with news in dataframe into one column, which could be done, while removing stop words in the following way: def concatenate_news(row): result='' pattern = re.compile(r'\b(' + r'|'\ .join(stopwords.words('english')) + r')\b\s*') for i in row[2:]: if pd.notnull(i): i= re.sub('[^A-Za-z0-9]+', ' ', i) i=re.sub(r"\b[a-zA-Z0-9]\b", "", i) result+=i.lower()+'' result=pattern.sub('',result) return (result) news_data['News'] = news_data.apply \ (lambda row: concatenate_news(row), axis=1) news_data = news_data[["Date","News"]] Stopwords are removed by this code snippet from texts, since words like it, to, and etc... do not give any useful information to the neural network and should be removed as unnecessary [11] 8 After dropping unnecessary empty rows, removing unnecessary stopwords and lowering case of all letters, we are left with dataframe, which contains only two columns, like: Date News 2008-08-08 georgia downs two russian warplanes countries... 2008-08-11 ont america nato help us wont help us help... ... ... It is better suited text, but it is till a text for human, but not suitable information for artificial neural network. To make suitable data in a dataframe we would require translating words into sequences of numbers. Suppose we have a sentence " Cow jumped over the moon. ", previously we translated it to “ cow jumped over moon ”, notice that the word the does belong to stopwords [11] and so is removed from a sentence, dots and other punctuation marks are also removed. Fitting tokenizer creates table of words with corresponding integers(tokens): Word Token(integer number) cow 1 jumped 2 over 3 moon 4 After which we can turn our sentence to a sequence of integers 1,2,3,4 For example sequence 4,2,3,1 translates then to sentence moon jumped over cow , in other words we tokenized text, mapping word to separate integer. Lets take our prepared text and make it into suitable, tokenized version with following code: t=Tokenizer() t.fit_on_texts(news_data['News']) news_data['Tokenized']= news_data.apply \ (lambda row: t.texts_to_sequences(row)[1],axis=1) 9 Meanwhile, it would also be wise to get date into more easily readable, pandas [8] datetime format, with following code: news_data['Date']= \ pd.to_datetime(news_data['Date'],format='%Y-%m-%d') After those operations we have a dataframe with 3 columns, and column News isn't really necessary, but lets leave it there for now. We translated sentence " georgia downs two russian warplanes countries... " into [ 823, 10179, 27, 30, 2442, 114, 420, 2016, 12,... ] and " wont america nato help us wont help us help... " into [ 2280, 314, 296, 160, 1, 2280, 160, 1, 160, 75,... ]. It is easily observable that integer 2280 is a token for word wont . Text is (almost)ready to be fed to a neural network. It should also be padded to matrix for keras [9] , but we will do it later, when we will construct an artificial neural network [12] 10 3 Sentence to matrix padding In order to be completely readable by artificial neural network, tokenized text should be padded to form proper matrix. So, lets pad our sequences of tokenized text into neat matrix. Problem is that sentences have different amount of words, in order to be fed into the neural network we have to create sequences of same length, and pad shorter sentences with zeroes. In order to do so lets invoke next code: max_len=max([max(sequence) for sequence in x]) max_word=(len(t.word_index) + 1) x_matrix = numpy.zeros((len(y), max_len)) max_len is longest sequence of words and max_word is number of different words in whole news_data dataframe(tokenizer contains this value). Now we format x_matrix in the following way. Suppose we have just three rows from three different days: • this is good paper • paper is good • this is good day After tokenization, we receive next dictionary: (1) this (2) is (3) good (4) paper (5) day We have only 5 original words in our list of sequences, so our collection of sequences will look like: • 1,2,3,4 • 4,2,3 • 1,2,3,5 Elegant way to make padded matrix, would be to create 0-filled matrix with width equal to maximum length of sequence in sequences, and amount of sequences is equal to number of sequences (sentences in an essence), thus: 11 0 0 0 0 0 0 0 0 0 0 0 0 and then to fill it with corresponding tokens: for i in range(len(x)): x[i]=(numpy.asarray(x[i]).astype('int')) y[i]=(numpy.asarray(y[i]).astype('int')) for j in range(len(x[i])): x_matrix[i,j]=x[i][j] Which will result our example sequence of three sentences in following matrix: 1 2 3 4 4 2 3 0 1 2 3 5 Our neural network is going to read sentences row by row correlating them to price changes in δ -range. 12 4 Price to learn For this part we require second dataset [4] with historical forex exchange rates, and convert Date column to datetime [8] format: eurusd_data = pd.read_csv("./eurusd_hour.csv") eurusd_data['Date']= pd.to_datetime(eurusd_data['Date'],\ format='%Y-%m-%d') Now we have second pandas [8] dataframe with hourly changes of exchange rate. We are particularly interested in Ask/Bid [13] change values over whole day in question. And in scope of this paper we are going to make neural network only for Bid value.( Bid is price of selling currency, Ask is a price for buying, those are interconnected concepts but slightly different. To keep things simple, we are going to examine only Bid change) Lets add additional columns to our news_data dataframe: news_data['Bch']=0 news_data['Bsig']=0 Bch is column, where we are going to store change over the δ -days and Bsig is column for so called signal or flag - it is 1, if price change over that day more than threshold, and 0 otherwise. Naturally we also require δ and threshold, so lets add them: days_delta=3 treshold=0.01 Parameters above can and should be changed for particular task. In scope of this paper, we are going to use price change three days later, after news, and price jump bigger than 0.01 ( δ - threshold) Sequence of events is as follows: News are out δ days passes (+3 in this scope), and we get accumulated bid change over that day. for index, row in news_data.iterrows(): current_eurusd=eurusd_data.loc[eurusd_data["Date"]==\ 13 row["Date"]+timedelta(days=days_delta)] Bch=current_eurusd['BCh'].sum(skipna=True) news_data = news_data[news_data.Bch!=0] Notice, that we are dropping rows where are no bid change in the eurusd_data dataframe. 14 Part IV Learning the data 1 About neural network – forward propagation Artificial Neural Network, or simply NN is in essence collection of neurons connected by analogue of biological synapses, called inputs, where each neuron receives real number as input, and propagates/transmits it forward to neurons of next level or output of NN [14] In this work we are going to use simplest version of NN – dense artificial neural network, which basically means that each neuron of a network receives input from all neurons of previous layer of network [15] Lets consider next example of dense NN [16] , It is toy neural network consisting of three layers: • Input layer with three neurons • Hidden layer with two neurons • Output layer with one neuron First, input layer called “Input layer”, because it is where information is fed to NN, hence the name. “Output” layer is respectively – layer, where output by NN is produced. “Hidden” layer is rarely accessed directly, thus it is hidden from direct influence, and is generally managed by neural network itself, automatically. Our toy NN has three input neurons, so it takes three inputs, lets call them x 1 , x 2 , x 3 , since it is dense NN, all neurons from previous layer connected to each neuron in next layer. So, if we call neurons in hidden level y 1 , y 2 , connections between Input and Hidden layer are listed as x 1 → y 1 , x 2 → y 1 , x 3 → y 1 and x 1 → y 2 , x 2 → y 2 , x 3 → y 2 , thus our example NN has 6 connections between input and hidden layer. 15 Input layer Hidde layer Outpu layer Neurons y 1 , y 2 also has biases. Bias is just a number coming to neuron, which helps adjusting how easy or hard neuron excites. Lets call them b y 1 , b y 2 for now. In order for connection between Input and Hidden layer to work, we also need connection weights, and since we have 6 connections between layers, we have 6 weights: w xy 11 , w xy 21 , w xy 31 , w xy 12 , w xy 22 , w xy 32 Now we can find values of neurons y 1 , y 2 in following way: y 1 = x 1 ∗ w xy 11 + x 2 ∗ w xy 21 + x 3 ∗ w xy 31 + b y 1 and y 2 = x 1 ∗ w xy 12 + x 2 ∗ w xy 22 + x 3 ∗ w xy 32 + b y 2 One more thing is required to make NN network, namely activation function – because while x 1 , x 2 , x 3 is both kind of input and output of input neurons, y 1 , y 2 are only values on incoming side of neurons in hidden layer. Next layers neuron will receive f act ( y 1 ) , f act ( y 2 ) instead of y 1 , y 2 . What is this f act ? It is called activation function, also called transfer function, and it decides if neuron fires or not. Or better said – if it activates or not [17] H ere is listed some activation functions: 1.1 Sigmoid Calculated in following way: f ( x ) = 1 1 + e x , and result can range from 0 to 1. Since center of Sigmoid is 0.5, not 0, it is probably not a good choice for layers closer to input side, and is most suitable for output layer predicting probability or binary classification [18], [19] 1.2 Tanh Formula for calculation is as follows: f ( x ) = e x − e − x e x + e − x , and it is different from Sigmoid, in the terms of output range – Tanh is centered at 0, and ranges from -1 to 1. 16 1.3 ReLU Classical ReLU is defined as f ( x ) = 0 , x < 0 x , x ≥ 0 , and it is very often used activation function for many reasons. It is very fast computationally and it doesn’t have backpropagation errors, unlike sigmoid. ReLU can be used in hidden layers, and not everywhere else [18] 1.4 S oftmax S oftmax is perfect for multivariate regression models, but with k=2, could be used for binary classification as well. Softmax for multivariate is defined as follows, for input vector z , and j indexes outputs from 1,2,3..k: f ( z ) j = e z j ∑ k = 1 K e z k Those are just some of popular activation functions, there are more of them, but from the scope of this paper it is unnecessary to list them all here. 2 About neural network – backward propagation 2.1 C ost functions “A cost function is a measure of "how good" a neural network did with respect to it's given training sample and the expected output. It also may depend on variables such as weights and biases.” [20] Cost and loss functions are related, where cost function is more general concept, supplied with average, and perhaps some model of complexity. In general, loss function (described Part IV 2.2 ) is atom part of cost function, where loss function is difference between predicted and expected for one training sample, while cost is some form of average of multiple loss functions Requirements for cost function [20] : • Cost function must be able to be we written as average, over loss functions for individual training examples. [21] • In order to backpropagate, cost function must be dependent only on neuron input values. 17 Cost function used in this paper is MSE – Mean Square Error, which in simple terms calculated in following way: MSE = 1 n ∑ n = 1 N ( y n actual − y n predicted ) 2 , where N is amount of samples. 2.2 Loss functions As mentioned in Part IV 2.1 – loss function is an element of cost function, and thus elementary part of MSE is Quadratic loss function , lets call it CLF = ( y actual − y predicted ) 2 , thus quadratic difference for one sample. 2.3 Gradient of cost function w.r.t weights and biases In order to have working NN we need to find optimal weights and biases that minimize result of a cost function. Lets simplify our NN, and instead of the mess of neurons and layers connected to everything in every-which-way let us first consider a simple 1–1– 1 network. In this trivial, one-neuron part of network we have input coming from previous layer output a ( L − 1 ) , since input coming from one neuron, we have only weight w ( L − 1 ) , we also have bias – b ( L − 1 ) For one-neuron network, value of neuron in level L is same as output of neuron from previous layer, otherwise it is bit more complex, and described in Part IV 1 , but for this example z ( L ) = w ( L − 1 ) a ( L − 1 ) + b ( L − 1 ) , and a ( L ) = f act ( z ( L ) ) Also, after all this forwardpropagation – we are left with value of Cost function, which indicates how well NN did. In order to make NN effective we are interested in ∂ C ∂ w ( L − 1 ) , thus how much Cost changes if we change w ( L − 1 ) Change can be found derivation chain rule ∂ C ∂ w ( L − 1 ) = ∂ C ∂ z ( L ) ∂ z ( L ) ∂ w ( L − 1 ) = ∂ C ∂ a ( L ) ∂ a ( L ) ∂ z ( L ) ∂ z ( L ) ∂ w ( L − 1 ) 18 w(L-1) a(L-1) b(L-1) z(L) a(L) C(ost) Components of this chain look like: ∂ C ∂ a ( L ) = 2 ( a ( L ) − y ) , where y is desired output ∂ a ( L ) ∂ z ( L ) = f ' act ( z ( L ) ) , where f act is corresponding activation function ∂ z ( L ) ∂ w ( L − 1 ) = a ( L − 1 ) , and thus whole ∂ C ∂ w ( L − 1 ) becomes: ∂ C ∂ w ( L − 1 ) = 2 ( a ( L ) − y ) f ' act ( z ( L ) ) a ( L − 1 ) 2.4 Optimizers 2.4.1 Gradient descent Gradient descent, also batch gradient descent is simply computing gradient of Cost function Part IV 2.1 with respect to all parameters of system 𝛉 (weights and biases) of whole dataset. 𝛉 new = 𝛉 old − η ⋅ ∇ 𝛉 J ( 𝛉 ) 2.4.2 Momentum Gradient descent from Part IV 2.4.1 has problems navigating where surface curves in one direction more steeply than in another [22] Adding momentum helps solving such problem, and it is done by adding fraction γ update vector from past step: v t = γv t − 1 + η ∇ θJ ( θ ) θ new = θ old − v t Result of such addition is faster convergence and reduced oscillation. 19 2.4.3 Adagrad Main idea of Adagrad is to keep in memory the squared gradient up to some point. In Momentum and Gradient Descent , we updated all parameters θ at once, by using same learning rate η, but Adagrad uses bit different approach, keeping learning rate for every parameter θ i Lets use g t for gradient at some times step t , and thus g t , i is then partial derivative of cost function with respect to the parameter θ i , at step t g t , i = ∇ θ J ( θ t , i ) Update of every parameter θ i each time step t : θ t + 1 , i = θ t , i − η √ G t , ii + ε g t , i Where ε is smoothing element to avoid division by zero (small size, about 10 − 8 ) G t ∈ ℝ d × d is diagonal matrix, where each element i,i is the sum of squares of gradients with respect to θ i up to time step t [22] 2.4.4 Adadelta Adadelta [23] is an enhancement of Adagrad that looks to minimize its aggressive, monotonically decreasing learning rate. Insead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size w Instead of inefficiently storing w previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average E [ g 2 ] t at time step t then depends (as a fraction γ similarly to the Momentum term) only on the previous average and the current gradient: E [ g 2 ] t = γE [ g 2 ] t − 1 + ( 1 − γ ) g t 2 Gradient descent then can be rewritten in terms of parameter update, term Δθ t : 20