What’s the frequency Kenneth

This entry is part 5 of 6 in the series Machine Learning

It’s quite long time man tries to predict the value of a phisical dimension and there are so many applications of time series forecasting, I won’t report here. With reference to the stock market, for example, the basic idea is that, even if its state, modeling it as dependent from Open, High, Low, Close and Volume values of the last minute candle closed at the time t, is variable with time, x(t)=(O(t),H(t),L(t),C(t),V(t)), representing it in the frequency domain will allow for working in a way that is independent from time. By chance, I bumped into the interesting paper “Sequences with Minimal Time-Frequency Uncertainty” https://www.sciencedirect.com/science/article/pii/S1063520314000906, that refers to Heisenberg’s uncertainty principle stating that a given function cannot be arbitrarily compact both in time and frequency, defining an “uncertainty” lower bound, taking the variance as a measure of localization in time and frequency, and indicating Gaussian functions reach this bound for continuous-time signals, and that this is not true for time sequences. In addition to that, most of the signals seen in nature are not distributed as a gaussian function, making the challenge even worse. The Hamiltonian is an operator that represents the total energy of a quantum system, including both kinetic and potential energy and this represents, in my opinion, a fundamental exogen variable to take into consideration when building a model. Even if based on statistical indexes, the strategy may be severely impacted by the “energy” of the market.
Further, the reason the reading was very interesting to me is that what I noticed over time was that the better managed to obtain a good prediction, the more the residual (prediction error) wave form became similar to the one of a noise signal, feeling that there is a limit to the precision of the prediction versus true values, so wanted to take a closer look at the shape and behaviour of this “noise”.

Interestingly, in the book authored by Mandelbrot “The (Mis)Behaviour of Markets: A Fractal View of Risk, Ruin and Reward”, the author made the point about the fact that the traditional normal distribution does not match “real world” distributions, that is power law curves with “fat tails”, like for example the Cauchy distribution that better fits daily and weekly stock price movement. So it looks like both the stock price and the residuals of the prediction behave in a similar way.
Because a reduction in error naturally implies an improvement in forecasting it would be interesting to understand if they are correlated and to what extent, so that if it is applied an inverse noise signal to the predicted values, it would improve prediction or, at least, reduce the spikes of the residuals timeseries. Two approaches have been followed:

  1. Identify the distribution that shows the best similarity with the residuals timeseries
  2. Filter the residuals timeseries and subtract just those freqencies that are more “disturbing” in this case high frequency variations
    Optimized regression with AutoML MAPE 1,25%, approach 1 leads to MAPE 1,79%, while approach 2 (e.g. removing hi-power frequencies) leads to 1,16%

The final goal is quite simple in some cases that is to predict if the market will go up or down and the issue is the change percentage is usually lower than the prediction error percentage (e.g. RMSE or MAPE), thus Machine Learning is usually not sufficient on its own to perform a good prediction, at least not good enough to drive investments decisions with a high success rate.

Ultimately, a good trading system, in my opinion, is composed by a good forecast model and a trading model that is as simple as possible. In particular, here follows an out-of-sample test started end of august and lasting about 9 months, based on a long-short strategy.

The reference markets are the futures of Nasdaq, S&P and Russel2000 and the strategy overperformed the markets by 34%.

Machine Learning – AutoML vs Hyperparameter Tuning

This entry is part 2 of 6 in the series Machine Learning

Starting back from where we left, majority voting (ranking) or averaging (regression) among the predictions combine the independent predictions into a single prediction. In most cases the combination of the single independent predictions doesn’t lead to a better prediction, unless all classifiers are binary and have equal error probability.

The test procedure of the Majority Voting methodology is aimed at defining a selection algorithm for the following steps:

  • Random sampling of the dataset and use of known algorithms for a first selection of the classifier
  • Validation of the classifier n times, with n = number of cross-validation folds
  • Extraction of the best result
  • Using the n classifiers to evaluate the predictions, apply the selector of the result expressed by the majority of the predictions
  • Comparison of the output with the real classes
  • Comparison of the result with each of the scores of the individual classifiers

Once the test has been carried out on a representative sample of datasets (not by number but by nature), the selection of the algorithm will be implemented and will take place in advance at the time of acquisition of the new dataset to be analysed.

First of all, the selection of the most accurate algorithm, in the event that there is a significant discrepancy on the accuracy found on the basis of the data used, certainly leads to the optimal result compared to the use of a majority voting algorithm which selects the result of the prediction in based on the majority vote (or the average in the case of regression).

If, on the other hand, the accuracy of the algorithms used is comparable or the same algorithm is used in parallel with different data sampling, then one can proceed as previously described and briefly reported below:

  • Random sampling of the dataset and use of known algorithms for a first selection of the classifier
  • Validation of the classifier n times, with n = number of cross-validation folds
  • Extraction of the best result
  • Using the n classifiers to evaluate the predictions, apply the selector of the result expressed by the majority of the predictions
  • Comparison of the output with the real classes
  • Comparison of the result with each of the scores of the individual classifiers

Below is the test log.

Target 5 bins

Accuracy for Random Forest is: 0.9447705370061973
— 29.69015884399414 seconds —
Accuracy for Voting is: 0.9451527478227634
— 46.41354298591614 seconds —

Target 10 bins

Accuracy for Random Forest is: 0.8798219989625706
— 31.42287015914917 seconds —
Accuracy for Voting is: 0.8820879630893554
— 58.07574200630188 seconds —

As can be seen, the improvement obtained compared to the models obtained with pure algorithms is between 0.1% and 0.5%.

Auto-ML vs Hyperparameters Tuning

We have used a python library called “Lazy Classifier” which simply takes the dataset and tests it against all the defined algorithms. The result for a 95k rows by 44 columns dataset (normalized log of a backdoor malware vs normal web traffic) is shown in the figure below.

Lazy classifier for Auto-ML algorithm test

It can be noted that picking the best algorithm would result in some cases in an improvement of several percent points (e.g. Logistic Regression vs Random Forest) and, in most cases, in savings in terms of computation time thus consumed energy. For example if we take Random Forest algorithm and we compare the computation with standard parameters and optimization obtained with Grid Search and Random Search, you can notice that the improvement is, again, not worth the effort of repeating the calculation n-times.

**** Random Forest Algorithm standard parameters ****
Accuracy: 0.95
Precision: 1.00
Recall: 0.91
F1 Score: 0.95

**** Grid Search ****
Best parameters: {‘max_depth’: 5, ‘n_estimators’: 100}
Best score: 0.93
**** Random Search ****
Best parameters: {‘n_estimators’: 90, ‘max_depth’: 5}
Best score: 0.93