Chaos in the Markets: Mandelbrot’s Fractal Vision of Finance

Some years back I received a gift from a friend (Antonio), a game-changing book “The (Mis)Behavior of Markets: A Fractal View of Risk, Ruin, and Reward”. In 2004 Mandelbrot introduced a new theory about the market movements suggesting that it is much more chaotic than what described by the classical theory. He showed that extreme price movements (crashes, booms) occur much more often than expected, thus suggesting that the long-tails of price movements are fatter than expected and markets are riskier than the used models can predict (including the most recent ones based on AI and ML). The two implications I have identified clearly, the former is that the Black-Scholes model is not suitable for options pricing as its assumption is that asset returns follow a normal distribution, while a distribution with fat tails is closer to the cauchy distribution. New fractal models have then been introduced under the assumption of the self similarity prices variation at different scale, based on volatility or, how I use to say, the energy of the system. It is curious to me that to some extent, the contents of the book, in some way, predicted the crash of 2008. The latter is that more attention should be paid to extreme cases as they occur more frequently than expected.

For first point, I wanted to verify the price distribution of three real underlyings, AAPL, BRK-A and IBM and their daily price variations over the last 42 years. You may notice that they follow a distribution more similar to Cauchy than to a Gaussian.

One of the most common suggested techniques by professional traders is to cover by buying a put option for every 100 owned stocks. Using AAPL stock again, as a reference, if a private investor owns 100 shares of Apple (AAPL) purchased at $218.27 each (as of close on last Friday), to protect the investment against potential declines, he/she decides to buy a put option with a strike price of $210 and an expiration in one month (priced at 3,65$). The cost of this option, known as the premium, varies based on factors like volatility and time until expiration, that is he/she has one month time to exercise the option, that is can profit both from an excessive drop and from a rise in price. If the price remains the same for one month, the loss is the option price that is 365$. Of course if the purchased put options are 3, at 200$ or the price is reached in an extremely volatile market, the value is not linear and the gain might be more consistent, covering the loss or even make the overall trade profitable. Last obvious consideration, when exercising the option in a positive ground, should close the long operation on the underlying as well to avoid a loss.

And now we come to the backtest simulation we anticipated in last article, that is reported in the following video. I have hypothesized two thresholds from open +3xσ and -3xσ (under the assumption the distribution of the previous days is normal even if we’ve understood it’s not) and that the trade is a buy when traversing the +3xσ and sell when traversing the -3xσ. If the price rebounds, in order to avoid losses, the trade operation is closed at the same price, otherwise at close. In two years it would have implied 389 operations and 13,4% gain on $100K if the trading commission is fixed and stop limit orders are executed perfectly (for lower capital the risk of impact of spread b/a is too high). You also need a perfect automated trading system to do that. As usual, it’s a matter of acceptable or not acceptable risk/reward ratio.

Considering that it’s not easy to gather historical price of options as the price is dependent from strike price, current price but also from volatility, assuming instead there is no overprice due to volatility (and the current market is extremely volatile), the dynamic P/L is shown in the following figure.

To make the long story short, looking at the graphs, in both cases (buying 1 PUT or 3 PUT) with a risk lower than 5% at any time, you could easily gain more than 20% in two weeks time, not considering the volatility component that would raise that value a lot.

Given that acceptable risk/reward is absolutely subjective – didn’t ever manage to get a response to the question: “what is an acceptable risk/reward ratio?” – up to the reader to evaluate if it is a good and safe trading strategy.

Fostering ethical machine learning

If there is a sport that, in my opinion, can serve well to explain how machine learning works, it’s tennis. Training requires thousands of balls, and it’s estimated that over ten years of practice, more than a million shots are played.

Why, then, do even professional players sometimes miss seemingly easy shots during matches when their error rate in training is often much lower? This is a typical case of overfitting, where the model has been generated from countless balls played, mostly of the same type, hit by the coach or trainer, while during a tournament, players encounter opponents and game situations that are very different—some never seen before.

Styles of play, speed, ball spin, and trajectories can be entirely different from those seen in training. Personally, I think I’ve learned more from matches I lost miserably than from months of training with similar drills. No offense to coaches and trainers—they know it well themselves, having built their experience largely through hundreds of tournaments and diverse opponents.

The similarity with machine learning is quite obvious. Machine learning, like tennis, requires preparation based on experience and the ability to adapt to unforeseen contexts. A model trained on overly homogeneous data may seem very effective during training but fail to recognize new situations—a limitation that only diverse exposure can overcome. Just as a tennis player grows stronger by facing opponents with different styles, a machine learning model improves with data that reflects the variety and complexity of the real world.

In both cases, improvement doesn’t come solely from mechanical repetition but from iterative learning, analyzing errors, and refining strategies. Each mistake, each failure, is a step toward a more resilient and capable system—or player. This is the key to overcoming the limits of overfitting and building skills that go beyond mere memorization, allowing excellence in unexpected conditions.

Now let’s move on to the part that interests us the most: biasing. By its nature, an explainable machine learning algorithm (for example, a decision tree) generates models that must “split” on attributes. At some point within the tree (unless the tree’s depth is reduced to avoid this situation), a decision will have to be made based on the value of an attribute, which could lead to discrimination based on gender, age, or other factors.

Less explainable algorithms produce results that evaluate all variables simultaneously, thus avoiding decisions based on a single variable. However, there are methodologies (like Shapley values or insights such as those from Antonio Ballarin https://doi.org/10.1063/5.0238654) that allow verification of the impact of a single variable’s variation on a particular target value.

In short, no matter how balanced the dataset is and how low the impact of the observed variable on the target is, there will always be slight biasing in the generated model. A temporary solution, considering the tennis example, is to eliminate the variable that could cause the model to behave in ways deemed unethical (e.g., age, gender, nationality) and construct an initial model that is certainly less accurate than one using all variables but usable from day one.

As the model learns, increasingly de-biased data will be provided (data must be filtered at the source, balancing the number of cases, for instance, between genders). Meanwhile, the algorithm (which at this point won’t know the value of the excluded attribute because it doesn’t exist) will update the model, enabling it to generalize more and more—like an athlete participating in a large number of tournaments.

The Philosopher’s Stone

This entry is part 6 of 6 in the series Machine Learning

Well, here we are. Two years of study and one year of backtesting on ten stocks selected from the blue chips of the Dow Jones, Nasdaq, Russell 2000, and S&P 500, which achieved a 61.48% return in 2024.

Now, we’ll test in quasi-real conditions with the ten selected stocks for 2025.

Funds allocation will be 1/N, for example, $100K divided into ten roughly equal slots (subject to slight rounding down, depending on the stock price as of January 2, 2025).
Update scheduled for end of June, 2025.

Happy 2025!


June, 29th 2025

Here we are after six months on the selected ten stocks and huge earthquakes for many macroeconomics shocks. Here follows the equity-line, showing 15,9% gain and maximum drawdown of 5,9%. Let’s see the final update end of the year 2025.

Ethics of algorithms or data? Or how they are used?

By now, we are all aware of the potential of AI and, to some extent, the risks associated with its unethical use.

However, I would like to bring attention to a use case that might change the perception of what is ethical and what the definition of ethics entails.

A well-known open dataset from UCI includes the characteristics of employees in a company, and among the attributes, there is a variable that can be used as a target, representing the status of the employee (attrition: yes or no).

The objective could be to pay more attention to employees who, according to the model, appear to be at higher risk of attrition, and this goal might alter the concept of ethics (which, by the way, is not uniform across communities, cultures, or contexts). For example, dataset bias related to attributes like gender or age in this case could help focus more on the disadvantaged groups (here meant as attributes). This is just a different point of view and it doesn’t necessarily mean it’s an ethical approach (e.g. somebody may object that a model built on this data would allow for retaining just resources with high scores in performance reviews).

Below is an analysis of the dataset that highlights some interesting aspects, such as the importance of certain attributes that may not be intuitively significant, or vice versa. For instance, after removing the employee number, which represents an identity, monthly salary ranks only fifth in importance, while gender is among the least important, thus having minimal influence on the target variable.

By breaking down according to the maximum value of Gini impurity 2p*(1-p), a binary tree is constructed in this way and shown in figure.

The second variable to observe is precisely OverTime, which also represents the dependency of attrition on the overtime value recorded for the employee. In this case we use a CNN and shapley values to determine dependence of the target from independent variables.

Finally, we must note that age has strong impact on the decision, but it is quite fragmented and it is selected to separate very well the classes close to the leaves. Here below two examples of clear separation between the two classes.

Edited by G.Fruscio

Mean variance and 1/N heuristic portfolio

As a next step, as non-expert of portfolio management, i found interesting books and papers making different use of the nobel-prize Markowitz strategy, AKA mean-variance portfolio https://en.wikipedia.org/wiki/Markowitz_model (MPT Modern Portfolio Theory) applied to 10 common stocks from the Nasdaq100.

The applied algorithm doesn’t allow for trades at every timeframe (15m, hour, day) thus some theories cannot be applied (e.g. fixed allocation, long strategy, etc.), however the best portfolio and weights with lowest risk that was found brought to a CAGR = 0.5402, SHARPE RATIO = 9.5759, MAX DRAWDOWN = 2,6948%.

Mean Variance Portfolio

A different approach would be to apply the 1/N heuristic that is a simple investment strategy where an investor allocates an equal proportion of their total capital across N assets. It offers some advantages, that is it simple and it may reduce the unsystematic risk associated with individual asset (it is not guaranteed that past behavior will be the same as future behavior), but it’s not optimal in terms of risk-adjusted return since it treats all assets equally, regardless of their risk/return characteristics.

The underlying selection follows a fast and frugal approach that is selecting first the underlying with acceptable risk (e.g. Max Drawdown) and then picking the 10 best performing. Resulting CAGR = 1.0249, SHARPE RATIO = 10.0465, MAX DRAWDOWN = 3,2248%

1/N Heuristic Approach

Multislot Performance Example

In algorithmic trading platforms, “multislot” refers to the system’s ability to manage multiple trading algorithms or strategies simultaneously. Each “slot” represents a distinct strategy or trading approach, allowing the system to execute several strategies in parallel. This capability enhances both optimization and diversification, as the system can apply these strategies within the same market or across different markets, assets, or trading techniques. Essentially, “multislot” allows the system to handle multiple orders at the same time. For example, a trader could place various orders to buy or sell different assets in varying quantities, each with specific execution criteria (e.g., market orders, limit orders). All of these are managed concurrently by the trading system.

In this example, we’ll make some assumptions:

  • The underlying assets are high-volume stocks, selected blue-chip companies (e.g., AAPL, ADBE, AMD, AMZN, CSCO, GOOG, INTC, MRVL, MSFT, NVDA, TSLA);
  • The algorithms demonstrate an average accuracy of 57%. For this simulation, the trading signals are selected randomly;
  • The trading strategy is straightforward: positions are either long or short, with positions opened at the market open and closed at the market close;
  • Each slot has a fixed capital allocation of $10,000;
  • Lever is equal to 1.

The results of the simulation is shown in the previous figure.

Deductions from gross gains should be applied and include:

  • Trading fees: Assume $2 per transaction. Given 250 trading days, this would result in an annual trading fee of $1,000 per underlying;
  • Taxes: Tax liabilities can vary significantly depending on the country;
  • Market volatility: Since this is a long/short strategy, we assume that the volatility at the open and close of the market will often cancel out, meaning any movement in one direction will likely be balanced by the opposite movement, resulting in a net sum of zero.

(Image: Photo of Coinstash from Pixabay)

Timeseries Forecast

This entry is part 4 of 6 in the series Machine Learning

Ancient civilizations, such as the Romans and the Greeks, used various methods for predicting the future. These methods were often based on superstition and religious beliefs. Some of the techniques used by the Romans included divination through chickens, human sacrifice, urine, thunder, eggs, and mirrors. The Ancient Greeks also had a variety of methods, such as water divination, smoke interpretation, and the examination of birthmarks and birth membranes. These practices were deeply rooted in their cultural and religious traditions. A scientific approach (one of many possible) involves the use of Machine Learning. In Machine Learning there are a number of benefits of time series prediction: identifying patternsin data, such as seasonal trends, cyclical patterns, and other regularities that may not be apparent from individual data points. Obviously, the main goal is to predict future data points based on historical patterns, allowing businesses to anticipate trends, plan for demand, and make informed decisions. Finally, it helps in better understanding a data set and cleaning the data by filtering out the noise, removing outliers, and gaining an overall perspective of the data. These methods are widely used in various industries, including finance, economics, and business, to make data-driven decisions and predictions. Time-series prediction work was partially founded on the “Laocoonte” project and was based on the fast training of a large number of ML models, tested in parallel and applied each step with the most accurate model. The prediction module was tested in two specific areas, indicated by the availability of real data and future business opportunities, in particular time series prediction (temperature and stock value prediction), tabular data prediction (churn prediction).
In order to obtain a benchmark against traditional models based on statistical algorithms (ARIMA type), the time series relating to the average daily temperature of Vancouver over the last twenty years was taken as a reference. The sample was subsequently divided in this way:

  • 66% of the samples to train the model;
  • 33% of samples to test the model.
    The result, applied for 365 samples, is shown in the following figure and the measured MAPE (Mean Absolute Percentage Error) is 1.206. The result, applied for 365 samples, is shown in the following figure and the measured MAPE (Mean Absolute Percentage Error) is 1.206.

For the same sample, this time drawn in a different year (September 2004 – September 2005), the result obtained using the Auto-ML predictive model showed a lower performance result, obtaining an MAPE score of 1.647.

Finally, the SMOreg regression algorithm with 60 lag was applied to the same sample, information on seasonality was introduced and the prediction was carried out using the same train/test ratio. The measured MAPE was 1.2053.

One of the potential applications of the timeseries prediction is forecasting of future values of an underlying, building the model by using historical data. Here below a chart representing a backtest on S&P Future ES-Mini (Equity Line in dollars, maintenance margin 11’200$).

Precise information filtering deriving from laws present in nature

This entry is part 3 of 6 in the series Machine Learning

When the probability of measuring a particular value of some quantity varies inversely as a power of that value, the quantity is said to follow a power law, also known variously as Zipf’s law or the Pareto distribution (1). Looking at the figures, it’s really astonishing how most of the information is carried by such a low pecentage of words in a text.

Having in mind other objectives, C. Shannon dedicated most of his life and research to pieces of work that became fundamental for the Information Theory, and the final results allow to understand what is the acceptable loss of information in order to transfer a sequence of signals in an efficient way.

We have transposed the two approaches in order to perform a deterministic research in text by filtering the total words in a number of texts of the same kind by identifying the keywords of an ontological map built for the purpose.

We allowed the human to perform reinforcement learning on the submitted text and after a very low number of iterations we got incredible results.

In about 88% of cases it was possible to identify exactly the piece of information we were looking for and in all cases we were able to identify the two or three sentences that contained the information, out of documents exceeding the 130 pages in length.

This approach, being a “weak AI” achievement, exploits the following advantages:

  • it uses certain, trackable and validated data sources;
  • it obtains certain results, not probabilistic, not using generative AI;
  • it relies on the ability of the domain expert for reinforcement learning;
  • the knowledge base can be kept in-house and will represent a corporate asset.

(to be continued)

By A. Ballarin, G.Fruscio

The Zipf-Mandelbrot-Pareto law is a combination of the Zipf, Pareto, and Mandelbrot distributions. It is a power-law distribution on ranked data that is used to model phenomena where a few events or entities are much more common than others. The probability mass function of the Zipf-Mandelbrot-Pareto law is given by a power-law distribution on ranked data, similar to Zipf’s law. The Zipf-Mandelbrot-Pareto model is often used to model co-authorship popularity, insurance frequency, vocabulary growth, and other phenomena (2,3,4,5)
Applications of the Zipf-Mandelbrot-Pareto law include modeling the frequency of words in a corpus of text data, modeling income or wealth distributions, and modeling insurance frequency. The Zipf-Mandelbrot-Pareto law is also used in the study of vocabulary growth and the relationship between the Heaps and Zipf Laws (2,3,4,5)
Overall, the Zipf-Mandelbrot-Pareto law is a useful tool for modeling phenomena where a few events or entities are much more common than others. It has applications in linguistics, economics, insurance, and other fields (2,3,4,5)

  1. http://www.cs.cornell.edu/courses/cs6241/2019sp/readings/Newman-2005-distributions.pdf
  2. https://eforum.casact.org/article/38501-modeling-insurance-frequency-with-the-zipf-mandelbrot-distribution
  3. https://www.r-bloggers.com/2011/10/the-zipf-and-zipf-mandelbrot-distributions/
  4. https://journalofinequalitiesandapplications.springeropen.com/articles/10.1186/s13660-018-1625-y
  5. https://www.sciencedirect.com/science/article/pii/S0378437122008172

Machine Learning – AutoML vs Hyperparameter Tuning

This entry is part 2 of 6 in the series Machine Learning

Starting back from where we left, majority voting (ranking) or averaging (regression) among the predictions combine the independent predictions into a single prediction. In most cases the combination of the single independent predictions doesn’t lead to a better prediction, unless all classifiers are binary and have equal error probability.

The test procedure of the Majority Voting methodology is aimed at defining a selection algorithm for the following steps:

  • Random sampling of the dataset and use of known algorithms for a first selection of the classifier
  • Validation of the classifier n times, with n = number of cross-validation folds
  • Extraction of the best result
  • Using the n classifiers to evaluate the predictions, apply the selector of the result expressed by the majority of the predictions
  • Comparison of the output with the real classes
  • Comparison of the result with each of the scores of the individual classifiers

Once the test has been carried out on a representative sample of datasets (not by number but by nature), the selection of the algorithm will be implemented and will take place in advance at the time of acquisition of the new dataset to be analysed.

First of all, the selection of the most accurate algorithm, in the event that there is a significant discrepancy on the accuracy found on the basis of the data used, certainly leads to the optimal result compared to the use of a majority voting algorithm which selects the result of the prediction in based on the majority vote (or the average in the case of regression).

If, on the other hand, the accuracy of the algorithms used is comparable or the same algorithm is used in parallel with different data sampling, then one can proceed as previously described and briefly reported below:

  • Random sampling of the dataset and use of known algorithms for a first selection of the classifier
  • Validation of the classifier n times, with n = number of cross-validation folds
  • Extraction of the best result
  • Using the n classifiers to evaluate the predictions, apply the selector of the result expressed by the majority of the predictions
  • Comparison of the output with the real classes
  • Comparison of the result with each of the scores of the individual classifiers

Below is the test log.

Target 5 bins

Accuracy for Random Forest is: 0.9447705370061973
— 29.69015884399414 seconds —
Accuracy for Voting is: 0.9451527478227634
— 46.41354298591614 seconds —

Target 10 bins

Accuracy for Random Forest is: 0.8798219989625706
— 31.42287015914917 seconds —
Accuracy for Voting is: 0.8820879630893554
— 58.07574200630188 seconds —

As can be seen, the improvement obtained compared to the models obtained with pure algorithms is between 0.1% and 0.5%.

Auto-ML vs Hyperparameters Tuning

We have used a python library called “Lazy Classifier” which simply takes the dataset and tests it against all the defined algorithms. The result for a 95k rows by 44 columns dataset (normalized log of a backdoor malware vs normal web traffic) is shown in the figure below.

Lazy classifier for Auto-ML algorithm test

It can be noted that picking the best algorithm would result in some cases in an improvement of several percent points (e.g. Logistic Regression vs Random Forest) and, in most cases, in savings in terms of computation time thus consumed energy. For example if we take Random Forest algorithm and we compare the computation with standard parameters and optimization obtained with Grid Search and Random Search, you can notice that the improvement is, again, not worth the effort of repeating the calculation n-times.

**** Random Forest Algorithm standard parameters ****
Accuracy: 0.95
Precision: 1.00
Recall: 0.91
F1 Score: 0.95

**** Grid Search ****
Best parameters: {‘max_depth’: 5, ‘n_estimators’: 100}
Best score: 0.93
**** Random Search ****
Best parameters: {‘n_estimators’: 90, ‘max_depth’: 5}
Best score: 0.93