When the depth model is applied to the financial scenario, the data risk mainly comes from three aspects: deviation in training data, wrong data or wrong data preprocessing, and legal compliance of data. Let’s not talk about the legal compliance of data, but mainly talk about the first two issues.
The deviation in the training data will make the model learn the wrong relationship and law during training. This risk is divided into the following forms:
Because the model is trained with historical data, it may make the model learn some laws or relationships that are no longer applicable at present. For example, the option model trained with data in the period of low volatility or outdated regulatory system is likely to bring new risks in the current new scenario, because the financial market is not invariable, on the contrary, Under the temptation of money, new tools and methods emerge one after another in the financial market. A market dominated by high-frequency trading is obviously different from the market ten years ago. However, if only recent data are used for training, the amount of data will be small.
It is often difficult to identify unreasonable relationships in a black box model. With the improvement of the model dimension, the inferred potential relationships will also increase rapidly, and many times, these relationships are not necessarily reasonable, or even only appear at a certain time.
The deviation in the sampling process may cause the data to not truly reflect the state of the market. For example, a company may want to use the same algorithm to trade between different exchanges, and each of them has different specifications and data. How to reasonably sample is a very important problem.
The loss function of deep learning will consider the overall performance of the model, so it may ignore those samples that account for a small proportion but are important. Even if its training data includes the data of high volatility stage, because the proportion is relatively small, the model may still perform well in the low volatility market.
The training depth model needs a large amount of data. Although a large amount of structural data has been precipitated in the financial field, it is often scarce in the direction we need. Because history is one-way, the data we obtain can only show one possibility, which exacerbates this lack of data to some extent.
Wrong data and wrong data preprocessing
Wrong data and wrong data preprocessing will cause problems in the model from the source. This risk will appear in:
Some data may not meet the basic economic constraints. As some studies have pointed out, historical option price data, whether listed contracts or OTC contracts, may not meet the limit of no arbitrage, which is particularly serious in emerging market data.
Data leakage in timing data. In data set division and Feature Engineering, there may be a phenomenon that the “cause” is the “result”, that is, the model obtains the information to be predicted in advance. For example, if you call train_ test_ This happens when preprocessing is run before split, such as fitting padding for missing values.
Financial data usually require timely and accurate data, but this is not reflected in historical data. For high frequency data, delay must be considered. For the widely used non market data, such as online news, the access time also needs to be paid attention to.
Financial data is usually long tailed and non-static, which makes it very difficult to detect and eliminate wrong data.
In order to solve the above problems, this paper proposes two solutions or mitigation schemes, which are briefly summarized: one is synthetic data generation, and the other is market simulator engine.
Synthetic data can help solve the problem of data deviation and improve the quality and quantity of historical data by enhancing data. The commonly used models are countermeasure generation network (GAN) and variational self encoder (VAE). These models are more expressive for high-dimensional data than the traditional ones, can generate data with expert suggestions and known facts, and provide a more accurate and robust method to solve the problems of sample imbalance, missing value filling and outlier processing.
In addition, the method of synthesizing data can also solve the problem of data error with other methods. For example, it can be combined with federated learning to improve the quality of single station data by collecting data from different sources.
Of course, the process of synthesizing data itself also needs to select evaluation indicators (metrics), loss function and training algorithm. These processes also contain risks. The current research in this area lacks standardized benchmark and theoretical guarantee. How to ensure that the generated data conforms to the basic economic assumptions (such as no arbitrage) is also a problem worth studying.
The idea to solve the data risk is very direct. Since you are worried about the source, processing and distribution of historical data, I don’t need these things. I directly make a simulation market and use the data in the simulator to train the model.
The key to the effectiveness of this method is that the simulation environment should be real enough. For example, the model trained and tested based on historical data may still perform poorly in the real link, and training this model in the simulation environment for a period of time can alleviate this problem. In addition, a simulation link is also crucial for deploying reinforcement learning.
At present, many quantitative funds are looking for such a system, which can develop and test trading algorithms in different environments and evaluate causality in the market. Finally, a good simulator can also be used to generate synthetic data, especially those fine-grained data that are difficult to obtain.
Standardized data preprocessing
In addition to the above two methods, standardized data preprocessing process is indispensable to eliminate data risk.
Firstly, we should formulate some constraints or specifications for data preprocessing, and use certain means to detect whether the data comply with this specification. For example, no arbitrage requirements greatly limit the possible price range of options, so imposing these requirements can avoid many other mistakes.
Another method is to sample data with different frequencies and periods. By comparing the results of these data sets, the stability of the model can be evaluated and calibrated.
Finally, follow the example of the open source community in other fields and develop standardized examples, code bases and resources to improve the ability to identify and handle data errors.