Once the features are preprocessed, you need to find a machine learning algorithm to train these features and predict the target value of the new observation. Different from functional engineering, model selection has a wealth of choices and options. There are “clustering model, classification and regression model, neural network-based model, association rule-based model” and so on.
Each algorithm is suitable for a certain class of problems and automatic model selection. For this model, we can filter the space through all appropriate models for a specific task, and select one that produces the highest accuracy (such as the lowest AIC) or the lowest bit error rate (such as RMSE). It is understandable that no machine learning algorithm performs best on all data sets (no free lunch theory), and some algorithms need hyperparametric tuning. In fact, when choosing models, we tend to try different variables, different coefficients or different hyperparameters. In the regression problem, there is a method that can automatically select the prediction variables used in the final model by using F-test, t-test, ajusted R-squared and other technologies. This method is called stepwise regression. But this method is error prone. Automatically select the frame of the model:
Auto sklearn is a python library created by Mathias feurer, Aaron Klein, Katharina eggensperger, etc. This library mainly deals with two core processes in machine learning: selecting algorithms from a wide list of classification and regression algorithms and hyperparametric optimization. This library does not perform feature engineering because dataset features are created by combining mathematical primitives such as featuretools.
Auto sklearn is similar to auto Weka and hyperopt sklearn. Here are some classifiers that auto sklearn can select from decision tree, Gaussian naive Bayes, gradient enhancement, KNN, LDA, SVM, random forest and linear classifier (SGD). In the preprocessing steps, it supports the following aspects: kernel principal component analysis, selection percentile, selection rate, one thermal coding, homing, balance, scaling, feature aggregation, and so on. Similarly, from the perspective of enriching the dataset by combining existing features, these can not be understood as feature engineering steps.
Some algorithms automatically optimize some indicators through a series of different variable configurations. This is similar to looking for variable importance. Usually, people can do this well by understanding the context and domain of variables. For example: “increased summer sales” or “the most expensive goods come from west London residents”. These variables can be naturally implied by human experts.
However, there is another way to understand the importance of a variable, that is, to see how important the variable is statistically. This is done automatically by algorithms such as decision trees (using the so-called Gini index or information gain). Random forests do the same, but unlike decision trees, random forests run multiple decision trees to create multiple models that introduce randomness.
For time series data, we tend to discuss cars. Arima package in R uses AIC as optimization index. Automatically generated algorithm. Arima uses Hyndman khandakar in the background to achieve this, which is explained in detail in the otext book below.
As mentioned earlier, H2O unmanned AI can be used for automated feature engineering. It can also be used to automatically train multiple algorithms at the same time. This is achieved by H2O. Automl package. It can automatically train your data using a variety of different algorithms with different parameters, such as GLM, xgboost, random forest, deep learning, integrated model, and so on.
Datarobot can also be used to automatically train multiple algorithms at the same time. This is achieved by using models adjusted by datarobot scientists, so dozens of models can be run with preset super parameters. It will eventually choose an algorithm with the highest accuracy. It also allows data scientists to manually intervene and adjust the model to improve accuracy.
Microsoft announced its automated machine learning toolkit in September. In fact, the product itself is called automatic ml and belongs to azure machine learning product. Microsoft’s automatic ml uses collaborative filtering and Bayesian optimization to search the space of machine learning. Microsoft refers to a combination of data preprocessing steps, learning algorithms, and hyper parameter configuration. Among the many model selection techniques discussed above, the typical part of automation in ML learning process is hyperparametric setting. Microsoft researchers have found that tuning only superparameters can sometimes be comparable to random search, so ideally, the whole end-to-end process should be automated.
Google has also made innovations in this field and launched Google cloud automation. In cloud automl Google, data scientists can train models of computer vision, natural language processing and translation by only obtaining tag data from users and automatically constructing and training algorithms.
Tpot is a python library for automated machine learning, which uses genetic programming to optimize the machine learning pipeline. Ml pipeline includes data cleaning, feature selection, feature preprocessing, feature construction, model selection and parameter optimization. The Tpot library makes use of the machine learning library available in scikit learn.
TPOT Machine Learning Pipeline
Amazon sage maker provides modeling, training, and deployment capabilities. It can automatically adjust the algorithm. In order to do this, it uses a technique called Bayesian optimization
Hyperdrive is a Microsoft product, which is built for comprehensive super parameter exploration. The hyperparametric search space can be covered by random search, grid search or Bayesian optimization. It implements a list of schedulers that you can choose to terminate the exploration phase early by jointly optimizing quality and cost.
Neural network structure selection
In the world of machine learning, one of the most boring tasks is to design and build neural network architecture. Usually, people spend hours or days trying to iterate different neural network architectures with different hyperparameters to optimize the objective function of the task at hand. This is time-consuming and error prone. ” Google introduced the idea of neural network search using evolutionary algorithm and reinforcement learning to design and find the optimal neural network structure. In essence, this is to create a layer in training, and then stack these layers to create a deep neural network architecture. In recent years, the research in this field has attracted extensive attention, and many research papers have been put forward. Notable research papers are:
Nasnet learning scalable transferable architecture for image recognition
The NASNet algorithm
Amoebanet – Architecture search of image classifier based on amoebanet regularization evolution
ENAs efficient neural structure search
Many concerns of the machine learning community focus on the development of learning algorithms, rather than the most important part of the end-to-end machine learning process, that is, the deployment and production of ML models. There are many inherent challenges in deploying machine learning models into production environments. Some companies and open source projects are trying to automate this process and minimize the pain of data scientists because they do not necessarily have Devops skills. The following is a list of frameworks and companies working in this field:
Seldon – provides methods to wrap models built in R, python, Java, and nodejs and deploy them to the kubernetes cluster. It provides integration with kubeflow, IBM fabric for deep learning, NVIDIA tensorrt, DL inference server, tensorflow services, etc.
Redis ML – is a module in redis (in memory distributed key value database), which allows the model to be deployed to the production environment. At present, it only supports the following algorithms: random forest (classification and regression), linear regression and logistic regression.
The model server of Apache mxnet is used to serve the deep learning model exported from mxnet or open neural network exchange (onnx).
Microsoft machine learning services allow you to deploy the model as a web service on a scalable kubernetes cluster, and you can invoke the model as a web service.
You can use Amazon sagemaker to deploy the model to an HTTPS endpoint that the application uses to infer / predict new data observations. Google cloud ml also supports model deployment and inference through HTTP calls to web services hosting the model. By default, it limits the size of the model to 250 MB.
H2O supports model deployment by utilizing the concept of Java mojo (optimized model object). Mojo supports automatic, deep learning, DRF, GBM, GLM, glrm, K-means, stack integration, support vector machines, word2vec and xgboost models. It is highly integrated with the Java type environment. For non Java programming models, such as R or python, you can save the model as a serialized object and load it at inference time.
Tensorflow service is used to deploy tensorflow model into production environment. In a few lines of code, you can use the tensorflow model as an API for forecasting.
If your models have been trained and exported to PMML format, openscoring can help you service these PMML models as inferred rest APIs. Graphpipe was created to decouple ML model deployment from framework specific model implementations such as tensorflow, caffe2, onnx.