sklearn.linear model.TheilSenRegressor

From GM-RKB
Jump to navigation Jump to search

A sklearn.linear model.TheilSenRegressor is an Theil-Sen Regression System within sklearn.linear_model class.

  • Context:
    • Usage:
1) Import TheilSenRegressor model from scikit-learn : from sklearn.linear_model import TheilSenRegressor
2) Create design matrix X and response vector Y
3) Create TheilSenRegressor object: model= TheilSenRegressor([fit_intercept=True, copy_X=True, max_subpopulation=10000.0, n_subsamples=None, ...])
4) Choose method(s):
  • fit(X, y), fits linear model.
  • get_params([deep]), gets parameters for this estimator.
  • predict(X), predicts using the linear model
  • score(X, y[, sample_weight]), returns the coefficient of determination R^2 of the prediction.
  • set_params(**params), sets the parameters of this estimator.


References

2017A

Theil-Sen Estimator: robust multivariate regression model.
The algorithm calculates least square solutions on subsets with size n_subsamples of the samples in X. Any value of n_subsamples between the number of features and samples leads to an estimator with a compromise between robustness and efficiency. Since the number of least square solutions is “n_samples choose n_subsamples”, it can be extremely large and can therefore be limited with max_subpopulation. If this limit is reached, the subsets are chosen randomly. In a final step, the spatial median (or L1 median) is calculated of all least square solutions.

2017B

(...)
TheilSenRegressor is comparable to the Ordinary Least Squares (OLS) in terms of asymptotic efficiency and as an unbiased estimator. In contrast to OLS, Theil-Sen is a non-parametric method which means it makes no assumption about the underlying distribution of the data. Since Theil-Sen is a median-based estimator, it is more robust against corrupted data aka outliers. In univariate setting, Theil-Sen has a breakdown point of about 29.3% in case of a simple linear regression which means that it can tolerate arbitrary corrupted data of up to 29.3%.

The implementation of TheilSenRegressor in scikit-learn follows a generalization to a multivariate linear regression model [8] using the spatial median which is a generalization of the median to multiple dimensions [9]. In terms of time and space complexity, Theil-Sen scales according to

[math]\displaystyle{ \binom{n_{samples}}{n_{subsamples}} }[/math]
which makes it infeasible to be applied exhaustively to problems with a large number of samples and features. Therefore, the magnitude of a subpopulation can be chosen to limit the time and space complexity by considering only a random subset of all possible combinations.