AutoTrain documentation

Tabular Classification / Regression

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.8.24).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Tabular Classification / Regression

Using AutoTrain, you can train a model to classify or regress tabular data easily. All you need to do is select from a list of models and upload your dataset. Parameter tuning is done automatically.

Models

The following models are available for tabular classification / regression.

  • xgboost
  • random_forest
  • ridge
  • logistic_regression
  • svm
  • extra_trees
  • gradient_boosting
  • adaboost
  • decision_tree
  • knn

Data Format

id,category1,category2,feature1,target
1,A,X,0.3373961604172684,1
2,B,Z,0.6481718720511972,0
3,A,Y,0.36824153984054797,1
4,B,Z,0.9571551589530464,1
5,B,Z,0.14035078041264515,1
6,C,X,0.8700872583584364,1
7,A,Y,0.4736080452737105,0
8,C,Y,0.8009107519796442,1
9,A,Y,0.5204774795512048,0
10,A,Y,0.6788795301189603,0
.
.
.

Columns

Your CSV dataset must have two columns: id and target.

Parameters

class autotrain.trainers.tabular.params.TabularParams

< >

( data_path: str = None model: str = 'xgboost' username: Optional = None seed: int = 42 train_split: str = 'train' valid_split: Optional = None project_name: str = 'project-name' token: Optional = None push_to_hub: bool = False id_column: str = 'id' target_columns: Union = ['target'] categorical_columns: Optional = None numerical_columns: Optional = None task: str = 'classification' num_trials: int = 10 time_limit: int = 600 categorical_imputer: Optional = None numerical_imputer: Optional = None numeric_scaler: Optional = None )

Parameters

  • data_path (str) — Path to the dataset.
  • model (str) — Name of the model to use. Default is “xgboost”.
  • username (Optional[str]) — Hugging Face Username.
  • seed (int) — Random seed for reproducibility. Default is 42.
  • train_split (str) — Name of the training data split. Default is “train”.
  • valid_split (Optional[str]) — Name of the validation data split.
  • project_name (str) — Name of the output directory. Default is “project-name”.
  • token (Optional[str]) — Hub Token for authentication.
  • push_to_hub (bool) — Whether to push the model to the hub. Default is False.
  • id_column (str) — Name of the ID column. Default is “id”.
  • target_columns (Union[List[str], str]) — Target column(s) in the dataset. Default is [“target”].
  • categorical_columns (Optional[List[str]]) — List of categorical columns.
  • numerical_columns (Optional[List[str]]) — List of numerical columns.
  • task (str) — Type of task (e.g., “classification”). Default is “classification”.
  • num_trials (int) — Number of trials for hyperparameter optimization. Default is 10.
  • time_limit (int) — Time limit for training in seconds. Default is 600.
  • categorical_imputer (Optional[str]) — Imputer strategy for categorical columns.
  • numerical_imputer (Optional[str]) — Imputer strategy for numerical columns.
  • numeric_scaler (Optional[str]) — Scaler strategy for numerical columns.

TabularParams is a configuration class for tabular data training parameters.

< > Update on GitHub