Skip to content Skip to footer

Assumptions, and the Pros & Cons of Data Models

  • Logistic Regression
  • Linear Regression
  • Support Vector Machine
  • Decision Trees
  • Naive Bayes
  • Random Forest
  • XGBoost

KNN (K-Nearest Neighbour)

Assumptions:

  • Every training data point should involve a set of vectors, and class names should be concerned with every training data point.
  • If only two classes are in the dataset, then the value of K should be an odd number.
  • Due to the lack of parameters, this algorithm does not requires assumptions to be strictly followed.
  • Using this algorithm, we can push training data at runtime, make predictions simultaneously, and make the procedure faster than other algorithms. This means a particular training program or step is not required.
  • The training step is not required, so the new data points addition step becomes easy.
  • Sensitive when data includes outliers inside.
  • If missing values or null values are available, then it can not work.
  • Calculations such as feature scaling and normalization are required in addition to distance calculation.

Logistic Regression

  • As minimal as possible or no multicollinearity is required in the independent data variables.
  • Independence between data variables is needed.
  • With Large datasets, this algorithm performs much better.
  • Fewer assumptions are required in terms of class distribution.
  • Less calculation is required to classify unknown data points.
  • Highly efficient when features are linearly separable.
  • Its working is dependent on the probabilistic approach that causes overfitting in high dimensional data space.
  • Weak in obtaining complex relationships.
  • Large data of all categories are required for training.

Linear Regression

Assumption:

  • Similar to logistic regression, As minimal as possible or no multicollinearity is required in the independent data variables.
  • The variance in error terms or residuals should be the same for any value of the target variable.
  • Regularization techniques can be applied when the model is overfitted.
  • Model performance weakens when outliers are in the data.
  • Data independence is difficult to obtain.

Support Vector Machine

Assumption:

  • Memory efficient.
  • Low interpretability.
  • Low efficiency when the dataset is noisy.
  • Good with high dimensional data, but the large sample size makes it inefficient.

Decision Trees

Assumptions:

  • Data should be distributed recursively based on the attribute value.
  • Calculations like data normalization and scaling are not required.
  • Higher interpretability can be explained using if-else conditions.
  • a Low number of missing values doesn’t affect the results.
  • Lower changes in the data can perform considerable differences in the tree structure.
  • Less effective for regression tasks.
  • The cost of training is higher.

Naive Bayes

Assumptions:

  • Works well with sequential and high-dimensional data like text and image data.
  • Only probability calculation is required, which makes its implementation easy.
  • No conditional independence makes the algorithm’s performance poor.

Random Forest

Assumption:

  • Handles outliers very easily.
  • It can perform well with non-linear data.
  • Generally don’t overfit on data
  • Becomes biased with unbalanced data.

XGBoost

Assumptions:

  • Fast and easily executable.
  • No extra calculation like scaling or normalizing is required.
  • It can easily handle missing values.
  • Tunning requires higher calculations.

Final words

Here in the article, we have seen assumptions, pros, and cons we need to take care of when modelling with some of the famous machine learning models. In real-world problems, it becomes essential to choose the most suitable model based on the problems and requirements. This article will help us in selecting a suitable model for different situations using their pros and cons, and considerable assumptions.

About DSW

Data Science Wizards (DSW) is an Artificial Intelligence and Data Science start-up that primarily offers platforms, solutions, and services for making use of data as a strategy through AI and data analytics solutions and consulting services to help enterprises in data-driven decisions.