Assumptions, and the Pros & Cons of Data Models

In every sector of life, before applying anything big or small we may need to consider some of the assumptions and know the pros and cons. Similarly, when we talk about data science and data modelling we have a variety of options that can help resolve data-related problems and make data-driven decisions. The main problem that comes to our mind is on choosing one of those options. Where A well-trained model can give fruitful results, a wrong-fitted model can exploit the whole scenario. So using this article, we can get some critical information about Assumptions, and the pros and cons of data models that are really usable in a real-life scenario. During the course we will read about the following model:

KNN (K-Nearest Neighbour)
Logistic Regression
Linear Regression
Support Vector Machine
Decision Trees
Naive Bayes
Random Forest
XGBoost

KNN (K-Nearest Neighbour)

Assumptions:

Distance metrics like Manhattan and euclidean can be used to measure the distance of data in feature space.
Every training data point should involve a set of vectors, and class names should be concerned with every training data point.
If only two classes are in the dataset, then the value of K should be an odd number.

Pros:

A white box algorithm means the mechanism is easy to implement and interpret.
Due to the lack of parameters, this algorithm does not requires assumptions to be strictly followed.
Using this algorithm, we can push training data at runtime, make predictions simultaneously, and make the procedure faster than other algorithms. This means a particular training program or step is not required.
The training step is not required, so the new data points addition step becomes easy.

Cons:

With large and sparse data set algorithm becomes inefficient and slow because of the cost of distance calculation between data points.
Sensitive when data includes outliers inside.
If missing values or null values are available, then it can not work.
Calculations such as feature scaling and normalization are required in addition to distance calculation.

Logistic Regression

As minimal as possible or no multicollinearity is required in the independent data variables.
Independence between data variables is needed.
With Large datasets, this algorithm performs much better.

Pros:

With less computational power, it is also a white-box algorithm.
Fewer assumptions are required in terms of class distribution.
Less calculation is required to classify unknown data points.
Highly efficient when features are linearly separable.

Cons:

This algorithm uses a linear decision surface to classify data, so it becomes problematic when there are non-linear problems.
Its working is dependent on the probabilistic approach that causes overfitting in high dimensional data space.
Weak in obtaining complex relationships.
Large data of all categories are required for training.

Linear Regression

Assumption:

There should be linearity between data points.
Similar to logistic regression, As minimal as possible or no multicollinearity is required in the independent data variables.
The variance in error terms or residuals should be the same for any value of the target variable.

Pros:

Highly efficient when independent and dependent variables are linearly related.
Regularization techniques can be applied when the model is overfitted.

Cons:

Data should be linearly separable.
Model performance weakens when outliers are in the data.
Data independence is difficult to obtain.

Support Vector Machine

Assumption:

Identical distribution and independence in data are required.

Pros:

Highly efficient with high-dimensional data even when the number of samples is lower than the number of dimensions.
Memory efficient.

Cons:

The high-level calculation makes the algorithm slower.
Low interpretability.
Low efficiency when the dataset is noisy.
Good with high dimensional data, but the large sample size makes it inefficient.

Decision Trees

Assumptions:

At the start of training, the whole data need to think as training data.
Data should be distributed recursively based on the attribute value.

Pros:

Less data preparation is required.
Calculations like data normalization and scaling are not required.
Higher interpretability can be explained using if-else conditions.
a Low number of missing values doesn’t affect the results.

Cons:

The higher calculation requires a lot of time in model training.
Lower changes in the data can perform considerable differences in the tree structure.
Less effective for regression tasks.
The cost of training is higher.

Naive Bayes

Assumptions:

Only Conditional independence in data is compulsorily required.

Pros:

High-performing algorithm when only conditional independence is satisfied.
Works well with sequential and high-dimensional data like text and image data.
Only probability calculation is required, which makes its implementation easy.

Cons:

The multiplication of several small digits makes it numerically unstable.
No conditional independence makes the algorithm’s performance poor.

Random Forest

Assumption:

No formal distribution of the data is required.

Pros:

It’s a non-parametric model that can perform well with skewed or multi-modal data.
Handles outliers very easily.
It can perform well with non-linear data.
Generally don’t overfit on data

Cons:

The higher calculation makes it slow in training.
Becomes biased with unbalanced data.

XGBoost

Assumptions:

The only assumption is that the encoded integer value for each variable should have ordinal relation.

Pros:

Highly interpretable.
Fast and easily executable.
No extra calculation like scaling or normalizing is required.
It can easily handle missing values.

Cons:

Optimized parameter tunning is required to avoid overfitting.
Tunning requires higher calculations.

Final words

Here in the article, we have seen assumptions, pros, and cons we need to take care of when modelling with some of the famous machine learning models. In real-world problems, it becomes essential to choose the most suitable model based on the problems and requirements. This article will help us in selecting a suitable model for different situations using their pros and cons, and considerable assumptions.

About DSW

Data Science Wizards (DSW) is an Artificial Intelligence and Data Science start-up that primarily offers platforms, solutions, and services for making use of data as a strategy through AI and data analytics solutions and consulting services to help enterprises in data-driven decisions.

DSW’s flagship platform UnifyAI is an end-to-end AI-enabled platform for enterprise customers to build, deploy, manage, and publish their AI models. UnifyAI helps you to build your business use case by leveraging AI capabilities and improving analytics outcomes.

Connect us at contact@datasciencewizards.ai and visit us at www.datasciencewizards.ai