Data Science Formulae with Examples

Home / Blog / Data Science / Data Science Formulae

Data Science Formulae

September 29, 2024
85

Meet the Author :

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 含羞草社区. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 含羞草社区 with more than Ten years of experience and has been making the IT transition journey easy for his students. 含羞草社区 is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Measures of Central Tendency

Measures of Dispersion

含羞草社区 also offers the Data Science Course in Hyderabad to start a better career. Enroll now!

Graphical Representation

Box Plot calculations

Upper limit = Q3 + 1.5(IQR)
IQR: Q3 – Q1
Lower limit = Q1 – 1.5(IQR)

8) Histogram calculations

Number of Bins = √n

Where n: number of records
Bin width = Range / Number of bins
Where Range: Max – Min value
Number of bins: √number of records

Normalization

Standardization

Robust Scaling

12) Theoretical quantiles in Q-Q plot = X - µ / σ

Want to learn more about data science? Enroll in this Data Science Classes in Bangalore to do so.

Where X: the observations
µ: mean of the observations
σ: standard deviation

13) Correlation (X, Y)

r = Σ((X岬� - X虅) * (Y岬� - 炔)) / √(Σ(X岬� - X虅)² * Σ(Y岬� - 炔)²)

Where:
X岬� and Y岬� are the individual data points for the respective variables.
X虅 (X-bar) and 炔 (Y-bar) are the sample means of variables X and Y, respectively.
Σ represents the sum across all data points.

14) Covariance (X, Y)

Cov(X, Y) = Σ((X岬� - X虅) * (Y岬� - 炔)) / (n - 1)

Are you looking to become a Data Scientist? Go through 含羞草社区's Data Science Course in Chennai

Box-Cox Transformation

Yeo- Johnson Transformation

Unsupervised Techniques

Clustering

Distance formulae(Numeric)

Distance formulae (Non- Numeric)

Dimension Reduction

Also, check this Data Science Course Training in Hyderabad to start a career in Data Science.

Singular Value Decomposition (SVD)

Association Rule

Support (s):

Confidence (c)

Lift (l)

Recommendation Engine

Cosine Similarity

Network Analytics

Closeness Centrality

Betweeness Centrality

Google Page Rank Algorithm

Text mining

Term Frequency (TF)

Inverse Document Frequency (IDF)

TF-IDF (Term Frequency-Inverse Document Frequency)

Supervised Techniques

Bayes' Theorem

K-Nearest Neighbor (KNN)

Euclidean distance is specified by the following formula,

Decision Tree:

Information Gain = Entropy before – Entropy after

Entropy

Confidence Interval

Regression

Simple linear Regression

Equation of a Straight Line

The equation that represents how an independent variable is related to a dependent variable and an error term is a regression model

Where, β0 and β1 are called parameters of the model,

ε is a random variable called error term.

Regression Analysis

R-squared-also known as Coefficient of determination, represents the % variation in output (dependent variable) explained by input variables/s or Percentage of response variable variation that is explained by its relationship with one or more predictor variables

Higher the R^2, the better the model fits your data
R^2 is always between 0 and 100%
R squared is between 0.65 and 0.8 => Moderate correlation
R squared in greater than 0.8 => Strong correlation

Multilinear Regression

Logistic Regression

Lasso and Ridge Regression

Residual Sum of Squares + λ * (Sum of the absolute value of the magnitude of coefficients)

Where, λ: the amount of shrinkage.

λ = 0 implies all features are considered and it is equivalent to the linear regression where only the residual sum of squares is considered to build a predictive model

λ = ∞ implies no feature is considered i.e., as λ closes to infinity it eliminates more and more features

Ridge = Residual Sum of Squares + λ * (Sum of the squared value of the magnitude of coefficients)

Where, λ: the amount of shrinkage

Advanced Regression for Count data

Negative Binomial Distribution

Poisson Distribution

Become a Data Scientist with 含羞草社区 Best Institute for Data Science Course in Chennai. Get trained by the alumni from IIT, IIM, and ISB.

Time Series:

Moving Average (MA)

The moving average at time "t" is calculated by taking the average of the previous "n" observations:

MA鈧� = (y鈧� + y鈧溾倠鈧� + y鈧溾倠鈧� + ... + y鈧溾倠鈧�) / n

Exponential Smoothing

Exponential smoothing gives more weight to recent observations. The smoothed value at time "t" is calculated using a weighted average:

S鈧� = α * y鈧� + (1 - α) * S鈧溾倠鈧�

Where "α" is the smoothing factor.

Autocorrelation Function (ACF)

Correlation between a variable and its lagged version (one time-step or more)

Yt = Observation in time period t
Yt-k = Observation in time period t – k
赢 = Mean of the values of the series
rk = Autocorrelation coefficient for k-step lag

Partial Autocorrelation Function (PACF):

The partial autocorrelation function measures the correlation between observations at different lags while accounting for intermediate lags. The PACF at lag "k" is calculated as the coefficient of the lag "k" term in the autoregressive model of order "k":
PACF鈧� = Cov(y鈧�, y鈧溾倠鈧� | y鈧溾倠鈧�, y鈧溾倠鈧�, ..., y鈧溾倠鈧栤倠鈧�) / Var(y鈧�)

Confusion Matrix

True Positive (TP) = Patient with disease is told that he/she has disease
True Negative (TN) = Patient with no disease is told that he/she does not have disease
False Negative (FN) = Patient with disease is told that he/she does not have disease
False Positive (FP) = Patient with no disease is told that he/she has disease

Overall error rate = (FN+FP) / (TP+FN+FP+TN)

Accuracy = 1 – Overall error rate OR (TP+TN) / (TP+FN+FP+TN); Accuracy should be > % of majority class

Precision = TP/(TP+FP) = TP/Predicted Positive = Prob. of correctly identifying a random patient with disease as having disease

Sensitivity (Recall or Hit Rate or True Positive Rate) = TP/(TP+FN) = TP/Actual Positive = Proportion of people with disease who are correctly identified as having disease

Specificity (True negative rate) = TN/(TN+FP) = Proportion of people with no disease being characterized as not having disease

FP rate (Alpha or type I error) = 1 – Specificity
FN rate (Beta or type II error) = 1 – Sensitivity
F1 = 2 * ((Precision * Recall) / (Precision + Recall))
F1: 1 to 0 & defines a measure that balances precision & recall

Forecasting Error Measures

MSE = (1/n) * Σ(Actual – Forecast)2
Where n: sample size
Actual: the actual data value
Forecast: the predicted data value
MAE = (1/n) * Σ |Actual – Forecast| Where n: sample size
Actual: the actual data value
Forecast: the predicted data value
MAPE = (1/n) * Σ |Actual – Forecast| / Actual
Where n: sample size
Actual: the actual data value
Forecast: the predicted data value
RMSE = √(1/n) * Σ(Actual – Forecast)2
Where n: sample size
Actual: the actual data value
Forecast: the predicted data value
MAD = (1/n) * Σ |Actual – µ|
Where n: sample size
Actual: the actual data value & µ: mean of the given set of data
SMAPE = (1 / n) * Σ( |F岬� - A岬 / (|F岬 + |A岬) ) * 100%
Where:
F岬� represents the forecasted value.
A岬� represents the actual value.

Looking forward to becoming a Data Scientist? Check out the Professional Course of Data Science Course in Bangalore and get certified today.

Data Science Placement Success Story

Data Science Training Institutes in Other Locations

Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Vizag, Tirunelveli, Aurangabad

Data Analyst Courses in Other Locations

ECIL, Jaipur, Pune, Gurgaon, Salem, Surat, Agra, Ahmedabad, Amritsar, Anand, Anantapur, Andhra Pradesh, Anna Nagar, Aurangabad, Bhilai, Bhopal, Bhubaneswar, Borivali, Calicut, Cochin, Chengalpattu , Dehradun, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Guduvanchery, Gwalior, Hebbal, Hoodi , Indore, Jabalpur, Jaipur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Kanpur, Khammam, Kochi, Kolhapur, Kolkata, Kothrud, Ludhiana, Madurai, Mangalore, Meerut, Mohali, Moradabad, Pimpri, Pondicherry, Porur, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thoraipakkam , Tiruchirappalli, Tirunelveli, Trichur, Trichy, Udaipur, Vijayawada, Vizag, Warangal, Chennai, Coimbatore, Delhi, Dilsukhnagar, Hyderabad, Kalyan, Nagpur, Noida, Thane, Thiruvananthapuram, Uppal, Kompally, Bangalore, Chandigarh, Chromepet, Faridabad, Guntur, Guwahati, Kharadi, Lucknow, Mumbai, Mysore, Nashik, Navi Mumbai, Patna, Pune, Raipur, Vadodara, Varanasi, Yelahanka

Navigate to Address

含羞草社区 - Data Analytics, Data Science Course Training in Chennai

1st Floor, Santi Ram Centre, Tirumurthy Nagar, Opposite to Indian Oil Bhavan, Nungambakkam, Chennai - 600006

1800-212-654-321

Get Direction:

Previous Blog

Next Blog

含羞草社区

Certification Program in Data Science

Practical Data Scientist Online Program

Data Science using Python and R Programming

Foundation Program in Data Science

Exclusive Python & R Program For Beginners

Data Science for Managers

AI & Deep Learning Course Training in USA

Business Analytics in USA

Data Visualization Using Tableau in USA

Professional Course in Data Analytics

MLOps Course with Training & Job Assistance in USA

Professional Certificate Course in Data Engineering

HR Analytics Course Training USA

Life Sciences and HealthCare Analytics Course in USA

Data Science for Internal Auditors

Certificate course on Data Science

Certificate course on Data Analytics

Certificate course on MLOps

Certificate course on Data Engineering