ML Engineering Starter Pack #2: Data structures, Prob & Stats, ML Algorithms

Jun 05, 2024

The second set of hands-on code demos for 11 topics in ML engineering. I introduce 3 core topics in ML engineering that I think all developers should know about. Like usual, I did my best to ensure these videos are loaded with practical information.

Check out the video for Part 2 on YouTube

Topics

Data Structures:
- Overview of Big O notation for algorithm efficiency.
- Arrays, Linked Lists, Stacks, Queues, Hash Tables, Trees, Graphs
Probability and Statistics:
- Introduction to basic probability concepts.
- Conditional Probability and Bayes Theorem.
- Standard Deviation: Calculating the average and variance; understanding the spread of data and its units.
- Correlation Coefficient: Assessing the relationship between variables; useful for feature selection and identifying multicollinearity.
Feature Selection and Correlation:
- Correlation analysis helps in identifying multicollinearity among features.
- Highly correlated features can be problematic in machine learning models.
- Example calculation of correlation coefficient, covariance, and normalization to standardize data.
Normal Distribution:
- Importance in modeling assumptions for algorithms like linear regression.
- Process of normalization to scale data appropriately.
- Demonstration of normalization using Python and comparison with SciPy and Matplotlib.
Machine Learning Algorithms:
- Overview of supervised and unsupervised learning.
- Demonstration of:
  - Linear Regression: Fitting and visualizing a linear model using scikit-learn.
  - Logistic Regression: Classification example using the iris dataset, highlighting the train-test split and k-fold cross-validation.
  - Decision Trees: Visualizing decision trees, their splits, and the importance of using ensemble methods to avoid overfitting.
  - K-Means Clustering: Example of clustering data without labels, demonstrating the choice of the number of clusters and evaluating the results.
Data Preprocessing:
- Importance of preprocessing for machine learning.
- Example of handling missing values, identifying numerical and categorical features, and using scikit-learn pipelines.
- Steps include imputing missing values, scaling numerical features, and one-hot encoding categorical features.
- Integration of preprocessing steps with a random forest classifier in a pipeline.

Feature Engineering Bookcamp

I’ve uploaded a 30 minute presentation of my slides on the book “Feature Engineering Bookcamp”.

This book is my style - super hands on - and it changed the way I think about training ML models using scikit-learn.

If that sounds interesting then go ahead and check out the preview of my video on Patreon. Membership is only a few bucks.

Namaste,

Alex

I block out a clear space of time and clean up all distractions. I read with deep focus, not to snack on content. Because building a “content-snacking brain” will yield a snacking sort of life.
Robin Sharma

ZazenCodes

Discussion about this post