ML Engineering Starter Pack #2: Data structures, Prob & Stats, ML Algorithms
The second set of hands-on code demos for 11 topics in ML engineering. I introduce 3 core topics in ML engineering that I think all developers should know about. Like usual, I did my best to ensure these videos are loaded with practical information.
Check out the video for Part 2 on YouTube
Topics
Data Structures:
Overview of Big O notation for algorithm efficiency.
Arrays, Linked Lists, Stacks, Queues, Hash Tables, Trees, Graphs
Probability and Statistics:
Introduction to basic probability concepts.
Conditional Probability and Bayes Theorem.
Standard Deviation: Calculating the average and variance; understanding the spread of data and its units.
Correlation Coefficient: Assessing the relationship between variables; useful for feature selection and identifying multicollinearity.
Feature Selection and Correlation:
Correlation analysis helps in identifying multicollinearity among features.
Highly correlated features can be problematic in machine learning models.
Example calculation of correlation coefficient, covariance, and normalization to standardize data.
Normal Distribution:
Importance in modeling assumptions for algorithms like linear regression.
Process of normalization to scale data appropriately.
Demonstration of normalization using Python and comparison with SciPy and Matplotlib.
Machine Learning Algorithms:
Overview of supervised and unsupervised learning.
Demonstration of:
Linear Regression: Fitting and visualizing a linear model using scikit-learn.
Logistic Regression: Classification example using the iris dataset, highlighting the train-test split and k-fold cross-validation.
Decision Trees: Visualizing decision trees, their splits, and the importance of using ensemble methods to avoid overfitting.
K-Means Clustering: Example of clustering data without labels, demonstrating the choice of the number of clusters and evaluating the results.
Data Preprocessing:
Importance of preprocessing for machine learning.
Example of handling missing values, identifying numerical and categorical features, and using scikit-learn pipelines.
Steps include imputing missing values, scaling numerical features, and one-hot encoding categorical features.
Integration of preprocessing steps with a random forest classifier in a pipeline.
Feature Engineering Bookcamp
I’ve uploaded a 30 minute presentation of my slides on the book “Feature Engineering Bookcamp”.
This book is my style - super hands on - and it changed the way I think about training ML models using scikit-learn.
If that sounds interesting then go ahead and check out the preview of my video on Patreon. Membership is only a few bucks.
Namaste,
Alex
I block out a clear space of time and clean up all distractions. I read with deep focus, not to snack on content. Because building a “content-snacking brain” will yield a snacking sort of life.
Robin Sharma