Kickstart Your Data Science Journey — A Guide for Aspiring Data Scientists
Key Technical Skills You Need to Kick-start Your Career in Data Science
Are you curious about data science? Does math and artificial intelligence excite you? Do you want to explore data science and plan to pursue a data science career? Whether you’re unsure where to begin or just taking your first steps into data science, you’ve come to the right place. Trust me, this guide will help you take your first steps with confidence!
Data science is one of the most exciting fields in which to work. It’s a multidisciplinary field that combines various techniques and tools to analyze complex datasets, build predictive models, and guide decision-making in businesses, research, and technology.
Data science is applied in various industries such as finance, healthcare, social media, travel, e-commerce, robotics, military, and espionage.
Myths and Truths about Data Science
The Internet has abundant information about how to start with data science, leading to myths and misconceptions about data science. The two most important misconceptions are —
- Learn Maths or not?— Many online courses and boot camps advertise that you can become a data scientist in 50 days! These courses are often misleading. They focus on advanced machine learning (ML) topics, a few quick coding tutorials using ML frameworks, and tell you not to get into the nitty-gritty mathematical details. It’s not okay not to worry about the mathematics behind it. Maths is important. Importing libraries, treating models like a black box, and relying on high-level APIs isn’t really data science, especially in product-based companies.
- Is Data Science equal to Large Language Models/Generative AI?— No. Data science is not synonymous with large language models (LLMs)/Generative AI. Data science spans far beyond LLMs and encompasses a variety of tools and algorithms. LLMs are groundbreaking but are not suited for solving every academic, research, or business problem. LLMs are one tool among many and shouldn’t define an entire skill set.
Data scientists require a strong grasp of mathematics. It’s important for someone starting their data science journey to focus on mathematics and fundamentals before diving into fancy stuff like LLMs. I’ve stressed the importance of fundamentals throughout this article. The knowledge of basic concepts will help you stand out from the crowd of data science aspirants. It will help you ace this career and stay updated with the developments in this rapidly growing field. Think of it as laying a building’s foundation. It takes the maximum time and effort. It’s essential to support everything that follows. Once the base is solid, you can start building upwards, floor by floor, expanding your knowledge and skills.
What is expected from you?
- Patience — Becoming a data scientist is a long, challenging, and tedious journey. Patience is key. Be ready to deal with a few struggles.
- Passion — Passion drives success. Your curiosity and enthusiasm for data and problem-solving will fuel your progress.
- Growth Mindset — Data science is a vast and rapidly evolving field. Embrace the mindset of continuous learning. Always seek to improve and stay updated.
- Think from first principles — Thinking from first principles is the thumb rule in any profession. It helps you solve problems by breaking them down to the basics and building solutions from the base.
- Consistency — Consistent efforts compound into grand success. Take small steps constantly.
Knowing where to start might seem overwhelming if you’re a beginner. With so many tools, concepts, and techniques to learn, it’s easy to feel lost. But don’t worry!
In this article —
- I will explore the role of a data scientist within an organization and highlight their key responsibilities and contributions.
- I’ll discuss the most fundamental technical skills you need to kick-start your career in data science.
- I’ll explain why these skills are important.
- I’ll share valuable resources to help you learn and develop these skills.
Let’s get started!
Job Description of a Data Scientist
- Define the Problem Statement — A data scientist’s role starts by identifying and solving business challenges using data-driven methods and predictive modeling. The first step involves collaborating with product managers and subject matter experts to define a clear and precise problem statement.
- Exploratory Data Analysis and Training Models — Data scientists perform data analysis to identify the underlying issues. Once the problem is defined, the next step is to gather and explore the appropriate data required to train an ML model. This is where they apply their primary data science skills and judgment to obtain a robust model.
- Model Evaluation — Data scientists play a crucial role in developing and tracking evaluation metrics to quantify the success of ML models. For instance, in an e-commerce recommendation system, these metrics could measure the model’s impact on sales, user engagement, or revenue growth. Defining the right metrics ensures that the model aligns with business objectives and can deliver meaningful value to the business.
- Model Deployment and A/B Testing — Once the model is ready, you work closely with engineers to deploy it into production. They conduct A/B testing to validate the model’s effectiveness and scale the model for larger use. They monitor its performance over time.
- Research and Experiment— Data scientists continuously experiment with innovative ideas to improve their models. Staying up to date with the latest research is essential. Reading research papers provides insights into new methodologies, algorithms, and breakthroughs.
The following technical skills are necessary.
- Mathematics — Linear Algebra, Probability, Statistics, and Calculus
- Machine Learning Fundamentals
- Coding — Python and SQL
1. Mathematics
Mathematics is everywhere. No doubt it’s the backbone and the core of data science. A good data scientist must have a deep and concise understanding of mathematics. Mastering mathematics will help you
- Correctly explore, analyze, and interpret large noisy industrial datasets.
- Extract meaningful conclusions from data.
- Grasp the foundational principle behind any ML model you want to use.
- Tweak the model (model hyperparameters, neural network architecture, loss function) based on your requirements.
- Choose the appropriate ML and business metrics to evaluate the models you built.
- Generate a feedback loop to detect possible scenarios where the model might fail.
- Perform error/root cause analysis to understand model flaws.
Without mathematical understanding, you’ll have difficulty unboxing the black box. The following topics are super important.
1.1. Linear Algebra
Linear Algebra is a beautiful and elegant branch of mathematics that deals with vectors, matrices, and linear transformations. Linear Algebra concepts are fundamental for solving systems of linear equations and manipulating high-dimensional data.
Why is it required?
- In industry, data at scale is inherently high-dimensional. Linear algebra provides the mathematical foundation to represent, store, and efficiently manipulate this data using vectors and matrices. Data transformations, projections, and optimizations can be performed easily by leveraging linear algebra concepts like linear transformations, determinants, orthogonality, and rank.
- For instance, dimensionality reduction techniques like principal component analysis (PCA) rely on concepts like singular value decomposition to extract meaningful, lower-dimensional representations of large datasets.
- Linear algebra is deeply embedded in the core of many ML algorithms. Neural networks and LLMs depend on efficient matrix operations, like matrix multiplications, to handle the massive computational demands of training and inference.
Nvidia folks are getting richer daily because they produce and sell the hardware (GPUs) and write open-source optimized software (Cuda) to perform efficient matrix operations!
Where to learn Linear Algebra?
- Professor Gilbert Strang’s MIT lectures — Here. He’s one of the best Linear algebra teachers in the world. Professor Strang is a legend. His explanations and way of teaching make the subject even more interesting.
- Sheldon Axler’s Book — Here. You can use Sheldon Axler’s book as a reference book and practice exercises.
- 3Blue1Brown YouTube Channel — Here. Follow this YouTube channel for eye-catching visualizations of different concepts in Linear algebra.
1.2. Probability and Statistics
Probability and statistics are essential for understanding uncertainty in data-driven fields. Probability theory provides a mathematical framework to quantify the likelihood of events. Statistics involves collecting, organizing, analyzing, and interpreting data to make informed decisions.
Why are they required?
- Before diving into ML models, it’s crucial to analyze and understand the basic properties of data. High school concepts like mean, median, mode, variance, quantiles, and standard deviation are foundational for exploring data distributions and trends.
- Statistical concepts such as variance, covariance, and correlation are key for identifying relationships between features.
- Probability is the core principle that drives predictive modeling. An in-depth understanding of probability axioms, probability density functions, probability distribution functions, random variables (continuous and discrete), Bayes theorem, expectation, variance, joint distributions, and conditional probability is essential.
- ML algorithms often assume that the input data and output follow a certain probability distribution. Familiarity with distributions like Gaussian (Normal), Geometric, Bernoulli, Binomial, Poisson, and Beta enables better assumptions about data and models.
- In product data science, A/B testing is a common practice to compare variations and make decisions. Knowledge about hypothesis testing using statistical tests like z-test and chi-squared test is useful.
Where to learn Probability and Statistics?
- Professor John Tsitsiklis’s MIT Introduction to Probability lectures — Here.
- Stanford University’s Probability for Computer Scientists lectures — Here
- Josh Starmer’s YouTube Playlist for Statistics— Here. His videos are very engaging. You can follow his YouTube channel to learn other data science concepts. This channel is useful, especially for learning/revising statistics concepts.
- Sheldon Ross’s Book — Here. You can use Sheldon Ross’ book as a reference book. Practice exercises from this book.
1.3. Calculus
Calculus is about finding the rate of change of a function. Calculus, especially differential calculus, plays an integral role in ML. It calculates the slope or gradient of curves, which tells us how a quantity changes in response to changes in another.
Why is it required?
- The ML algorithm aims to obtain a set of parameters with the least prediction error (or loss function). Optimization algorithms like gradient descent are used extensively to minimize the prediction error and update the model parameters.
- In deep learning, the chain rule of differentiation is critical for the backpropagation algorithm. Backpropagation computes gradients efficiently through deep neural networks. It’s fundamental to understanding how neural networks work and how gradients are used to obtain the best model parameters.
The 2024 Nobel Laureate Geoffrey Hinton co-authored the backpropagation algorithm paper in 1986!
Where to learn Calculus?
Wait! You’ll find it out soon!
2. Machine Learning Fundamentals
Machine learning is built upon the core principles of Linear algebra, probability, statistics, and calculus. At its essence, ML is applied mathematics, and once you grasp the underlying mathematics, understanding fundamental ML concepts becomes much easier. These fundamentals are essential for building robust and accurate ML models.
Most comprehensive ML courses begin by introducing the various types of algorithms. There are supervised, unsupervised, self-supervised, and reinforcement learning methods, each designed for specific problems. ML algorithms are further categorized into classification, regression, and clustering, depending on whether the task predicts labels, continuous values, or identifies patterns.
Nearly all ML workflows follow a structured process, which includes the following key steps:
- Feature Engineering and Data Preprocessing — Although it might not be the most glamorous part of data science, feature engineering and data preprocessing play a pivotal role in determining how well your machine learning models will perform. This involves splitting your data into train, validation, and test sets. Other key activities include dimensionality reduction, feature selection, normalization, and handling outliers. Properly addressing missing values and class imbalance (in classification tasks) is crucial to prevent biased or inaccurate models. These steps ensure your data is clean and properly structured, allowing the model to focus on learning from meaningful patterns rather than noise.
- Training and Optimization — Probability and statistics play a pivotal role in defining the loss function of an ML algorithm. A key concept, maximum likelihood estimation (MLE), is often used to derive the loss function based on our assumptions about the data’s distribution. During training, the model’s parameters (weights) are updated iteratively by optimizing the loss function. As you might have guessed, this is done using gradient descent algorithms. Mathematics everywhere!
- Overfitting and Underfitting — These are two of the many challenges we face while training ML models. Overfitting occurs when a model learns noise in the training data and performs poorly on unseen data. Underfitting happens when a model is too simple to capture the underlying patterns, leading to poor training and test data performance. Bias-variance tradeoff is the balance between model complexity and generalization. High bias leads to underfitting, and high variance leads to overfitting. The ability to manage this tradeoff by varying hyperparameters, applying regularization, and observing validation set performance is one of the important skills of a data scientist.
- Evaluation Metrics — As a data scientist, it’s crucial to pick the most suitable metric to evaluate your model. Evaluation is done on the test set. There are a plethora of ML metrics suitable to different problem scenarios.
Where to learn ML?
- Andrew Ng’s Stanford University’s ML Specialization — Here. I undertook this course in 2019. It remains the best course for understanding basic ML. You can audit this course for free! This specialization doesn’t cover the math deeply but gives you an intuitive understanding of ML.
- Cornell Tech’s Applied Machine Learning Lectures — Here. This course is super important. It starts with a primer on calculus and optimization before diving deep into the nitty-gritty details of various ML algorithms. You’ll witness the amalgamation of linear algebra, probability, and calculus concepts here. Lecture notes are available in the link shared above.
These courses will cover ML algorithms such as linear regression, Bayes classifier, logistic regression, k-means clustering, Gaussian mixture models, support vector machines, neural networks, decision trees, random forests, and boosting algorithms.
A clear understanding of mathematics and ML fundamentals opens the avenues for exploring advanced concepts like deep learning, natural language processing, computer vision, recommendation systems, generative AI, and large language models (LLMs).
You might have noticed a pattern. I have provided you with resources involving lectures from top universities like MIT, Stanford University, Carnegie Mellon University, and Cornell Tech. From next time onwards, look for course lectures from these universities whenever you want to upskill. They offer the best explanation and content. For instance, Stanford University has courses on Deep Learning for NLP, Graph ML, and Reinforcement Learning on its YouTube channel.
3. Coding
Coding skills are just as essential as mathematics for thriving as a data scientist. Coding skills help develop your problem-solving and critical-thinking abilities. Python and SQL are the most important coding skills you must possess.
3.1 Python
Python is the most widely used programming language in data science due to its simplicity, versatility, and powerful libraries.
What will you have to do?
- Your first target must be learning basic data structures like strings, lists/arrays, dictionaries, and core Object-Oriented Programming (OOP) concepts like classes and objects. Become an expert in these two areas.
- Knowledge of advanced data structures like trees, graphs, and traversal algorithms is a plus point.
- You must be proficient in time and space complexity analysis. It’ll help you write efficient code in practice. Learning the basic sorting and searching algorithms can help you gain a sufficient understanding of time and space complexity.
Python has the best data science library collection. Two of the most essential libraries are —
- NumPy — This library supports efficient operations on vectors and matrices.
- Pandas/PySpark — Pandas is a powerful data frame library for data manipulation and analysis. It can handle structured data formats like .csv, .parquet, and .xlsx. Pandas dataframes support operations that simplify tasks like filtering, sorting, and aggregating data. Pandas library is good for handling small datasets. The PySpark library is used to handle big data. It supports a variety of SQL operations (discussed later in the article), making it ideal for working with large datasets in distributed environments.
Beyond these, there are several other libraries you’ll encounter and use regularly —
- Scikit-learn — A go-to library for implementing machine learning algorithms, data preprocessing, and model evaluation.
- PyTorch — A deep learning framework widely used for building and training neural networks.
- Matplotlib and Seaborn — Libraries for data visualization, allowing you to create plots, charts, and graphs to visualize and understand data.
As a beginner, mastering every library isn’t a requirement. There are countless domain-specific libraries, like OpenCV, statsmodel, and Transformers, that you’ll pick up naturally through hands-on practice. Learning to use libraries is one of the easiest parts of data science and becomes second nature as you work on more projects. There’s no need to memorize functions — honestly, I still google various Pandas and PySpark functions all the time! I’ve seen many aspirants focus solely on libraries. While libraries are important, they’re just a small part of your toolkit.
3.2 SQL
SQL (Structured query language) is a fundamental tool for data scientists, especially when working with large datasets stored in relational databases. Data in many industries is stored in relational databases like SQL. SQL is one of the most important skills to hone when starting your data science journey. SQL allows you to query, manipulate, and retrieve data efficiently. This is often the first step in any data science workflow. Whether you’re extracting data for exploratory analysis, joining multiple tables, or performing aggregate operations like counting, averaging, and filtering, SQL is the go-to language.
I had only a basic understanding of SQL queries when I started my career. That changed when I joined my current company, where I began using SQL professionally. I worked with industry-level big data, ran SQL queries to fetch data, and gained hands-on experience.
The following SQL statements and operations are important —
Basic —
- Extraction —The select statement is the most basic statement in SQL querying.
- Filtering —The where keyword is used to filter data as per conditions.
- Sorting — The order by keyword is used to order the data in either asc or desc order.
- Joins — As the name suggests, SQL Joins help you join multiple tables in your SQL database. SQL has different types of joins — left, right, inner, outer, etc.
- Aggregation Functions— SQL supports various aggregation functions such as count(), avg(), sum(), min(), max().
- Grouping — The group by keyword is often used with an aggregation function.
Advanced —
- Window Functions — Window functions are a powerful feature in SQL that allows you to perform calculations across a set of table rows related to the current row. Once you are proficient with the basic SQL queries mentioned above, familiarize yourself with window functions such as row_number(), rank(), dense_rank(), lead(), lag(). Aggregation functions can also be used as window functions. The partition by keyword is used to partition the set of rows (called the window) and then perform the window operations.
- Common Table Expressions (CTEs) — CTEs make SQL queries more readable and modular, especially when working with complex subqueries or recursive queries. They are defined using the with keyword. This is an advanced concept.
You’ll often use Python’s PySpark library in conjunction with SQL. PySpark has APIs for all SQL operations and helps integrate SQL and Python. You can perform various SQL operations on PySpark dataframes in Python seamlessly!
3.3 Practice, Practice, Practice
- Rigorous practice is key to mastering coding skills, and platforms like LeetCode and GeeksForGeeks offer great tutorials and exercises to improve your Python skills.
- SQLZOO and w3schools are great platforms to start learning SQL.
- Kaggle is the best place to combine your ML and coding skills to solve ML problems. It’s important to get hands-on experience. Pick up any contest. Play with the dataset and apply the skills you learn from the lectures.
- Implementing ML algorithms without using special ML libraries like scikit-learn or PyTorch is a great self-learning exercise. Writing code from scratch for basic algorithms like PCA, gradient descent, and linear/logistic regression can help you enhance your understanding and coding skills.
During my Master’s in AI course at the Indian Institute of Science, Bengaluru, we had coding assignments where we implemented algorithms in C! Yes C! One of these assignments was about training a deep neural network for MNIST digits classification.
I built a deep neural network from scratch in C. I created a custom data structure for storing weights and wrote algorithms for gradient descent and backpropagation. I felt immense satisfaction when the C code ran successfully on my laptop’s CPU. My friend mocked me for doing this “impractical” exercise and argued that we have highly efficient libraries for such a task. Although my code was inefficient, writing the code from scratch deepened my understanding of the internal mechanics of deep neural networks.
You’ll eventually use libraries for your projects in academia and industry. However, as a beginner, jumping straight into libraries can prevent you from fully understanding the fundamentals.
Final Notes
Congratulations on making it this far in the article! We’ve covered the core skills necessary to become a data scientist. By now, I hope you have a solid understanding of why the basics are so important.
A Master’s degree from a reputed institution can provide structured learning on mathematics and ML concepts. It also offers opportunities to work on projects and gain practical experience. However, if pursuing a formal degree isn’t an option, don’t worry. You can follow the YouTube playlists and reference books mentioned earlier to self-learn.
Every expert was once a beginner. The key is to start small. Take it one step at a time, and gradually build your knowledge. Make sure not to skip any steps — start by mastering the math before moving on to applying it. Don’t rush the process. Focus on truly understanding each concept. Developing a strong foundation and thinking from first principles should always be your mantra. Over time, everything will begin to fall into place. With the right mindset, you’ll excel in this journey.
I highly recommend becoming a Medium member if you haven’t done so. You’ll unlock unlimited access to invaluable resources. Trust me, it’s a goldmine of knowledge! You’ll find insightful articles written by data science professionals and experts.
I hope you find my article interesting. Thank you for reading, and good luck in your data science journey!
Kickstart Your Data Science Journey — A Guide for Aspiring Data Scientists was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Kickstart Your Data Science Journey — A Guide for Aspiring Data Scientists