Getting Started with Python for Data Analysis cover image

Getting Started with Python for Data Analysis

Jonathan Barrios • April 10, 2022

data-analytics data-science machine-learning

So you want to learn Python to work with data, and you want to get started right away. First, congrats on deciding to learn one of the most popular programming languages out there and, best of all, the most popular language when working with data!

In this post, we'll cover the landscape of tools and how you can get started right away, with little to no setup and no coding experience required. Let's do this!

By the end of this post, you'll have a clear idea of what Python is, how it differs from other languages, what an interactive notebook is, and what it takes to break into the field. Here are some of the questions you will find answers to in this post:

What is Python?

Python is a general-purpose programming language popular with data practitioners such as data engineers, data analysts, data scientists, and machine learning engineers. Python's design philosophy is about code readability leveraging whitespace and significant indentation. Python is also an object-oriented language that makes it popular with web developers, programmers, and even game developers to write clear, logical code for various applications, small and large.

What is Google Colab?

From Google, "Colaboratory, or 'Colab' for short, is a product from Google Research. Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education."

Colab is also an online version of Jupyter notebook and both are what I like to call interactive notebooks for iterative development. A code editor such as Visual Studio Code, is not the same as an interactive notebook eventhough they are both Itegrated Development Environments (IDEs). Before we dive into IDEs and how interactive books are used for iterative development, let's learn more about Jupyter notebook next.

What is Jupyter notebook?

From Project Jupyter, "Project Jupyter is a project and community whose goal is to develop open-source software, open standards, and services for interactive computing across dozens of programming languages."

Jupyter Notebook is essentially a server-client app that runs in a web browser. Unlike Google Colab, which is served in the cloud and runs in a web browser while Jupyter notebook runs locally on your machine using a local server via a web browser.

In short, Colab and Jupyter notebook are the same things, except one runs in the cloud and the latter runs on your local machine, and they both use a browser and the user interface.

What's the difference between code editors and interactive notebooks?

Programmers use a code editor like Visual studio code to write code using built-in tools to make development more straightforward and store your code in a file that you can execute when you want to run a script or program. On the other hand, An interactive notebook runs code in a code cell iteratively, one cell after the other, displaying output as you write your code. Interactive notebooks also allow text cells, images, visualizations, and even videos, making them prevalent for data analysis and visualization.

How do I get started as a data analyst?

Traditionally, data analysis was performed using spreadsheet programs or SQL, a standard query language used to communicate with databases. However, tools such as Excel made working with spreadsheets much more accessible than their programmatic predecessors.

Today, Python is probably the most popular language for working with data, and Python libraries such as Pandas offer data structures similar to spreadsheets. The difference is that Python is much more powerful than Excel and allows you to automate tasks and work with large datasets.

Traditionally, data analysis was performed using spreadsheet programs such as VisiCalc or SQL, a standard query language used to communicate with databases. However, tools such as Excel made working with spreadsheets much more accessible than their programmatic predecessors.

Today, Python is probably the most popular language for working with data, and Python libraries such as Pandas offer data structures similar to spreadsheets. The difference is that Python is much more powerful than Excel and allows you to automate tasks and work with large datasets.

Choosing the best data development environment is essential for the aspiring data practitioner. Furthermore, there are many different ways to get started, making this first step more complicated than it has to be. However, my advice and recommendations are simple. Start with Python and SQL, and learn to manipulate and visualize data using Python libraries such as Pandas and Matplotlib. If you already know how to use Excel, you will be able to transfer your knowledge to the Pandas library, which uses DataFrames, a Pandas equivalent to spreadsheets. If you don't know anything about spreadsheets, learning the basics helps but is unnecessary.

To get started, check out my course on Python Foundations for Data Analysis. If you have any questions, reach out on Twitter @_jonathan_codes, and until then, happy analyzing! 🐍 📊