Data Collection & Importing Datasets

Data Collection is the first step in the Data Science workflow.
Before analyzing or building machine learning models, you need data.
This article explains Data Collection and how to import datasets in Python using Pandas — in simple, teacher-style language.

What Is Data Collection?

Data Collection is the process of gathering information from various sources so you can analyze it.

In Data Science, data can come from:

  • Websites

  • Databases

  • Sensors

  • Mobile apps

  • Surveys

  • Social media

  • APIs

  • Cloud storage

  • CSV, Excel, JSON files

Without data, no analysis or prediction is possible.

Why Is Data Collection Important in Data Science?

Data Collection allows you to:

  • Understand business problems

  • Find patterns

  • Build predictive models

  • Improve decision-making

  • Create reports & dashboards

 The quality of your data decides the quality of your model.
This is why companies invest heavily in data engineering, data pipelines & ETL.

Types of Data

Data can be collected in three main forms:

Structured Data (Tables, rows & columns)

Examples:

  • Excel files

  • SQL database

  • CSV files

Used in: Analytics, BI, ML models.

Unstructured Data

Examples:

  • Text (emails, reviews)

  • Images

  • Videos

  • Audio

Used in: NLP, Computer Vision, Deep Learning.

Semi-Structured Data

Examples:

  • JSON

  • XML

  • Web server logs

Useful for APIs, web apps, cloud data.

Sources of Data Collection

There are many ways to collect data. Let’s learn them one by one.

Manual Collection

Collecting data by hand:

  • Surveys

  • Google forms

  • Interviews

Suitable for small datasets.

Automated Collection

Using tools to collect data at scale:

  • Web scraping

  • Sensors / IoT

  • API calls

  • Automated logs

  • Data from mobile apps

Used in big data analytics and machine learning systems.

Web Scraping

Extracting data from websites.

Tools:

  • BeautifulSoup

  • Scrapy

  • Selenium

Example: Collecting product prices from Amazon (for analysis).

APIs (Application Programming Interfaces)

APIs give data in real-time.

Example API sources:

  • Twitter API

  • Google Maps API

  • Weather API

  • Stock Market API

Format: Mostly JSON

Databases

Companies store large data in:

  • MySQL

  • PostgreSQL

  • MongoDB

  • SQL Server

  • Oracle

Data Scientists import this data into Python for processing.

Cloud Data Platforms

Modern companies store data on:

  • AWS S3

  • Google BigQuery

  • Azure Blob Storage

  • Snowflake

  • Databricks

These are used in data engineering, ETL, and big data pipelines.

Public Datasets

Beginner-friendly sources:

  • Kaggle

  • UCI Machine Learning Repository

  • Google Dataset Search

  • GitHub

  • Government portals (Open Data India)

Importing Datasets in Python

We use Pandas because it is the most powerful data analysis library.

import pandas as pd

Now let’s learn how to import different dataset types.

Importing CSV File

CSV (Comma Separated Values) is the most common format.

df = pd.read_csv("data.csv")
print(df.head())

Importing Excel File

df = pd.read_excel("file.xlsx")

Importing JSON File

df = pd.read_json("data.json")

Importing from a URL

If your dataset is online:

df = pd.read_csv("https://example.com/data.csv")

Importing SQL Database Data

import pandas as pd
import sqlite3

conn = sqlite3.connect("mydatabase.db")
df = pd.read_sql_query("SELECT * FROM employees", conn)

Works similarly for MySQL & PostgreSQL using proper connectors.

Importing from APIs (Very Important for Real Projects)

Example: JSON API

import requests
import pandas as pd

response = requests.get("https://api.example.com/data")
data = response.json()
df = pd.DataFrame(data)

Preparing Data After Import

Once data is imported, the next steps include:

  • Cleaning

  • Handling missing values

  • Removing duplicates

  • Formatting data types

  • Feature engineering

  • Exploratory analysis

These steps help prepare data for:

  • Machine learning

  • Dashboards

  • Business reporting

Example Data Collection Scenario

A company wants to analyze customer behavior.

Data is collected from:

  • Website logs → browsing history

  • Mobile app → user actions

  • Database → customer details

  • Survey → customer feedback

  • API → location information

Then, data is imported into Python using Pandas and cleaned before analysis.

This is exactly how real-world Data Science projects work.

Why Data Collection Matters for High-Level Work

Good data enables:

  • Accurate models

  • Reliable predictions

  • Business insights

  • Detecting patterns

  • Data-driven decision-making