Data Collection & Importing Datasets
Data Collection is the first step in the Data Science workflow.
Before analyzing or building machine learning models, you need data.
This article explains Data Collection and how to import datasets in Python using Pandas — in simple, teacher-style language.
What Is Data Collection?
Data Collection is the process of gathering information from various sources so you can analyze it.
In Data Science, data can come from:
Websites
Databases
Sensors
Mobile apps
Surveys
Social media
APIs
Cloud storage
CSV, Excel, JSON files
Without data, no analysis or prediction is possible.
Why Is Data Collection Important in Data Science?
Data Collection allows you to:
Understand business problems
Find patterns
Build predictive models
Improve decision-making
Create reports & dashboards
The quality of your data decides the quality of your model.
This is why companies invest heavily in data engineering, data pipelines & ETL.
Types of Data
Data can be collected in three main forms:
Structured Data (Tables, rows & columns)
Examples:
Excel files
SQL database
CSV files
Used in: Analytics, BI, ML models.
Unstructured Data
Examples:
Text (emails, reviews)
Images
Videos
Audio
Used in: NLP, Computer Vision, Deep Learning.
Semi-Structured Data
Examples:
JSON
XML
Web server logs
Useful for APIs, web apps, cloud data.
Sources of Data Collection
There are many ways to collect data. Let’s learn them one by one.
Manual Collection
Collecting data by hand:
Surveys
Google forms
Interviews
Suitable for small datasets.
Automated Collection
Using tools to collect data at scale:
Web scraping
Sensors / IoT
API calls
Automated logs
Data from mobile apps
Used in big data analytics and machine learning systems.
Web Scraping
Extracting data from websites.
Tools:
BeautifulSoup
Scrapy
Selenium
Example: Collecting product prices from Amazon (for analysis).
APIs (Application Programming Interfaces)
APIs give data in real-time.
Example API sources:
Twitter API
Google Maps API
Weather API
Stock Market API
Format: Mostly JSON
Databases
Companies store large data in:
MySQL
PostgreSQL
MongoDB
SQL Server
Oracle
Data Scientists import this data into Python for processing.
Cloud Data Platforms
Modern companies store data on:
AWS S3
Google BigQuery
Azure Blob Storage
Snowflake
Databricks
These are used in data engineering, ETL, and big data pipelines.
Public Datasets
Beginner-friendly sources:
Kaggle
UCI Machine Learning Repository
Google Dataset Search
GitHub
Government portals (Open Data India)
Importing Datasets in Python
We use Pandas because it is the most powerful data analysis library.
import pandas as pd
Now let’s learn how to import different dataset types.
Importing CSV File
CSV (Comma Separated Values) is the most common format.
df = pd.read_csv("data.csv")
print(df.head())
Importing Excel File
df = pd.read_excel("file.xlsx")
Importing JSON File
df = pd.read_json("data.json")
Importing from a URL
If your dataset is online:
df = pd.read_csv("https://example.com/data.csv")Importing SQL Database Data
import pandas as pd
import sqlite3
conn = sqlite3.connect("mydatabase.db")
df = pd.read_sql_query("SELECT * FROM employees", conn)
Works similarly for MySQL & PostgreSQL using proper connectors.
Importing from APIs (Very Important for Real Projects)
Example: JSON API
import requests
import pandas as pd
response = requests.get("https://api.example.com/data")
data = response.json()
df = pd.DataFrame(data)
Preparing Data After Import
Once data is imported, the next steps include:
Cleaning
Handling missing values
Removing duplicates
Formatting data types
Feature engineering
Exploratory analysis
These steps help prepare data for:
Machine learning
Dashboards
Business reporting
Example Data Collection Scenario
A company wants to analyze customer behavior.
Data is collected from:
Website logs → browsing history
Mobile app → user actions
Database → customer details
Survey → customer feedback
API → location information
Then, data is imported into Python using Pandas and cleaned before analysis.
This is exactly how real-world Data Science projects work.
Why Data Collection Matters for High-Level Work
Good data enables:
Accurate models
Reliable predictions
Business insights
Detecting patterns
Data-driven decision-making