Top 21 sources for Data collection

Data collection is an important stage in Data Science. It is the foundation for building any data science project. There are three types of data are available structured (tabular dataset), unstructured (text, images and audio), and semi-structured data (mix of both). There are various ways through which you can collect the data.

Publicly available Sources

There are many publicly available sources through which you can download the dataset for free. Some of the most popular are

Google Dataset Search

This is google search engine for finding any types of dataset. Link to the website.

Kaggle

Kaggle is an Data Science platform through which you can download the dataset. There are more than 450k+ dataset available on this platform which you use it in your project. Link to the website.

Country Specific dataset

You can get the country specific dataset also on the official government website. Few of them are listed below.

USA government data (https://data.gov/)
UK government data (https://www.data.gov.uk/)
Indian government data (https://www.data.gov.in/)

Hugging Face

You can also download various types of data related to images, text and tabular data on the hugging face website. There are 350k+ dataset available on this website.

World health Organization

You can get the dataset from World health organization. Link to the website.

World Bank Dataset

You can get the world bank dataset also. It provides various types of dataset of global development data, encompassing indicators on agriculture, climate change, economy, education, energy, environment, external debt, financial sectors, gender, health, infrastructure, poverty, private sector, public sector, science and technology, social development, social protection, trade, and urban development. Link to the website.

Statcounter

This website provides data related to real-time web analytics data, including website traffic statistics, visitor behavior, device types, browser usage, operating systems, screen resolutions, and geographic locations. It helps businesses understand user engagement, track marketing effectiveness, and monitor trends in global internet usage. Link to the website.

Our World in Data

This website provides data related to numerous global issues, including health, education, environment, energy, technology, and demographics. Link to the website.

Food and Agriculture Organization

This website provides information related to food and agriculture. Link to the website.

DermNet

You can get dataset related to skin conditions, encompassing more than 600 skin diseases from patients across 10+ countries. Link to the website. However this is paid. You can get free dataset here.

AlphaFold

In this website you get dataset related to drug discovery and disease understanding. It has over 214 million predicted 3D protein structures, generated by DeepMind’s AlphaFold AI system. Link to the website.

GeoSpatial dataset

This website contains dataset related to GeoSpatial activities. There are so many websites which you can use download the data.

Free GIS Data ( Link to the website)
Google Earth Engine (Link to the website)
NASA Earth Data (Link to the website)
Earth Explorer (Link to the website)

AI4Bharat

This website contains dataset related to Indian languages. Link to the website.

International Energy Agency

This website contains dataset related to energy. Link to the website.

API

The second way to collect the data is through API. It can be either paid or free (based on no of request per day). These are the some most important API which you can use to download the dataset

Yahoo Finance

With the help of this API, you can download the historical stock market dataset freely for any companies in the world. This is the Python code which you use download the dataset.

The first step is to install the Yahoo finance library.

!pip install -q yfinance

The second step is to import the required libraries and download the historical dataset for the 5 year period. Don’t forget to specify the ticker name.

import yfinance as yf
ticker = 'ADANIPOWER.NS'  # .NS is for NSE India
adani_power_data = yf.download(ticker, period="5y")

Now with the help of head function you can see the top 5 rows of the dataset.

adani_power_data.head()

OpenWeatherMap

You can download the weather related information with the help of this API. It can provide you various things such as ground level, sea level, humidity, temperature and many more.

The first step is to get the API keys from here. Go to https://home.openweathermap.org/api_keys to get Openweathermap API key .

The second step is to import all the required libraries.

import requests
import pandas as pd
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

request library is used to load the URL.
pandas library help us to store the information in a dataframe.
geopy library is used to locate the coordinates of addresses, cities, countries, and landmarks across the globe.

The third step is to define a list of city and country and put it in a dataframe.

cities = [
    ["Kyiv", "Ukraine"],
    ["Warsaw", "Poland"],
    ["Berlin", "Germany"],
    ["London", "UK"]
]

df = pd.DataFrame(cities, columns=["city", "country"])

The fourth step is to get the coordinates (latitude, longitude) for all the city mentioned in the dataframe. But before that you have to initialize the locator and geocode.

locator = Nominatim(user_agent="myGeocoder")
geocode = RateLimiter(locator.geocode, min_delay_seconds=.1)

def get_coordinates(city, country):
  response = geocode(query={"city": city, "country": country})
  return {
    "latitude": response.latitude,
    "longitude": response.longitude
  }

df_coordinates = df.apply(lambda x: get_coordinates(x.city, x.country), axis=1)
df = pd.concat([df, pd.json_normalize(df_coordinates)], axis=1)
df

The fifth step is to pass the OpenWeatherMap API keys securely with the help of getpass module.

from getpass import getpass
openweathermap_api_key = getpass('Enter Openweathermap API key: ')

The sixth step is to create get_weather function to fetch all the important metrics from the URL and store the response in the JSON format. And if there is any error print the message on the screen.

import datetime

def get_weather(row):
    url = f"https://api.openweathermap.org/data/2.5/forecast?lat={row.latitude}&lon={row.longitude}&appid={openweathermap_api_key}&units=metric"
    res = requests.get(url)
    data = res.json()

    if res.status_code != 200:
        print(f"⚠️ Error for {row.city}: {data.get('message')}")
        return None

    return data["list"][0]

Now apply the above function on the dataframe and normalize the JSON column using json_normalize method.

df_weather = df.apply(lambda x: get_weather(x), axis=1)
df = pd.concat([df, pd.json_normalize(df_weather)], axis=1)
df

Binance API

You can download the cryptocurrency dataset with the help of Binance API. Here you can fetch out following information like historical price or the real time price of the crypto such as Bitcoin.

The first step is to import all the required libraries

import requests
import json
import time
import threading

request is used to fetch the data from URL.
json is used to store the data,
threading is used for dealing with concurrent processes.

The second step is to define a variable which will control the loop. After that a function is created to exit out of loop by pressing enter.

# Flag to control the loop
stop_loop = False

# Function to wait for Enter key
def wait_for_enter():
    global stop_loop
    input("Press Enter to quit...\n")
    stop_loop = True

The third step is to start the input listener in a separate thread

# Start the input listener in a separate thread
thread = threading.Thread(target=wait_for_enter)
thread.start()

The fourth step is to run the main loop which with fetch the data from the URL and store the data in the json format and display you the currency name and the price.

# Main loop
while not stop_loop:
    response = requests.get("https://api.coinbase.com/v2/prices/BTC-USD/spot")
    data = response.json()
    currency = data["data"]["base"]
    price = data["data"]["amount"]
    print(f"Currency : {currency}  Price: {price}")
    time.sleep(5)

print("Exited successfully.")

Kaggle API

With the help of this API you can download the dataset from the Kaggle platform. The first step is to get the API keys from the Kaggle.

Go to this link https://www.kaggle.com/settings. After that create API key it will download json file. Next upload that json file in Google Colab.

The second step is to install the Kaggle library.

!pip install kaggle

The third step is to import all the required library.

import os
import getpass
from kaggle.api.kaggle_api_extended import KaggleApi
import zipfile

os is to specify the credentials of the Kaggle.
getpass is used to securely pass the API keys.
kaggleAPI is used for downloading the dataset from Kaggle.
zipfile is used to perform operation on zipfile like extract the content from it.

The fourth step is to enter the Kaggle username and the API key from the json file and set the Kaggle credentials. After that the authenticate the details using the Kaggle API.

# 🔐 Get secure user input
username = getpass.getpass("Enter your Kaggle username: ")
key = getpass.getpass("Enter your Kaggle API key: ")

# 🌐 Set Kaggle credentials
os.environ['KAGGLE_USERNAME'] = username
os.environ['KAGGLE_KEY'] = key

# 🔗 Authenticate
api = KaggleApi()
api.authenticate()

The fifth step is to specify the Competition name and download all the files. After that extract all the data using the zipfile library,

competition_slug = 'titanic'

# 📦 Download files
api.competition_download_files(competition_slug, path='titanic_data')

# 📂 Extract ZIP
with zipfile.ZipFile('titanic_data/titanic.zip', 'r') as zip_ref:
    zip_ref.extractall('titanic_data')

print("✅ Titanic data downloaded and extracted successfully!")

News API

First of all you have to import two library which is requests and getpass.

import requests
import getpass

Now you have to securely pass the News API Key.

# Securely input API key
api_key = getpass.getpass("Enter your News API key: ")

Next you have to specify the URL and parameters. The parameter contains three query, country name and the API key.

# Endpoint and parameters
url = "https://newsapi.org/v2/top-headlines"
params = {
    'q': 'AI',
    'country': 'us',
    'apiKey': api_key
}

Now you have make the request on the above URL along with the parameter given above. After that return the response in the JSON format. After that print the top 5 headlines for the Specified query.

# Make request
response = requests.get(url, params=params)
data = response.json()

# Print top 5 headlines
for i, article in enumerate(data['articles'][:5], start=1):
    print(f"{i}. {article['title']}\n   {article['url']}\n")

World Bank API

This API provides data related to the global development such as economic indicators, population stats, education, health, environment, and more.

The wgbapi is a python library used to fetch the data from the World Bank API.

pip install wbgapi

After installing this library the next step is to import the required libraries. Then with the help of source.info function you can retrieve metadata about available data sources, such as source IDs, descriptions, and last updated.

import wbgapi as wb
import pandas as pd

wb.source.info()

After that you can specify the indicators such as GDP. Next, specify the country code(s) and the time range for which you want to fetch the data. Once retrieved, reset the index and display the formatted data.

# Define the indicator for GDP (constant 2015 US$)
indicator = 'NY.GDP.MKTP.KD'

# Specify the country code (India: 'IND') and time range
data = wb.data.DataFrame(indicator, economy='IND', time=range(2000, 2022))

# Reset index for better readability
data.reset_index(inplace=True)
data

You can also specify multiple indictors (GDP, Population) and multiple countries to retrieve the data from the world bank API.

# Define the indicators
indicators = {
    'NY.GDP.MKTP.CD': 'GDP (current US$)',
    'SP.POP.TOTL': 'Population'
}

# Specify the country codes and time range
countries = ['USA', 'CHN', 'IND', 'BRA']
data = wb.data.DataFrame(indicators, economy=countries, time=range(2010, 2021))

# Reset index for better readability
data.reset_index(inplace=True)

# Display the data
data

NASA API Key

With the help of the NASA API you can access to space and Earth science data. It includes data like satellite imagery, astronomy pictures, Mars rover photos, asteroid information, climate data, and more.

First of all get the NASA API keys from here https://api.nasa.gov/.

Second import all the required libraries which you need to retrieve the data from NASA.

import requests
import json
import getpass

getpass is used to securely pass the API keys.
request is used to fetch the data from URL.
json is used to store the data,

Next use the get pass module to securely get the API key from the user

api_key = getpass.getpass("Enter your NASA API key: ")

After that Specify the NASA URL and then make the request to it in order to retrieve data.

# API URL with key
url = f"https://api.nasa.gov/planetary/apod?api_key={api_key}"

# Make the API request
response = requests.get(url)

If the request is successful, store the response in JSON format. And then display the following information like Title, Date, Explanation, and Image URL.

# Check if request was successful
if response.status_code == 200:
    data = response.json()
    print("\nTitle:", data['title'])
    print("Date:", data['date'])
    print("Explanation:", data['explanation'])
    print("Image URL:", data['url'])
else:
    print("Failed to retrieve data. Status code:", response.status_code)

Web Scrapping

Another one of the most important way to collect data is through web scrapping. FireCrawl is on the important python libraries which you can use to perform web scrapping in just 7 lines of code. You can check our blog written on FireCrawl.

Conclusion

In conclusion, data collection is an essential steps in data science which lays the foundation of building any data science project.

If you like the article and would like to support me, make sure to:

👏 Like for this article and subscribe to our newsletter
📰 View more content on my DataSpoof website
🔔 Download the code from Github
🔔 Follow Me: LinkedIn| Youtube | Instagram | Twitter