Blog
Data Collection
For analyzing trending data, there is considerable flexibility in the tools and methods available. The Pytrends API, developed by Google, offers insights into trending searches within specific regions, as well as the ability to track the popularity of particular keywords over time. Similarly, the YouTube Data API, which will be explored in detail below, provides endpoints to access trending videos and topics worldwide.
For Amazon, trending products were manually scraped using the Scrapy framework, which is a powerful and efficient tool for web scraping. Additional information and resources about using Scrapy for such tasks are provided in the respective section below.
When it comes to data, external data can be challenging to obtain, especially with limited financial resources. For personal social media data, platforms like Instagram, Facebook, and YouTube offer APIs that allow access to this data. These APIs provide a way to gather valuable insights for analysis. Below are the steps for obtaining data or accessing the respective APIs for Instagram, Facebook, and YouTube:
Instagram Graph API:
- Step 1: Create an Instagram Developer account.
- Step 2: Register a new application to obtain an access token.
- Step 3: Generate the access token and set the necessary permissions.
- Step 4: Use the API to fetch data such as user information, posts, and comments.
Facebook API:
- Step 1: Create a Facebook Developer account.
- Step 2: Register a new application and set up the necessary configurations.
- Step 3: Obtain an access token for the desired user or page.
- Step 4: Use the API to access user data, pages, groups, posts, and more.
YouTube Data API:
- Step 1: Create a Google Cloud project and enable the YouTube Data API.
- Step 2: Set up OAuth 2.0 and create credentials to obtain an API key or access token.
- Step 3: Use the API to retrieve data such as video details, comments, channel information, and more.
For external data, we utilized a survey published by Whatsgoodly, a millennial-focused social polling company (accessible at Kaggle.) This survey gathered responses from 9,491 U.S. Millennials, asking which social media platform they care about the most. The data was then analyzed to identify the distribution of key segments, such as gender and education level, across four major platforms: LinkedIn, Facebook, Instagram, and Snapchat. The analysis was performed using Python for detailed insights.
Moving on to the Yelp dashboard, we used the Yelp Open Dataset (accessible at Yelp Dataset). This dataset is a subset of Yelp’s business, review, and user data, specifically provided for academic research purposes. Available in JSON format, it includes information on 6,990,280 reviews from 150,346 businesses. The dataset is structured into several JSON files:
- business.json: Contains business information, including location, attributes, and categories.
- review.json: Includes full review texts, user IDs, and business IDs.
- user.json: Provides user metadata, including friend mappings.
- checkin.json: Contains check-in data for businesses.
- tip.json: Features short suggestions written by users for businesses.
- photo.json: Includes photo data with captions and classifications (e.g., “food,” “menu,” “inside”).
These resources, combined with robust data analysis techniques, enabled us to derive meaningful insights into consumer behavior and business dynamics.
Python Libraries for Data Collection
Python provides a rich ecosystem of libraries that streamline the data collection process, particularly for social media, web scraping, and API interaction. Here are some essential libraries used for various aspects of data collection:
- BeautifulSoup: A popular library for parsing HTML and XML documents. It simplifies extracting structured data from web pages.
- Scrapy: An advanced web scraping framework designed for large-scale data extraction. It is particularly useful for crawling multiple pages and managing complex scraping workflows.
- Selenium: A library that automates browser actions, enabling scraping of dynamic websites that rely heavily on JavaScript.
- Requests: A user-friendly library for making HTTP requests to APIs and retrieving JSON responses
Challenges Faced During Data Collection
1. Access Restrictions for Social Media APIs
- Facebook Graph API: Accessing Facebook and Instagram data through the Graph API requires verified business documents. This poses a significant hurdle for individual users or projects without formal business registration.
- TikTok API: Some platforms, like TikTok, require a minimum follower count to access advanced analytics, further limiting access for smaller-scale users or teams.
- Workaround: A common solution is to use prebuilt datasets, such as those available on Kaggle or other public repositories. These datasets provide a starting point for analysis when direct access to data is restricted
2. Scraping Challenges
- HTTP 429 (Too Many Requests) Error: This error is frequently encountered when scraping websites. Without proper error handling, such errors can cause the application to hang or crash.
- Solutions:
Implementing proper error-catching mechanisms prevents crashes during scraping.
Using appropriate user-agent headers to mimic legitimate traffic. Here’s an example:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
}
url = "https://example.com"
response = requests.get(url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
print(soup.title.text)
else:
print(f"Error: {response.status_code}")
Scheduling scraping tasks at regular intervals using job schedulers like schedule
or cron jobs to avoid overwhelming the server. Example using Python’s schedule
library:
import schedule
import time
def scrape_task():
print("Scraping started...")
# Add your scraping code here
print("Scraping completed!")
schedule.every(1).hour.do(scrape_task)
while True:
schedule.run_pending()
time.sleep(1)
3. Processing Large Datasets
- Issue: When working with datasets containing millions of rows, such as the Yelp Business dataset, the application often experiences performance bottlenecks, with processing times extending to 15 minutes or more.
- Solutions:
Using Dask for Large Dataframes: Dask allows efficient handling of larger-than-memory datasets by parallelizing operations. Here’s an example:
import dask.dataframe as dd
# Read a large CSV file
df = dd.read_csv("large_dataset.csv")
# Perform computations
grouped = df.groupby("category")["rating"].mean().compute()
print(grouped)
Multiprocessing: Leveraging multiple CPU cores can speed up data processing. Here’s an example using Python’s multiprocessing
module:
import multiprocessing
def process_data(chunk):
# Simulate data processing
return sum(chunk)
if __name__ == "__main__":
data = [i for i in range(1_000_000)]
chunk_size = len(data) // multiprocessing.cpu_count()
chunks = [data[i:i + chunk_size] for i in range(0, len(data), chunk_size)]
with multiprocessing.Pool() as pool:
results = pool.map(process_data, chunks)
total = sum(results)
print(f"Total sum: {total}")