Blog
Social Media Analytics
Social media analytics is a vital process for understanding and leveraging the power of online platforms. By analyzing user behavior, engagement patterns, and trends, businesses and individuals can make informed decisions to improve strategies, enhance brand visibility, and connect with their target audience effectively. The creation of a social media analytics framework involves a series of structured steps designed to ensure the collection, preparation, and meaningful interpretation of data. Below, we outline these steps to help you build a comprehensive and efficient analytics process.
Table of Contents
- Defining Objectives
- Data Collection
- Data Cleaning and Preprocessing
- Data Storage
- Data Analysis & Visualization
Defining Objectives
First and foremost, it is crucial to define the business objectives and goals you aim to achieve through the analysis. Clearly outlining these objectives will shape the entire trajectory of the analysis. For our platform, we established several key questions that our analysis aims to answer. You can adopt these goals as they are or customize them to suit your specific needs:
- Which social media platform is preferred by Millennials?
- What are the key engagement metrics for each platform?
- How do user demographics influence platform preference and engagement?
- What sentiment do users express about each platform?
- What is the geographic distribution of businesses, and how can clustering identify business hotspots?
- What sentiments do customers express in their reviews, and can sentiment analysis predict future ratings?
- What are the common words, bigrams, and trigrams in reviews, and how do they relate to ratings?
- Which business hours correlate with the highest ratings, and can we predict optimal hours for new businesses?
- What attributes contribute to higher business ratings, and can we create a predictive model for rating improvement?
- How do different categories of businesses compare, and what can clustering reveal about category similarities and differences?
- What are the behaviors and patterns of elite Yelp users, and how do they differ from regular users?”
Data Collection
For analyzing trending data, there is considerable flexibility in the tools and methods available. The Pytrends API, developed by Google, offers insights into trending searches within specific regions, as well as the ability to track the popularity of particular keywords over time. Similarly, the YouTube Data API, which will be explored in detail below, provides endpoints to access trending videos and topics worldwide.
For Amazon, trending products were manually scraped using the Scrapy framework, which is a powerful and efficient tool for web scraping. Additional information and resources about using Scrapy for such tasks are provided in the respective section below.
When it comes to data, external data can be challenging to obtain, especially with limited financial resources. For personal social media data, platforms like Instagram, Facebook, and YouTube offer APIs that allow access to this data. These APIs provide a way to gather valuable insights for analysis. Below are the steps for obtaining data or accessing the respective APIs for Instagram, Facebook, and YouTube:
Instagram Graph API:
- Step 1: Create an Instagram Developer account.
- Step 2: Register a new application to obtain an access token.
- Step 3: Generate the access token and set the necessary permissions.
- Step 4: Use the API to fetch data such as user information, posts, and comments.
Facebook API:
- Step 1: Create a Facebook Developer account.
- Step 2: Register a new application and set up the necessary configurations.
- Step 3: Obtain an access token for the desired user or page.
- Step 4: Use the API to access user data, pages, groups, posts, and more.
YouTube Data API:
- Step 1: Create a Google Cloud project and enable the YouTube Data API.
- Step 2: Set up OAuth 2.0 and create credentials to obtain an API key or access token.
- Step 3: Use the API to retrieve data such as video details, comments, channel information, and more.
For external data, we utilized a survey published by Whatsgoodly, a millennial-focused social polling company. This survey gathered responses from 9,491 U.S. Millennials, asking which social media platform they care about the most. The data was then analyzed to identify the distribution of key segments, such as gender and education level, across four major platforms: LinkedIn, Facebook, Instagram, and Snapchat. The analysis was performed using Python for detailed insights.
Moving on to the Yelp dashboard, we used the Yelp Open Dataset (accessible at Yelp Dataset). This dataset is a subset of Yelp’s business, review, and user data, specifically provided for academic research purposes. Available in JSON format, it includes information on 6,990,280 reviews from 150,346 businesses. The dataset is structured into several JSON files:
- business.json: Contains business information, including location, attributes, and categories.
- review.json: Includes full review texts, user IDs, and business IDs.
- user.json: Provides user metadata, including friend mappings.
- checkin.json: Contains check-in data for businesses.
- tip.json: Features short suggestions written by users for businesses.
- photo.json: Includes photo data with captions and classifications (e.g., “food,” “menu,” “inside”).
These resources, combined with robust data analysis techniques, enabled us to derive meaningful insights into consumer behavior and business dynamics.
Python Libraries for Data Collection
Python provides a rich ecosystem of libraries that streamline the data collection process, particularly for social media, web scraping, and API interaction. Here are some essential libraries used for various aspects of data collection:
- BeautifulSoup: A popular library for parsing HTML and XML documents. It simplifies extracting structured data from web pages.
- Scrapy: An advanced web scraping framework designed for large-scale data extraction. It is particularly useful for crawling multiple pages and managing complex scraping workflows.
- Selenium: A library that automates browser actions, enabling scraping of dynamic websites that rely heavily on JavaScript.
- Requests: A user-friendly library for making HTTP requests to APIs and retrieving JSON responses
Data Cleaning and Preprocessing
Python offers a wide array of libraries and built-in tools specifically designed to streamline the process of data cleaning and preprocessing. Among these, Pandas stands out as an indispensable library for handling and transforming data efficiently. With its robust functionality, Pandas simplifies tasks like removing duplicates, parsing strings, handling missing values, and more. Below are some common data cleaning operations using Pandas
1. Removing Duplicates
Duplicate rows can inflate data or skew analysis. Use the following command to remove them:
df = df.drop_duplicates()
2. Parsing and Transforming Strings
String operations, such as splitting, concatenating, or formatting text columns, are effortless with Pandas:
df['Name'] = df['Name'].str.upper() # Convert all text to uppercase
df['Domain'] = df['Email'].str.split('@').str[1] # Extract domain from email addresses
3. Handling Missing Data
Managing missing values is crucial for maintaining data quality. Pandas provides multiple options for dealing with NaN
:
df = df.fillna('Unknown') # Replace missing values with a placeholder
df = df.dropna(subset=['ColumnName']) # Remove rows where a specific column has NaN
4. Filtering and Subsetting Data
Extracting relevant portions of the dataset is straightforward:
df = df[df['Age'] > 18] # Filter rows where age is greater than 18
Data Storage
Since our application is hosted on an SSH server, much of the data also resides on the server in a MongoDB database. MongoDB is an ideal choice due to its flexibility and scalability, particularly for handling unstructured or semi-structured data. Below are the steps for setting up a MongoDB database on an SSH server, creating a database and collections, and transferring files into the database.
1. Setting Up MongoDB on an SSH Server
To set up MongoDB on a remote SSH server:
sudo apt update
sudo apt install mongodb
sudo systemctl start mongodb
sudo systemctl enable mongodb
2. Creating a Database and Collections
After installing MongoDB, log in to the MongoDB shell:
mongo
use your_database_name
db.createCollection("your_collection_name")
1. Ensure the files are in JSON or CSV format, which MongoDB supports for import.
2. Use scp to transfer files to the server:
scp your_file.json username@your-server-ip:/target/directory
3. Use the mongoimport tool to load files into a collection:
mongoimport --db your_database_name --collection your_collection_name --file /target/directory/your_file.json --jsonArray
If importing CSV files, include the --type csv and --headerline options. Data Analysis & Visualization
The final design of your dashboard, both logical and physical, is entirely up to your specific objectives and preferences. You can decide which metrics to display and the visualization styles that best communicate the insights. Python provides a wide range of visualization libraries to support this flexibility, including:
- Matplotlib: Offers foundational tools for creating static, interactive, and animated plots.
- Seaborn: Built on top of Matplotlib, it simplifies the creation of aesthetically pleasing and informative statistical graphics.
- Plotly: Provides interactive visualizations, perfect for dashboards requiring user interaction like zooming or filtering.
All these libraries integrate seamlessly with Streamlit, a Python-based framework for building web apps.
Additionally, Streamlit includes several built-in visualization tools and features, such as:
- st.bar_chart(): For creating bar charts with minimal code.
- st.line_chart(): For quick line charts.
- st.map(): For geographic data visualizations.
These tools, combined with Streamlit’s interactivity and flexibility, allow you to design dashboards that are both functional and visually appealing, ensuring the metrics are displayed in a way that aligns with your project’s goals.