Blog
Data Cleaning and Preprocessing
When it comes to data, external data can be challenging to obtain, especially with limited financial resources. For personal social media data, platforms like Instagram, Facebook, and YouTube offer APIs that allow access to this data. These APIs provide a way to gather valuable insights for analysis. Below are the steps for obtaining data or accessing the respective APIs for Instagram, Facebook, and YouTube:
1. Removing Duplicates
Duplicate rows can inflate data or skew analysis. Use the following command to remove them:
df = df.drop_duplicates()
2. Parsing and Transforming Strings
String operations, such as splitting, concatenating, or formatting text columns, are effortless with Pandas:
df['Name'] = df['Name'].str.upper() # Convert all text to uppercase
df['Domain'] = df['Email'].str.split('@').str[1] # Extract domain from email addresses
3. Handling Missing Data
Managing missing values is crucial for maintaining data quality. Pandas provides multiple options for dealing with NaN
:
df = df.fillna('Unknown') # Replace missing values with a placeholder
df = df.dropna(subset=['ColumnName']) # Remove rows where a specific column has NaN
4. Filtering and Subsetting Data
Extracting relevant portions of the dataset is straightforward:
df = df[df['Age'] > 18] # Filter rows where age is greater than 18
General data cleaning was performed across all datasets, including those collected directly and those generated from web scraping. This included tasks such as removing duplicates, handling missing values, and ensuring proper formatting of columns. The cleaning process was applied uniformly across all data sources to ensure consistency and accuracy for further analysis.
Depending on the machine learning models used, additional preprocessing steps were carried out:
- Natural Language Processing (NLP): For Yelp reviews, NLP techniques were applied to process the textual data. The text was filtered to include only English language content, and special characters were removed. Tokenization and lemmatization were performed to prepare the text for sentiment analysis. These steps ensured that the text data was clean, standardized, and ready for use in sentiment models.
import pandas as pd
import spacy
from sklearn.feature_extraction.text import CountVectorizer
# Sample data
reviews = ["The food was amazing! Will come again.", "Terrible service. Not recommended."]
# Load English NLP model
nlp = spacy.load("en_core_web_sm")
# Preprocess reviews
def preprocess_text(text):
doc = nlp(text.lower())
tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
return " ".join(tokens)
# Apply preprocessing
cleaned_reviews = [preprocess_text(review) for review in reviews]
print("Cleaned Reviews:", cleaned_reviews)
# Convert text into a bag-of-words format
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(cleaned_reviews)
print("Feature Names:", vectorizer.get_feature_names_out())
- Linear Regression: For the linear regression model, one-hot encoding was applied to convert non-numeric categorical columns into numeric variables. This step was essential to ensure that the machine learning model could interpret categorical data appropriately.
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Sample data
data = pd.DataFrame({
"category": ["Restaurant", "Salon", "Restaurant", "Grocery"],
"rating": [4.5, 3.0, 4.0, 2.5]
})
# One-hot encode categorical variable
data_encoded = pd.get_dummies(data, columns=["category"], drop_first=True)
# Split data into features and target
X = data_encoded.drop(columns=["rating"])
y = data_encoded["rating"]
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
print("Predictions:", predictions)
- Association Analysis: For association rule mining, the data was transformed into a transactional format. This preprocessing step was necessary to apply algorithms like Apriori, which requires data to be structured as transactions (i.e., a list of items or events that occur together).
from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd
# Sample transactional data
transactions = [
["Milk", "Bread", "Butter"],
["Bread", "Butter"],
["Milk", "Bread"],
["Milk", "Butter"],
["Bread", "Butter", "Eggs"]
]
# Convert data into a transactional format
itemset = pd.DataFrame(transactions)
itemset = pd.get_dummies(itemset.stack()).sum(level=0)
# Apply Apriori algorithm
frequent_itemsets = apriori(itemset, min_support=0.5, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)
print("Frequent Itemsets:")
print(frequent_itemsets)
print("\nAssociation Rules:")
print(rules)