Facebook Data Mining

Facebook has incredible amounts of data that can be mined for analysis and insights on user or customer behavior, social trends, market analysis, and so on. Graph API is the interface for extracting this data. It is the primary way to get data into and out of the Facebook platform. It’s a low-level HTTP-based API that apps can use to programmatically query data, post new stories, manage ads, upload photos, and perform a wide variety of other tasks.

I created a project called Facebook Miner for extracting this data using Python. This project extracts Facebook page posts along with its comments and sub-comments.

The project is located on Github. To execute, simply execute the Python script. It does require preliminary setup, such as the FB page lists file for extraction. These preliminary steps are detailed in the Github readme file.

Facebook Graph API for Data Mining

Facebook provides an API for extracting company and customer comments called Graph API. It is the primary way to get data into and out of the Facebook platform. It’s a low-level HTTP-based API that apps can use to programmatically query data, post new stories, manage ads, upload photos, and perform a wide variety of other tasks[1].

Before the Graph API can be accessed, a user token is required[2]. An access_token can be acquired from Facebook’s developer tools located at https://developers.facebook.com/tools/

Once acquired, an endpoint API can be access as follows:

curl -i -X GET \
"https://graph.facebook.com/facebook/picture?redirect=false&access_token={valid-access-token-goes-here}"

The Graph API is named after the idea of a “social graph” — a representation of the information on Facebook. It’s composed of:

  • nodes — basically individual objects, such as a User, a Photo, a Page, or a Comment
  • edges — connections between a collection of objects and a single object, such as Photos on a Page or Comments on a Photo
  • fields — data about an object, such as a User’s birthday, or a Page’s name

Typically, you use nodes to get data about a specific object, use edges to get collections of objects on a single object, and use fields to get data about a single object or each object in a collection.

User Page Data Structure

Each post by a user on their page in Facebook is published on the main page accessible using the URL form http://www.facebook.com/<page_name>. There are variations on how content gets published on a company page. Some companies only allow authorized personnel to post content, while others are open to all users in Facebook.

Below is the structure of posts for a company page:

- User A Page 
-> Page Post by Friends (or enemies)
-> Page Post Comments
-> Page Post Comment Sub-Comments
...
...
...
...
-> Page Post by Friends (or enemies)
-> Page Post Comments
-> Page Post Comment Sub-Comments
...
...
...
...
- User B Page
...
...
- User C Page
...

See screenshot below for illustration:

For each post, customers can add comments as well as sub-comments. To retrieve sub-comments, a GET request must include the posts node identifier in the request URL. Doing so will return all the associated comments for a given post.

Python Code for Extracting Facebook Posts

This Github Python project (https://github.com/rayitis/Facebook-Miner) uses Pandas Dataframe[3] to load and manipulate data for processing. A Data frame is a two-dimensional data structure, i.e. data is aligned in a tabular fashion in rows and columns. The following is a structure of the methods in the Python program for extracting Facebook data.


import pandas as pd
import json
from pprint import pprint
import os
import requests
 
# Load company list
# companyList = pd.read_csv('DataSet/Fortune 500 2017 - Fortune 500-small.csv')
companyList = pd.read_csv('DataSet/Fortune 500 2017 - Fortune 500.csv')
 
def constructUrl(contentType, variableStr):
   #contentType is either post, comment, or subcomment
   #variableStr is either the company name, post id (for comments), or comment id (for sub comments)
    if contentType == 'post':
<URL txt here>
 
    return url
 
def graphApiRequest(url):
    requestResult = requests.get(url)
    return requestResult
 
def writeToFile(folderName, fileName, extractedData):
    if not os.path.exists(folderName):
    ...
 
def getNextPageUrl(folderName, fileName):
  
    try:
        nextPageUrl = jsonData["paging"]["next"]
    except KeyError:
        nextPageUrl = 'null'
    return nextPageUrl
    ...
 
companyFbName = companyList['FBPage'][0]
folderName = 'DataSet/'+companyFbName+'/1-CompanyPost/'
url = constructUrl('post', companyFbName)
 
 
for i in range(companyList['FBPage'].size):
    pageNum = 0
    companyFbName = companyList['FBPage'][i]
    #companyNum += 1
    folderName = 'DataSet/' + str(companyList['Rank'][i]) + '-' + companyFbName + '/1-CompanyPost/'
    url = constructUrl('post', companyFbName)
 
    while url != 'null':
        #print "We're on time %d" % (x)
        pageNum += 1
        fileName = companyFbName+"_posts_page"+str(pageNum)+".json"
        extractedData = graphApiRequest(url)
        writeToFile(folderName, fileName, extractedData)
        nextPageUrl = getNextPageUrl(folderName, fileName)
        url = nextPageUrl
        print(companyFbName + ' page '+ str(pageNum) + ': '+ url)
        if url == 'null' and pageNum == 1:
            with open('DataSet/Exceptions.txt', "a") as exceptionFile:
                exceptionFile.write(',' + companyFbName)

To extract user comments from Facebook, the code set above is leveraged with the addition of processing logic that is suited for the user comments data structure provided by Facebook’s Graph API.

The code process and architecture are as follows:

  1. Read each extracted company post JSON files for each company. There are several JSON files representing multiple pages. This means that a new nested loop is added that will loop through each post in the JSON file.
  2. Extract the company post ID element in the JSON file. This post ID is the unique identifier for a given company post the is read by the “while” code snippet in step 1 above.
  3. Invoke the Graph API with this post ID, which is represented as the variable “variableStr.” This user comments Graph API URL is slightly different than the company post URL, and therefore a new URL string is generated.
  4. Write the output of the JSON request to a file. Each file is represented by user comments to a given company post. Similar to the company post files, this is also broken down into multiple pages with each company post file generating an average of 5,000 user comment files.

References

[1] Facebook Graph API for Developers. URL: https://developers.facebook.com/docs/graph-api/overview/

[2] Facebook Access Tokens. URL: https://developers.facebook.com/docs/facebook-login/access-tokens

[3] Pandas DataFrame. URL: https://pandas.pydata.org/pandas-docs/stable/dsintro.html


Leave a Reply

Your email address will not be published. Required fields are marked *