in python

Twitter Data Mining: Collecting tweets and replies by python

Even although Twitter has already provided an efficient and friendly API to developers, the twitter API has several limitations.

  • First, twitter API can only grab 3200 tweets at maximum. According to a twitter developer said, 3200 is the hard limitation.

"For users with less than 3,200 tweets, you can get all of them with a few calls. For users with more, there's a hard limit at 3,200. There is no reliable way to try capturing an archive of a specific user's tweets between a span of dates." Link

  • Second, twitter API do not provide any method that allow developer to retrieve all replies.

I have tried using the Search method to find the related tweets and decide whether the result is the reply to the tweets. Since the search method can only fetch 100 continuous tweets at maximum, this way does not work out either.

After searching for help in Google for quiet a long time, I decide to write my own 'API' to implement this functionality.

The method I use here is really old fashion and works like a web crawler. I am going to mimic the browsing behavior. My program will continuously send requests to twitter, analyze the retrieved html code and filter out the any valuable data. The biggest advantage is that no matter what you see in the browser, the crawler can  record them, such as date, tweets, replies, user-name, etc.

You can use this method for many other programs. For example, I have used this way implemented an Automatically Course Registration app.

How do we design the program to mimic the browsing behavior?

First, ask yourself, after opening a browser, what is the first step to browsing the twitter website?

The answer is: Type the twitter address into the browser.

Then, here is the code:

import requests
r = requests.get('https://twitter.com')
print r.text

I use the request library, the so called HTTP for humans, which makes our code more understandable. 🙂 Link

Next question, after the browser loading the twitter home pager, what is your next step?

Answer: Type the username&password and click Sign-In button.

What the browser does is that it will send your password and username to the twitter server. But it is not that simple because we need to know what exactly does the browser send to twitter and the address it send to. In order to fully understand the whole process, we need the help of a packet sniffing tool. I highly recommend using Firefox and its plugin HttpFox.

So, let us see what happens when we click the Sign-In button.

Screen Shot 2014-07-17 at 1.26.11 PM

Notice the second activity in the record. It is a POST, which means the browser is sending something!!And the URL on the right is twitter.com/session. Furthermore, when we check the POST Data below, we can clearly see the post format. The parameters that we need send care about are: user_name, password and authenticity_token.

The authenticity_token can be found in the html code:

Screen Shot 2014-07-17 at 1.25.19 PMSo what we need to do is to extract the authenticity_token from the html code.

import requests
from bs4 import BeautifulSoup

#Initial Request to https://twitter.com
session = requests.Session()
resp = session.get('https://twitter.com', verify=True)

#Phrase the html code and extract authenticity_token
html = BeautifulSoup(resp.text)
auth_token = html.find("input", {"name": "authenticity_token"})['value']

#Construct POST data
payload = {
    'session[username_or_email]': 'your-user-name',
    'session[password]': 'your-password',
    'remember_me': 1,
    'return_to_ssl': True,
    'scribe_log': '',
    'redirect_after_login': '/',
    'authenticity_token': auth_token
}

#Send POST to https://twitter.com/sessions
session.post("https://twitter.com/sessions", data=payload, allow_redirects=True)

So far, we have used the python to mimic the website behavior to log into the twitter website.

It is time to collecting data

Our goal is to retrieve all tweets' replies. First, let us take a look at a tweet.

Screen Shot 2014-07-17 at 1.44.13 PM

Examine the html code, we can find all each reply is under the <div> with class name "simple-tweet .... descendent"

Screen Shot 2014-07-17 at 1.51.46 PM

Code for retrieving the replies:

#After login process, access the tweet URL
respond = session.get("https://twitter.com/ChaseSupport/status/489726436907819008")
parsed_html = BeautifulSoup(respond.text)

#Retrieve replies
for div in parsed_html.findAll('div', {"class": "descendant"}):
    for p in BeautifulSoup(str(div)).findAll('p', {"class": "tweet-text"}):
        reply_content = p.text.replace('\n', ' ').replace('\r', '')
        print reply_content

Besides, we need understand the format of URL and know how to construct URLs so that we can allow our program to do the batch process.

By observation, in twitter, a tweet's URL has the following form:

https://twitter.com/twitter-account/status/tweet-id

Twitter-account is the account you want to crawl. For tweet-id, we can get the it from the Twitter Official API (Remember the limitation that you can only get latest 3000 tweet and replies' id)

Wrap up

In my finished program, I can extract tweets' replies, tweets' ancestors, number 0f retweet and number of favorite.

The snippet from my output, which is crawled from @GEICO_Service :

Screen Shot 2014-07-17 at 2.06.49 PM Screen Shot 2014-07-17 at 2.07.22 PM