May 2013 Archives


This blog entry exhibits a simple method to retrieve the ten most frequently occurring hashtags in a given set of tweets. First, we need a function that retrieves the set of hashtags attached to a given tweet. That function takes a tweet JSON object and returns a list of hashtags (i.e., strings) that appeared the most in all tweets. An example of how this function looks like is as follows:


def tweethashtags (tweet):
hashtags = []
if 'entities' in tweet:
tweetentities = tweet['entities']
if 'hashtags' in tweetentities:
hashtagstext = tweetentities['hashtags']
for hasht in hashtagstext:
if 'text' in hasht:
hashtags.append(hasht['text'])
return hashtags

Then, you will have to parse the input tweets into JSON objects, invoke the tweethashtags function to retrieve the set of hashtags for each tweet. To calculate the top-10 hashtags, we keep a count of relevant tweets for each hashtag. Finally, we retrieve the ten hashtags that possess the highest tweet count. We can do that by sorting the hashtags in a descending order of their tweets count; that is an O(nlogn) operation such that n is the number of hashtags. Alternatively, we can simply loop over hashtags 10 times and each time we fetch the hashtag that has the maximum tweets count and discard that hashtag in further iterations. This operation is O(10n). The code that does this functionality, is as follows:


def main():
tweet_file = open(sys.argv[1])
hashtagcount = {}
for lyne in tweet_file:
tweet = json.loads(lyne)
tweethash = tweethashtags(tweet)
if tweethash == None or tweethash == []:
continue
for tweeth in tweethash:
if tweeth in hashtagcount.keys():
hashtagcount[tweeth] += 1.0
else:
hashtagcount[tweeth] = 1.0

totalcount = 0.0
maxscores = {}
while totalcount < 10.0:
maxcount = 0.0
maxhash = ""
for hashtag in hashtagcount.keys():
if hashtagcount[hashtag] > maxcount
and not(hashtag in maxscores):
maxcount = hashtagcount[hashtag]
maxhash = hashtag
maxhashtag = hashtag

print maxhash," ",maxcount
maxscores[maxhash] = maxcount
totalcount += 1.0

if __name__ == '__main__':
main()


Recently, I have been playing with twitter data. Below is a basic python script that fetches the 1% live-stream tweets published by twitter. To access the live stream, you will need to have the oauth2 library installed for authentication purposes.

To be able to access the 1% live-stream, you need to set up your twitter account using the following steps:


  1. Go to https://dev.twitter.com/apps and log in using your twitter credentials.

  2. Create an new application using your twitter account.

  3. Create an access token for your created application.

  4. Fill in the missed information in the below "fetchtweet.py" script, as follows:

    access_token_key = ""
    access_token_secret = ""
    consumer_key = ""
    consumer_secret = ""


  5. Save the "fetchtweets.py" script

  6. Finally, run the script as follows:

    $ python fetchtweets.py

    You can keep the script running until you get the data size you want.

  7. You may also pipe the fetched tweets and dump them to a file, as follows:

    $ python fetchtweets.py > tweets.txt



fetchtweets.py sript:


import oauth2 as oauth
import urllib2 as urllib

access_token_key = "..."
access_token_secret = "..."

consumer_key = "..."
consumer_secret = "..."

_debug = 0

oauth_token = oauth.Token(key=access_token_key,
secret=access_token_secret)
oauth_consumer = oauth.Consumer(key=consumer_key,
secret=consumer_secret)

signature_method_hmac_sha1 = oauth.SignatureMethod_HMAC_SHA1()

http_method = "GET"


http_handler = urllib.HTTPHandler(debuglevel=_debug)
https_handler = urllib.HTTPSHandler(debuglevel=_debug)

'''
Construct, sign, and open a twitter request
using the hard-coded credentials above.
'''
def twitterreq(url, method, parameters):
req = oauth.Request.from_consumer_and_token(oauth_consumer,
token=oauth_token,
http_method=http_method,
http_url=url,
parameters=parameters)

req.sign_request(signature_method_hmac_sha1, oauth_consumer, oauth_token)

headers = req.to_header()

if http_method == "POST":
encoded_post_data = req.to_postdata()
else:
encoded_post_data = None
url = req.to_url()

opener = urllib.OpenerDirector()
opener.add_handler(http_handler)
opener.add_handler(https_handler)

response = opener.open(url, encoded_post_data)

return response

def fetchsamples():
url = "https://stream.twitter.com/1/statuses/sample.json"
parameters = []
response = twitterreq(url, "GET", parameters)
for line in response:
print line.strip()

if __name__ == '__main__':
fetchsamples()



About this Archive

This page is an archive of entries from May 2013 listed from newest to oldest.

October 2012 is the previous archive.

July 2013 is the next archive.

Find recent content on the main index or look in the archives to find all content.

Categories

Powered by Movable Type 4.31-en