Playing with Twitter Data (Retrieve Top 10 hashtags)

| No Comments


This blog entry exhibits a simple method to retrieve the ten most frequently occurring hashtags in a given set of tweets. First, we need a function that retrieves the set of hashtags attached to a given tweet. That function takes a tweet JSON object and returns a list of hashtags (i.e., strings) that appeared the most in all tweets. An example of how this function looks like is as follows:


def tweethashtags (tweet):
hashtags = []
if 'entities' in tweet:
tweetentities = tweet['entities']
if 'hashtags' in tweetentities:
hashtagstext = tweetentities['hashtags']
for hasht in hashtagstext:
if 'text' in hasht:
hashtags.append(hasht['text'])
return hashtags

Then, you will have to parse the input tweets into JSON objects, invoke the tweethashtags function to retrieve the set of hashtags for each tweet. To calculate the top-10 hashtags, we keep a count of relevant tweets for each hashtag. Finally, we retrieve the ten hashtags that possess the highest tweet count. We can do that by sorting the hashtags in a descending order of their tweets count; that is an O(nlogn) operation such that n is the number of hashtags. Alternatively, we can simply loop over hashtags 10 times and each time we fetch the hashtag that has the maximum tweets count and discard that hashtag in further iterations. This operation is O(10n). The code that does this functionality, is as follows:


def main():
tweet_file = open(sys.argv[1])
hashtagcount = {}
for lyne in tweet_file:
tweet = json.loads(lyne)
tweethash = tweethashtags(tweet)
if tweethash == None or tweethash == []:
continue
for tweeth in tweethash:
if tweeth in hashtagcount.keys():
hashtagcount[tweeth] += 1.0
else:
hashtagcount[tweeth] = 1.0

totalcount = 0.0
maxscores = {}
while totalcount < 10.0:
maxcount = 0.0
maxhash = ""
for hashtag in hashtagcount.keys():
if hashtagcount[hashtag] > maxcount
and not(hashtag in maxscores):
maxcount = hashtagcount[hashtag]
maxhash = hashtag
maxhashtag = hashtag

print maxhash," ",maxcount
maxscores[maxhash] = maxcount
totalcount += 1.0

if __name__ == '__main__':
main()

Leave a comment

About this Entry

This page contains a single entry by M. Sarwat published on May 24, 2013 7:15 PM.

Playing with Twitter Data (Get the 1% Livestream) was the previous entry in this blog.

About Giraph: Large-Scale (Big) Graph Analytics is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Categories

Powered by Movable Type 4.31-en