"What users (customers) really want?" a question asked by almost every business and online retail store. The answer to this question helps users find interesting items (products, movies, books) to buy, watch, read, etc... In a pursuit to such an answer, personalized recommender systems analyze users preferences/tastes and purchasing history in order to predict how much a specific user would like an unseen item. Actually, Netflix, an online streaming video service, reported that $75\%$ of movies users watch on Netflix are from recommendation.

netflix.png

uiratings.png

Technically speaking, a recommender system takes as input a user-item ratings matrix (movies example is given below) in which rows represent users and columns represent items and each matrix entry represent a rating (e.g., scale from 1 to 10) that a user assigned to an item. To recommend a movie to Alice, a recommender system first predicts how much Alice would like unseen movies (i.e., Inception and Casino) and then returns the movie that is expected to have the maximum predicted rating value.

Existing recommender systems are offline in nature; they pre-compute a set of recommended items offline for every user, store them on disk, and returns the pre-computed recommendation to a user when she logs on to the system. However, such libraries requires loading the whole user/items ratings data from persistent storage to memory, which may represent a performance bottleneck with large-scale data. Moreover, they assume the input data and the generated recommendation model fits in-memory, and hence does not scale to large-scale datasets. Such offline systems include: software libraries that perform the full recommendation process in-memory, e.g., LensKit and MyMediaLight.


Offline recommenders also include large-scale Hadoop-based offline systems like Apache Mahout. These systems are built on-top of the Hadoop ecosystem and run the recommendation generation functionality as a batch processing MapReduce job. Despite the fact that Hadoop-based implementations are scalable, nonetheless, they suffer from the following: (1)~Tremendous overhead of transferring the user/item ratings data from the transactional database system (where the data reside) to HDFS and vice-versa. (2)~Inadequacy of handling online arbitrary recommendation scenarios since the recommended items are pre-computed offline. (3)~Hadoop-based systems require users to set-up a Hadoop cluster and write a MapReduce program rather than using a declarative query which is not appealing for novice users.

mahoutrec.png

At the University of Minnesota Data Management lab, we developed RecDB; an open Source Recommendation Engine built entirely Inside PostgreSQL 9.2. RecDB allows application developers to build recommendation applications in a heartbeat through a wide variety of built-in recommendation algorithms like user-user collaborative filtering, item-item collaborative filtering, Singular_value_decomposition. Applications powered by RecDB can produce online and flexible personalized recommendations to end-users. RecDB has the following main features:

recdb.png

  • Usability: RecDB is an out-of-the-box tool for web and mobile developers to implement a myriad of recommendation applications. The system is easily used and configured so that a novice developer can define a variety of recommenders that fits the application needs in few lines of SQL.
  • Seamless Database Integration: Crafted inside PostgreSQL database engine, RecDB is able to seamlessly integrate the recommendation functionality with traditional database operations, i.e., SELECT, PROJECT, JOIN, in the query pipeline to execute ad-hoc recommendation queries.
  • Scalability and Performance: The system optimizes incoming recommendation queries (written in SQL) and hence provides near real-time personalized recommendation to a high number of end-users who expressed their opinions over a large pool of items.

Creating a Recommender: Users may create recommenders a-priori so that when a recommendation query is issued may be answered with less latency. The user needs to specify the ratings table in the ON clause and also specify where the user, item, and rating value columns are in that table. Moreover, the user has to designate the recommendation algorithm to be used to predict item ratings in the USING clause. An example of creating an Item-Item Collaborative Filtering recommender on the User/Item ratings table MovieRatings is as follows


CREATE RECOMMENDER MovieRec ON MovieRatings
USERS FROM userid
ITEMS FROM itemid
EVENTS FROM ratingval
USING ItemCosCF


Generating Recommendation: To generate recommendation, RecDB allows users to write their recommendation query using SQL. In the recommendation query, the user needs to specify the ratings table and also specify where the user, item, and rating value columns are in that table. Moreover, the user has to designate the recommendation algorithm to be used to predict item ratings. For example, if MovieRatings(userid,itemid,ratingval) represents the ratings table in a movie recommendation application, then to recommend top-10 movies based on the rating predicted using Item-Item Collaborative filtering (applying cosine similarity measure) algorithm to user 1, the user writes the following SQL:


SELECT * FROM MovieRatings R
RECOMMEND R.itemid TO R.userid ON R.ratingval USING ItemCosCF
WHERE R.userid = 1
ORDER BY R.ratingval
LIMIT 10

The main benefit of implementing the recommendation functionality inside a database engine (PostgreSQL) is to allow for integration with traditional database operations, e.g., selection, projection, join. For example, the following query recommends the top 10 Comedy movies to user 1. In order to do that, the query joins the recommendation with the Movies table and apply a filter on the movies genre column (genre LIKE '%Comedy%').


SELECT * FROM MovieRatings R, Movies M
RECOMMEND R.itemid TO R.userid ON R.ratingval USING ItemCosCF
WHERE R.userid = 1 AND M.movieid = R.itemid AND M.genre LIKE '%Comedy%'
ORDER BY R.ratingval
LIMIT 10


More Applications. For demonstration purposes, we developed a restaurant recommendation application, using RecDB, that generates restaurant recommendation to users based upon their spatial locations. The application analyzes the history of user CheckIns (visits) such that each checkin entry represents whether a user has visited a restaurants before. In such case, the checkin field is set to one if the user visited the restaurant, and zero otherwise.

restrec.png

We create RestaurantRec; a recommender that generates personalized recommendation using the singular value decomposition (SVD) recommendation algorithms performed on the CheckIns table as the user/item events matrix.

CREATE RECOMMENDER RestaurantRec ON CheckIns
USERS FROM userid
ITEMS FROM itemid
RATINGS FROM checkin
USING SVD

A user visiting 'New York City' asks for top five restaurant (location-aware) recommendation by issuing the following query to RecDB. The query states the current user location using traditional SQL operators (WHERE A.itemid = B.itemid AND B.location = 'New York City'). RecDB therefore produces a set of five restaurants by using ORDER BY / LIMIT SQL.

SELECT A.itemid FROM CheckIns A , Restaurant B
RECOMMEND A.rid TO A.userid ON A.checkin USING SVD
WHERE A.uid = 1 AND A.itemid = B.itemid AND B.location = 'New York'
ORDER BY A.checkin LIMIT 5

RecDB source code is available here on GitHub. Since its release on October 3rd 2013, RecDB has been downloaded more than 4000 times.

Geographical Changes Matter


Ever since the big bang and the creation of the universe, our planet Earth encountered so many geographical changes, e.g., climate change, deforestation, population migration, and changes in sea levels. Such changes may have positive and/or negative effects on our lives or other species' lives. For instance, global climate change might lead to loss of sea ice, accelerated sea level rise and longer more intense heat waves. Hence, It's quite essential for governments to carefully study geographical changes as well as their impact on our society.

We need not mention that computer scientists have a big role here which is to provide effective tools for geographers to manage and analyze geospatial data. Recently, ESRI released a set of interesting tools for geographers to study geographical changes. For instance, they provided a web-based tool, called "ChangeMatters", that helps geographers visualize a designated geographical region in order to manifest the geographical change in such area in a different time period. In other words, ChangeMatters allows users to visualize geographical changes on the map over like 30 years period. The figure below shows how ChangeMatters presents data for the "Aral Sea" area.

aralsea.png

The Aral Sea was a lake lying between Kazakhstan and Karakalpakstan. Formerly one of the four largest lakes in the world with an area of 26,300 square miles, the Aral Sea has been steadily shrinking since the 1960s after the rivers that fed it were diverted by Soviet irrigation projects. As it turns out from the figure (given above), a huge portion of the Aral Sea water shrinked substantially between year 1975 and year 2000 which makes it one the planet's environmental disasters.
ChangeMatters also helps geographers analyze changes in vegetation index, urbanization, desertification, and so many geographical phenomena. Such tools would help geographers better assess benefits and/or threats of such changes in Earth's Geography, and hence prevents environmental disasters or at least avoid their severe effect on our life. Given the importance of such tools, I envision more computer science research related to geospatial analytics to come.

Recently, I had the chance to install, modify, compile, and run Apache Giraph; a system that processes batch-jobs (analytics tasks) over large-scale graphs (e.g., shortest path, PageRank). Giraph is deemed an open source implementation of Pregel which was introduced by Google in the first place. To achieve scalability, Giraph is built as a layer on top of Hadoop to leverage its parallelism and cluster management capabilities. Giraph loads graph data stored in Hadoop HDFS into the memory of a cluster of servers. Therefore, the designated graph processing algorithm executes as a giant map phase with no reduce phase involved to avoid unnecessary data shuffling across the network. During the map phase, each worker asynchronously communicates with other workers using Netty to exchange messages during a graph processing job.

A salient feature of Giraph is its simple, yet effective, programming model. A Giraph programmer has to think like a "Vertex". In other words, the programmer implements a method, called compute(), that is invoked for each visited (active) vertex. An example of the compute() method for the single source shortest path algorithm is as follows:


public void compute(Iterable messages) {
double minDist = Double.MAX_VALUE;
for (DoubleWritable message : messages) {
minDist = Math.min(minDist, message.get());
}
if (minDist < getValue().get()) {
setValue(new DoubleWritable(minDist));
for (Edge edge : getEdges()) {
double distance = minDist + edge.getValue().get();
sendMessage(edge.getTargetVertexId(),
new DoubleWritable(distance));
}
}
voteToHalt();
}

In Giraph, a graph processing job consists of a set of iterations, called SuperSteps. Initially, all graph vertices are active. Therefore, only a subset of graph vertices are active in subsequent supersteps. Giraph employs a bulk synchronous parallel computation approach to implement a barrier between consecutive supersteps in order to synchronize computation among all graph workers. Each vertex performs local computation and/or send messages to other vertices inside the compute() method. The compute() method eventually invokes the voteToHalt() method to broadcast that the current vertex is done with computation. The graph processing job terminates only if no more vertices are to be visited.


This blog entry exhibits a simple method to retrieve the ten most frequently occurring hashtags in a given set of tweets. First, we need a function that retrieves the set of hashtags attached to a given tweet. That function takes a tweet JSON object and returns a list of hashtags (i.e., strings) that appeared the most in all tweets. An example of how this function looks like is as follows:


def tweethashtags (tweet):
hashtags = []
if 'entities' in tweet:
tweetentities = tweet['entities']
if 'hashtags' in tweetentities:
hashtagstext = tweetentities['hashtags']
for hasht in hashtagstext:
if 'text' in hasht:
hashtags.append(hasht['text'])
return hashtags

Then, you will have to parse the input tweets into JSON objects, invoke the tweethashtags function to retrieve the set of hashtags for each tweet. To calculate the top-10 hashtags, we keep a count of relevant tweets for each hashtag. Finally, we retrieve the ten hashtags that possess the highest tweet count. We can do that by sorting the hashtags in a descending order of their tweets count; that is an O(nlogn) operation such that n is the number of hashtags. Alternatively, we can simply loop over hashtags 10 times and each time we fetch the hashtag that has the maximum tweets count and discard that hashtag in further iterations. This operation is O(10n). The code that does this functionality, is as follows:


def main():
tweet_file = open(sys.argv[1])
hashtagcount = {}
for lyne in tweet_file:
tweet = json.loads(lyne)
tweethash = tweethashtags(tweet)
if tweethash == None or tweethash == []:
continue
for tweeth in tweethash:
if tweeth in hashtagcount.keys():
hashtagcount[tweeth] += 1.0
else:
hashtagcount[tweeth] = 1.0

totalcount = 0.0
maxscores = {}
while totalcount < 10.0:
maxcount = 0.0
maxhash = ""
for hashtag in hashtagcount.keys():
if hashtagcount[hashtag] > maxcount
and not(hashtag in maxscores):
maxcount = hashtagcount[hashtag]
maxhash = hashtag
maxhashtag = hashtag

print maxhash," ",maxcount
maxscores[maxhash] = maxcount
totalcount += 1.0

if __name__ == '__main__':
main()


Recently, I have been playing with twitter data. Below is a basic python script that fetches the 1% live-stream tweets published by twitter. To access the live stream, you will need to have the oauth2 library installed for authentication purposes.

To be able to access the 1% live-stream, you need to set up your twitter account using the following steps:


  1. Go to https://dev.twitter.com/apps and log in using your twitter credentials.

  2. Create an new application using your twitter account.

  3. Create an access token for your created application.

  4. Fill in the missed information in the below "fetchtweet.py" script, as follows:

    access_token_key = ""
    access_token_secret = ""
    consumer_key = ""
    consumer_secret = ""


  5. Save the "fetchtweets.py" script

  6. Finally, run the script as follows:

    $ python fetchtweets.py

    You can keep the script running until you get the data size you want.

  7. You may also pipe the fetched tweets and dump them to a file, as follows:

    $ python fetchtweets.py > tweets.txt



fetchtweets.py sript:


import oauth2 as oauth
import urllib2 as urllib

access_token_key = "..."
access_token_secret = "..."

consumer_key = "..."
consumer_secret = "..."

_debug = 0

oauth_token = oauth.Token(key=access_token_key,
secret=access_token_secret)
oauth_consumer = oauth.Consumer(key=consumer_key,
secret=consumer_secret)

signature_method_hmac_sha1 = oauth.SignatureMethod_HMAC_SHA1()

http_method = "GET"


http_handler = urllib.HTTPHandler(debuglevel=_debug)
https_handler = urllib.HTTPSHandler(debuglevel=_debug)

'''
Construct, sign, and open a twitter request
using the hard-coded credentials above.
'''
def twitterreq(url, method, parameters):
req = oauth.Request.from_consumer_and_token(oauth_consumer,
token=oauth_token,
http_method=http_method,
http_url=url,
parameters=parameters)

req.sign_request(signature_method_hmac_sha1, oauth_consumer, oauth_token)

headers = req.to_header()

if http_method == "POST":
encoded_post_data = req.to_postdata()
else:
encoded_post_data = None
url = req.to_url()

opener = urllib.OpenerDirector()
opener.add_handler(http_handler)
opener.add_handler(https_handler)

response = opener.open(url, encoded_post_data)

return response

def fetchsamples():
url = "https://stream.twitter.com/1/statuses/sample.json"
parameters = []
response = twitterreq(url, "GET", parameters)
for line in response:
print line.strip()

if __name__ == '__main__':
fetchsamples()



A quick look inside Sindbad

| No Comments

Recently, members of the data management lab at the University of Minnesota demonstrated Sindbad; a location-aware social networking system. From a user perspective, Sindbad can be considered as the Location-Aware Facebook (or Google+). In that sense, Sindbad got all the functionality in a typical social networking service, except that the geo-location is dealt with as a first class citizen. Below is a video that manifests how users interact with Sindbad through a web application:

Technically speaking, Sindbad incorporates the geo-spatial location into all aspects of the user social networking experience. For instance, Sindbad delivers location-aware messages (i.e., News Feed) to end-users. The importance of a location-aware news feed appears in the following scenario:
Assume Bob visited Paris and had dinner in that amazing french restaurant. He posted a Facebook message featuring the following: "Restaurant XYZ is awesome". Alice, a friend of Bob, will see that message in her recent news feed at the time it was posted. Three months later, Alice traveled to Paris and was wondering where to have dinner. If she would just see Bob's message he posted when he was in Paris, it would have saved her day. Alice needed to see those messages that are relevant to her current geo-spatial location (Paris).
Sindbad takes the spatial location of both the user, their friends, and the posted messages in order to provide location-aware news to the system users.

Moreover, Sindbad provides a location-aware recommendation service to its users. The Sindbad recommendation service takes into account both the user location and the item location in generating recommendations to the user. Sindbad users living in the same geographic region share the same preferences over restaurants, books, and even movies. Sindbad also recommends restaurant based on: (1) The history of user visits to different restaurants, and (2) the distance between the user current location and the restaurant. Sindbad also allow its users to rate and review items which helps in generating collaborative recommendation to other system users.

Sindbad has been demonstrated at the ACM International Conference on Management of Data, SIGMOD 2012. Here is the abstract:

This demo presents Sindbad; a location-based social networking system. Sindbad supports three new services beyond traditional social networking services, namely, location-aware news feed, location-aware recommender, and location-aware ranking. These new services not only consider social relevance for its users, but they also consider spatial relevance. Since location-aware social networking systems have to deal with large number of users, large number of messages, and user mobility, efficiency and scalability are important issues. To this end, Sindbad encapsulates its three main services inside the query processing engine of PostgreSQL. Usage and internal functionality of Sindbad, implemented with PostgreSQL and Google Maps API, are demonstrated through user (i.e., web/phone) and system analyzer GUI interfaces, respectively.


Recently, I have read the following paper: finding popular places published in the 18th international conference on Algorithms and computation (ISAAC 2007). I have presented the paper with Abhigna adusumilli and Rahul Saladi in the advanced computational geometry seminar. The paper is quite interesting as the ubiquity of mobile devices (e.g., Smart Phones, GPS) made it easy to track humans (objects) movement. Storing users (objects) visits to different places can be leveraged to come up with new insights on how users move from one place to another. In the paper, the authors propose techniques to find the popular places by analyzing a set of given trajectories for n different entities and \tau time steps. A popular place is the one that is visited by at least K different entities. To this end, the users defines two different models to addressing the problem (1) Discrete Model and (2) Continuous Model, as follows:

Discrete Model: A popular place is defined as the unit square that is visited by k different entities. An entity i visits a unit square \sigma if at some point t \in \tau, i is within the spatial extents of \sigma.

Continuos Model: A popular place is defined as the unit square that is intersected by the trajectory of k different entities.

The problem of finding the most popular place in the discrete model is defined as follows:
Input: A set of n entities moving the 2D space for \tau time steps, forming a set of \taun points in space.
Output: The axis aligned unit square\sigma that is visited by the highest number of entities.

Finding the most popular place in the discrete model can be solved by employing a colored range counting algorithm. That can be achieved by exhaustively applying the colored range counting query of each unit square around each point in \taun points. That results into 2-approximation algorithm to solve the problem, which requires O(\tau^2n^2log^2\tau n) time and space.

The authors proposes an initial algorithm to solve the problem which relies of sweeping a vertical strip over the given \tau n points. A start event happens when a point p, belonging to entity \Lambda, enters the sweep strip. An end event happens when a point p, belonging to entity \Lambda, is about to leave the sweep strip. For each event, we check the rectangle R_p of width 1 and height 2, in which point p represents the midpoint of the right side of R_p. The algorithm sweeps a unit square over R_p while keeping of the number of entities in the sweep square to get the most popular unit square. The algorithm requires O(\tau n log \tau n) time per event as the number of points in R_p is O(\tau n). That means that the naive algorithm requires O(\tau^2 n^2 log \tau n) time to find the most popular unit square.

The authors presented a new technique to get the most popular place in the discrete model. They build a set of y-intervals I_\Lambda of length one for each entity \Lambda. Each interval I represents a y-interval for a point p (x_p, y_p) such that I is equal to [y_p-1/2, y_p+1/2]. The most popular unit square is the one for which the center belongs to as many Intervals as possible (each interval belongs to a different entity).

To this end, the proposed technique relies on the following data structure to maintain the set of intervals for each event (point) during the sweep: (1) B_\Lambda. (2) T_\Lambda, described as follows:

B_\Lambda: It is a balanced binary search tree storing all points in entity \Lambda ordered with respect to their y-coordinates. B_\Lambda can be queried and maintained in O(log\tau) time and space using Red-Black Trees such that \tau is the number of points in entity \Lambda.

T_\Lambda: It is a tree that stores all intervals I_\Lambda. To this end, it stored all endpoints of all intervals in I_\Lambda. The authors proved that maintaining T_\Lambda requires O(log \tau) time and O(\tau) space.

Throughout the sweep, the set of all B_\Lambda trees \cal{B}^{ent} and the set of T_\Lambda trees \cal{T}^{ent} can be maintained in O(\tau n log \tau) time and O(\tau n) space. Then, the authors explained how a single data structure T_{int} can be maintained for all entities during the sweep. T_{int} is a tree structure that stores all the intervals belonging to all entities. T_{int} stores the start and end points in \cal{I} such that \cal{I}=\cup I_\Lambda.
The main algorithm finds the leaf node in T_{int} that stabs the highest number of intervals. As the intervals for a single entity are disjoint, the algorithm guarantees that the intervals stabbed by a leaf node belong to different entities. The authors proved that finding the most popular place using T_{int} can be reported in O(\tau nlog\tau n) time using O(\tau n) space.

Recently, I have presented a paper at ICDE 2012. The paper title is: "LARS: A Location-Aware Recommender System". The paper is a joint work with Justin Levandoski, Ahmed Eldawy, and Mohamed Mokbel. The main idea of the paper is to promote the spatial locations of both the users and the items, as first class citizens in existing recommender systems. Here is the abstract:

This paper proposes LARS, a location-aware recommender system that uses location based ratings to produce recommendations. Traditional recommender systems do not consider spatial properties of users nor items; LARS, on the other hand, supports a taxonomy of three novel classes of location-based ratings, namely, spatial ratings for non-spatial items, non-spatial ratings for spatial items, and spatial ratings for spatial items. LARS exploits user rating locations through user partitioning, a technique that influences recommendations with ratings spatially close to querying users in a manner that maximizes system scalability while not sacrificing recommendation quality. LARS exploits item locations using travel penalty, a technique that favors recommendation candidates closer in travel distance to querying users in a way that avoids exhaustive access to all spatial items. LARS can apply these techniques separately, or together, depending on the type of location-based rating available. Experimental evidence using large-scale real-world data from both the Foursquare location-based social network and the MovieLens movie recommendation system reveals that LARS is efficient, scalable, and capable of producing recommendations twice as accurate compared to existing recommendation approaches.

I am really proud of the work in that paper and I wish the findings in that paper are useful in pratice. We will be demonstrating the research ideas presented in the paper in SIGMOD 2012, using a Location-base social networking system (called Sindbad) developed at University of Minnesota.

My paper "RecStore: An Extensible and Adaptive Framework for Online Recommender Queries inside the Database Engine" has been recently accepted to the International Conference on Extending Database Technology. The paper is a joint work with Justin J. Levandoski, Mohamed F. Mokbel, and Micheal D. Ekstrand. Here is the paper abstract:

Most non-trivial recommendation methods (e.g., collaborative filtering) consist of (1) a computationally intense offline phase that computes a recommender model based on users' opinions of items, and (2) an online phase consisting of SQL-based queries that use the model (generated offline) to derive user preferences and provide recommendations for interesting items. Current application usage trends require a completely online recommender process, meaning the recommender model must update in real time as new opinions enter the system. To tackle this problem, we propose RecStore, a DBMS storage engine module capable of efficient online model maintenance. Externally, models managed by RecStore behave as relational tables, thus existing SQL-based recommender queries remain unchanged while gaining online model support. RecStore maintains internal statistics and data structures aimed at providing efficient incremental updates to the recommender model, while employing an adaptive strategy for internal maintenance and load shedding to realize a tuneable tradeoff for more efficient updates or query processing based on system workloads. RecStore is also extensible, supporting a declarative syntax for defining recommender models. The efficacy of RecStore is demonstrated by providing the implementation details of five state-of-the-art collaborative filtering models. Performance benefits are demonstrated through extensive experimental evaluation of a prototype of RecStore, built inside the storage engine of PostgreSQL, using a real-life recommender system workload.

I will be presenting the paper at EDBT 2012 in Berlin Germany on March 27th, 2012. will be glad to receive your comments

Here, I present the result of simulating the check-in process in a location-based social network. To this end, I have written a python script using SimPy, a Python-based simulation tool, the script is given below:


#!/usr/bin/env python
""" Simulate the Users visits to spatial locations in a social network
"""
from __future__ import generators # not needed for Python 2.3+
from SimPy.Simulation
import * from random
import gauss,shuffle,randint,choice,seed,random,Random,expovariate

class Generator(Process):
""" generates a set of users """
def __init__(self, maxNumberOfUsers):
Process.__init__(self)
self.name = "generator"
self.maxNumberOfUsers = maxNumberOfUsers

def execute(self):
users = []
userscopy = []
for i in range(self.maxNumberOfUsers):
j = User(i)
users.append(j)
userscopy.append(j)

for user in users:
shuffle(userscopy)
numoffriends = gauss(numoffriends_mean,numoffriends_sd)
for k in range(int(numoffriends)):
if userscopy[k] != user:
user.addfriend(userscopy[k])

for user in users:
activate(user,user.execute())
yield hold,self,1.0
self.trace("WARNING generator finished -- ran out of Users")

def trace(self,message):
if GTRACING: print "%7.4f \t%s"%(self.time(), message)


class User(Process):
def __init__(self,i):
Process.__init__(self)
self.i = i
self.name = "User"+`i`
self.friends = []
self.visits = []

def getID(self):
return self.i

def addfriend(self,friend):
self.friends.append(friend)

def getvisits(self):
return self.visits

def getfriendsvisits(self):
allvisits = []
for f in self.friends:
allvisits.extend(f.getvisits())
return allvisits

def getrandomfriendsvisit(self,allvisits):
randomvisit = ""
if len(allvisits) == 0:
randomvisit = randint(0,maxnumvisits)
else:
self.trace("list is not empty")
randomvisit = choice(allvisits)
return randomvisit

def execute(self):
global busyStartTime,totalBusyVisits,totalBusyTime
global Nfree,busyEndTime,Jrv
self.trace("User "+`self.i`+" started")
while now()<= simulationtime:
visitedlocation = ""
Y = random()
if Y <= correlation:
allvisits = self.getfriendsvisits()
visitedlocation =
self.getrandomfriendsvisit(allvisits)
self.trace("A friend visit "+`visitedlocation`)
else:
visitedlocation = randint(0,maxnumvisits)
self.trace("A random visit "+`visitedlocation`)

visitshist[visitedlocation]+=1
self.visits.append(visitedlocation)
"""visitsrate = gauss(visitsrate_mean,visitsrate_sd)"""
visitsrate = visitsrate_mean
yield hold,self,1/visitsrate

def trace(self,message):
if TRACING: print "%7.4f \t%s %s "%(now(), message , self.name)


""" Begin Main"""
simulationtime = 1000
correlation = 0.5
maxnumvisits = 1023
visitshist = []
for i in range(maxnumvisits+1):
visitshist.append(0)
numoffriends_mean = 100
numoffriends_sd = 30
maxNumberOfUsers = 10000
visitsrate_mean = 8.0
visitsrate_sd = 0.2
TRACING = 0
GTRACING =0

g = Generator(maxNumberOfUsers)
activate(g,g.execute())
simulate(until=simulationtime)

visitshist.sort()
print visitshist
print "Simulation ended"

In the simulation, we have six main independent (inputs); explained below as follows: (1) Number of spatial locations (NSL): it represents the total number of spatial locations that any user can visit at any time. (2) Average User Popularity (AUP): it represents the average number of friends per user (i.e., following a Gaussian distribution) for the population of all users in the social network. (3) Number of Users (NU): it represents the total number of users in the social network. (4) Average Visit Rate (AVR): it represents the average rate (i.e., following a Gaussian distribution) at which each user visits (check-ins) at any random spatial location. It is measured in number of visits per day. (5) Mobility-Friendship Correlation ratio (MFC): it is a value from 0 to 1 that determines the amount of correlation between being friends and visiting the same places. (6) Simulation time (ST): It represents the total time (i.e., number of days) over which we run the simulation.

Given the aforementioned independent variables, four dependent variables are measured: (1) Number of Visits (NV [i]): It represents the number of visits per a single spatial location. In other words, each spatial location i (i.e., such that i takes the value from 1 to Number of Spatial Locations (NSL)) has number of visits (users check-ins) NV[i]. (2) Minimum Location Visits (Min): it represents the number of visits (check-ins) in the spatial location that has the lowest number of user visits. (3) Maximum Location Visits (Max): it represents the number of visits (check-ins) in the spatial location that has the highest number of visits. (4) Location Visits Range (Range): It represents the absolute value of the di erence between the Maximum Location Visits (Max) and the Minimum Location Visits (Min).

In the simulation, three main independent variables are taken into account: (1) Average User Popularity(AUP), (2) Average Visit Rate (AV R), and (3) Mobility-Friendship Correlation ratio (MFC). Each of the three independent variables are varied, while keeping the other variables constant. We have run three main simulation experiments, measuring the e ect of each independent variable as follows:

Experiment 1 (Eff ect of varying Average User Popularity (AUP)): In this experiment, we assign the constant values of 0.5 and 0.6667 to Mobility-Friendship Correlation ratio (MFC) and Average Visit Rate (AV R), respectively. We then measure the effect of Average User Popularity (AUP) taking the values of 140, 80, 160, 320, 640, 1280, 2560 on the Number of Visits (NV [i]) for each spatial location i such that 1  i  NSL. I have also measured the e ffect of AUP on both Minimum Location Visits (Min) and Maximum Location Visits (Max). The following figure shows the values of both Minimum Location Visits (Min) and Maximum Location Visits (Max when varying AUP. From the gure, we can see that the Min is slightly decreasing and Max is substantially increasing. This trend is explained by the fact that the popularity of the users that visits a certain place increases the probability that the friends of this user and the friend of his friends and all the way through the social graph) will visit the same place.



popularity.png

Experiment 2 (E ffect of varying Average Visit Rate (AVR)): In this experiment, we assign the values of 0.5 and 100 to Mobility-Friendship Correlation ratio (MFC) and Average User Popularity (AUP), respectively. Therefore, we measure the e ffect of Average Visit Rate (AV R) taking values of 0.5, 1.0, 2.0, 4.0, 8.0 on the same dependent variables measured in Experiment 1 (i.e., NV [i], Min, and Max). The following figure shows the values of both Minimum Location Visits (Min) and Maximum Location Visits (Max) when varying AVR. In the figure, It turns out that Min and Max are increasing approximately the same proportion which is intuitive as the total number of visits in each place increases in the same simulation time period.

visitrate.png




Experiment 3 (Eff ect of varying Mobility-Friendship Correlation ratio (MFC)): In this experiment, we assign the values of 0.5 and 100 to Average Visit Rate (AV R) and Average User Popularity (AUP), receptively. Then, we measure the e ect of varying Mobility-Friendship Correlation ratio (MFC) taking the values of 0.0, 0.2, 0.4, 0.8, 1.0 on the same dependent variables measured in both Experiment 1 and Experiment 2 (i.e., NV [i], Min, and Max). The following figure shows both Minimum Location Visits (Min) and Maximum Location Visits (Max) when varying MFC. The impact is quite intuitive; Max is substantially increasing when increasing MFC as some place are getting more and more attention (visits). Also, Min is slightly decreasing as some places were not initially visited by most of the users, which lead to less people interested to visit those place as their friends did not visit these places before. The results of applying this experiment show how the e ffect of Friendship-to-Mobility correlation could be manipulated to influence the users to visiting specific places.

correlation.png