Week 4 Blog - λ Lovelace

This is a requested blog post by the module coordinators, it will cover:

Rate limit & a possible solution
Approaches of data prioritising
Conclusion

Rate limit & a possible solution

As mentioned in the blog of week 3, Twitter REST API has a rate limit, which means we may not be able to acquire enough data for the Recommender System. Also, we did a bit research about the Twitter streaming API, but we found that the streaming API can only catch live data. In order to get history data of the users, we still have to use the REST API.

In order to overcome the rate limit, we were thinking about using multiple keys to solve this problem. Basically, we predefined a list of keys which is shown as below:

key01 = {'consumer_key':' ',
           'consumer_secret':' ',
           'access_token':' ',
           'access_secret':' '}
...

keys = [key01,key02,key03,key04,key05...]

Then we defined a method which does the work of authentication.

def auth(token):
    auth = tweepy.OAuthHandler(consumer_key = token['consumer_key'], consumer_secret = token['consumer_secret'])
    auth.set_access_token(token['access_token'], token['access_secret'])
    return tweepy.API(auth)

Basically, the idea of this approach is, when the program catches a RateLimitError exception, it will automatically switch to the next key. So here is an example:

As we can see, @JOEdotie has 265.7K followers. If we want to get all the ids of the followers, it is it is far beyond the rate limit.

@JOEdotie has 265,700 followers.
Only 15 requests are allowed in 15 minutes and each request can fetch up to 5000 ids. So in total a key can get 15 * 5000 = 75000 ids in 15 minutes.
265,700 / 75000 = 3.5
So for 265,700 ids, we will need 4 keys.

total=[]

#cursor for marking which tweets we have already fetched
next_cursor = -1

#i indicates which key is using
i=0

api = auth(tokens[i])

#when all the tweets have been fetched, the cursor will be 0
while next_cursor != 0:
    try:
        followers_ids = api.followers_ids(screen_name='JOEdotie',count=5000,cursor=next_cursor)
        total = total + followers_ids[0]
        next_cursor = followers_ids[1][1]
    except tweepy.RateLimitError:
    	#switch to the next key
        i=i+1
        api = auth(tokens[i])
        
len(total)

# Result: 265746

So we have fetched all the ids of the users who are following @JOEdotie.

However, this approach has many issues to be considered.

If we have many users, we probably have to use a lot of keys. We cannot create all these keys manually. It will be too time consuming.
The rate limit varies between different methods in the REST API, so it may be difficult to handle all the stuff.
Unfortunately, after we came up with the idea, we found that it against the Twitter Developer Policy.

So we are going to set the rate limit issue aside until we have decided how much data we exactly need.

Approaches of data prioritising

Next, we were thinking if we can find any approach to prioritising the data.

For example, if an account does not tweet anymore, we will not have to fetch data from that account.

So here are some approaches:

1. Find the tweets that a user liked

If a user has a habit of liking tweets, then this approach will work. Basically, Twitter REST API has a method which can fetch the most recent tweets that a user liked. If we can get these tweets, we will be able to know which tweets or which account the user is more interested in.

#get the tweets that a user liked
favorites = [ status._json for status in tweepy.Cursor(api.favorites).items()]

#get the screen names of liked tweets
favorite_screen_names = [favorite['user']['screen_name'] for favorite in favorites]

#count those screen names
favorite_count = Counter(favorite_screen_names)
favorites_counts = favorite_count.most_common()
favorites_counts[:3]

# Result:
[('officialgaa', 27),
 ('uachtaranclg', 3),
 ('ArianaGrande', 2)]

So as the result above shows, the user may be more interested in the account @officalgaa. So tweets from @officalgaa should be fetched more.

2. Filter out the accounts that are not active

If an account has not tweeted for years, we may not take it into consideration. So we don’t have to fetch data of this account.

The REST API has a method GET friends/list which can return a list of user objects that the user is following.

We will take this as an example:

Each user object contains the user’s most recent tweet (if applicable) which contains a created_at attribute. Therefore, we can know when did the user last tweet.

# return a list of user object of friends
friends = [ friend._json for friend in tweepy.Cursor(api.friends,count=200,screen_name='xinqili123').items()]

# create a date e.g. 7 days ago
date = datetime.datetime.now() - datetime.timedelta(days=7)

# filter the user objects
# if its most recent tweet is created before 7 days ago, print the user's screen_name and the date of last tweet.
for friend in friends:
    dt = parser.parse(friend['status']['created_at'])
    if dt.date() < date.date():
        print("Screen Name: %s, Date of Recent tweets: %s" % (friend['screen_name'],dt.date()))
        
# Result: 
Screen Name: RadiantGames_is, Date of Recent tweets: 2016-05-21
Screen Name: hjaltalinband, Date of Recent tweets: 2016-05-22
Screen Name: jsconfis, Date of Recent tweets: 2016-05-25
Screen Name: arnif, Date of Recent tweets: 2016-05-26
Screen Name: WaterfordCityCt, Date of Recent tweets: 2016-05-10
...

Conclusion

So far we have confirmed that we can technically use multiple keys to fetch data beyond the rate limit. However, this method against the Developer Policy which means we may risk being blocked by Twitter.

Also, we have started thinking about prioritising data in order to filter out the useless data. But, each approach has its pros and cons. For example, if a user does not have the habbit of liking tweets, the first approach which fetches the tweets that a user liked will not work. Therefore, we should combine different prioritising approaches in order to find an optimised prioritisation solution.

For the data source, out next steps will be:

Keep trying to find ways to overcome the rate limit.
Find more prioritising approaches.
Establish a basic Recommender System to make it more clear about how much data we need exactly.

- Team λ Lovelace