This is a requested blog post by the module coordinators, it will cover:

  • What are you building?
  • What are the use cases?
  • What data sources will you use?
  • What technologies will you use?

But before we begin, here are some UCD podium impressions:

The Product

The project we are doing is a collaborative recommender system for tweets; a personalised tweet stream. The name of the product and our team is λ Lovelace.

A Twitter user uses our iOS mobile client and grants us API access. Tweets from followers not of interest (e.g. political rants) are filtered out (or deferred to later) while interesting tweets are prioritised in the timeline. Tweets from non-followers may be suggested as well. Essentially we hope to create a better, more personalised timeline of tweets than what Twitter provides by default. Our iOS app will make observations of the users engagements (opening, liking, time in focus, etc) and sends the information to the recommender back-end for further recommendations.

The mobile app is required in order to collect additional user preference information to refine the recommendations. For example, if a user clicks a link in a tweet, likes a tweet, retweets, or engages in conversations. Another potential passive observation mechanism would be to have the client measure the amount of time a tweet is visible. Thinking being if a tweet is in focus for longer it might be of more interest than a tweet that is scrolled past quickly.

Our contributions or novelty if you will are as follows:

  1. Filter out uninteresting tweets (or defer to later)
  2. Collect additional user preferences in a mobile app
  3. Show interesting tweets from non-followers

These are ordered by priority, that is we will first strive to implement tweet filtering, then data collection in the mobile app and if things go well we will try to introduce outsider tweets that might be of interest to the user.

Use Cases

A typical use case is a Twitter power user. The user follows a lot of other accounts and the feed of tweets is overwhelming on the home timeline. There is not enough time to read them all.

Take for example power user Susan. She is a software engineer and most of the accounts she follows on Twitter are programming or tech related. For the most part tweets are relevant but tweets about the general election in the USA are of no concern to Susan and therefore a waste of time. Same applies for other topics of tweets.

There are Twitter clients out there that allow user to mute or block certain words, hashtags, or other accounts. However over the long haul it’s a lot of manual effort.

Data Source

Our only data source so far is the Twitter API. Other data sources may come later but as of now we suspect it will remain the only data source for the remainder of the project.

So far we have been interfacing with the API via the third party library Tweepy. It’s been fairly good but it is missing some features. For example Tweepy does not consider muted users in the home timeline. That means we have to do extra REST API requests and filter the home timeline manually ourselves. When this was written we are considering switching altogether to the REST API.

In our work we have observed that Twitter’s API rate limit will affect us. For example here are some of the key limits we face:

  • Rate limit: 15 or 180 requests per 15 minutes
  • Home timeline: 800 tweets
  • User tweets & retweets: 3200, 200 per request

The rate limits are summarized further here. We have thought of few different ways to account for that, for example by batching requests over prolonged time. But generally we see a need to cache tweets ourselves to enable the recommender system unlimited tweet retrieval.

Here are some workarounds we experimented with for the rate limits. The below code will create a list of user IDs that a particular account follows:

friends = [ friend for friend in tweepy.Cursor(api.friends_ids).items()]
len(friends)

# Result: 16

The code below fetches 200 tweets from a friend each request until the 2000 newest tweets have been collected. It does this for all the friends so we end up with a list of list of tweets from our friends.

tweets_results = []
for friend in friends:
    results = [ status for status in tweepy.Cursor(api.user_timeline, id=friend, count=200).items(2000)]
    tweets_results.append(results)
    
for item in tweets_results:
    print(len(item))

""" 
Results: 
2000
2000
2000
2000
2000
2000
2000
2000
848
2000
2000
2000
2000
2000
2000
"""

The printed results show that all the 16 friends have at least 2000 tweets that could be fetched, except one friend only had made 848 tweets.

Technology Stack

Below are some of the technical decisions we’ve made so far. Please note that we do not consider them binding. That is, we are fully prepared to switch languages, stacks mid project if we believe it will suit us better.

For the backend we’ll strive to use Python 3 as much as we can but for some parts it may be necessary to use Python 2.7. For the recommender system we aim to use Python 3 data science libraries as much as we can. For example we have been using the library Tweepy for interacting with the Twitter API via Python.

Python is not the fastest language on the block so we’ve pondered the possibility to dip into Rust for performance critical parts if some gains are to be had. We’ll see.

As for the database we have not entirely made up our mind. What comes to mind is PostgreSQL for general storage. The Twitter API has pretty restrictive rate limits so it looks like we might need to store tweets ourselves. What comes to mind are some document databases like Redis or RethinkDb.

Project Management

In a previous post we covered our approach to project management. In summary we took a look at a lot of solutions and ended on ZenHub. ZenHub is a Chrome extension to augment the webpage of our GitHub repository to give us a Kanban board, story points on issues, epics, burndown charts, and more.

On behalf of λ Lovelace
- Jón Rúnar Helgason