Blog

Jul 13, 2016

Presentation of M’interessa project

Jul 12, 2016

Model assessment

M’interessa project is, in fact, an information retrieval solution, a binary classification product. So, information retrieval metrics must be used in order to assess the model of the project.

Jul 7, 2016

The features extracted from each tweet and used in the model are based on the API reference page about the Tweet objects (https://dev.twitter.com/overview/api/tweets), we’ve got the following attributes into each Tweet:

Jul 7, 2016

Data pipeline

The process that the data follows is the one shown in the diagram:

Jul 4, 2016

Kafkian: or how we deploy and set-up our Kafka broker

What is Apache Kafka?

Apache Kafka is an industry standard solution for creating real-time data pipelines involving several subsystems. It is a queing system based on a publish-subscribe (PubSub) model, where producers publish messages to a topic (or several topics) and consumers subscribe to topics. Each topic is similar to a queue, hence consumers remove messages from the queue. It comes with built-in replication, and offers high scalability. As such, in general we talk about interacting with a Kafka cluster.

Jul 1, 2016

ETL from Kafka logs

Objective

The main objective is to fetch the data from Kafka and perform some (more or less sophisticaded) filters over the gathered tweets in order to:

Jun 29, 2016

Scrapping the web from twitter

Objectives

The main objective is to analyze the web pages mentioned by the tweet to determine if they are interesting for the user.

Jun 20, 2016

Near duplicate detection

Twitter is full of duplicated or near-duplicated content (ND for short from now onwards). Flooding a user’s candidate tweet list with a bunch of items that have (almost) the same content can only lead to a bad user experience.

M'interessa Learning by choosing

Presentation

Presentation of M’interessa project

Model assessment

Feature extraction

Data pipeline

Kafkian: or how we deploy and set-up our Kafka broker

What is Apache Kafka?

ETL from Kafka logs

Objective

Scrapping the web from twitter

Objectives

Near duplicate detection