M'interessa   Learning by choosing

Presentation

Presentation of M’interessa project

Read more...

Model assessment

M’interessa project is, in fact, an information retrieval solution, a binary classification product. So, information retrieval metrics must be used in order to assess the model of the project.

Read more...

Feature extraction

The features extracted from each tweet and used in the model are based on the API reference page about the Tweet objects (https://dev.twitter.com/overview/api/tweets), we’ve got the following attributes into each Tweet:

Read more...

Data pipeline

The process that the data follows is the one shown in the diagram:

Read more...

Kafkian: or how we deploy and set-up our Kafka broker

What is Apache Kafka?

Apache Kafka is an industry standard solution for creating real-time data pipelines involving several subsystems. It is a queing system based on a publish-subscribe (PubSub) model, where producers publish messages to a topic (or several topics) and consumers subscribe to topics. Each topic is similar to a queue, hence consumers remove messages from the queue. It comes with built-in replication, and offers high scalability. As such, in general we talk about interacting with a Kafka cluster.

Read more...

ETL from Kafka logs

Objective

The main objective is to fetch the data from Kafka and perform some (more or less sophisticaded) filters over the gathered tweets in order to:

Read more...

Scrapping the web from twitter

Objectives

The main objective is to analyze the web pages mentioned by the tweet to determine if they are interesting for the user.

Read more...

Near duplicate detection

Twitter is full of duplicated or near-duplicated content (ND for short from now onwards). Flooding a user’s candidate tweet list with a bunch of items that have (almost) the same content can only lead to a bad user experience.

Read more...