Model assessment
M’interessa project is, in fact, an information retrieval solution, a binary classification product. So, information retrieval metrics must be used in order to assess the model of the project.
Read more...Feature extraction
The features extracted from each tweet and used in the model are based on the API reference page about the Tweet objects (https://dev.twitter.com/overview/api/tweets), we’ve got the following attributes into each Tweet:
Read more...Kafkian: or how we deploy and set-up our Kafka broker
What is Apache Kafka?
Apache Kafka is an industry standard solution for creating real-time data pipelines involving several subsystems. It is a queing system based on a publish-subscribe (PubSub) model, where producers publish messages to a topic (or several topics) and consumers subscribe to topics. Each topic is similar to a queue, hence consumers remove messages from the queue. It comes with built-in replication, and offers high scalability. As such, in general we talk about interacting with a Kafka cluster.
Read more...ETL from Kafka logs
Objective
The main objective is to fetch the data from Kafka and perform some (more or less sophisticaded) filters over the gathered tweets in order to:
Read more...Scrapping the web from twitter
Objectives
The main objective is to analyze the web pages mentioned by the tweet to determine if they are interesting for the user.
Read more...Near duplicate detection
Twitter is full of duplicated or near-duplicated content (ND for short from now onwards). Flooding a user’s candidate tweet list with a bunch of items that have (almost) the same content can only lead to a bad user experience.
Read more...