Scrapping the web from twitter
Objectives
The main objective is to analyze the web pages mentioned by the tweet to determine if they are interesting for the user.
Secondary objectives:
- Identify type of pages could be more interesting
- Identify type of conten could be more interesting
- Have a preview of the page to easily identify it.
Features
Each tweet has 1 or n url referenced. That url takes to a web page wich have several features. From each page we have extracted the following features:
From the content:
- All the images and their attributes
-
All the links and their attributies. Also categorized by:
-
Social networks:
- Meneame
-
Internal / External
-
Metadata:
- Whis name
- Whois organization
- Whois City
- Hostname
- Url Params
- Content Type
- Description
- Content language
- Abstract
- Topic
- Subject
- Generator
- copyright
- Title
- Author
- Schema Present?
Troubleshooting
Extract the metadata from whe web pages is relatively easy. The most difficult is to access to the content due the actual dynamic web pages and the responsive design. Old pages rere a simple html content wich is easy to parse. But actual webs are difficult to process “the content” There has been tests different packages for it custom regexp processing , BeautifulSoup, requests but none provide a reliable clean result. as far as there are a lot of code embedded in the text.
We wanted a thumbnail of the web sites. For that we used phantomJs wich provides us to make a snapshoot of the web pages.
Procedure
- The tweets come forom a Kafka pipe.
- The process looks form referenced urls
- The referenceds URL are scrapped as described
- The information is stored in a mongodb
- We make a photo of the web page
- We rerturn the tweet to a kafka pipe