M'interessa   Learning by choosing

The following is a non-exhaustive list of the resources that have been reviewed along


  1. Bird, Steven; Klein, Ewan; Loper, Edward. Natural Language Processing with Python. O'Reilly, 2009.
  2. Dasadia, Cyrus; Nayak, Amol. MongoDB Cookbook (Second Edition). Packt Publishing, 2016.
  3. Garg, Nishant. Learning Apache Kafka (Second Edition). Packt Publishing, 2015.
  4. Goasguen, Sébastien. Docker Cookbook. O'Reilly, 2016.
  5. Goasguen, Sébastien. Docker in the Cloud: recipes for AWS, Azure, Google, and More. O'Reilly, 2016.
  6. Grus, Joel. Data Science from Scratch. O'Reilly, 2015.
  7. Haloi, Saurav. Apache ZooKeeper Essentials: a fast-paced guide to using Apache ZooKeeper to coordinate services in distributed systems. Packt Publishing, 2015.
  8. Hardeniya, Nitin. NLTK Essentials: build cool NLP and machine learning applications using NLTK and other Python libraries. Packt Publishing, 2015.
  9. Junqueire, Flavio; Reed, Benjamin. ZooKeeper. O'Reilly, 2014.
  10. Karau, Holden. Learning Spark. O'Reilly, 2015.
  11. Kleppmann, Martin. Making Sense of Stream Processing. O'Reilly, 2016.
  12. Lawson, Richard. Web Scraping with Python. Packt Publishing, 2015.
  13. Leskovec, Jure; Anand, Rajaraman; Ullman, Jeffrey D. Mining of massive datasets. Stanford University, 2014.
  14. Lin, Jimmy; Dyer, Chris. Data-intensive text processing with MapReduce. University of Maryland, College Park, 2010.
  15. López, Félix; Romero, Víctor. Mastering Python Regular Expressions. Packt Publishing, 2014.
  16. Makice, Kevin. Twitter API: up and running. O'Reilly, 2009.
  17. Manivannan, Arum. Scala Data Analysis Cookbook. Packt Publishing, 2015.
  18. Matthias, Karl; Kane, Sean P. Docker: up and running. O'Reilly, 2015.
  19. McKendrick, Russ. Extending Docker. Packt Publishing, 2016.
  20. Narębski, Jakub. Mastering Git. Packt Publishing, 2016.
  21. Narkhede, Neha; Shapira, Gwen; Palino, Todd. Kafka: the definitive guide. O'Reilly, 2016.
  22. Nicolas, Patrick R. Scala for Machine Learning. Packt Publishing, 2014.
  23. Parsian, Mahmoud. Data Algorithms. O'Reilly, 2015.
  24. Perkins, Jacob. Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, 2010.
  25. Russell, Matthew A. 21 Recipes for Mining Twitter. O'Reilly, 2011.
  26. Settles, Burr. Active Learning Literature Survey. University of Wisconsin-Madison, 2010.
  27. Yadav, Rishi. Sparck Cookbook. Packt Publishing, 2015.

Online articles and resources

  1. A hands-on introduction to Apache Kafka
  2. Apache Spark and Apache Kafka integration example
  3. Build an AI Artist - Machine Learning for Hackers #5
  4. Capture screenshots using phantomjs
  5. Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format
  6. dc.js - Dimensional Charting Javascript Library
  7. Dealing with Unbalanced Classes ,Svm, Random Forests And Decision Trees In Python
  8. Distributed representations of sentences and documents
  9. Docker-Spark
  10. Dockerfile for Apache Kafka
  11. Error reading field 'topic_metadata' in Kafka
  12. Example integration of Kafka, Avro & Spark-Streaming on live Twitter feed
  13. Example JSON response from Twitter streaming API
  14. ExifRead 2.1.2
  15. ExifTool by Phil Harvey
  16. FeatureHasher and DictVectorizer Comparison
  17. Google advanced search: A comprehensive list of Google search operators
  18. High Performance Kafka Consumer for Spark Streaming. Now Support Spark 1.6 and Kafka 0.9
  19. How useful are Topic Models in practice?
  20. HubInfo | A GitHub Repo Widget
  21. Implementation of Q.V. Le, and T. Mikolov, Distributed Representations of Sentences and Documents ICML, 2014
  22. Indexing Tweets with NiFi and Solr
  23. Installing XGBoost on Ubuntu
  24. Introducing DeepText: Facebook's text understanding engine
  25. Introducing FBLearner Flow: Facebook's AI backbone
  26. Introducing our Hybrid lda2vec Algorithm
  27. Introductory sample scala app using Apache Spark Streaming to accept data from Kafka and write a summary to Cassandra
  28. Kafka (and Zookeeper) in Docker
  29. kafka-python 1.2.2
  30. lda: Topic modeling with latent Dirichlet Allocation
  31. Link via an ambassador container
  32. Low level integration of Spark and Kafka
  33. Machine Learning Library (MLlib) Guide
  34. Networking in Compose
  35. Node.js Passport Twitter login and storing user with MongoDB (mongoose)
  36. Opinionated stacks of ready-to-run Jupyter applications in Docker
  37. Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Flink and DataFlow
  38. Screenshot top 20 websites from top 20 categories using python
  39. Sign-Up with Facebook, Twitter and Google using Node.js and MongoDB
  40. sklearn.feature_extraction.text.TfidfVectorizer
  41. Spark Streaming + Kafka Integration Guide
  42. Technologies for Reusing Text from the Web - Talk by Martin Potthast
  43. The amazing power of word vectors
  44. Topic modeling with LDA: MLlib meets GraphX
  45. Topic Modeling with Latent Dirichlet Allocation
  46. Tweets | Twitter Developer
  47. UI Bootstrap
  48. Understand Docker container networks
  49. US presidential election via Twitter using Apache NiFi, Spark, Hive and Zeppelin
  50. Versatile Spark – Streaming
  51. Word2vec Tutorial
  52. XGBoost4J: Portable Distributed XGBoost in Spark, Flink and Dataflow
  53. Zookeeper & Kafka Install : A single node and a multiple broker cluster - 2016

Scholarly articles

  1. Andor, Daniel [et al.]. Globally Normalized Transition-Based Neural Networks. Google Inc., 2016.
  2. Breiman, Leo. Bagging Predictions. Department of Statistics, University of California Berkeley, 1994.
  3. Beygelzimer, Alina; Kale, Satyen; Luo, Haipeng. Optimal and Adaptive Algorithms for Online Boosting. Arxiv.org, 2015.
  4. Blei, David M. Surveying a suite of algorithms that offer a solution to managing large document archives. Communications of the AMC, 2012.
  5. Chen, Tianqi; Guestrin, Carlos. XGBoost: A Scalable Tree Boosting System. arxiv.org, 2016.
  6. Gu, Xiaodong [et al.]. Deep API Learning. arxiv.org, 2016.
  7. Liu, Bing [et al.]. Partially Supervised Classification of Text Documents. Singapore-MIT Alliance, 2002.
  8. Mihalcea, Rada; Tarau, Paul. TextRank: Bringing Order Into Texts. Conference on Empirical Methods in Natural Language Processing, 2004.
  9. Morstratter, Fred [et al.]. Is the Sample Good Enough? Comparing Datafrom Twitter’s Streaming API with Twitter’s Firehose. Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media, 2013.
  10. Niu, Gang. Theoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning. arxiv.org, 2016.
  11. Qin, Xiangju [et al.]. Learning from data streams with only positive and unlabeled data. Journal of Intelligent Information Systems, 2013.
  12. Sill, Joseph [et al.]. Feature-Weighted Linear Stacking. arxiv.org, 2009.
  13. Tao, Ke [et al.]. Groundhog Day: Near-Duplicate Detection on Twitter. Proceedings of the 22nd international conference on World Wide Web, 2013.
  14. Teh, Yee Whye [et al.]. Hierarchical Dirichlet Processes. University of California Berkeley, 2005.