The following is a non-exhaustive list of the resources that have been reviewed along
Books
- Bird, Steven; Klein, Ewan; Loper, Edward. Natural Language Processing with Python. O'Reilly, 2009.
- Dasadia, Cyrus; Nayak, Amol. MongoDB Cookbook (Second Edition). Packt Publishing, 2016.
- Garg, Nishant. Learning Apache Kafka (Second Edition). Packt Publishing, 2015.
- Goasguen, Sébastien. Docker Cookbook. O'Reilly, 2016.
- Goasguen, Sébastien. Docker in the Cloud: recipes for AWS, Azure, Google, and More. O'Reilly, 2016.
- Grus, Joel. Data Science from Scratch. O'Reilly, 2015.
- Haloi, Saurav. Apache ZooKeeper Essentials: a fast-paced guide to using Apache ZooKeeper to coordinate services in distributed systems. Packt Publishing, 2015.
- Hardeniya, Nitin. NLTK Essentials: build cool NLP and machine learning applications using NLTK and other Python libraries. Packt Publishing, 2015.
- Junqueire, Flavio; Reed, Benjamin. ZooKeeper. O'Reilly, 2014.
- Karau, Holden. Learning Spark. O'Reilly, 2015.
- Kleppmann, Martin. Making Sense of Stream Processing. O'Reilly, 2016.
- Lawson, Richard. Web Scraping with Python. Packt Publishing, 2015.
- Leskovec, Jure; Anand, Rajaraman; Ullman, Jeffrey D. Mining of massive datasets. Stanford University, 2014.
- Lin, Jimmy; Dyer, Chris. Data-intensive text processing with MapReduce. University of Maryland, College Park, 2010.
- López, Félix; Romero, Víctor. Mastering Python Regular Expressions. Packt Publishing, 2014.
- Makice, Kevin. Twitter API: up and running. O'Reilly, 2009.
- Manivannan, Arum. Scala Data Analysis Cookbook. Packt Publishing, 2015.
- Matthias, Karl; Kane, Sean P. Docker: up and running. O'Reilly, 2015.
- McKendrick, Russ. Extending Docker. Packt Publishing, 2016.
- Narębski, Jakub. Mastering Git. Packt Publishing, 2016.
- Narkhede, Neha; Shapira, Gwen; Palino, Todd. Kafka: the definitive guide. O'Reilly, 2016.
- Nicolas, Patrick R. Scala for Machine Learning. Packt Publishing, 2014.
- Parsian, Mahmoud. Data Algorithms. O'Reilly, 2015.
- Perkins, Jacob. Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, 2010.
- Russell, Matthew A. 21 Recipes for Mining Twitter. O'Reilly, 2011.
- Settles, Burr. Active Learning Literature Survey. University of Wisconsin-Madison, 2010.
- Yadav, Rishi. Sparck Cookbook. Packt Publishing, 2015.
Online articles and resources
- A hands-on introduction to Apache Kafka
- Apache Spark and Apache Kafka integration example
- Build an AI Artist - Machine Learning for Hackers #5
- Capture screenshots using phantomjs
- Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format
- dc.js - Dimensional Charting Javascript Library
- Dealing with Unbalanced Classes ,Svm, Random Forests And Decision Trees In Python
- Distributed representations of sentences and documents
- Docker-Spark
- Dockerfile for Apache Kafka
- Error reading field 'topic_metadata' in Kafka
- Example integration of Kafka, Avro & Spark-Streaming on live Twitter feed
- Example JSON response from Twitter streaming API
- ExifRead 2.1.2
- ExifTool by Phil Harvey
- FeatureHasher and DictVectorizer Comparison
- Google advanced search: A comprehensive list of Google search operators
- High Performance Kafka Consumer for Spark Streaming. Now Support Spark 1.6 and Kafka 0.9
- How useful are Topic Models in practice?
- HubInfo | A GitHub Repo Widget
- Implementation of Q.V. Le, and T. Mikolov, Distributed Representations of Sentences and Documents ICML, 2014
- Indexing Tweets with NiFi and Solr
- Installing XGBoost on Ubuntu
- Introducing DeepText: Facebook's text understanding engine
- Introducing FBLearner Flow: Facebook's AI backbone
- Introducing our Hybrid lda2vec Algorithm
- Introductory sample scala app using Apache Spark Streaming to accept data from Kafka and write a summary to Cassandra
- Kafka (and Zookeeper) in Docker
- kafka-python 1.2.2
- lda: Topic modeling with latent Dirichlet Allocation
- Link via an ambassador container
- Low level integration of Spark and Kafka
- Machine Learning Library (MLlib) Guide
- Networking in Compose
- Node.js Passport Twitter login and storing user with MongoDB (mongoose)
- Opinionated stacks of ready-to-run Jupyter applications in Docker
- Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Flink and DataFlow
- Screenshot top 20 websites from top 20 categories using python
- Sign-Up with Facebook, Twitter and Google using Node.js and MongoDB
- sklearn.feature_extraction.text.TfidfVectorizer
- Spark Streaming + Kafka Integration Guide
- Technologies for Reusing Text from the Web - Talk by Martin Potthast
- The amazing power of word vectors
- Topic modeling with LDA: MLlib meets GraphX
- Topic Modeling with Latent Dirichlet Allocation
- Tweets | Twitter Developer
- UI Bootstrap
- Understand Docker container networks
- US presidential election via Twitter using Apache NiFi, Spark, Hive and Zeppelin
- Versatile Spark – Streaming
- Word2vec Tutorial
- XGBoost4J: Portable Distributed XGBoost in Spark, Flink and Dataflow
- Zookeeper & Kafka Install : A single node and a multiple broker cluster - 2016
Scholarly articles
- Andor, Daniel [et al.]. Globally Normalized Transition-Based Neural Networks. Google Inc., 2016.
- Breiman, Leo. Bagging Predictions. Department of Statistics, University of California Berkeley, 1994.
- Beygelzimer, Alina; Kale, Satyen; Luo, Haipeng. Optimal and Adaptive Algorithms for Online Boosting. Arxiv.org, 2015.
- Blei, David M. Surveying a suite of algorithms that offer a solution to managing large document archives. Communications of the AMC, 2012.
- Chen, Tianqi; Guestrin, Carlos. XGBoost: A Scalable Tree Boosting System. arxiv.org, 2016.
- Gu, Xiaodong [et al.]. Deep API Learning. arxiv.org, 2016.
- Liu, Bing [et al.]. Partially Supervised Classification of Text Documents. Singapore-MIT Alliance, 2002.
- Mihalcea, Rada; Tarau, Paul. TextRank: Bringing Order Into Texts. Conference on Empirical Methods in Natural Language Processing, 2004.
- Morstratter, Fred [et al.]. Is the Sample Good Enough? Comparing Datafrom Twitter’s Streaming API with Twitter’s Firehose. Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media, 2013.
- Niu, Gang. Theoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning. arxiv.org, 2016.
- Qin, Xiangju [et al.]. Learning from data streams with only positive and unlabeled data. Journal of Intelligent Information Systems, 2013.
- Sill, Joseph [et al.]. Feature-Weighted Linear Stacking. arxiv.org, 2009.
- Tao, Ke [et al.]. Groundhog Day: Near-Duplicate Detection on Twitter. Proceedings of the 22nd international conference on World Wide Web, 2013.
- Teh, Yee Whye [et al.]. Hierarchical Dirichlet Processes. University of California Berkeley, 2005.