Designing Data-Intensive Applications ===================================== Chapter 10 References -------------------- 1. Jeffrey Dean and Sanjay Ghemawat: “[MapReduce: Simplified Data Processing on Large Clusters](,” at *6th USENIX Symposium on Operating System Design and Implementation* (OSDI), December 2004. 1. Joel Spolsky: “[The Perils of JavaSchools](,” **, December 29, 2005. 1. Shivnath Babu and Herodotos Herodotou: “[Massively Parallel Databases and MapReduce Systems](,” *Foundations and Trends in Databases*, volume 5, number 1, pages 1–104, November 2013. [doi:10.1561/1900000036]( 1. David J. DeWitt and Michael Stonebraker: “[MapReduce: A Major Step Backwards](,” originally published at **, January 17, 2008. 1. Henry Robinson: “[The Elephant Was a Trojan Horse: On the Death of Map-Reduce at Google](,” **, June 25, 2014. 1. “[The Hollerith Machine](,” United States Census Bureau, **. 1. “[IBM 82, 83, and 84 Sorters Reference Manual](,” Edition A24-1034-1, International Business Machines Corporation, July 1962. 1. Adam Drake: “[Command-Line Tools Can Be 235x Faster than Your Hadoop Cluster](,” **, January 25, 2014. 1. “[GNU Coreutils 8.23 Documentation](,” Free Software Foundation, Inc., 2014. 1. Martin Kleppmann: “[Kafka, Samza, and the Unix Philosophy of Distributed Data](,” **, August 5, 2015. 1. Doug McIlroy: [Internal Bell Labs memo](, October 1964. Cited in: Dennis M. Richie: “[Advice from Doug McIlroy](,” **. 1. M. D. McIlroy, E. N. Pinson, and B. A. Tague: “[UNIX Time-Sharing System: Foreword](,” *The Bell System Technical Journal*, volume 57, number 6, pages 1899–1904, July 1978. 1. Eric S. Raymond: [*The Art of UNIX Programming*]( Addison-Wesley, 2003. ISBN: 978-0-13-142901-7 1. Ronald Duncan: “[Text File Formats – ASCII Delimited Text – Not CSV or TAB Delimited Text](,” **, October 31, 2009. 1. Alan Kay: “[Is 'Software Engineering' an Oxymoron?](,” **. 1. Martin Fowler: “[InversionOfControl](,” **, June 26, 2005. 1. Daniel J. Bernstein: “[Two File Descriptors for Sockets](,” **. 1. Rob Pike and Dennis M. Ritchie: “[The Styx Architecture for Distributed Systems](,” *Bell Labs Technical Journal*, volume 4, number 2, pages 146–152, April 1999. 1. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: “[The Google File System](,” at *19th ACM Symposium on Operating Systems Principles* (SOSP), October 2003. [doi:10.1145/945445.945450]( 1. Michael Ovsiannikov, Silvius Rus, Damian Reeves, et al.: “[The Quantcast File System](,” *Proceedings of the VLDB Endowment*, volume 6, number 11, pages 1092–1101, August 2013. [doi:10.14778/2536222.2536234]( 1. “[OpenStack Swift 2.6.1 Developer Documentation](,” OpenStack Foundation, **, March 2016. 1. Zhe Zhang, Andrew Wang, Kai Zheng, et al.: “[Introduction to HDFS Erasure Coding in Apache Hadoop](,” **, September 23, 2015. 1. Peter Cnudde: “[Hadoop Turns 10](,” **, February 5, 2016. 1. Eric Baldeschwieler: “[Thinking About the HDFS vs. Other Storage Technologies](,” **, July 25, 2012. 1. Brendan Gregg: “[Manta: Unix Meets Map Reduce](,” **, June 25, 2013. 1. Tom White: *Hadoop: The Definitive Guide*, 4th edition. O'Reilly Media, 2015. ISBN: 978-1-491-90163-2 1. Jim N. Gray: “[Distributed Computing Economics](,” Microsoft Research Tech Report MSR-TR-2003-24, March 2003. 1. Márton Trencséni: “[Luigi vs Airflow vs Pinball](,” **, February 6, 2016. 1. Roshan Sumbaly, Jay Kreps, and Sam Shah: “[The 'Big Data' Ecosystem at LinkedIn](,” at *ACM International Conference on Management of Data* (SIGMOD), July 2013. [doi:10.1145/2463676.2463707]( 1. Alan F. Gates, Olga Natkovich, Shubham Chopra, et al.: “[Building a High-Level Dataflow System on Top of Map-Reduce: The Pig Experience](,” at *35th International Conference on Very Large Data Bases* (VLDB), August 2009. 1. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, et al.: “[Hive – A Petabyte Scale Data Warehouse Using Hadoop](,” at *26th IEEE International Conference on Data Engineering* (ICDE), March 2010. [doi:10.1109/ICDE.2010.5447738]( 1. “[Cascading 3.0 User Guide](,” Concurrent, Inc., **, January 2016. 1. “[Apache Crunch User Guide](,” Apache Software Foundation, **. 1. Craig Chambers, Ashish Raniwala, Frances Perry, et al.: “[FlumeJava: Easy, Efficient Data-Parallel Pipelines](,” at *31st ACM SIGPLAN Conference on Programming Language Design and Implementation* (PLDI), June 2010. [doi:10.1145/1806596.1806638]( 1. Jay Kreps: “[Why Local State is a Fundamental Primitive in Stream Processing](,” **, July 31, 2014. 1. Martin Kleppmann: “[Rethinking Caching in Web Apps](,” **, October 1, 2012. 1. Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira: *[Hadoop Application Architectures](*. O'Reilly Media, 2015. ISBN: 978-1-491-90004-8 1. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “[Challenges to Adopting Stronger Consistency at Scale](,” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015. 1. “[Performance and Efficiency](,” Apache Pig Documentation, **, 2017. 1. Sriranjan Manjunath: “[Skewed Join](,” **, 2009. 1. David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider, and S. Seshadri: “[Practical Skew Handling in Parallel Joins](,” at *18th International Conference on Very Large Data Bases* (VLDB), August 1992. 1. Marcel Kornacker, Alexander Behm, Victor Bittorf, et al.: “[Impala: A Modern, Open-Source SQL Engine for Hadoop](,” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015. 1. Matthieu Monsch: “[Open-Sourcing PalDB, a Lightweight Companion for Storing Side Data](,” **, October 26, 2015. 1. Daniel Peng and Frank Dabek: “[Large-Scale Incremental Processing Using Distributed Transactions and Notifications](,” at *9th USENIX conference on Operating Systems Design and Implementation* (OSDI), October 2010. 1. “["Cloudera Search User Guide,"]( Cloudera, Inc., September 2015. 1. Lili Wu, Sam Shah, Sean Choi, et al.: “[The Browsemaps: Collaborative Filtering at LinkedIn](,” at *6th Workshop on Recommender Systems and the Social Web* (RSWeb), October 2014. 1. Roshan Sumbaly, Jay Kreps, Lei Gao, et al.: “[Serving Large-Scale Batch Computed Data with Project Voldemort](,” at *10th USENIX Conference on File and Storage Technologies* (FAST), February 2012. 1. Varun Sharma: “[Open-Sourcing Terrapin: A Serving System for Batch Generated Data](,” **, September 14, 2015. 1. Nathan Marz: “[ElephantDB](,” **, May 30, 2011. 1. Jean-Daniel (JD) Cryans: “[How-to: Use HBase Bulk Loading, and Why](,” **, September 27, 2013. 1. Nathan Marz: “[How to Beat the CAP Theorem](,” **, October 13, 2011. 1. Molly Bartlett Dishman and Martin Fowler: “[Agile Architecture](,” at *O'Reilly Software Architecture Conference*, March 2015. 1. David J. DeWitt and Jim N. Gray: “[Parallel Database Systems: The Future of High Performance Database Systems](,” *Communications of the ACM*, volume 35, number 6, pages 85–98, June 1992. [doi:10.1145/129888.129894]( 1. Jay Kreps: “[But the multi-tenancy thing is actually really really hard](,” tweetstorm, **, October 31, 2014. 1. Jeffrey Cohen, Brian Dolan, Mark Dunlap, et al.: “[MAD Skills: New Analysis Practices for Big Data](,” *Proceedings of the VLDB Endowment*, volume 2, number 2, pages 1481–1492, August 2009. [doi:10.14778/1687553.1687576]( 1. Ignacio Terrizzano, Peter Schwarz, Mary Roth, and John E. Colino: “[Data Wrangling: The Challenging Journey from the Wild to the Lake](,” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015. 1. Paige Roberts: “[To Schema on Read or to Schema on Write, That Is the Hadoop Data Lake Question](,” **, July 2, 2015. 1. Bobby Johnson and Joseph Adler: “[The Sushi Principle: Raw Data Is Better](,” at *Strata+Hadoop World*, February 2015. 1. Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, et al.: “[Apache Hadoop YARN: Yet Another Resource Negotiator](,” at *4th ACM Symposium on Cloud Computing* (SoCC), October 2013. [doi:10.1145/2523616.2523633]( 1. Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, et al.: “[Large-Scale Cluster Management at Google with Borg](,” at *10th European Conference on Computer Systems* (EuroSys), April 2015. [doi:10.1145/2741948.2741964]( 1. Malte Schwarzkopf: “[The Evolution of Cluster Scheduler Architectures](,” **, March 9, 2016. 1. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, et al.: “[Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing](,” at *9th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), April 2012. 1. Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia: *Learning Spark*. O'Reilly Media, 2015. ISBN: 978-1-449-35904-1 1. Bikas Saha and Hitesh Shah: “[Apache Tez: Accelerating Hadoop Query Processing](,” at *Hadoop Summit*, June 2014. 1. Bikas Saha, Hitesh Shah, Siddharth Seth, et al.: “[Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications](,” at *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2742790]( 1. Kostas Tzoumas: “[Apache Flink: API, Runtime, and Project Roadmap](,” **, January 14, 2015. 1. Alexander Alexandrov, Rico Bergmann, Stephan Ewen, et al.: “[The Stratosphere Platform for Big Data Analytics](,” *The VLDB Journal*, volume 23, number 6, pages 939–964, May 2014. [doi:10.1007/s00778-014-0357-y]( 1. Michael Isard, Mihai Budiu, Yuan Yu, et al.: “[Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks](,” at *European Conference on Computer Systems* (EuroSys), March 2007. [doi:10.1145/1272996.1273005]( 1. Daniel Warneke and Odej Kao: “[Nephele: Efficient Parallel Data Processing in the Cloud](,” at *2nd Workshop on Many-Task Computing on Grids and Supercomputers* (MTAGS), November 2009. [doi:10.1145/1646468.1646476]( 1. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd: “[The PageRank Citation Ranking: Bringing Order to the Web](,” Stanford InfoLab Technical Report 422, 1999. 1. Leslie G. Valiant: “[A Bridging Model for Parallel Computation](,” *Communications of the ACM*, volume 33, number 8, pages 103–111, August 1990. [doi:10.1145/79173.79181]( 1. Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl: “[Spinning Fast Iterative Data Flows](,” *Proceedings of the VLDB Endowment*, volume 5, number 11, pages 1268-1279, July 2012. [doi:10.14778/2350229.2350245]( 1. Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, et al.: “[Pregel: A System for Large-Scale Graph Processing](,” at *ACM International Conference on Management of Data* (SIGMOD), June 2010. [doi:10.1145/1807167.1807184]( 1. Frank McSherry, Michael Isard, and Derek G. Murray: “[Scalability! But at What COST?](,” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015. 1. Ionel Gog, Malte Schwarzkopf, Natacha Crooks, et al.: “[Musketeer: All for One, One for All in Data Processing Systems](,” at *10th European Conference on Computer Systems* (EuroSys), April 2015. [doi:10.1145/2741948.2741968]( 1. Aapo Kyrola, Guy Blelloch, and Carlos Guestrin: “[GraphChi: Large-Scale Graph Computation on Just a PC](,” at *10th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2012. 1. Andrew Lenharth, Donald Nguyen, and Keshav Pingali: “[Parallel Graph Analytics](,” *Communications of the ACM*, volume 59, number 5, pages 78–87, May 2016. [doi:10.1145/2901919]( 1. Fabian Hüske: “[Peeking into Apache Flink's Engine Room](,” **, March 13, 2015. 1. Mostafa Mokhtar: “[Hive 0.14 Cost Based Optimizer (CBO) Technical Overview](,” **, March 2, 2015. 1. Michael Armbrust, Reynold S Xin, Cheng Lian, et al.: “[Spark SQL: Relational Data Processing in Spark](,” at *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2742797]( 1. Daniel Blazevski: “[Planting Quadtrees for Apache Flink](,” **, March 25, 2016. 1. Tom White: “[Genome Analysis Toolkit: Now Using Apache Spark for Data Processing](,” **, April 6, 2016.