SELECT TRUE STATEMENTS Hardware and DBMS Design 1. A B+-tree index could be used to index attribute values in a single-server document store. [TRUE] 2. Consistent hashing is useful for workload balancing in a large distributed system. [TRUE] 3. In a main-memory-based database system, there is no need for persistent storage of any kind. [FALSE] 4. The default isolation mode of PostgreSQL is full serializability. [FALSE] 5. If only old values are logged during updates, durability can still be implemented. [FALSE] 6. Key-value stores excel at joining large relations. [FALSE] 7. SSDs are faster than HDDs, partly because they have no moving parts. [TRUE] 8. Support for ACID transaction properties is particularly important for workloads with many small update transactions. [TRUE] 9. In a main-memory relational database system, it remains necessary to log the “old” values during updates, in case the transaction aborts. [TRUE] 10. NoSQL literally means that SQL-like query languages must not be supported by NoSQL systems. [FALSE] 11. Document stores cannot store business data that is typically stored in relational systems. [FALSE] 12. While locking whole tables guarantees serializability in a relational system, it leads to terrible performance, which is why it is turned off by default in most systems. [TRUE] 13. Writing the log is usually slower than writing the actual updates to relations. [FALSE] 14. SSD storage is volatile. [FALSE] 15. The FORCE policy is difficult to implement in reality, since committing large transactions can cause many disk writes, and the server might crash in the middle of the writing process. [TRUE] 16. Query optimisation are often improved by gathering statistics over relations. [TRUE] 17. What is the query optimizer component responsible for? Select all statements that apply: - Generating a query execution plan. [TRUE] - Parsing queries. [FALSE] - Comparing access methods. [TRUE] - Maintaining statistics over existing relations. [FALSE] 18. A lock manager is necessary to implement ACID-style isolation. [TRUE] 19. When a transaction is committed, all the pages it has changed must be written 20. to disk. [FALSE] 21. A distributed system with multiple replicas of data items can provide both strict consistency and availability at the same time. [FALSE] 22. A nested-loops join implementation is necessary to evaluate joins with complicated conditions. [TRUE] 23. Database backups are needed in case secondary storage (HDDs or SSDs) fails. [TRUE] 24. Many NoSQL systems over distributed storage. [TRUE] 25. CAP-style consistency is actually very similar to ACID-style consistency. [FALSE] 26. An RDBMS which implements clustered indexes, can have multiple such clustered indexes per table. [FALSE] 27. Main memory cannot break or fail, so database backups are not needed for main-memory database systems. [FALSE] 28. CAP-style consistency is actually very similar to ACID-style isolation. [TRUE] 29. Defragmentation (reorganizing files together on disk) does not make much sense on SSDs. [TRUE] 30. By defnition, a relational database can only use B+-tree indexes. [FALSE] 31. Magnetic tapes are useful for archival storage (e.g., database backups). [TRUE] 32. Utilisation of the L2 cache is not important in a disk-based database management system. [FALSE] 33. Traditional relational systems can scale-up infinitely. [FALSE] 34. By definition, a NoSQL system can not implement the SQL query language. [FALSE] 35. Storing multiple database recovery logs on a single hard disk drive (HDD) gives excellent logging performance. [FALSE] 36. Data replication in a distributed system reduces the risk of losing data. [TRUE] 37. The notion of consistency in the CAP theorem is the same as the notion of consistency in the ACID properties. [FALSE] 38. Compared to hard-disk drive (HDD) technology, state-of-the-art solid state disks (SSDs) generally improve the performance of disk operations. [TRUE] 39. Transaction isolation is easier to manage with very short transactions than with very long transactions. [TRUE] 40. Data replication in a distributed system eliminates the risk of losing data. [FALSE] 41. Compared to older persistent storage technology, solid state disks (SSDs) are particularly effective for small random reads. [TRUE] 42. The CAP theorem applies to normal operation of large-scale distributed systems. [FALSE] Data Systems for Analytics 1. At its core, cloud storage uses the same storage media as local storage: HDDs and/or SSDs. [TRUE] 2. Deep learning is often used to train classifiers that can translate unstructured multimedia data (images and video) to more structured text data. [TRUE] 3. Data Volume alone is not sufficient to consider a collection of data to be a big data collection. [TRUE] 4. Data is never written to storage in a big data collection. [FALSE] 5. A well-designed relational database can never have incorrect data. [FALSE] 6. Social value can be a significant reason for working with big data. [TRUE] 7. Clustering data is generally achieved with one scan of the data collection. [FALSE] 8. Support for ACID transaction properties is generally useless for big data analysis applications. [TRUE] 9. All big data applications are evil in nature. [FALSE] 10. Spark is designed for supporting user-facing and interactive applications. [FALSE] 11. Many big data applications require so much computation that a distributed processing framework is necessary. [TRUE] 12. A machine learning model that is based on real data can never have any bias, because it accurately models real life. [FALSE] 13. Without the ability to join relations, there would be no big data analytics applications. [FALSE] 14. Value in big data analytics refers to more than monetary value. [TRUE] 15. In big data applications, there is no need to worry about inserting new data. [FALSE] 16. A major reason why Spark is more efficient than Hadoop is that it uses memory more effectively. [TRUE] 17. Hadoop is significantly more efficient than Spark for complex processing pipelines. [FALSE] 18. The main data access pattern of analytics systems consists of full sequential scans of large data collection. [TRUE] 19. Using machine learning on existing data is likely to produce models that have built-in bias. [TRUE] 20. In big data, Volume refers to the massive quantities of data that must be stored and processed. [TRUE] 21. Data in big data collections is never correct. [FALSE] 22. Clustering algorithms can be used to identify groups of related records in, for example, banking applications. [TRUE] 23. Key-value stores are better for web-caching than analytics applications. [TRUE] 24. Spark is significantly more flexible than Hadoop for complex processing pipelines. [TRUE] 25. Big data is always correct data. [FALSE] 26. Most machine learning algorithms will learn the biases present in the training data. [TRUE] 27. Document stores are the best tool for big data analytics applications. [FALSE] 28. Although Hadoop is fairly recent technology, the MapReduce concept is very old. [TRUE] 29. Spark is more efficient than Hadoop for most Big Data applications. [TRUE] 30. Videos are generally considered structured data. [FALSE] 31. Spark has very low latency for small tasks. [FALSE] 32. Distributed processing is important for most Big Data applications. [TRUE] 33. Sequential disk reads are the most important disk access pattern in big data analytics. [TRUE] 34. In Big Data applications, \velocity" has two potential meanings: a) that data is added very rapidly, and b) that one must react rapidly to the added data in many cases. [TRUE] 35. The novelty of Hadoop MapReduce was primarily the invention of the Map and Reduce operations. [FALSE] 36. In Big Data applications, it is important to verify that the data is clean and applicable to the analysis that is to be undertaken. [TRUE]