
Big data: what do you need to learn? A small guide for beginners 
Big data: how to get started?
Everywhere, we say it’s the future, that we need to get into big data. As soon as possible. Very good. But then, where to start? What needs to be learned urgently? What should you train in to become a master of Big Data?
Recommended read : Discover everything you need to know about cucurbit vegetables
Rather than providing a list of typical training programs here, with titles that maintain that feeling of trendy keywords, we preferred to tell you the story of Big Data. How and why did we get here?
Where does this term come from and what technological developments should the world of big data explode?
Further reading : Everything You Need to Know About Laundry with Ivy
By having a better view of what Big Data is, you deduce what you need to learn to become an ace on the subject, and especially what you need to learn yourself.
Big data: what is the difference compared to data analysis?
The first question is the one that allows us to understand the novelty of this matter. What is the difference between Big Data and good analysis of old data and the statistics courses we organized in high school — and sometimes beyond?
In fact, the novelty comes from the Internet and the fact that all data is now produced instantly from all sides.
According to IBM, humankind produces 2.5 quintillion (25 billion billion) bits of data every day, eating, sleeping, playing, working, etc. This is equivalent to 2 stacks of DVDs placed one on top of the other whose height varies from the earth… to the moon! Knowing, therefore, that these DVDs contain everything from the texts we send, to the photos we take, to the data of all the machines that communicate with each other, which explains why data has become, in common language, so “big”.
And so, when people talk about this big data, they generally think about the fact that we are going to take a large part of this data, analyze it, and deduce something interesting.
In fact, it’s much more than that.
Big Data is much more than that.
Big Data also consists of:
- taking large amounts of data from different sources,
- using these data of very different types, produced at different rates without necessarily having to “translate” them into specific formats.
- storing this data so that it can be used simultaneously for a bunch of different analyses for different purposes.
- and doing all this very quickly, and sometimes even… in real-time.
At the very beginning, we talked about the 3V or VVV: volume (large quantities), variety (different types of data), and velocity (speed of processing).
Big Data vs. data warehouse
But what the acronym VVV did not put into perspective is this central innovation that means that data does not need to be systematically “transformed” to be analyzed.
This “non-destructive” processing meant that organizations could now analyze the same batches of data for different purposes and extract that data from sources that had themselves collected it for yet different purposes.
A new approach thus emerged, very different from what we knew at the time around data mining and data warehouses.
The classic data warehouse was designed for a specific use. Data was structured and converted into specific formats. The original data is necessarily destroyed in the process.
We were talking about the ETL process: “extract, transform, and load”.
The ETL approach is therefore limited to specific analyses for specific data. It was perfect if all your data existed in your transaction system. But it’s no longer so terrible when you know all the data that exists here and there in our hyper-connected world, producing all the quintillions of data mentioned above.
This is therefore one of the first things to learn in the world of Big Data: — look outside, ask what data is available outside the system — look at the data available to everyone, the famous open data — ask how we could connect them, cross them, what we would learn from these with the data we already have.
The end of data warehouses?
Don’t believe that Big Data makes data warehouses obsolete. Big Data systems lead you to work with unstructured data, but the type of query results you get is far from the sophistication of data warehouses.
The data warehouse is designed for in-depth data analysis, which is made possible precisely because the data has been transformed and planned in a specific format.
data warehouse providers have been working for years to optimize their search engines to meet the typical expectations of a specific business. The
If you or your organization are a specialist in this area, do not feel left behind or overwhelmed. We are not exactly talking about the same thing.
Big Data allows you to analyze much more data from many sources, but with less fine resolution. The
Big Data is an impressionist world that will always need the hyperrealism of data warehouses.
That is why we are doomed to live with traditional data warehouses and the new style of processing called Big Data.
Big Data is therefore not about unlearning what you have learned from training in the data warehouse. The Data Scientist does not necessarily have to replace the Business Intelligence engineer.
Instead, it will be necessary for the company to raise the question of how to ensure that both mutually enrich each other. And that too must be learned and studied.
Technological breakthroughs behind Big Data
This is for the philosophical and organizational side.
But there is also the technological dimension. Again, Big Data is based on a revolution that needs to be understood. And to which it is necessary to adapt through ad hoc training and learning.
Unlike 3V, Big Data is 3V 1U: volume, variety, velocity, and even more: non-destructive use of data.
To achieve this, technological challenges had to be successfully met.
First of all, the development of distributed computing systems: this is the realm of Hadoop, for example.
It also required a method to put disparate data into perspective: this is the territory of Google’s MapReduce or more recently Apache Spark
Finally, it required a cloud/internet infrastructure principle to access data and use it as desired.
Until twelve years ago, it was impossible to manipulate data volumes like those we need today. At that time, we certainly felt that these data volumes were huge and that our data warehouses were capable of massive processing.
But the limits of those devices at the time in terms of location and storage, computing power, and the inability to process different types of data did not allow us to cope with a new context: that of a world where, thanks to the Internet, data is produced and interconnected everywhere, in all directions and at all times.
MapReduce
Around 2003, Google researchers developed MapReduce. This programming technique simplifies the processing of data sets by first reducing the data to sets of key/value pairs, then performing calculations on data with similar keys to reduce everything to a single value. Each “big chunk” of data could be processed in parallel on hundreds or even thousands of inexpensive machines. This large-scale parallel processing technique allowed Google to generate search results on incredibly larger volumes of data than before, and to do so, faster.
Google is behind the two technological breakthroughs that made Big Data possible:
The first was Hadoop, which consists of two key services:
- a data storage system, using HDFS (Hadoop Distributed File System)
- a parallel data processing system, using a technique called Map Reduce.
Hadoop runs on a set of servers in a shared-nothing architecture.
Servers can be added or removed at will in a Hadoop cluster. The system detects and resolves problems on each server. Hadoop, in other words, “self-heals”. As a result, it can continue to provide data and operate at scale while performing tasks requiring high performance, despite changes or failures.
Its true added value lies elsewhere: it comes from the add-ons and customizable extensions of its technology. Hadoop offers additional projects that add features to the platform:
- Hadoop Common: These are the core tools needed for other Hadoop projects
- Chukwa: a data collection system for managing large distributed systems.
- HBase: a distributed and scalable database that supports structured data storage for large tables.
- HDFS: a distributed system that provides fast access to data from applications
- Hive: data warehouse infrastructure that provides data summarization and an ad hoc query system.
- MapReduce: a software infrastructure for processing distributed data sets
- Pig: an important language and framework for executing parallel processes.
- ZooKeeper: a high-performance coordination service for distributed applications.
Implementing a Hadoop platform necessarily includes some of these sub-projects.
For example, many organizations choose to use HDFS as the primary distributed file management system and HBase as a database that can store billions of columns of data. And the use of MapReduce (or Spark) is then almost obvious as it brings speed and agility to the Hadoop platform.
With MapReduce, developers can create programs that process massive volumes of unstructured data in parallel via a cluster of distributed processors. The MapReduce framework is divided into two functional spaces:
- Map, a function that distributes work to different nodes in computer clusters.
- Reduce, a function that gathers the work and summarizes the results into a single value.
One of the main advantages of MapReduce is that it is fault-tolerant or resilient. How does it do that? It “monitors” each node in the cluster regularly. Each is supposed to periodically return completed work with “status” updates. If a node remains silent longer than necessary, a master node signals it and reassigns the work to other nodes in the cluster.
Originally built to index the Nutch search engine, Hadoop is now used in all major industries for various Big Data tasks.
In other words, thanks to the distributed computing system HDFS and YARN (Yet Another Resource Negotiator), the software allows the user to process gigantic volumes of data spread across thousands of computers as if it were one huge machine.
How to adapt this to your case? The story of these technological leaps is inevitably binding for you and your organization.
Don’t know distributed computing? It is urgent to better understand it and train yourself or your teams.
<p As for the tools, the story of Hadoop shows how its logic is imposed on the universe of Big Data. Its universe should not remain foreign to you. Think about it.
But the story is not over.
In 2009, researchers at the University of California, Berkeley developed Apache Spark, an alternative to MapReduce. Spark performs its calculations in parallel using in-memory storage, it is up to 100 times faster than MapReduce. Spark can be used alone or within Hadoop.
Even with Hadoop, you still need a way to store and access data.
This is generally what NoSQL databases such as MongoDB, CouchDB, or Cassandra are used for processing unstructured or semi-structured data distributed across multiple machines.
Unlike data warehouses, where massive amounts of data are sent to a unified format and stored in a single data store, these tools do not alter the nature or location of the original data — emails remain emails, sensor data remains sensor data — and can be stored practically anywhere.
One problem remains: having massive amounts of data in NoSQL databases installed in machine clusters is not very useful unless you do something with it. This is where Big Data analysis comes in.
Tools such as Tableau, Splunk, and Jasper BI allow you to analyze this data to identify patterns, extract meanings, and reveal new insights. What you do with it will depend on your needs.
Again, skills in these analyses are particularly in demand. This is part of the skills to acquire.
Note also that mastering NoSQL databases seems important, if not essential. Knowing and mastering NoSQL seems essential for the world of Big Data. Although… in this article, you will see that SQL is making a comeback in the Big Data universe.
To be continued…
Need a more precise diagnosis?
What do you need to learn? Who needs to be trained in what in your organization? Contact us for a precise diagnosis.
Source: InfoWorld. Article written by Galen Gruman (editor-in-chief), Steve Nunez (editor-in-chief), Frank Ohlhorst, and Dan Tynan.
Tag: What is Big Data?