Guide to Big Data Architecture for Small Businesses & Organizations
No business is too small for big data. In fact, any business that doesn’t take advantage of advanced data collection and analytics capabilities risks finding itself out of business faster than you can say “Kubernetes.”
A report issued in late 2018 by Oxford Economics and NTT Data (registration required) highlights the importance of big data to future business success:
- 90% of executives surveyed agreed or strongly agreed that big data will be “critical to overall financial performance” in the next three years.
- 92% believe it will be “critical to improving the customer experience.”
- 80% state it will be “critical to growth.”
- 83% believe it will be “a competitive differentiator for our company.”
The survey quotes one executive who compared data to horses: “Beautiful and powerful, but useless to humans unless tamed and harnessed.”
A clear pattern can be discerned in the way businesses of all sizes approach big data. In one camp are the organizations that are “following a defined path toward their data goals.” These companies understand the regulations that apply to the data they hold (98% agree or strongly agree vs. 74% of companies without a clear data plan). They also understand the types of data they share (93% vs. 74%) and how they intend to use the data they collect (91% vs. 76%).
Small businesses are rising to the challenge of staying competitive with their larger counterparts by quickly moving to plan and implement their big data architectures. A 2018 survey conducted by Dresner Advisory Services and reported by Forbes found that organizations with 100 employees or fewer had the highest adoption rate of business intelligence (BI) tools, including data models driven by advanced analytics. They also had the highest rate of hiring employees with analytics and BI skills.
Crafting a big data architecture requires the same planning and understanding of a company’s data goals as is needed to construct a building, as Datamation points out. Step one is meeting with stakeholders to understand their needs and objectives. Then, big data architects compare the pros and cons of various frameworks and analytics tools. They must also consider the many different data sources, data types, and data formats needed to accommodate the big data plan, as well as the storage options and how the results of data analyses will be used by different stakeholders.
This guide serves as an introduction to the many ways that big data architectures are transforming small businesses and enabling them to compete successfully with their much larger counterparts.
How Small Businesses Use Big Data, and Why They Need It
A survey of data professionals conducted by BI-Survey.com identified the three most important trends of 2019. They are: data quality/mastering data management, data discovery/visualization, and self-service BI. All three of these trends are key components of developing a big data architecture, yet there is uncertainty about the best approaches businesses can take to ensure their data operations provide strategic advantage.
At the same time, the number of small businesses that will benefit from implementing a big data architecture continues to expand as the technology becomes more accessible and less expensive. Datamation lists six criteria for determining whether a big data approach is suitable for a small business:
- You have extensive data from your network and web logs that you need to extract and analyze.
- The data sets you process are larger than 100GB and require more than 8 hours to run.
- You have sufficient resources to invest in a big data project that may require third-party products to optimize.
- Your data stores are large and include a great deal of unstructured data that needs to be converted to structured formats for analysis.
- You collect and analyze large volumes of structured and unstructured data from multiple sources.
- Your business requires proactive analysis of massive data stores, including seasonal sales, the impact of advertising, sentiment analysis of social media posts, email pattern analysis, and other categories.
What Is a ‘Big Data Architecture’?
Traditional databases have driven business for more than half a century. Yet the standard database management system design has clear physical limitations. Most cannot undertake the analysis of today’s massive data stores and the complex relationships that exist between structured data (text and numbers, primarily) and unstructured data (audio, video, and other non-traditional data formats).
Big data architectures are designed to overcome these limitations by supporting the collection, processing, and analysis of data sets that are measured in terabytes and that need to model unique combinations of data from diverse sources. Microsoft describes the workloads associated with typical big data solutions:
- Batch processing of various big data sources when the data is stored, or “at rest”
- Real-time processing of big data as it is gathered, or “in motion”
- Exploring big data interactively
- Forecasting events by applying predictive analytics and machine learning
Among the components of big data architecture are the following: data sources, the storage media, stream processing to prepare the data for analysis, the analytical data store that is queried to extract BI, reporting to present the BI in usable formats, and orchestration to automate the workflows through which the data travels.
Big Data Architectural Layers
Big data architecture is comprised of four logical layers, as Datamation explains:
- Data sources—include the company’s own databases and internal documents, as well as data from mobile devices, social media, email, sensors, and third-party providers.
- Data massaging and storage—collect the data from these diverse sources, converts unstructured data to a form that analysis tools can ingest, stores the structured data in a relational database management system (RDBMS) and the unstructured data in a NoSQL or Hadoop Distributed File System (HDFS) database.
- Analytics—is the layer that transforms the stored data into business intelligence using technologies such as sampling for structured data and specialized tools to analyze unstructured data.
- Consumption—delivers the results of the analyses in an output format that managers, applications, and business processes can incorporate to support decision making.
Four major processes manage the transfer of data across these logical layers:
- Links between data sources ensure the data is delivered where it needs to be in a timely manner.
- Governance provisions guarantee compliance with regulations relating to privacy and data security.
- Central management consoles allow managers to monitor their big data architecture to anticipate performance problems and address them quickly.
- Quality of service (QoS) framework defines data quality parameters, compliance policies, and the performance of data ingestion and analysis operations.
Benefits of Big Data Architectures
The ultimate goal of big data architectures for businesses large and small is better business decision making. To achieve this goal, the big data setup must deliver information to the right people at the right time. Additionally, that information must be relevant to the decision, complete, accurate, and in a usable format. When all the pieces of the big data puzzle are assembled correctly, the result is a system that enables a company to choose among various technology options to find the best data analytics solution.
Microsoft describes three other benefits of big data architectures:
- The parallelism that is the foundation of big data allows companies to implement high-performance applications that can scale to accommodate large volumes of data.
- Scale-out provisioning introduces the elasticity that adjusts big data applications to handle large and small workloads efficiently to avoid overspending on resources.
- Because the components of big data architectures are also native to the Internet of Things (IoT) and BI applications, they support interoperability without having to change data formats or otherwise transform the data.
Resources for Designing a Big Data Architecture
Realizing the benefits of big data architectures for small businesses requires overcoming several challenges, especially cutting through the complexity that such architectures entail. It can also be difficult to find employees with the skills to design, implement, and manage a big data setup. Because the technologies at the heart of big data are so new, they continue to evolve; new managed services for big data are announced regularly. Lastly, the many platforms and data sources in a typical big data architecture complicate the process of securing all that data.
Here’s a quick look at available resources to support a small business big data architecture plan.
Data Connection and Ingestion
Deciding whether to build your own big data architecture or buy a solution off the shelf often comes down to how your company will connect to and collect data from various sources. TechRepublic explains that products from vendors such as Teradata, SAP, SAS, and Splunk can be implemented quickly and are simple to use, but they are also expensive.
If your big data architecture will rely primarily on batch processing of static data, solutions from Oracle, Hadoop MapReduce, and Apache Spark may work well. They accommodate large volumes of data, support scheduling, and serve as test beds for building out prototypes. On the other hand, products such as Apache Kafka, Splunk, and Flink support streaming data that facilitates the creation of predictive models. Stream processing also offers massive scaling and diverse data sources, particularly in DevOps environments.
Data Protection and Governance
A primary concern for companies choosing between running their big data architecture on public or private cloud services is their ability to comply with data security regulations. The site Transforming Data with Intelligence describes how governance in the era of big data differs from traditional approaches. A big data architecture stitches together diverse subsystems, each of which is comprised of its own data types, business processes, and purposes. Closing the gaps that naturally occur between these separate components is the key to securing a big data architecture.
In particular, when data streams through the various subsystems, it can be published and consumed on-demand rather than being packaged and vetted before delivery. To make the streaming process governable, the streams must be isolated and immutable (no changes allowed), permissions must be set in advance for all publishers and subscribers, audit logs must be maintained carefully, and the streams must be replicated to ensure availability in the event of a system failure.
Data Processing and Management
The advent of big data architectures moved some data processing operations outside the database itself to avoid the overhead and other limitations endemic to traditional database management systems (DBMSs). Towards Data Science highlights the approach taken by Hadoop and other big data management tools: break data into pieces that can be processed in parallel to create scaling that allows the volume of processes to increase while maintaining the amount of data each process handles.
Spark is able to accommodate high volumes of processes in memory by leveraging the total amount of memory available in the distributed environment. Spark’s biggest advantage, when compared to Hadoop, is its ability to iteratively process the same piece of data multiple times. Iterative processing is key to big data analytics and machine learning in particular. However, it is difficult to scale processing when hundreds of analytic processes run simultaneously, which makes Hadoop’s distributed storage an important complement to Spark in a big data architecture.
Data Transformation and Quality
At present, businesses of all sizes must run their data systems in two worlds: the traditional realm of relational databases, and the new era of big data architectures. As Transforming Data with Intelligence explains, the quality of traditional data must be preserved as it is transformed to be used in big data environments. The data quality tools and techniques businesses are familiar with can be applied to big data architectures once they’ve been adjusted and optimized.
Applying data quality standards to big data settings must consider that people will use self-service consoles to extract intelligence from big data stores, so they must accommodate ad hoc browsing, visualizing, and querying. Other data-quality challenges include reduplication to prevent skewed analytics; automatic identification of links between data sets; profiling data during development and monitoring it during production; and qualifying data captured from customers via smartphone apps, social media, third-party data providers, and other sources.
Tools and Tips for Implementing Big Data Architectures
Few small businesses have the resources to hire a data scientist or business data analytics specialist. Fortunately, many of the tools that data experts use are available to organizations of all sizes. In addition, many of the tools provide users with dashboard-style controls and at-a-glance views of their big data operations.
Tips for Maintaining a Big Data Architecture
The starting point for a big data management plan is to identify the problems that data analytics can help solve. The Python Guru writes that focusing on specific problem areas keeps big data plans from sprawling out of control or becoming unwieldy. Doing so also reduces the amount of irrelevant information that the analytics operation has to process. It is also important to devise solutions that use the technologies available to a company, rather than approaches that are beyond its capabilities.
Tools for Managing Big Data Architectures
- Hadoop—is an open-source database management system (DBMS) from Apache that is designed to scale from single servers to clusters comprised of thousands of distributed processing nodes, each working on large datasets in unison. Hadoop is noted for its extensive software library, data processing capacity, and stability due to Apache’s product support.
- Cassandra—is another open-source database from Apache intended for applications requiring scalability, high availability, and high performance. Cassandra automatically replicates data to multiple nodes in distributed networks in the cloud, on premises, or a mix of both, allowing failed nodes to be replaced with zero downtime.
- Bokeh—is an open-source visualization project of the nonprofit NumFOCUS. It is noted for its interactivity, including a hover tool for displaying data on mouse-overs and easy sharing of plots and other data with colleagues.
Big Data Services Geared to Small Businesses
As the amount of data that is available to businesses expands, the importance of maximizing the value of the data increases. As Business News Daily points out, a range of services for tapping these data assets are affordable to many small businesses. In some cases, a small business can’t afford not to take advantage of these services. For example, SAS has been a leader in business data analytics for decades, and now offers easy-to-use dashboards paired with automated forecasting and data mining in a way that requires few resources or extensive setups.
The introduction of big data architectures to small businesses points to the way organizations of all types are being transformed into data-centric operations. While many small firms are taking their first tentative steps into the realm of data analytics, the changes that big data and other technologies will make on business processes are just beginning.