We live in the world of cloud computing, best-of-breed applications and BYOX (bring your own things). Companies are opening up to the idea of providing freedom and choice of technology and tools. Freedom to use tools and applications of choice shortens learning curve and promotes focus on innovation and efficiency. But this freedom comes with cost. Enterprises need to have strong technology infrastructure and processes to support variety of applications, tools and platforms while ensuring security, privacy and compliance.
Our experience of working with publishing industry has let us observe this bitter-sweet truth first hand. In the publishing world, the content is generated by many internal and external contributors. In most cases it is impossible to enforce usage of single content management systems and ideation-to-publish process. So companies end up with large amount of content being generated from discrete systems in various formats. Commonly content being accumulated is: large amount of data, unstructured, in waves and inconsistent.
For efficiency and consistency of publishing quality content, it is very important to have set of common formats in place for content and digital asset management. Common formats promote efficiency, modularity, standardization, reuse content and other digital assets. Big data platforms like Hadoop can come handy for publishing firms to apply a layer of common formats and processes on top of large amount of unstructured content they accumulate from discrete systems and individuals. The Hadoop Ecosystem provides technology platform required to handle large unstructured content data to support enterprise scale publishing process.
At Jade Global, we have created a reference architecture to support and enhance publishing process. This is based on our experience working with companies dealing with large amount of unstructured content from discrete systems based on Hadoop ecosystem. This architecture covers most commonly sought after functions of publishing process like aggregation, filtering, curation, classification, indexing, standardization, modularization and workflow. There are many more Hadoop Ecosystem components with potential usefulness for content management and publishing but the reference architecture covers most commonly used functions. Also it is possible to slice ecosystem component to implement each function separately on top of Hadoop Core.
Core Functions of Reference Architecture:
Hadoop Flume Agent and Sink are very efficient at collecting unstructured data from discrete systems. In typical configuration each source system is assigned with a dedicated Flume agent, which will be configured to collect data in format that source system is capable of providing. The beauty of Flume is that it supports various formats so that there is no need for changes in source systems. Also at Jade Global, our team can create custom Flume connectors to collect data from unsupported proprietary systems. Function of Flume Sink is to apply filter to incoming data and store in Hadoop Distributed File System. Sink can be used to filter out data that is not needed for further publishing process. Or it can perform simple transformation functions before storing content.
Hadoop Distributed File System provides reliable and high-performance storage for structured and unstructured dat. Because of high-performance access and support for unstructured data, HDFS is perfectly suited to store unstructured content from various source systems. Jade’s Hadoop team specializes in installing, administering, configuring and maintaining Hadoop Core components like HDFS.
MapReduce is Hadoop’s data analysis, manipulating and programming engine. It delivers high-performance data transformation capability with almost effortless programming. With MapReduce’s ability to read, analyze and transform large volume of unstructured data at lightning fast speed, it becomes powerhouse of standardizing content in the format enterprise publishing process requires. Jade’s specialists have experience developing MapReduce based standardization processes including removing unnecessary content (like css styling, HTML tags), changing content format from proprietary to industry standard open formats, consolidating content files by type of content, modularizing content for future reuse, duplicate identification & cleanup. Our passion and drive to explore better ways to transform unstructured data continues to deliver new ways optimize MapReduce for our clients.
Mahout Machine learning platform is high-speed and highly scalable self-learning platform which runs on top of Hadoop. Most common use cases of Mahout in publishing process includes automatic classification of content segments, identifying search tags for content segments, and automatically generating metadata for content. Automatic content classification on large amount of unstructured data using Mahout can bring huge efficiency and standardization benefits to enterprises.
Search and Metadata:
Like standardization, MapReduce can run high-speed search indexing and metadata creation jobs on huge amount of data. At Jade Global, we have devised highly efficient MapReduce based processes to generate search indexes from various types of open and proprietary sources. Also we specialize in automatically identifying custom metadata based on company’s requirements from unstructured discrete content sources. We also assist our clients in installing, administering, configuring and maintaining HBase database to store content metadata and other transaction information. HBase is a Hadoop based column oriented no-sql database that delivers convenience of a relational database with high-scalability and lightning fast performance.
Advantages of Reference Architecture for Publishing Process
- Freedom and Productivity: Implementation of this reference architecture allows authors and contributors to use platform of their choice for ideation, authoring and packaging content. As the reference architecture includes standardization process, the organization does not need to compromise security, privacy and standards compliance while allowing discrete systems to generate content.
- Common Formats and Processes: The reference architecture is designed to support common publishing processes and formats. With high-speed standardization support, the reference architecture and Hadoop ecosystem allows organizations to define and enforce best practices and processes for publishing. Also it allows for continuous optimization of publishing process and formats to keep up with changing business and technology needs.
- Automation: The reference architecture and Hadoop Ecosystem enables organization to automate large portions of content publishing process with flexibility of human intervention as needed. From content aggregation, standardization, classification, indexing, search optimization to open standard publishing can be automated using Hadoop Oozzie workflow engine.
- Open Format Publishing: This architecture promotes publication of content in open and industry standard formats to achieve flexibility of publishing to multiple platforms like web, print, mobile, social or even to content resellers. This allows publishing businesses to explore non-traditional revenue streams and innovative ways to deliver content.
- Time to Market: Automation, standardization, process focus and high-speed processing of large amount of data enables businesses to publish content at fast pace. In today’s competitive world of content publishing, each second spent from ideation to publishing is critical for success of content delivery, it’s popularity and revenue generated from it. The reference architecture and Hadoop Ecosystem enables enterprises to achieve best-in-class efficiency for publishing process.
Read the blog: Is MapReduce dead?