Thursday, 14 September 2017

Spark Tutorials

Welcome to Spark Tutorials. The objective of these tutorials is to provide in depth understand of Spark. We will introduce the basic concepts of Apache Spark and the first few necessary steps to get started with Spark.
In addition to free Spark Tutorials, we will cover common interview questions, issues and how to’s of Spark.

Introduction

Big data and data science are enabled by scalable, distributed processing frameworks that allow organizations to analyze petabytes of data on large commodity clusters. MapReduce (especially the Hadoop open-source implementation) is the first, and perhaps most famous, of these frameworks.
Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java, Python, and R that allow data workers to efficiently execute machine learning algorithms that require fast iterative access to datasets (see Spark API Documentation for more info). Spark on Apache Hadoop YARN enables deep integration with Hadoop and other YARN enabled workloads in the enterprise.
Apache Spark is a general-purpose distributed computing engine for processing and analyzing large amounts of data. Though not as mature as the traditional Hadoop MapReduce framework, Spark offers performance improvements over MapReduce, especially when Spark’s in-memory computing capabilities can be leveraged.
Spark programs operate on Resilient Distributed Datasets, which the official Spark documentation defines as “a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.”
MLlib is Spark’s machine learning library, which we will employ for this tutorial. MLlib includes several useful algorithms and tools for classification, regression, feature extraction, statistical computing, and more.

Concepts

At the core of Spark is the notion of a Resilient Distributed Dataset (RDD), which is an immutable collection of objects that is partitioned and distributed across multiple physical nodes of a YARN cluster and that can be operated in parallel.
Typically, RDDs are instantiated by loading data from a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat on a YARN cluster.
Once an RDD is instantiated, you can apply a series of operations. All operations fall into one of two types: transformations or actions. Transformation operations, as the name suggests, create new datasets from an existing RDD and build out the processing Directed Acyclic Graph (DAG) that can then be applied on the partitioned dataset across the YARN cluster. An Action operation, on the other hand, executes DAG and returns a value.

Installation of Spark

Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based system. The following steps show how to install Apache Spark.
Java installation is one of the mandatory things in installing Spark. Try the following command to verify the JAVA version.
$java -version 
If Java is already, installed on your system, you get to see the following response −
java version "1.7.0_71" 
Java(TM) SE Runtime Environment (build 1.7.0_71-b13) 
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
In case you do not have Java installed on your system, then Install Java before proceeding to next step.

 Verifying Scala installation

You should Scala language to implement Spark. So let us verify Scala installation using following command.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step for Scala installation.

Downloading Scala

Download the latest version of Scala by visit the following link Download Scala. For this tutorial, we are using scala-2.11.6 version. After downloading, you will find the Scala tar file in the download folder.

 Installing Scala

Follow the below given steps for installing Scala.

Extract the Scala tar file

Type the following command for extracting the Scala tar file.
$ tar xvf scala-2.11.6.tgz

Move Scala software files

Use the following commands for moving the Scala software files, to respective directory (/usr/local/scala).
$ su – 
Password: 
# cd /home/Hadoop/Downloads/ 
# mv scala-2.11.6 /usr/local/scala 
# exit 

Set PATH for Scala

Use the following command for setting PATH for Scala.
$ export PATH = $PATH:/usr/local/scala/bin

Verifying Scala Installation

After installation, it is better to verify it. Use the following command for verifying Scala installation.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

Installing Spark

Follow the steps given below for installing Spark.
Interested in mastering Spark Training? 
Enroll now for FREE demo  SparkTraining Online

Extracting Spark tar

The following command for extracting the spark tar file.
$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz 

Moving Spark software files

The following commands for moving the Spark software files to respective directory (/usr/local/spark).
$ su – 
Password:  

# cd /home/Hadoop/Downloads/ 
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark 
# exit 

Setting up the environment for Spark

Add the following line to ~/.bashrc file. It means adding the location, where the spark software file are located to the PATH variable.
export PATH = $PATH:/usr/local/spark/bin
Use the following command for sourcing the ~/.bashrc file.
$ source ~/.bashrc

Verifying the Spark Installation

Write the following command for opening Spark shell.
$spark-shell
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on classpath 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop 
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
   ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop) 
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server 
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292. 
Welcome to 
      ____              __ 
     / __/__  ___ _____/ /__ 
    _\ \/ _ \/ _ `/ __/  '_/ 
   /___/ .__/\_,_/_/ /_/\_\   version 1.4.0 
      /_/  
  
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71) 
Type in expressions to have them evaluated. 
Spark context available as sc  
scala>
Start a Spark Script

Create a New Spark Script

To run your Spark script on Mortar, you’ll need to place the script in thesparkscripts directory of a Mortar project.
The finished, ready-to-run version of the Spark script is available for your reference in the example project sparkscripts directory: sparkscripts/text-classifier-complete.py.

Create a New Spark Script

Now that you have a project to work with, you’re ready to start writing your own Spark script.
ACTION: In your favorite code editor, create a new blank file called text-classifier.py in the sparkscripts directory of your project.

Features of Apache Spark

-Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory.
-Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying.
-Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

Wednesday, 13 September 2017

How Yahoo’s Internal Hadoop Cluster Does Double-Duty on Deep Learning

Five years ago, many bleeding edge IT shops had either implemented a Hadoop cluster for production use or at least had a cluster set aside to explore the mysteries of MapReduce and the HDFS storage system.
While it is not clear all these years later how many ultra-scale production Hadoop deployments there are in earnest (something we are analyzing for a later in-depth piece), those same shops are likely on top trying to exploit the next big thing in the datacenter—machine learning, or for the more intrepid, deep learning.
For those that were able to get large-scale Hadoop clusters into production and who now enjoy a level of productivity on those systems, integrating deep learning and machine learning presents a challenge—at least if that workload is not being moved to an entirely different cluster for the deep learning workload. How can Caffe and TensorFlow integrate with existing data in HDFS on the same cluster? It turns out, it is quite a bit easier, even with the addition of beefy GPU-enabled nodes to handle some of the training part of the deep learning workflow.
Work on this integration of deep learning and Hadoop comes from the least surprising quarters—Yahoo, the home of Hadoop and MapReduce over a decade ago. Yahoo’s main internal cluster for research, user data, production workloads across its many brands and services (search, ad delivery, Flickr, email), and now deep learning is all based on a mature Hadoop-centered stack. Over time, teams at Yahoo have integrated the many cousins of Hadoop; Spark, Tez, and more, but they are now looking to capture trends in open source that float away from the Apache base for large-scale analytics they’ve cultivated over the years.

Saturday, 9 September 2017

QlikView Training Online With Live Projects And Job Assistance

What is QlikView

QlikView is an extensive business intelligence platform presenting powerful business analytics to a broad range of users throughout the organisation. The QlikView suite is arranged for data extractions and application development, capable information distribution and powerful analysis for either the end-user or power user. QlikView can be extended both online and offline expand a range of industry leading technologies.
QlikView Client is a stand-alone Windows client provides the full developer’s ETL and end-user application to develop, deploy and analyse QlikView applications. QlikView Publisher ensures that the right information reaches the right user at the right time.  Developed for larger user environments, QlikView Publisher gives further centralized administration and management.  QlikView Publisher allows for complete control of the distribution of a company’s QlikView applications through QlikView AccessPoint also automating the data refresh process.

Why to attend Tekslate Online Training ?
Classes are conducted by Certified QlikView Working Professionals with 100 % Quality Assurance.
With an experienced Certified practitioner who will teach you the essentials you need to know to kick-start your career on QlikView. Our training make you more productive with your QlikView Training Online. Our training style is entirely hands-on. We will provide access to our desktop screen and will be actively conducting hands-on labs with real-time projects.

Whу уоu ѕhоuld learn Qlikviеw ?

-75% of Cоmраniеѕ Arе Invеѕting оr Plаnning tо Invеѕt in Big Dаtа bу 2017
-Big Dаtа, Analytics Sаlеѕ Will Rеасh $187 Billiоn by 2019
-Qlik among Leaders in Gаrtnеr Magic Quadrant fоr Sixth Year
Qlikview iѕ one оf the mоѕt еffесtivе Buѕinеѕѕ Intеlligеnсеѕоftwаrе for Anаlуѕiѕ, Dаtа Visualizationand discovery tools аvаilаblе in thе market tоdау. It is being adopted bу ѕоmе оf thе biggеѕt organizations in thе world. Thus рrоfеѕѕiоnаlѕ with Qlikviеw ѕkillѕ аnd Cеrtifiсаtiоnѕ аrе in muсh dеmаnd and this Qlikviеw training and courses рrоvidеѕ уоu an орроrtunitу tо enter into hоttеѕt data visualization dоmаin.

For more information on Online Qlikview Training, Visit : Tekslate

Friday, 1 September 2017

pentho training

About Pentaho


Pentaho is founded in 2004 at Orlando, U.S.A,
Pentato manages, facilitates, supports and takes the lead development role in pentaho BI project a pioneering initiative by open source community to provide organizations with a comprehensive set of BI capabilities that enable them to radically improve business performance, effectiveness and efficiency.

Pentaho is an open source platform that enables developers with it tools to analyze and integrate the data. End business users are benefited by ready to use analytics in a very simple format. Petaho gives the business intelligence solutions to accessible data at different platforms. Pentaho environment supports the different data source and big data life cycle that represents data into a comprehensive insight format.


Pentaho Certification
Pentaho certification is designed to ensure that you learn & master the concepts of Pentaho and pass the certification exams on the first go. The practical hands-on learning approach followed at TekSlate will ensure you get job ready by the end of it.
  • Having a Pentaho certification distinguishes you as an expert.
  • For Pentaho certification, you need not go to a test center, as the exams are available online.
  • You need to prove your technical skills with a qualified associate exam in order to become a Pentaho certified.

Salary Trends

Average Pentaho Salary in USA is increasing and is much better than other products.
Pentaho Training
Ref: Indeed.com
Benefits to our Global Learners
  • Tekslate services are Student-centered learning.
  • Qualitative & cost effective learning at your pace.
  • Geographical access to learn from any part of the world.
for more information: https://tekslate.com

Spark Tutorials

Welcome to Spark Tutorials. The objective of these tutorials is to provide in depth understand of Spark. We will introduce the basic conc...