Thursday 14 September 2017

Spark Tutorials

Welcome to Spark Tutorials. The objective of these tutorials is to provide in depth understand of Spark. We will introduce the basic concepts of Apache Spark and the first few necessary steps to get started with Spark.
In addition to free Spark Tutorials, we will cover common interview questions, issues and how to’s of Spark.

Introduction

Big data and data science are enabled by scalable, distributed processing frameworks that allow organizations to analyze petabytes of data on large commodity clusters. MapReduce (especially the Hadoop open-source implementation) is the first, and perhaps most famous, of these frameworks.
Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java, Python, and R that allow data workers to efficiently execute machine learning algorithms that require fast iterative access to datasets (see Spark API Documentation for more info). Spark on Apache Hadoop YARN enables deep integration with Hadoop and other YARN enabled workloads in the enterprise.
Apache Spark is a general-purpose distributed computing engine for processing and analyzing large amounts of data. Though not as mature as the traditional Hadoop MapReduce framework, Spark offers performance improvements over MapReduce, especially when Spark’s in-memory computing capabilities can be leveraged.
Spark programs operate on Resilient Distributed Datasets, which the official Spark documentation defines as “a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.”
MLlib is Spark’s machine learning library, which we will employ for this tutorial. MLlib includes several useful algorithms and tools for classification, regression, feature extraction, statistical computing, and more.

Concepts

At the core of Spark is the notion of a Resilient Distributed Dataset (RDD), which is an immutable collection of objects that is partitioned and distributed across multiple physical nodes of a YARN cluster and that can be operated in parallel.
Typically, RDDs are instantiated by loading data from a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat on a YARN cluster.
Once an RDD is instantiated, you can apply a series of operations. All operations fall into one of two types: transformations or actions. Transformation operations, as the name suggests, create new datasets from an existing RDD and build out the processing Directed Acyclic Graph (DAG) that can then be applied on the partitioned dataset across the YARN cluster. An Action operation, on the other hand, executes DAG and returns a value.

Installation of Spark

Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based system. The following steps show how to install Apache Spark.
Java installation is one of the mandatory things in installing Spark. Try the following command to verify the JAVA version.
$java -version 
If Java is already, installed on your system, you get to see the following response −
java version "1.7.0_71" 
Java(TM) SE Runtime Environment (build 1.7.0_71-b13) 
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
In case you do not have Java installed on your system, then Install Java before proceeding to next step.

 Verifying Scala installation

You should Scala language to implement Spark. So let us verify Scala installation using following command.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step for Scala installation.

Downloading Scala

Download the latest version of Scala by visit the following link Download Scala. For this tutorial, we are using scala-2.11.6 version. After downloading, you will find the Scala tar file in the download folder.

 Installing Scala

Follow the below given steps for installing Scala.

Extract the Scala tar file

Type the following command for extracting the Scala tar file.
$ tar xvf scala-2.11.6.tgz

Move Scala software files

Use the following commands for moving the Scala software files, to respective directory (/usr/local/scala).
$ su – 
Password: 
# cd /home/Hadoop/Downloads/ 
# mv scala-2.11.6 /usr/local/scala 
# exit 

Set PATH for Scala

Use the following command for setting PATH for Scala.
$ export PATH = $PATH:/usr/local/scala/bin

Verifying Scala Installation

After installation, it is better to verify it. Use the following command for verifying Scala installation.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

Installing Spark

Follow the steps given below for installing Spark.
Interested in mastering Spark Training? 
Enroll now for FREE demo  SparkTraining Online

Extracting Spark tar

The following command for extracting the spark tar file.
$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz 

Moving Spark software files

The following commands for moving the Spark software files to respective directory (/usr/local/spark).
$ su – 
Password:  

# cd /home/Hadoop/Downloads/ 
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark 
# exit 

Setting up the environment for Spark

Add the following line to ~/.bashrc file. It means adding the location, where the spark software file are located to the PATH variable.
export PATH = $PATH:/usr/local/spark/bin
Use the following command for sourcing the ~/.bashrc file.
$ source ~/.bashrc

Verifying the Spark Installation

Write the following command for opening Spark shell.
$spark-shell
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on classpath 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop 
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
   ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop) 
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server 
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292. 
Welcome to 
      ____              __ 
     / __/__  ___ _____/ /__ 
    _\ \/ _ \/ _ `/ __/  '_/ 
   /___/ .__/\_,_/_/ /_/\_\   version 1.4.0 
      /_/  
  
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71) 
Type in expressions to have them evaluated. 
Spark context available as sc  
scala>
Start a Spark Script

Create a New Spark Script

To run your Spark script on Mortar, you’ll need to place the script in thesparkscripts directory of a Mortar project.
The finished, ready-to-run version of the Spark script is available for your reference in the example project sparkscripts directory: sparkscripts/text-classifier-complete.py.

Create a New Spark Script

Now that you have a project to work with, you’re ready to start writing your own Spark script.
ACTION: In your favorite code editor, create a new blank file called text-classifier.py in the sparkscripts directory of your project.

Features of Apache Spark

-Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory.
-Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying.
-Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

Wednesday 13 September 2017

How Yahoo’s Internal Hadoop Cluster Does Double-Duty on Deep Learning

Five years ago, many bleeding edge IT shops had either implemented a Hadoop cluster for production use or at least had a cluster set aside to explore the mysteries of MapReduce and the HDFS storage system.
While it is not clear all these years later how many ultra-scale production Hadoop deployments there are in earnest (something we are analyzing for a later in-depth piece), those same shops are likely on top trying to exploit the next big thing in the datacenter—machine learning, or for the more intrepid, deep learning.
For those that were able to get large-scale Hadoop clusters into production and who now enjoy a level of productivity on those systems, integrating deep learning and machine learning presents a challenge—at least if that workload is not being moved to an entirely different cluster for the deep learning workload. How can Caffe and TensorFlow integrate with existing data in HDFS on the same cluster? It turns out, it is quite a bit easier, even with the addition of beefy GPU-enabled nodes to handle some of the training part of the deep learning workflow.
Work on this integration of deep learning and Hadoop comes from the least surprising quarters—Yahoo, the home of Hadoop and MapReduce over a decade ago. Yahoo’s main internal cluster for research, user data, production workloads across its many brands and services (search, ad delivery, Flickr, email), and now deep learning is all based on a mature Hadoop-centered stack. Over time, teams at Yahoo have integrated the many cousins of Hadoop; Spark, Tez, and more, but they are now looking to capture trends in open source that float away from the Apache base for large-scale analytics they’ve cultivated over the years.

Saturday 9 September 2017

QlikView Training Online With Live Projects And Job Assistance

What is QlikView

QlikView is an extensive business intelligence platform presenting powerful business analytics to a broad range of users throughout the organisation. The QlikView suite is arranged for data extractions and application development, capable information distribution and powerful analysis for either the end-user or power user. QlikView can be extended both online and offline expand a range of industry leading technologies.
QlikView Client is a stand-alone Windows client provides the full developer’s ETL and end-user application to develop, deploy and analyse QlikView applications. QlikView Publisher ensures that the right information reaches the right user at the right time.  Developed for larger user environments, QlikView Publisher gives further centralized administration and management.  QlikView Publisher allows for complete control of the distribution of a company’s QlikView applications through QlikView AccessPoint also automating the data refresh process.

Why to attend Tekslate Online Training ?
Classes are conducted by Certified QlikView Working Professionals with 100 % Quality Assurance.
With an experienced Certified practitioner who will teach you the essentials you need to know to kick-start your career on QlikView. Our training make you more productive with your QlikView Training Online. Our training style is entirely hands-on. We will provide access to our desktop screen and will be actively conducting hands-on labs with real-time projects.

Whу уоu ѕhоuld learn Qlikviеw ?

-75% of Cоmраniеѕ Arе Invеѕting оr Plаnning tо Invеѕt in Big Dаtа bу 2017
-Big Dаtа, Analytics Sаlеѕ Will Rеасh $187 Billiоn by 2019
-Qlik among Leaders in Gаrtnеr Magic Quadrant fоr Sixth Year
Qlikview iѕ one оf the mоѕt еffесtivе Buѕinеѕѕ Intеlligеnсеѕоftwаrе for Anаlуѕiѕ, Dаtа Visualizationand discovery tools аvаilаblе in thе market tоdау. It is being adopted bу ѕоmе оf thе biggеѕt organizations in thе world. Thus рrоfеѕѕiоnаlѕ with Qlikviеw ѕkillѕ аnd Cеrtifiсаtiоnѕ аrе in muсh dеmаnd and this Qlikviеw training and courses рrоvidеѕ уоu an орроrtunitу tо enter into hоttеѕt data visualization dоmаin.

For more information on Online Qlikview Training, Visit : Tekslate

Friday 1 September 2017

pentho training

About Pentaho


Pentaho is founded in 2004 at Orlando, U.S.A,
Pentato manages, facilitates, supports and takes the lead development role in pentaho BI project a pioneering initiative by open source community to provide organizations with a comprehensive set of BI capabilities that enable them to radically improve business performance, effectiveness and efficiency.

Pentaho is an open source platform that enables developers with it tools to analyze and integrate the data. End business users are benefited by ready to use analytics in a very simple format. Petaho gives the business intelligence solutions to accessible data at different platforms. Pentaho environment supports the different data source and big data life cycle that represents data into a comprehensive insight format.


Pentaho Certification
Pentaho certification is designed to ensure that you learn & master the concepts of Pentaho and pass the certification exams on the first go. The practical hands-on learning approach followed at TekSlate will ensure you get job ready by the end of it.
  • Having a Pentaho certification distinguishes you as an expert.
  • For Pentaho certification, you need not go to a test center, as the exams are available online.
  • You need to prove your technical skills with a qualified associate exam in order to become a Pentaho certified.

Salary Trends

Average Pentaho Salary in USA is increasing and is much better than other products.
Pentaho Training
Ref: Indeed.com
Benefits to our Global Learners
  • Tekslate services are Student-centered learning.
  • Qualitative & cost effective learning at your pace.
  • Geographical access to learn from any part of the world.
for more information: https://tekslate.com

Friday 25 August 2017

Learn Oracle Exadata Training Online With Examples

What is Exadat
What is Exadat
Exadata is pre-configured combination of hardware and software which provides a platform to run the Oracle Database.
       What is flash cache and how it works?
The flash cache is a hardware component configured in the exadata storage cell server which delivers high performance in read and write operations.
Primary task of smart flash cache is to hold frequently accessed data in flash cache so next time if same data required than physical read can be avoided by reading the data from flash cache.
       What are the types of EHCC?
  •  Query Low
  •  Query High
  •  Archive High
  •  Archive Low
     What is the purpose of spine switch?
Spine switch is used to connect or add more Exadata machine in the cluster
      What is ASR?
ASR is the tool to manage the Oracle hardware. Full form of ASR is Auto Service Request.
Whenever any hardware fault occurs ASR automatically raise SR in Oracle Support and send notification to respective customer.
    What is the difference between cellcli and dcli?
Cellcli can be used on respective cell storage only.
DCLi (Distributed command Line Utility) – DCLI can be used to replicate command on multipla storage as well as DB servers.
    What is the difference between wright-through and write-back flashcache mode?
writethrough –> Falshcache will be used only for reading purpose
writeback –> Flashcache will be used for both reading and writing
Exadata sizing configuation
  • Full Rack
  • Half Rack
  • Quater Rack
  • 1/8th Rack
What are the steps to create DBFS?
  • Create Directory
  • Create Tablespace on database which you are going to use for DBFS
  • Create user for DBFS
  • Grant required privileges to created user
  • Now connect to database with created user
  • Create dbfs filesystem by invoking dbfs_create_filesystem_advanced
  • Mount file system by starting dbfs_client
What is the difference between DBRM and IORM?
DBRM is the feature of database while IORM is the feature of storage server software.
What is smart flash cache?
Flash cache is the PCIe (Peripheral Components Interconnect Express) card which is plugged into the back end of the storage cell.
How smart scan works?
If any query executes on database server than database server sends the extents and metadata to the storage cell.
Smart scan will scan data blocks to identify relevant rows and columns.
Once data identified by smart scan, it will return to database with only appropriate rows
and columns.
Once DB server gets the data, it will assemble returned data into result set.
This operation will save the bandwidth as well CPUs and memory cost on database server
as whole sql processing happens on storage server.
Get through the interview bar with our selected interview questions for Oracle Exadata enthusiasts
What are the pre-requisites to configure ASR?
  • Access to My Oracle Support
  • Internet connectivity using HTTPS
  • Network connectivity from ASR server to Exadata components
Which MOS ID I should refer for latest patch update?
MOS 888828.1
Which tool is used to generate initial configuration files based on customer’s data?
OEDA (Oracle Exadata Deployment Assistance)
What are the unique features of Exadata?
  • Smart Scan (Cell Offload)
  • Flash cache
  • EHCC (Exadata Hybrid Columnar Compression)
  • IORM (IO Resource Manager)
  • Storage Index
Which all networks available in Exadata?
  • Infiniband Network
  • ILOM and Management Network
  • Client/Public Network
What are the Exadata Health check tools available?
  • Exacheck
  • sundiagtest
  • oswatcher
  • OEM 12c
What is client or public network in exadata?
Client or public network is used to established connectivity between database and application.
What are the steps involved for initial Exadata configuration?
Initial network preparation
  • Configure Exadata servers
  • Configure Exadata software
  • Configure database hosts to use Exadata
  • Configure ASM and database instances
  • Configure ASM disk group for Exadata
What is iDB protocol?
iDB stands for intelligent database protocol. It is a network based protocol which is responsible to
communicate between storage cell and database server.
What is LIBCELL?
Libcell stands for Library Cell which is linked with Oracle kernel. It allows oracle kernel to talk with the storage server via network based instead of operating system reads and writes.
Which packaged is used by compression adviser utility?
DBMS_COMPRESSION package
What is the primary goal of storage index?
Storage indexes are a feature unique to the Exadata Database Machine whose primary goal is to reduce the amount of I/O required to service I/O requests for Exadata Smart Scan.
What is smart scan offloading?
Offloading and Smart Scan are two terms that are used somewhat interchangeably. Exadata Smart
Scan offloads processing of queries from the database server to the storage server.
Processors on the Exadata Storage Server process the data on behalf of the database SQL query. Only the data requested in the query is returned to the database server.
What is checkip and what the use of it?
Checkip is the OS level script which contains IP address and hostname which will be used by Exadata in configuration phase. It checks network readiness like proper DNS configuration, it also checks there is no IP duplication in the network by pinging it which not supposed to ping initially.
Which script is used to reclaim the disk space of unused operating system?
For Linux: reclaimdisks.sh
For Solaris: reclaimdisks.pl
What should be ASM space allocation if backup performed internally?
40% storage space allocation for DATA disk group
60% storage space allocation for RECO disk group
How database server communicates to storage cell?
Database server communicates with storage cell through infiniband network.
Can I have multiple celldisk for one grid disk?
No. Celldisk can have multiple griddisk but griddisk cannot have multiple celldisk
How many FMods available on each flash card?
Four FMods (Flash Modules) are available on each flash card.
Which processes are used by storage software?
  • Cellsrv – Cell Server
  • MS- management server
  • RS – Restart Server
List the steps for replacing the damaged physical flash disk.
  • Identify damaged flash disk
  • Power off the cell
  • Replace flash card
  • Power on the cell
  • Verify and confirm new flash card
What is smart flash log?
Smart flash log is a temporary storage area on Exadata smart flash cache to store redoes log data.
Which parameter is used to enable and disable the smart scan?
cell_offload_processing
How to check infiniband topology?
We can verify infiniband switch topology by executing verify-topology script from one of our database server.
Can we use HCC on non-exadata environment?
No, HCC is only available data stored on Exadata storage server.
What is resource plan?
It is collection of plan directives that determines how database resources are to be allocated.
What is DBFS?
DBFS stands for Database File system which can be built on ASM disk group using database tablespace.
What are the major steps involved for cell server patching?
  • Check and note down existing configuration of cell
  • Clean up any previous patchmgr utility
  • Verify that the cells meet prerequisite checks
  • Patch cell server using patchmgr
  • Validation updated cell
What is the purpose of infiniband spine switch?
Spine switch is used to connect multiple exadata database machines.
What is OEM?
OEM is Oracle Enterprise Manager which is centralized tool to monitor and administer systems as well software.
What is offload block filtering?
Exadata storage server filters out the blocks that are not required for the incremental backup in progress so only the blocks that are required for the backup are sent to the database.
Which command is used to monitor BCT?
SQL>select filename,status, bytes from v$block_change_tracking;
How to add memory into database server?
  • Power off database server
  • Add memory expansion into server
  • Power on the server

Spark Tutorials

Welcome to Spark Tutorials. The objective of these tutorials is to provide in depth understand of Spark. We will introduce the basic conc...