Posts

How to set up a Spark multi-node cluster on AWS-EC2

Context Our, Timothy Humphrey and I, aim was to benchmark the performance of Apache Spark against HPCC.  Tim had already a set up HPCC on AWS and had all the worker nodes on the same placement group so that the network latency is low. The easiest way to start a Spark multi-node cluster in AWS is to use AWS EMR. There are various resources like this , which explain  how to Spark multi-node cluster.  But, EMR service does not support placement group. Since, it only supports a small number of EC2 instance types, as compared to the EMR service. Hence, we decided to set up Spark on EC2 rather than using AWS EMR. The rest of the article will document the process. Disclaimer Most of the content of this document is from here  with some changes for example, the tutorial talks about install Spark on Ubuntu system and for our setup we wanted to use Centos (for legacy reasons). Steps 1. Spin N nodes on EC2. For this article, we assume N=5. Please make sure that the ports (7077, 8080, etc.

How to access the VM from the host machine

Context: To connect to Spark using the EMBED feature of ECL along with LIVY. Since, I am running my Spark on a standalone system inside a Ubuntu 16.10 VM and I still wanted to used ECLIDE, I decided that I should be able to connect to VM from the host (windows) machine. This video clearly describes how to set this up.

Performance comparison between FUSE plugin based download and Browser based download - 2.0

Image
Updated version of this To measure performance between HPCCFuseJ based download and browser-based download, we measure the time taken to copy mounted folder (using HPCCFuseJ) to a local folder and compare it with the download time using the browser (for details refer to this ). The graph below shows the ratio:  In the figure, it can be seen that the time required to download using HPCCFuseJ increases as the file size increase. The exponential trend of the graph is because of multiple fetch nature of the HPCCFuseJ. HPCCFuseJ works in two phases namely: data fetching and data processing. Data fetching phase uses web service calls to fetch data. Data processing uses the data fetched (during data fetching phase) and converts it to JSON format. This data (in JSON format) is then consumed by the application. The figure below shows the ratio of the time required by total time required (by both phases) to the time required by the data fetch phase. In the figure, it can b

How to test the performance of the fuse plugin? 2.0

Context: My code couldn't handle nested dataset and this was primarily because how I was parsing the result of the WS calls. So, I modified the parsing code by replacing split functionality of python by regex. Changes I was testing the performance of code by copying a file from the mounted HPCC drive (using HPCCFuseJ) to local drive. The problem with this approach is as the size of the file increases the copy operation takes longer and ends up at a hang state. To fix this, I wrote a program which does the copy operation but rather than keeping the copied data in the memory flushes it to a local file periodically. Code

How to use REST based calls to submit spark jobs

How to use Livy: Livy (alpha) enables programmatic, fault-tolerant, multi-tenant submission of Spark jobs from web/mobile apps which uses a REST calls to communicate with the Spark cluster. Note, during the time of writing this blog, I have only tried Levy on a standalone PySpark setup, so I don't know the challenges involved in setting up Levy in the PySpark Cluster. Since we would be using local file make sure to add the folder containing the pyspark scripts to parameter 'livy.file.local-dir-whitelist' of livy.conf file. Failing to do so would result in the following error:                 requirement failed: Local path pi.py cannot be added to user sessions. The command used to submit a batch script is as follows:  curl -X POST --data '{"file": "file:/home/osboxes/spark-1.6.0/examples/src/main/python/pi.py"}' -H "Content-Type: application/json" localhost:8998/batches  | python -m json.too l This command runs the spark

Performance comparison between FUSE plugin based download and Browser based download

Image
To measure performance between HPCCFuseJ based download and browser-based download, we measure the time taken to copy mounted folder (using HPCCFuseJ) to a local folder and compare it with the download time using the browser. The graph below shows the ratio:  In the figure, it can be seen that the time required to download using HPCCFuseJ increases as the file size increase. The exponential trend of the graph is because of multiple fetch nature of the HPCCFuseJ. HPCCFuseJ works in two phases namely: data fetching and data processing. Data fetching phase uses web service calls to fetch data. Data processing uses the data fetched (during data fetching phase) and converts it to JSON format. This data (in JSON format) is then consumed by the application. The figure below shows the ratio of the time required by total time required (by both phases) to the time required by the data fetch phase. In the figure, it can be seen that as the file size grows the time required by t