Installing Zeppelin notebook on Windows 10

Currently, Apache Zeppelin is the preferred notebook for Spark. It also offers a wide variety of other interpreters,  markdown, and good visualization, which makes it a great tool for your Spark or general Big Data project. However, from y personal experience I have found that installing Zeppelin on Windows machine can be a bit complicated and somewhat frustrating. The official document says that you only need Java installed and Java Home for Zeppelin to work, even on Windows.  However, it looks like a white lie to me. Mostly I have noticed that with those steps only you will see the following error message with Saprk interpreter.

java.lang.NullPointerException
at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38)
Null Pointer Exception

NullPointerException, thou mighty bitch

Here is a guide of steps that worked for me and my friend on Windows 10. I am not claiming that it will definitely work for you or that skipping any of the steps will definitely not work. But, it was exactly those steps that worked for both of us, nothing less.

1. Install Java and set Java Home environment variable. This is a pretty standard procedure and people have been doing it since forever. (A recent hieroglyph found in Egypt describes how people used to do that in ancient Egypt).

2. Download Zeppelin notebook from here and extract (using 7-zip or other) it to your local machine. Avoid ‘C:\Program Files’ as the directory, if you don’t want to mess with admin privileges. I simply chose plain C:, but ‘C:\users\<username>’ is also fine.

3. Download Hadoop 2.7.x from here. Download binary, not source. Extract it to your machine, Again avoid ‘Program Files’.

4. Download ‘winutils.exe’ from here. Paste it in ‘<your-hadoop-directory>\bin’.

5. Set environment variable ‘HADOOP_HOME’ and point it to your Hadoop directory. Your final settings in environment variable should look like this.

Environment variables settings

Your Environment variables should look like this

6. Go to Zeppelin-directory\bin and start Zeppelin by launching ‘zeppelin.cmd’

7.

a. Once inside zeppelin click on the menu on top right corner and select ‘interpreter’ as shown in the image.

Select Interpreter

Select Interpreter

b. Scroll down to Spark section and click on edit (pencil icon). Set the last variable ‘zeppelin.spark.useHiveContext’ to ‘false’

Select edit

Set useHiveContext to false

Set useHiveContext to false

C. Click on save button.

8. Now create a new notebook (or use the spark tutorial notebook). Save interpreter setting, keep Spark as default. Run the code ‘sc.version’. Viola, if you have not sinned too much in past you will see the following output.

It should show your Spark Version

It should show your Spark Version

Denial and Ostrich Algorithm

A few days ago, my friend and I were discussing denial. What is denial? Denial, if you don’t know, is completely avoiding a situation or ignoring the inevitable and believing that it is not there. Now, what do you think about denial? Obviously, you will think that it is not a trait worth cultivating. But what if I told you that in many situations utter and unreasonable denial is the best strategy by far, constantly ignoring the inevitable like it will never happen. Surprised? Well you should, as from the very beginning we are told to tackle problems upfront and never to avoid it and it is the right decision in most cases, but unfortunately not all.

Can you think of one case where denial is the best strategy? In fact, the best one is essential for our function and embedded so well in our psyche that it is hard to identify it. Any guess? Well, it is the denial of death. Without denial, every man (and woman) on the earth will be thinking about death, which is inevitable, and what to do about it. The reason we don’t do that and are able to engage in perfunctory activities is that we are constantly ignoring that fact. We know we are going to die, but most of the time in our subconscious we like to believe that we will leave forever. This denial lets us function, discover, invent and succeed.

Many of you may find the example of death too extreme, as death is inevitable and it does not seem that we can solve it ever (don’t confuse the denial of death here with the desire of elongated life, with fear of death it doesn’t matter whether you die in a year, hundred years, thousand years or at the doom of mankind). But what about the problems which are very much evitable and can be avoided with careful planning? Is it ever better to avoid them? Matter of fact, it is in some cases, and that brings us to Ostrich Algorithm.

Ostrich burying its head in the sand

This is a good strategy, Seriously!

Ostrich Algorithm is more of an approach than an actual algorithm as it suggests doing nothing. It is the most popular approach for tackling deadlock* in many computer systems. The name is derived from the bird ‘Ostrich’ which is believed (however, wrongly) to bury its head in sand when a threat approaches. (Actually, the poor bird is checking its eggs, which it lays under the ground; when it buries its head as you might have seen in many pictures). In Ostrich approach of deadlock prevention, the computer assumes that deadlock will never occur, although it may very well occur and can crash the system. However, in most of the computer system, the chance of such deadlock is negligible, while constantly checking for deadlocks and preventing them can slow the system down. People don’t care if their computer crashes, once a year and they had to press the reset button. For most of us, system crashes multiple times in a year and mostly not due to deadlock. However, the same cannot be told for ‘Nuclear Reactor Controller’, their reliability take place over performance and ingenious deadlock prevention techniques must be used.

After all his discussion, if you still don’t want to believe that in many cases denial is good, then you are in the state of denial; but if that encourages you to tackle any problem upfront and makes you enthusiastic, remain in that denial, for denial is good.

*In computer science deadlock refers to the situation where two processes are waiting for resources held by each other. E.g.  process A is holding resource x and waiting for resource y, while process B is holding y and waiting for x. Both the processes are stuck in a perpetual waiting. This also applies to more than two processes.

OverFlow or Flaw?

Have you ever ran out of hard disk space? No, then you are lying. Every single human being (of our age) has ran out on hard disk space. And when it happens either you delete some unnecessary things (you often find a lot of them) to free space or if you got bucks buy a new hard disk. But what happens when carefully designed protocols and standards run out space or we can say reach their limit. Billions are invested in them and millions of people are using them, so you cannot replace them overnight (and in some cases ever). Let’s see about four of such cases.

1. FAT32

Did you ever had that annoying error when I you try to copy a large file to your USB stick or memory card? It says ‘The file ‘xxxxxx’ is too large for destination file system’. And you are like, ‘what the heck? The pen drive (memory stick) is 16 gigs in size and the file is just like 4.5 or 5 Gb.’ And what you do next is, as it always been, getting super annoyed and asking to an invisible computer overload as Joey* did in ‘Friends’, “Why? Computer, Why? Why are you doing this to me?” And yes, I agree that it is the right thing to do. Now what if I told you that you had to go through this mental anguish because top minds at Microsoft couldn’t see ten years in future. Okay, no one can see future, but when one is designing a file system, one must estimate and accommodate the future needs. The file system most USB stick use is called FAT32 (File Allocation Table 32bit) and it was introduced by Microsoft in 1996 for Windows 95. With a 32 bit Address the largest file it can accommodate is 4 GB and largest partition size is 2 TB. Maybe 4Gb was a large file in 1996, but only after 9 years game called GTA San Andreas was larger than 4 GB. So, their file system can’t accommodate something that can be used commonly by users after 10 years. Although this file system is replaced by NTFS for most Windows based computers it is still used widely in memory sticks and memory cards. You can format your USB drive with NTFS, but then there can be compatibility issue with other devices, especially smartphone OTG connections as I have noticed. There is also exFAT which solves this problem and will mount on smartphones, but it won’t work with Windows XP if it is not updated. Therefore, if you own a 16 GB memory stick which you want to be able to store large files and work with your phone, so you format it with exFAT and then at the college you plug it into a Windows XP which is not updated since like the beginning of time and boom…you are screwed.

2. IPv4

You hadn’t seen this problem on your computer and if you are not a techy you probably haven’t heard about it. But this problem is much bigger than the one we previously described. As you have guessed it is related to the Internet. IPv4 stand for Internet Protocol version 4 and IP address is the address assigned to every device and website connected to the internet. The routers use this address to guide your data to the correct server and same way guides the data server sends to your device. It looks like fairly efficient system, what is the problem then? The problem is that we have ran out of IPv4 addresses. As ‘computerphile’ puts it ‘Internet is full’, there is no room for anything new. You will think that “Well, I see a new site popping out every day”. This is done with many little tricks such as recycling IPs and using NAT. Actually, your device’s IP is not dedicated IP, a new IP is assigned to you by your ISP every time you connect to Internet. A static IP is luxury these days. All this problems are occurring because IPv4 uses a 32 bit address which can accommodate about 4 billion addresses. And with millions of websites, lots of online services, defense departments running their servers and a smartphone in every hand, these addresses have ran out. Who is to blame for this serious shortage of IP addresses and providing a topic for debate to techies for years? (IPv4 vs. IPv6 debate is a decade old and is still hot) IPv4 was designed in 1981 and first deployed on ARPANET (predecessor of Internet) in 1983. Maybe 4 billion was a large number then with just thousands of computer attached to it at that time. But after 30 years the protocol reached its limit is, as I see, a failure in part of the protocol design and deployment. In fact world’s population was about 4.5 billion at that time and was rapidly growing. The protocol designers certainly didn’t had a vision of computer in every hand or idea that it will last this long, but still a small mistake at that time is giving a hard time to many organizations and people. The solution to this problem is IPv6 which uses 128 bit address, which means 2128 unique address this are gazillions (3.4 * 1038 to be exact, enough if each person is assigned a billion billion) of addresses and are unlikely to run out. The problem is there is so much infrastructure that works on IPv4 and so much investment made on it that IPv4 is unlikely to phase out any time soon (or ever, as some expert believe). The future of Internet is as a hybrid network.

3. Unix Time

One can defend the FAT32 and IPv4 situation by saying that it is very hard to predict how computing world will grow and what will come next and it was even more difficult in those years were computing world was growing exponentially in its capabilities (Moore’s Law). But these excuse would not work here as this situation did not involve any more future prediction that ‘I will probably still alive in 2038 or many people alive today will be still alive in 2038 or many computers (servers mostly) we are using today will be still in use in 2038’. Have you ever wondered even when you completely switch off your computer and turn it on again it still shows the right time, even if not connected to Internet. This is managed by a counter powered by a tiny electric cell on your motherboard. This counter is incremented by one every second and it is ticking that way since 1970 (which is 0, negative numbers can be used to denote date as back as 1901). The problem is that the 32 bit (again 32? What’s wrong with 32 bit ) data structure Unix operating system uses to access this time runs out in 2038, when it happens it will reset to zero and it would be a chaos. But what is UNIX? I have not used it. I have not seen it. I have never heard of it. Why are we worrying about such operating system? Many of you will ask. The bad news is Unix is the mother of all modern operating system a whole class of operating system called ‘Unix-like OS’, so not only Unix but Windows (Not Unix-Like but still uses 32 bit time stamp), Mac OS X (which is basically Unix), Linux (Free alternative to Unix), Android and many other operating systems uses this 32 bit time stamp. So we are talking about most of computers in the world. Second, 2038 seems far away, why are we worrying about that? We will figure something out by that time, right. Wrong, this time stamp is ubiquitous and used extensively in embedded system (which cannot be reprogramed). Furthermore, it is used so widely that changing it to 64 bit would create massive code incapability. Lastly, it cannot wait for 2038 for all programs, programs that use future dates need it be fixed earlier. Like, for a program which uses dates from 20 years in future needs it fixed by 2018. The good thing is expert are worried and already working on it, but as some experts say there will still be some systems using 32 bit time even in 2038. The solution to this problem is 64 bit time stamp which will cover us through year 292,277,026,596, which is far greater than lifetime of the earth. Think, if they have used 64 bit time-stamp (or 48 bit, though it would be odd) from the beginning there would be no such problem. But at that time with 8 and 16 bit computing 32 seemed good enough and 64 was something too large.

4. Y2K Problem

This problem is even more ridiculous than the previous one. However, it has already passed and didn’t cause much trouble, but it is worth remembering. Many computers from the 1980’s used only two digits to store the year part of the date. Which meant that it would reach limit ’99’ in the year 1999, and consider 2000 as 00. You may feel what big deal, I always use only two digits for dates and never confuse 2015 with 1915 or 1815 or 0015. And I won’t be bothered if the computer showed me 00 as year instead of 2000. But, unfortunately computer can’t understand it will surely confuse 2000 with 0000. Actually, when computing elapsed time between dates, computers subtracts dates in a predefined way. If a bank computer is programmed to subtract date of deposit from current date with to calculate interest, with year set to 00 it will subtract 99 (which is the date of deposit) from 00(which it thinks is current date). It will be a negative number. As per the formula of interest, you will be paying a negative interest and that also of 99 years! Next thing is that you are bankrupt. It is ironical that best minds of the world couldn’t see what was inevitable in a decade. What they would have said to a CEO of a multibillion-dollar bank, “Mr. CEO, this is a machine we are selling you for tens of thousands of dollars. It is reliable and will manage accounts worth billions of dollars. And yes last thing, it will fail after 15 years, while tearing your entire system apart”.

 

*Joey is a beloved character of all-time favorite TV series ‘Friends’