2/5 (Tuesday) - Decide on project
Met with Prof Jabeen and team members (Sahir, Rohan). Outlined current assignment: set up Hadoop node on a new VM as a precursor to converting an unused computer into a storage node in tier 3 of the WLCG.
2/6 (Wednesday) - Begin VM setup
Met with Sahir and Rohan. Started on this tutorial for downloading Hadoop. Decided to use Scientific Linux 7. Downloaded disk (took 2hrs) and created new VM.
2/7 (Thursday) - Failed VM setup; researched Hadoop.
Failed VM setup: Installed Scientific Linux 7 (SL7) (used the "everything" version of the disk, which came in a lot of versions). To do this, set up Red Hat VM, then used the SL7 disk to boot it. Except, it would try to re-install it every time and wouldn't actually boot.
Notes on video explaining Hadoop:
Node - separate machines within the cluster are referred to "nodes," all used to store parts of the same data.
Storage speeds become significantly faster because each machine (==node) is storing different parts of the data at the same time (in lieu of having one machine store all these parts one after the other).
Data processing speeds become faster for same reason
3 copies of each part of the data are created on 3 different nodes to improve access speed.
2 copies in the same rack, 1 copy in a different rack
Rack - a group of nodes that are close together and controlled by the same network
Can more easily communicate bc have higher bandwidth inside the rack than between racks
Usually ~30-40 nodes
2/9 (Sat) - Successful VM install; Started enabling YUM repositories
Started YUM repositories: Finished 2/3 steps. Spent a long time on errors that were super easy to fix (1 vs l in a url, for example). YUM repos are a type of repo (see vocab); YUM is just the name for this specific repo. In order to use them, you must first enable the OS, EPEL, and OSG repos. All of this setup is explained here.
More vocab:
Repository (repo) - somewhere software files are stored and are available for download
RPM - Red Hat Packaging Manager; packages software
yum - command for installing RPM documents (can use rpm too, but more complicated)
yum install - installs packages
2/11 (Mon) - Finished YUM setup; started Hadoop tutorial
Finished YUM setup: Completed last step in tutorial linked above. I have to power off my machine every time I quit (otherwise, my mouse disappears), which means I have to re-login as root and manually re-connect to wifi each time (these commands are below under "other vocab").
More Hadoop background:
HDFS - Hadoop Distributed File System; this is the official title of Hadoop. It is separate from OSG (was developed by smth called "the Apache project"). There is an OSG-specific distribution, but Hadoop is not primarily an OSG product.
- Installing HDFS will make new users: hdfs, zookeeper for all; hadoop, mapred for NameNodes
NameNode - the directory server; houses cluster metadata
Secondary NameNode (optional) - improves start times. "Periodically merges updates to the HDFS file system back into the fsimage"????
DataNode - the rest of the cluster machines, all of which store data.
Client - refers to machine that can run Hadoop client commands
- Hadoop installation includes Hadoop, supporting files, and RPMs for simplification of installation process
Other vocab:
nmcli c - shows list of wifi connections
nmcli d connect/disconnect <serverName> - connects/disconnects to/from server
su - - lets you log in as root
2/12 (Tues) - Continued Hadoop tutorial
Notes on tutorial: Downloaded the DataNode package. Was having issues because forgot to connect to wifi, but connecting to wifi fixed it. Still, got weird warning for yum install osg-se-hadoop-datanode command:
I hope that's ok...
Meeting w Prof Jabeen - Dell R540 "Power edge". Has many 10TB disks.
- Pay attention: Raid (redundancy of disks??) - sometimes up to 6. But, not foolproof--apparently one machine just got lost bc both its disks failed.
- Will need to install Centos (?) on the Dell, which is a version of RedHat
- Want older version of Hadoop for when we're actually doing it, but right now doesn't matter
- Not "raiding" on this machine bc keeps Hadoop already keeps multiple copies of all data (raid>0 means there's mult. copies)
What we're doing:
1) Set up our machines (one namenode, two datanodes)
2) Research how to set everything up on Dell
3) Figure out how to make the network work w/ Centos7!!!!! (Prof says this is hardest part)
2/15 (Fri) - met with team for NETWORKING!
Static IPs and bridge connection: Although we worked for 1.5hrs, we never got our computers connected. We did figure out how to access our machine's IPs and change them manually to be static in settings, which is necessary to consistently ssh between machines using one IP. I'm a bit confused because all of our IPs and networks were the same when we used route -n, but this might be because we were in the VM and not our actual computers, and the VM's IP is based off the host machine's connections because the VM doesn't directly connect to wifi. For next time, we need to set up a bridge connection between our VM and host machine and establish a static IP for the VM (I don't think that changing IP through settings actually did what we need, since the new values don't show up when you use any terminal commands). This is all to prepare the VMs to communicate with each other on the Hadoop framework. Today we tried to change from NAT (the default) to "bridge connection" in the VB settings, but this made the internet not work at all, and we couldn't connect to either of the available networks (although the networks that were available didn't change from when we just had the NAT setting).
Notes on IPs: An IP address is usually in the form x.x.x.x, where x is a number from 0 to 255. The IP address' mask tells it which part is the network address and which part is the host address. All IP addresses and masks are 32-bit numbers, and for every digit of the mask that is a 1, the corresponding IP address bit is part of the network address. The same is true for 0's and the host address.
Network address - identifies the network the machine is connected to; usually the first 3/4 of the IP address
Host address - identifies the machine within the network. Usually the last quarter of the IP address, but this changes based on the mask.
Mask - says which part of IP address is network and which is host. 255.255.255.0 means the first 3/4 of the address are network and the last 1/4 is host.
Other commands:
route -n - shows IP info for networks
***I'm not sure what destination and gateway are here.*** INVESTIGATE!!!
Iface in the image is the name of the connection. The first two are the virtual ethernet ones (connected from the host machine); the last one is the "bridge connection," which doesn't work...
Flags - U means that the network is up; G means that it's a gateway network (i.e. this is the "head" port that sends info out to all the machines in the network)
2/21 (Thurs) - networking background research
Port forwarding - makes "hole" in firewall to allow access to a specific device on a private network from outside
General IP address setup:
Internet -> modem -> router -> individual device
Indiv. devices have internal IP's based off router IP
Modem has its own, unrelated IP to connect to internet
Firewall blocks others from accessing modem's IP and using it to send uninvited information
With port forwarding:
Allow port forwarding on router
"Point" an existing port number to desired device (the device you want to access from outside the network)
-> outside traffic is "forwarded" to this port number
Bridged network -