转自:http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
这篇是haddop官方推荐的指南,写的非常详细
感谢作者的辛勤耕耘
—————-正文开始—————–
n this tutorial I will describe the required steps for setting up a _pseudo-distributed, single-node_ Hadoop cluster backed by the Hadoop Distributed File System, running on Ubuntu Linux.
Are you looking for the [multi-node cluster tutorial](http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/)? Just [head over there](http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/).
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the [Google File System (GFS)](http://en.wikipedia.org/wiki/Google_File_System) and of the [MapReduce](http://en.wikipedia.org/wiki/MapReduce) computing paradigm. Hadoop’s [HDFS](http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html) is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.
The main goal of this tutorial is to get a simple Hadoop installation up and running so that you can play around with the software and learn more about it.
This tutorial has been tested with the following software versions:
* [Ubuntu Linux](http://www.ubuntu.com/) 10.04 LTS (deprecated: 8.10 LTS, 8.04, 7.10, 7.04)
* [Hadoop](http://hadoop.apache.org/) 1.0.3, released May 2012
![](http://www.michael-noll.com/blog/uploads/Yahoo-hadoop-cluster_OSCON_2007.jpeg "Cluster of machines running Hadoop at Yahoo!")
Figure 1: Cluster of machines running Hadoop at Yahoo! (Source: Yahoo!)
# Prerequisites
## Sun Java 6
Hadoop requires a working Java 1.5+ (aka Java 5) installation. However, using Java 1.6 (aka Java 6) is recommended for running Hadoop. For the sake of this tutorial, I will therefore describe the installation of Java 1.6.
Important Note: The apt instructions below are taken from [this SuperUser.com thread](http://superuser.com/questions/353983/how-do-i-install-the-sun-java-sdk-in-ubuntu-11-10-oneric). I got notified that the previous instructions that I provided no longer work. Please be aware that adding a third-party repository to your Ubuntu configuration is considered a security risk. If you do not want to proceed with the apt instructions below, feel free to install Sun JDK 6 via alternative means (e.g. by [downloading the binary package from Oracle](http://www.oracle.com/technetwork/java/javase/downloads/)) and then continue with the next section in the tutorial.
~~~ line-numbers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
~~~
|
|
The full JDK which will be placed in `/usr/lib/jvm/java-6-sun` (well, this directory is actually a symlink on Ubuntu).
After installation, make a quick check whether Sun’s JDK is correctly set up:
~~~ line-numbers
1
2
3
4
~~~
|
|
Adding a dedicated Hadoop system user
We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).
This will add the user `hduser` and the group `hadoop` to your local machine.
## Configuring SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to `localhost`for the `hduser` user we created in the previous section.
I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication. If not, there are [several online guides](http://ubuntuguide.org/) available.
First, we have to generate an SSH key for the `hduser` user.
~~~ line-numbers
1
2
3
4
5
6
7
8
9
10
11
12
~~~
|
|
The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes).
Second, you have to enable SSH access to your local machine with this newly created key.
The final step is to test the SSH setup by connecting to your local machine with the `hduser`user. The step is also needed to save your local machine’s host key fingerprint to the `hduser`user’s `known_hosts` file. If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in `$HOME/.ssh/config` (see `man ssh_config` for more information).
~~~ line-numbers
1
2
3
4
5
6
7
8
9
~~~
|
|
If the SSH connect should fail, these general tips might help:
* Enable debugging with `ssh -vvv localhost` and investigate the error in detail.
* Check the SSH server configuration in `/etc/ssh/sshd_config`, in particular the options `PubkeyAuthentication` (which should be set to `yes`) and `AllowUsers` (if this option is active, add the `hduser` user to it). If you made any changes to the SSH server configuration file, you can force a configuration reload with `sudo /etc/init.d/ssh reload`.
## Disabling IPv6
One problem with IPv6 on Ubuntu is that using `0.0.0.0` for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of my Ubuntu box. In my case, I realized that there’s no practical point in enabling IPv6 on a box when you are not connected to any IPv6 network. Hence, I simply disabled IPv6 on my Ubuntu machine. Your mileage may vary.
To disable IPv6 on Ubuntu 10.04 LTS, open `/etc/sysctl.conf` in the editor of your choice and add the following lines to the end of the file:
/etc/sysctl.conf
~~~ line-numbers
1
2
3
4
~~~
|
|
You have to reboot your machine in order to make the changes take effect.
You can check whether IPv6 is enabled on your machine with the following command:
A return value of 0 means IPv6 is enabled, a value of 1 means disabled (that’s what we want).
### Alternative
You can also disable IPv6 only for Hadoop as documented in [HADOOP-3437](https://issues.apache.org/jira/browse/HADOOP-3437). You can do so by adding the following line to `conf/hadoop-env.sh`:
conf/hadoop-env.sh
Hadoop
Installation
[Download Hadoop](http://www.apache.org/dyn/closer.cgi/hadoop/core) from the [Apache Download Mirrors](http://www.apache.org/dyn/closer.cgi/hadoop/core) and extract the contents of the Hadoop package to a location of your choice. I picked `/usr/local/hadoop`. Make sure to change the owner of all the files to the `hduser` user and `hadoop` group, for example:
~~~ line-numbers
1
2
3
4
~~~
|
|
(Just to give you the idea, YMMV – personally, I create a symlink from `hadoop-1.0.3` to `hadoop`.)
## Update $HOME/.bashrc
Add the following lines to the end of the `$HOME/.bashrc` file of user `hduser`. If you use a shell other than bash, you should of course update its appropriate configuration files instead of `.bashrc`.
$HOME/.bashrc
~~~ line-numbers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
~~~
|
|
You can repeat this exercise also for other users who want to use Hadoop.
## Excursus: Hadoop Distributed File System (HDFS)
Before we continue let us briefly learn a bit more about Hadoop’s distributed file system.
> The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop project, which is part of the Apache Lucene project.
>
> **The Hadoop Distributed File System: Architecture and Design**[hadoop.apache.org/hdfs/docs/…](http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html)
The following picture gives an overview of the most important HDFS components.
![](http://www.michael-noll.com/blog/uploads/HDFS-Architecture.gif "HDFS Architecture (source: Hadoop documentation)")
## Configuration
Our goal in this tutorial is a single-node setup of Hadoop. More information of what we do in this section is available on the [Hadoop Wiki](http://wiki.apache.org/hadoop/GettingStartedWithHadoop).
### hadoop-env.sh
The only required environment variable we have to configure for Hadoop in this tutorial is `JAVA_HOME`. Open `conf/hadoop-env.sh` in the editor of your choice (if you used the installation path in this tutorial, the full path is `/usr/local/hadoop/conf/hadoop-env.sh`) and set the `JAVA_HOME` environment variable to the Sun JDK/JRE 6 directory.
Change
conf/hadoop-env.sh
to
conf/hadoop-env.sh
Note: If you are on a Mac with OS X 10.7 you can use the following line to set up `JAVA_HOME` in`conf/hadoop-env.sh`.
conf/hadoop-env.sh (on Mac systems)
conf/*-site.xml
In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, [HDFS](http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html), even though our little “cluster” only contains our single local machine.
You can leave the settings below “as is” with the exception of the `hadoop.tmp.dir` parameter – this parameter you must change to a directory of your choice. We will use the directory `/app/hadoop/tmp` in this tutorial. Hadoop’s default configurations use `hadoop.tmp.dir` as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point.
Now we create the directory and set the required ownerships and permissions:
~~~ line-numbers
1
2
3
4
~~~
|
|
If you forget to set the required ownerships and permissions, you will see a `java.io.IOException` when you try to format the name node in the next section).
Add the following snippets between the `<configuration> ... </configuration>` tags in the respective configuration XML file.
In file `conf/core-site.xml`:
conf/core-site.xml
~~~ line-numbers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
~~~
|
|
In file `conf/mapred-site.xml`:
conf/mapred-site.xml
~~~ line-numbers
1
2
3
4
5
6
7
8
~~~
|
|
In file `conf/hdfs-site.xml`:
conf/hdfs-site.xml
~~~ line-numbers
1
2
3
4
5
6
7
8
~~~
|
|
See [Getting Started with Hadoop](http://wiki.apache.org/hadoop/GettingStartedWithHadoop) and the documentation in [Hadoop’s API Overview](http://hadoop.apache.org/core/docs/current/api/overview-summary.html) if you have any questions about Hadoop’s configuration options.
## Formatting the HDFS filesystem via the NameNode
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.
Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!
To format the filesystem (which simply initializes the directory specified by the `dfs.name.dir`variable), run the command
The output will look like this:
~~~ line-numbers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
~~~
|
|
Starting your single-node cluster
Run the command:
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
The output will look like this:
~~~ line-numbers
1
2
3
4
5
6
7
~~~
|
|
A nifty tool for checking whether the expected Hadoop processes are running is `jps` (part of Sun’s Java since v1.5.0). See also [How to debug MapReduce programs](http://wiki.apache.org/hadoop/HowToDebugMapReducePrograms).
~~~ line-numbers
1
2
3
4
5
6
7
~~~
|
|
You can also check with `netstat` if Hadoop is listening on the configured ports.
~~~ line-numbers
1
2
3
4
5
6
7
8
9
10
11
12
~~~
|
|
If there are any errors, examine the log files in the `/logs/` directory.
## Stopping your single-node cluster
Run the command
to stop all the daemons running on your machine.
Example output:
~~~ line-numbers
1
2
3
4
5
6
7
~~~
|
|
Running a MapReduce job
We will now run your first Hadoop MapReduce job. We will use the [WordCount example job](http://wiki.apache.org/hadoop/WordCount)which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. More information of [what happens behind the scenes](http://wiki.apache.org/hadoop/WordCount) is available at the [Hadoop Wiki](http://wiki.apache.org/hadoop/WordCount).
### Download example input data
We will use three ebooks from Project Gutenberg for this example:
* [The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson](http://www.gutenberg.org/etext/20417)
* [The Notebooks of Leonardo Da Vinci](http://www.gutenberg.org/etext/5000)
* [Ulysses by James Joyce](http://www.gutenberg.org/etext/4300)
Download each ebook as text files in `Plain Text UTF-8` encoding and store the files in a local temporary directory of choice, for example `/tmp/gutenberg`.
~~~ line-numbers
1
2
3
4
5
6
~~~
|
|
Restart the Hadoop cluster
Restart your Hadoop cluster if it’s not running already.
Copy local example data to HDFS
Before we run the actual MapReduce job, we first [have to copy](http://wiki.apache.org/hadoop/ImportantConcepts) the files from our local file system to Hadoop’s [HDFS](http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html).
~~~ line-numbers
1
2
3
4
5
6
7
8
9
10
~~~
|
|
Run the MapReduce job
Now, we actually run the WordCount example job.
This command will read all the files in the HDFS directory `/user/hduser/gutenberg`, process it, and store the result in the HDFS directory `/user/hduser/gutenberg-output`.
Note: Some people run the command above and get the following error message:
In this case, re-run the command with the full name of the Hadoop Examples JAR file, for example:
Example output of the previous command in the console:
~~~ line-numbers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
~~~
|
|
Check if the result is successfully stored in HDFS directory `/user/hduser/gutenberg-output`:
~~~ line-numbers
1
2
3
4
5
6
7
8
9
~~~
|
|
If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the `"-D"` option:
An important note about mapred.map.tasks: Hadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn’t manipulate that. You cannot force mapred.map.tasks but you can specify mapred.reduce.tasks.
Retrieve the job result from HDFS
To inspect the file, you can copy it from HDFS to the local file system. Alternatively, you can use the command
to read the file directly from HDFS without copying it to the local file system. In this tutorial, we will copy the results to the local file system though.
~~~ line-numbers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
~~~
|
|
Note that in this specific output the quote signs (“) enclosing the words in the `head` output above have not been inserted by Hadoop. They are the result of the word tokenizer used in the WordCount example, and in this case they matched the beginning of a quote in the ebook texts. Just inspect the `part-00000` file further to see it for yourself.
> The command fs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) **not be sorted**.
## Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default (see `conf/hadoop-default.xml`) available at these locations:
* [http://localhost:50070/](http://localhost:50070/) – web UI of the NameNode daemon
* [http://localhost:50030/](http://localhost:50030/) – web UI of the JobTracker daemon
* [http://localhost:50060/](http://localhost:50060/) – web UI of the TaskTracker daemon
These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.
### NameNode Web Interface (HDFS layer)
The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.
By default, it’s available at [http://localhost:50070/](http://localhost:50070/).
![](http://www.michael-noll.com/blog/uploads/Hadoop-namenode-screenshot.png "A screenshot of Hadoop's NameNode web interface")
### JobTracker Web Interface (MapReduce layer)
The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on).
By default, it’s available at [http://localhost:50030/](http://localhost:50030/).
![](http://www.michael-noll.com/blog/uploads/Hadoop-jobtracker-screenshot.png "A screenshot of Hadoop's Job Tracker web interface")
### TaskTracker Web Interface (MapReduce layer)
The task tracker web UI shows you running and non-running tasks. It also gives access to the ‘‘local machine’s’’ Hadoop log files.
By default, it’s available at [http://localhost:50060/](http://localhost:50060/).
![](http://www.michael-noll.com/blog/uploads/Hadoop-tasktracker-screenshot.png "A screenshot of Hadoop's Task Tracker web interface")
# What’s next?
If you’re feeling comfortable, you can continue your Hadoop experience with my follow-up tutorial [Running Hadoop On Ubuntu Linux (Multi-Node Cluster)](http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/) where I describe how to build a Hadoop ‘‘multi-node’’ cluster with two Ubuntu boxes (this will increase your current cluster size by 100%, heh).
In addition, I wrote [a tutorial](http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/) on [how to code a simple MapReduce job](http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/) in the Python programming language which can serve as the basis for writing your own MapReduce programs.
# Related Links
From yours truly:
* [Running Hadoop On Ubuntu Linux (Multi-Node Cluster)](http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/)
* [Writing An Hadoop MapReduce Program In Python](http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/)
From other people:
* [How to debug MapReduce programs](http://wiki.apache.org/hadoop/HowToDebugMapReducePrograms)
* [Hadoop API Overview](http://hadoop.apache.org/core/docs/current/api/overview-summary.html) (for Hadoop 2.x)