How to be a winner – advice for students starting into research work

** ** Don't get hung up trying to understand everything at the outset ** ** The biggest challenge you face at the onset of any new project is that there is a huge (seemingly overwhelming) amount of stuff you need to know to tackle your problem properly. While this phenomenon is true in the small for the beginning researcher, it is also true in the large for any research project. So learning how to cope with this challenge is an important skill to to master to become a good researcher. In contrast, blocking your action and progress while waiting for complete knowledge is the road to failure.

Coping mechanisms employed by winners include:

• prioritizing (what do I need to know most)
• read (everything made available to you, and seek out more; but don't put months of reading between you and getting started doing things.)
• multithreading (when blocked on one item or path, is there another I can productively pursue?)
• pursuing multiple, possible solution techniques (maybe some have easier/less blocks paths than others)
• wishful thinking (ok, let's assume this subproblem is solved, does that allow me to go on and solve other problems?)
• pester people who might have some of the information you need (you might think they should know what you need to know, but often they don't have a clear idea of what you do and don't know; start by getting them to give you pointers to things you can use to help yourself. Show respect for their time and always follow up on the resources you've been given before asking for a personal explanation.)
• propose working models — maybe they are wrong or different from others, but they give you something to work with and something concrete to discuss and compare with others. You will refine your models continually, but it's good to have something concrete in mind to work with. ** ** Losers will stop the first time they run into something they don't know, cannot solve a problem, or encounter trouble slightly out of what they consider their part'' of the problem and then offer excuses for why they cannot make any progress. Winners consider the whole problem theirs and look for paths around every hangup. Losers make sure there is someone or something to blame for their lack of progress. Winners find ways to make progress despite complications. Losers know all the reasons it cannot be done Winners find a way to do it.

** ** ** ** Communicate and Synchronize Often

Of course, when you do have to build your own models, solve unexpected problems, make assumptions, etc. do make sure to communicate and synchronize with your fellow researchers. Do they have different models from yours? What can you learn from each others' models and assumptions? Let them know what you're thinking, where you're stuck, and how you're trying to get around your problems. ** **

Decompose

The whole problem often seems overwhelming. Decompose it into manageable pieces (preferably, with each piece a stable intermediate). Tackle the pieces one at a time. Divide and conquer.

This may sound obvious, but it works. I've turned numerous problems which appeared frightening'' in scope into many 1-day or 2-day tasks, and then tackled each nice, contained 1–2 day task as I came to it. As I understood more, new problems and tasks arose, but they could all be broken back down to bite sized pieces which would be tackled one at a time.

** ** Be Organized

** ** Realize that your supervisor is busy

Your professor or graduate student supervisor is busy. He hired you to help him get more accomplished than he could have on his own. Your biggest benefit to him is when you can be self moving and motivating.

Do not expect your supervisor to solve all your problems. Find out what he has thought about and suggests for a stating point and work from there. But, realize there may become a time when you have put more quality thought into something than he has (and this will happen more and more often to you as you get into your work). So, when you think you see or know a better way to solve a problem, bring it up. In an ideal scenario this is exactly what should happen. Your supervisor gives you the seed and some directions, then goes off to think about other problems. You put in concentrated time on your problem and ultimately come back with more knowledge and insight into your subproblem than your supervisor.

As a supervisor, I work in two modes:

1. Until a student has demonstrated that he has thought more deeply about the problem than I have, I strongly advocate that he start things my way.
2. Once a student has examined a problem in depth, then we can discuss it as peers, and generally the student becomes the expert on this subproblem, and I can offer general advise from my experience and breadth.** ** Deliver ** ** Once you've signed up you have to deliver. But, you do not have to deliver the final solution to everything at once. This, in fact, is a fallacy of many people and research projects.

Losers keep promising a great thing in the future but have nothing to show now. Winners can show workable/usable results along the way to the solution. These pieces can include:

• solutions to simplified models
• pieces of a flow
• intermediate output/data
• measurements of problem characteristics
• stable intermediates (see below) ** ** Demonstrate progress

This allows your supervisor to offer early feedback and to help you prioritize your attention—this will often help you both make mid-course corrections increasing the likelihood you will end up with interesting results in the end. Requirements and understanding invariable evolve (remember the key challenge at the beginning is incompletely knowledge). Change and redirection is normal, expected, and healthy (since it is usually a result of greater knowledge and understanding). The incremental model is robust and prepared for this adaptation while the monolithic (all-at-once) model is brittle and often leads to great solutions which don't address the real problem.

** **

Incrementally grow your solutions (especially software).

In the new chapters which appear in the 20th Anniversary edition of Mythical Man Month, Brooks identifies incremental development and progressive refinement towards the goal as one of the best, new techniques which he's come to appreciate since the original writing of MMM. From my own experience, I whole heartedly agree with this, and it does have a very positive impact on morale (yours, your team's, your supervisor's).

** ** Target stable intermediates ** ** Look for stable intermediate points on your incremental path to solving some problem.

• points where some clear piece of the problem has been solved (has a nice interface to this subproblem, produces results at this stage)
• things you can build upon
• things you can spin-out
• things you can share with team members (allow them to help)
• points of accomplishment ** ** Don't turn problems (subtasks) into research problems unnecessarily. ** ** Often you'll run into a subtask with no single, obviously right solution. If solving this piece right is key to the overall goals, maybe it will be necessary to devote time to studying and solving this subproblem better than it has ever been solved before. However, for most sub-problems, this is not the case. You want to keep focussed on the overall goals of the project and come up with an adequate'' solution for this problem. In general, try to do the obvious or simple thing which can be done expediently. Make notes on the the possible weaknesses and the alternatives you could explore should these weakness prove limiting. Then, if this does become a bottleneck or weak link in the solution chain, you can revisit it and your alternatives and invest more effort exploring them.Learn to solve your own problemsIn general, in life, there won't always be someone to turn to who has all the answers. It is vitally important that you learn how to tackle all the kinds of problems you may encounter. Use your supervisors as a crutch or scaffolding only to get yourself started. Watch them and learn not just the answers they help you find, but how they find the answers you were unable to obtain on your own. Strive for independence. Learn techniques and gain confidence in your own ability to solve problems now.

—————-正文开始—————–

 ~~~ line-numbers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ~~~

The full JDK which will be placed in /usr/lib/jvm/java-6-sun (well, this directory is actually a symlink on Ubuntu). After installation, make a quick check whether Sun’s JDK is correctly set up:

 ~~~ line-numbers 1 2 3 4 ~~~

We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).

 ~~~ line-numbers 1 2 ~~~

This will add the user hduser and the group hadoop to your local machine. ## Configuring SSH Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to localhostfor the hduser user we created in the previous section. I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication. If not, there are [several online guides](http://ubuntuguide.org/) available. First, we have to generate an SSH key for the hduser user.

 ~~~ line-numbers 1 2 3 4 5 6 7 8 9 10 11 12 ~~~

The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes). Second, you have to enable SSH access to your local machine with this newly created key.

 ~~~ line-numbers 1 ~~~

The final step is to test the SSH setup by connecting to your local machine with the hduseruser. The step is also needed to save your local machine’s host key fingerprint to the hduseruser’s known_hosts file. If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).  ~~~ line-numbers 1 2 3 4 5 6 7 8 9 ~~~ If the SSH connect should fail, these general tips might help: * Enable debugging with ssh -vvv localhost and investigate the error in detail. * Check the SSH server configuration in /etc/ssh/sshd_config, in particular the options PubkeyAuthentication (which should be set to yes) and AllowUsers (if this option is active, add the hduser user to it). If you made any changes to the SSH server configuration file, you can force a configuration reload with sudo /etc/init.d/ssh reload. ## Disabling IPv6 One problem with IPv6 on Ubuntu is that using 0.0.0.0 for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of my Ubuntu box. In my case, I realized that there’s no practical point in enabling IPv6 on a box when you are not connected to any IPv6 network. Hence, I simply disabled IPv6 on my Ubuntu machine. Your mileage may vary. To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file: /etc/sysctl.conf  ~~~ line-numbers 1 2 3 4 ~~~ You have to reboot your machine in order to make the changes take effect. You can check whether IPv6 is enabled on your machine with the following command:  ~~~ line-numbers 1 ~~~ A return value of 0 means IPv6 is enabled, a value of 1 means disabled (that’s what we want). ### Alternative You can also disable IPv6 only for Hadoop as documented in [HADOOP-3437](https://issues.apache.org/jira/browse/HADOOP-3437). You can do so by adding the following line to conf/hadoop-env.sh: conf/hadoop-env.sh  ~~~ line-numbers 1 ~~~ # Hadoop ## Installation [Download Hadoop](http://www.apache.org/dyn/closer.cgi/hadoop/core) from the [Apache Download Mirrors](http://www.apache.org/dyn/closer.cgi/hadoop/core) and extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop. Make sure to change the owner of all the files to the hduser user and hadoop group, for example:  ~~~ line-numbers 1 2 3 4 ~~~ (Just to give you the idea, YMMV – personally, I create a symlink from hadoop-1.0.3 to hadoop.) ## Update$HOME/.bashrc Add the following lines to the end of the $HOME/.bashrc file of user hduser. If you use a shell other than bash, you should of course update its appropriate configuration files instead of .bashrc.$HOME/.bashrc

 ~~~ line-numbers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 ~~~

You can repeat this exercise also for other users who want to use Hadoop. ## Excursus: Hadoop Distributed File System (HDFS) Before we continue let us briefly learn a bit more about Hadoop’s distributed file system. > The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop project, which is part of the Apache Lucene project. > > **The Hadoop Distributed File System: Architecture and Design**[hadoop.apache.org/hdfs/docs/…](http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html) The following picture gives an overview of the most important HDFS components. ![](http://www.michael-noll.com/blog/uploads/HDFS-Architecture.gif "HDFS Architecture (source: Hadoop documentation)") ## Configuration Our goal in this tutorial is a single-node setup of Hadoop. More information of what we do in this section is available on the [Hadoop Wiki](http://wiki.apache.org/hadoop/GettingStartedWithHadoop). ### hadoop-env.sh The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory. Change conf/hadoop-env.sh

 ~~~ line-numbers 1 2 ~~~

 ~~~ line-numbers 1 2 ~~~

Note: If you are on a Mac with OS X 10.7 you can use the following line to set up JAVA_HOME inconf/hadoop-env.sh. conf/hadoop-env.sh (on Mac systems)

 ~~~ line-numbers 1 2 ~~~

### conf/*-site.xml

In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, [HDFS](http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html), even though our little “cluster” only contains our single local machine. You can leave the settings below “as is” with the exception of the hadoop.tmp.dir parameter – this parameter you must change to a directory of your choice. We will use the directory /app/hadoop/tmp in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point. Now we create the directory and set the required ownerships and permissions:

 ~~~ line-numbers 1 2 3 4 ~~~

If you forget to set the required ownerships and permissions, you will see a java.io.IOException when you try to format the name node in the next section). Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file. In file conf/core-site.xml: conf/core-site.xml

 ~~~ line-numbers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ~~~

In file conf/mapred-site.xml: conf/mapred-site.xml

 ~~~ line-numbers 1 2 3 4 5 6 7 8 ~~~

In file conf/hdfs-site.xml: conf/hdfs-site.xml

 ~~~ line-numbers 1 2 3 4 5 6 7 8 ~~~

See [Getting Started with Hadoop](http://wiki.apache.org/hadoop/GettingStartedWithHadoop) and the documentation in [Hadoop’s API Overview](http://hadoop.apache.org/core/docs/current/api/overview-summary.html) if you have any questions about Hadoop’s configuration options. ## Formatting the HDFS filesystem via the NameNode The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)! To format the filesystem (which simply initializes the directory specified by the dfs.name.dirvariable), run the command

 ~~~ line-numbers 1 ~~~

The output will look like this:

 ~~~ line-numbers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ~~~

Run the command:

 ~~~ line-numbers 1 ~~~

This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine. The output will look like this:

 ~~~ line-numbers 1 2 3 4 5 6 7 ~~~

A nifty tool for checking whether the expected Hadoop processes are running is jps (part of Sun’s Java since v1.5.0). See also [How to debug MapReduce programs](http://wiki.apache.org/hadoop/HowToDebugMapReducePrograms).

 ~~~ line-numbers 1 2 3 4 5 6 7 ~~~

You can also check with netstat if Hadoop is listening on the configured ports.

 ~~~ line-numbers 1 2 3 4 5 6 7 8 9 10 11 12 ~~~

If there are any errors, examine the log files in the /logs/ directory. ## Stopping your single-node cluster Run the command

 ~~~ line-numbers 1 ~~~

to stop all the daemons running on your machine. Example output:

 ~~~ line-numbers 1 2 3 4 5 6 7 ~~~

## Running a MapReduce job

We will now run your first Hadoop MapReduce job. We will use the [WordCount example job](http://wiki.apache.org/hadoop/WordCount)which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. More information of [what happens behind the scenes](http://wiki.apache.org/hadoop/WordCount) is available at the [Hadoop Wiki](http://wiki.apache.org/hadoop/WordCount). ### Download example input data We will use three ebooks from Project Gutenberg for this example: * [The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson](http://www.gutenberg.org/etext/20417) * [The Notebooks of Leonardo Da Vinci](http://www.gutenberg.org/etext/5000) * [Ulysses by James Joyce](http://www.gutenberg.org/etext/4300) Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a local temporary directory of choice, for example /tmp/gutenberg.

 ~~~ line-numbers 1 2 3 4 5 6 ~~~

 ~~~ line-numbers 1 ~~~

### Copy local example data to HDFS

Before we run the actual MapReduce job, we first [have to copy](http://wiki.apache.org/hadoop/ImportantConcepts) the files from our local file system to Hadoop’s [HDFS](http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html).

 ~~~ line-numbers 1 2 3 4 5 6 7 8 9 10 ~~~

### Run the MapReduce job

Now, we actually run the WordCount example job.

 ~~~ line-numbers 1 ~~~

This command will read all the files in the HDFS directory /user/hduser/gutenberg, process it, and store the result in the HDFS directory /user/hduser/gutenberg-output. Note: Some people run the command above and get the following error message: In this case, re-run the command with the full name of the Hadoop Examples JAR file, for example: Example output of the previous command in the console:

 ~~~ line-numbers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ~~~

Check if the result is successfully stored in HDFS directory /user/hduser/gutenberg-output:

 ~~~ line-numbers 1 2 3 4 5 6 7 8 9 ~~~

If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the "-D" option:

 ~~~ line-numbers 1 ~~~

### Retrieve the job result from HDFS

To inspect the file, you can copy it from HDFS to the local file system. Alternatively, you can use the command

 ~~~ line-numbers 1 ~~~

to read the file directly from HDFS without copying it to the local file system. In this tutorial, we will copy the results to the local file system though.

 ~~~ line-numbers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~~~

Note that in this specific output the quote signs (“) enclosing the words in the head output above have not been inserted by Hadoop. They are the result of the word tokenizer used in the WordCount example, and in this case they matched the beginning of a quote in the ebook texts. Just inspect the part-00000 file further to see it for yourself. > The command fs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) **not be sorted**. ## Hadoop Web Interfaces Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations: * [http://localhost:50070/](http://localhost:50070/) – web UI of the NameNode daemon * [http://localhost:50030/](http://localhost:50030/) – web UI of the JobTracker daemon * [http://localhost:50060/](http://localhost:50060/) – web UI of the TaskTracker daemon These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try. ### NameNode Web Interface (HDFS layer) The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files. By default, it’s available at [http://localhost:50070/](http://localhost:50070/). ![](http://www.michael-noll.com/blog/uploads/Hadoop-namenode-screenshot.png "A screenshot of Hadoop's NameNode web interface") ### JobTracker Web Interface (MapReduce layer) The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on). By default, it’s available at [http://localhost:50030/](http://localhost:50030/). ![](http://www.michael-noll.com/blog/uploads/Hadoop-jobtracker-screenshot.png "A screenshot of Hadoop's Job Tracker web interface") ### TaskTracker Web Interface (MapReduce layer) The task tracker web UI shows you running and non-running tasks. It also gives access to the ‘‘local machine’s’’ Hadoop log files. By default, it’s available at [http://localhost:50060/](http://localhost:50060/). ![](http://www.michael-noll.com/blog/uploads/Hadoop-tasktracker-screenshot.png "A screenshot of Hadoop's Task Tracker web interface") # What’s next? If you’re feeling comfortable, you can continue your Hadoop experience with my follow-up tutorial [Running Hadoop On Ubuntu Linux (Multi-Node Cluster)](http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/) where I describe how to build a Hadoop ‘‘multi-node’’ cluster with two Ubuntu boxes (this will increase your current cluster size by 100%, heh). In addition, I wrote [a tutorial](http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/) on [how to code a simple MapReduce job](http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/) in the Python programming language which can serve as the basis for writing your own MapReduce programs. # Related Links From yours truly: * [Running Hadoop On Ubuntu Linux (Multi-Node Cluster)](http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/) * [Writing An Hadoop MapReduce Program In Python](http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/) From other people: * [How to debug MapReduce programs](http://wiki.apache.org/hadoop/HowToDebugMapReducePrograms) * [Hadoop API Overview](http://hadoop.apache.org/core/docs/current/api/overview-summary.html) (for Hadoop 2.x)

# 用python对pdf批量重命名

 1 2 3 4 5 6 7  #这里就是把先前得到的路径名加上得到的新文件名，再加上后缀名，得到新的文件名 title=title[:32] #这里就是我添加的代码，改了之后就不会报文件名过长的错误了 new=string+title+".pdf" print "old=%s " % old 

 1 2 3 4 5  print "new = %s " % new #这里一定要对新的文件名重新定义编码格式，而且一定是GBK，因为Windows中文版默认的就是GBK编码 new=new.encode('GB2312') #关闭文件流，不然无法更名 stream.close() 

import os import operator from pyPdf import PdfFileWriter, PdfFileReader #对取得的文件命格式化，去掉其中的非法字符 def format(filename):     if (isinstance(filename, str)):         tuple=('?','╲','*','/',',','“','<','>','|','“','“','，','‘','”',',','/',':')         for char in tuple:             if (filename.find(char)!=-1):                 filename=filename.replace(char,” ”)         return filename     else:         return 'None' #通过递归调用次方法依次遍历文件夹中的每个文件,如果后缀名是.pdf，则对其处理 def VisitDir(path):     li=os.listdir(path)     for p in li:         pathname=os.path.join(path,p)         if not os.path.isfile(pathname):                            VisitDir(pathname)         else:             back=os.path.splitext(pathname)             backname=back[1]             if backname=='.pdf':                 print pathname                 rename(pathname) #文件改名程序 def rename(pathname):     stream=file(pathname, “rb”)     input1 = PdfFileReader(stream)     isEncrypted=input1.isEncrypted     if not(isEncrypted):  #这里的pathname包含路径以及文件名，根据/将起分割成一个list然后去除文件名，保留路径         list=pathname.split(“")         oldname=””         for strname in list:             oldname+=strname+'\'         old=oldname[0:len(oldname)-1] #这就是去除文件名         list.pop()                       string=””         for strname in list:             string+=strname+'\'         print “string= %s” % string         title=str(input1.getDocumentInfo().title)         print “title = %s” % (input1.getDocumentInfo().title)                          title=format(title)  #这里就是把先前得到的路径名加上得到的新文件名，再加上后缀名，得到新的文件名                new=string+title+”.pdf”         print “old=%s “ % old         print “new = %s “ % new #这里一定要对新的文件名重新定义编码格式，而且一定是GBK，因为Windows中文版默认的就是GBK编码         new=new.encode('GBK') #关闭文件流，不然无法更名         stream.close()         if(str(title)!=”None”):             try:                 os.rename(old, new)             except WindowsError,e:                  #print str(e)                   print e         else:             print”The file contian no title attribute!”      else:         print “This file is encrypted!” if name==”main”:  path=r”F:PapersICDE'09Demos”  VisitDir(path) 但是这段代码还是有问题的： 1、有些pdf文件的title属性为空，或者并不是真正的文件名 这样的重命名无意义 2、有些pdf文件的title属性里有非拉丁字符， 这会报错 UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in position 4: ordinal not in range(128) 后来大神又找到了一个更牛逼的库，pdfminer   http://www.unixuser.org/~euske/python/pdfminer/ 主要的想法就是用pfdminer的pdf2txt模块将pdf文件的第一页转换成文本，然后从中读取第一行作为文件名 得益于pdfminer强大的功能，这样的命名准确率非常高。只要你的pdf文件不是扫描版的（图片格式），都可以正确获取文件名 下面的代码要正常运行，必须与pdfminer中的pdf2txt.py位于同一目录下

#encoding:utf-8

''' 目的：根据文章的标题重命名title

import os import operator from pyPdf import PdfFileWriter, PdfFileReader #对取得的文件命格式化，去掉其中的非法字符 def format(filename):     if (isinstance(filename, str)):         tuple=('?','╲','*','/',',','“','<','>','|','“','“','，','‘','”',',','/',':')         for char in tuple:             if (filename.find(char)!=-1):                 filename=filename.replace(char,” ”)         return filename     else:         return 'None'

## 添加因为pdf转换产生的乱码 key_value = { 'xefxacx81':'fi'}

#通过递归调用次方法依次遍历文件夹中的每个文件,如果后缀名是.pdf，则对其处理 def VisitDir(path):     li=os.listdir(path)     for p in li:         pathname=os.path.join(path,p)         if not os.path.isfile(pathname):                            VisitDir(pathname)         else:             back=os.path.splitext(pathname)             backname=back[1]             if backname=='.pdf':                 print pathname                 rename(pathname) #文件改名程序 def rename(pathname):     stream=file(pathname, “rb”)     input1 = PdfFileReader(stream)     isEncrypted=input1.isEncrypted     if not(isEncrypted):  #这里的pathname包含路径以及文件名，根据/将起分割成一个list然后去除文件名，保留路径         list=pathname.split(“")         oldname=””         for strname in list:             oldname+=strname+'\'         old=oldname[0:len(oldname)-1] #这就是去除文件名         list.pop()                       string=””         for strname in list:             string+=strname+'\'         print “string= %s” % string         ## Option 1: user attributes         title = str(input1.getDocumentInfo().title)         #print “title = %s” % (input1.getDocumentInfo().title)          ################ jiang added          #if(str(title) == “None”):         ###Option2: use pdf content download         os.system('python pdf2txt.py -p 1 “'+old+'“ >c.txt')         f = open(“c.txt”,”rb”)         title = ””         a = f.readline()         while( a not in (“rn”,”n”) ):             title += a             a = f.readline()         title = title.replace(“rn”,” ”).strip()          ########### jiang added             title=format(title)         for key,value in key_value.iteritems():             title = title.replace(key,value)           #这里就是把先前得到的路径名加上得到的新文件名，再加上后缀名，得到新的文件名                new=string+title+”.pdf”         print “old=%s “ % old         print “new = %s “ % new #这里一定要对新的文件名重新定义编码格式，而且一定是GBK，因为Windows中文版默认的就是GBK编码         new=new.encode('GBK') #关闭文件流，不然无法更名         stream.close()         if(str(title)!=”None”):             try:                 os.rename(old, new)             except WindowsError,e:                  print str(e)         else:             # python pdf2txt.py -p 1 p43-nazir.pdf >c.txt             print”The file contian no title attribute!”      else:         print “This file is encrypted!” if name==”main”:     path=r”.” #the current directory     VisitDir(path)