论文做实验需要英文分词和单词提取，有啥好python库推荐么 - V2EX

Home Sign Up Sign In

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 5093 days ago, the information mentioned may be changed or developed.

其实主要是把一篇文章的关键词（所有出现次数大于等于1的英文单词，不考虑词组）全部提取出来然后按我自己的算法来建立索引就OK了，python直接写其实也挺方便，

不过还是想问问，有专门处理这类问题的库么？尤其是如果能直接从网页上抓取并过滤html就好了。因为实验要设计挺多不同领域的文档做统计，我就不想复制到txt了。。。。

嘿嘿，首次发主题，谢谢诸位.

4 replies • 1970-01-01 08:00:00 +08:00

1

eric

Jul 23, 2012

1

NLTK 的 word_tokenize 就能很方便的实现。
http://nltk.org/

2

stackpop

OP

Jul 23, 2012

@eric 的确好强大，就是我想要的东西。之前打算用C++ 写，后来朋友建议用python,代码简洁得多，果然py强大~难怪国外好多大学CS第一门编程课改成python了，呵呵

3

fanzheng

Jul 24, 2012

如果只是出现次数的话用split然后counter()嘛，官方模块文档里面的counter()

4

from0tohero

Jul 26, 2012

1

NLTK最好没有之一～

About · Help · Advertise · Blog · API · FAQ · Solana · 1192 Online Highest 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 31ms · UTC 23:55 · PVG 07:55 · LAX 16:55 · JFK 19:55
♥ Do have faith in what you're doing.