`

使用Java调用谷歌搜索

    博客分类:
  • java
阅读更多

search托管于github

 

如何利用Java来调用谷歌搜索,更多细节请到github上查看search

 

自己没搜索引擎,又想要大规模的数据源,怎么办?可以对谷歌搜索善加利用,以小搏大,站在巨人的肩膀上。有很多的应用场景可以很巧妙地借助谷歌搜索来实现,比如网站的新闻采集,比如技术、品牌的新闻跟踪,比如知识库的收集,比如人机问答系统等,我之前做的一个准确率达百分之九十几的人机问答系统的数据源,其中一部分就是充分利用了谷歌搜索。
 

package org.apdplat.search;

import java.io.IOException;
import java.io.InputStream;
import java.io.UnsupportedEncodingException;
import java.net.URLEncoder;
import java.util.ArrayList;
import java.util.List;

import org.apache.commons.httpclient.DefaultHttpMethodRetryHandler;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.params.HttpMethodParams;
import org.json.JSONArray;
import org.json.JSONException;
import org.json.JSONObject;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class GoogleSearcher implements Searcher{
    private static final Logger LOG = LoggerFactory.getLogger(GoogleSearcher.class);

    @Override
    public List<Webpage> search(String url) {
        List<Webpage> webpages = new ArrayList<>();
        try {
            HttpClient httpClient = new HttpClient();
            GetMethod getMethod = new GetMethod(url);

            httpClient.executeMethod(getMethod);
            getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,
                    new DefaultHttpMethodRetryHandler());

            int statusCode = httpClient.executeMethod(getMethod);
            if (statusCode != HttpStatus.SC_OK) {
                LOG.error("搜索失败: " + getMethod.getStatusLine());
                return null;
            }
            InputStream in = getMethod.getResponseBodyAsStream();
            byte[] responseBody = Tools.readAll(in);
            String response = new String(responseBody, "UTF-8");
            LOG.debug("搜索返回数据:" + response);
            JSONObject json = new JSONObject(response);
            String totalResult = json.getJSONObject("responseData").getJSONObject("cursor").getString("estimatedResultCount");
            int totalResultCount = Integer.parseInt(totalResult);
            LOG.info("搜索返回记录数: " + totalResultCount);

            JSONArray results = json.getJSONObject("responseData").getJSONArray("results");

            LOG.debug("搜索结果:");
            for (int i = 0; i < results.length(); i++) {
                Webpage webpage = new Webpage();
                JSONObject result = results.getJSONObject(i);
                //提取标题
                String title = result.getString("titleNoFormatting");
                LOG.debug("标题:" + title);
                webpage.setTitle(title);
                //提取摘要
                String summary = result.get("content").toString();
                summary = summary.replaceAll("<b>", "");
                summary = summary.replaceAll("</b>", "");
                summary = summary.replaceAll("\\.\\.\\.", "");
                LOG.debug("摘要:" + summary);
                webpage.setSummary(summary);
                //从URL中提取正文
                String _url = result.get("url").toString();
                webpage.setUrl(_url);
                String content = Tools.getHTMLContent(_url);
                LOG.debug("正文:" + content);
                webpage.setContent(content);
                webpages.add(webpage);
            }
        } catch (IOException | JSONException | NumberFormatException e) {
            LOG.error("执行搜索失败:", e);
        }
        return webpages;
    }

    public static void main(String args[]) {
        String query = "杨尚川";
        try {
            query = URLEncoder.encode(query, "UTF-8");
        } catch (UnsupportedEncodingException e) {
            LOG.error("url构造失败", e);
            return;
        }
        String url = "http://ajax.googleapis.com/ajax/services/search/web?start=0&rsz=large&v=1.0&q=" + query;
        
        Searcher searcher = new GoogleSearcher();
        List<Webpage> webpages = searcher.search(url);
        if (webpages != null) {
            int i = 1;
            for (Webpage webpage : webpages) {
                LOG.info("搜索结果 " + (i++) + " :");
                LOG.info("标题:" + webpage.getTitle());
                LOG.info("URL:" + webpage.getUrl());
                LOG.info("摘要:" + webpage.getSummary());
                LOG.info("正文:" + webpage.getContent());
                LOG.info("");
            }
        } else {
            LOG.error("没有搜索到结果");
        }
    }
}

 

 

5
3
分享到:
评论
9 楼 qbuer 2017-04-11  
The Google Web Search API is no longer available.
8 楼 redhobor 2016-04-03  
貌似Google Search API给屏蔽了,请问如何调用?
7 楼 cy06xt 2013-12-24  
win7 下导入成功,xp下导入总是提示pom错误。邪。
6 楼 yangshangchuan 2013-11-04  
dongtianlaile 写道

杨哥,项目导入成功后,可以运行,但是pom.xml报错耶~

Multiple markers at this line
- Document is invalid: no grammar found.
- Document root element "project", must match DOCTYPE root "null".
- 84 changed lines



你是用GitHub上的代码吗?最新代码看这里:https://github.com/ysc/search-demo
5 楼 dongtianlaile 2013-11-04  

杨哥,项目导入成功后,可以运行,但是pom.xml报错耶~

Multiple markers at this line
- Document is invalid: no grammar found.
- Document root element "project", must match DOCTYPE root "null".
- 84 changed lines

4 楼 yangshangchuan 2013-10-24  
houzhanshanlinzhou 写道
好麻烦啊,没有jar包,maven太麻烦了


等你学会了maven的用法就知道不麻烦了,maven+netbeans多方便呀,mvn eclipse:eclipse + eclipse也还可以
3 楼 houzhanshanlinzhou 2013-10-24  
好麻烦啊,没有jar包,maven太麻烦了
2 楼 yangshangchuan 2013-10-21  
快乐的boy 写道
能提供一下jar包吗,谢谢

http://github.com/ysc/search-demo查看最新代码,jar包不用自己下载,maven会自动下载
1 楼 快乐的boy 2013-10-19  
能提供一下jar包吗,谢谢

相关推荐

Global site tag (gtag.js) - Google Analytics