搜索引擎技术系列教材（四）- lucene - 向Lucene中导入14万条产品数据

-->

下载区
文件名	文件大小
140k_products.rar	4m
lucene.rar	14m
解压rar如果失败，请用5.21版本或者更高版本的winrar 点击下载 winrar5.21

工具版本兼容问题

步骤 1 : 14万条数据
步骤 2 : 关于数据库
步骤 3 : 先运行，看到效果，再学习
步骤 4 : 模仿和排错
步骤 5 : 140k_products.txt
步骤 6 : Product.java
步骤 7 : ProductUtil.java
步骤 8 : TestLucene.java

步骤 1 :

14万条数据

edit 顶折

纠问

在前面的入门里是用了10条记录来测试，实际情况肯定是不会只有10条记录了，所以为了模仿真实环境，花了很多精力，四处搜刮来了14万条天猫的产品数据，接下来我们就会把这14万条记录加入到 Lucene,然后观察搜索效果。
这14万条记录放在右上角 140k_products.rar，其解析办法在后续会讲解

步骤 2 :

关于数据库

edit 顶折

纠问

本来应该先把这14万条记录保存进数据库，然后再从数据库中取出来的，不过考虑到不是每个同学都有JDBC基础，以及放进数据库的繁琐，和14万条数据从数据库里读取出来的时间消耗，就改成直接从文件里读取出来，然后转换为泛型是Product的集合的形式，相当于从数据库里读取出来了，不过会快很多。
有兴趣的同学可以自己把这些数据放进数据库里，并且使用 like 的方式看看查询性能如何。

步骤 3 :

先运行，看到效果，再学习

edit 顶折

纠问

老规矩，先下载右上角的可运行项目，配置运行起来，确认可用之后，再学习做了哪些步骤以达到这样的效果。
执行TestLuceue之后，会花大概20秒左右时间为这14万条记录建立索引，然后输入不同的关键字查询出不同的结果来。

步骤 4 :

模仿和排错

edit 顶折

纠问

在确保可运行项目能够正确无误地运行之后，再严格照着教程的步骤，对代码模仿一遍。
模仿过程难免代码有出入，导致无法得到期望的运行结果，此时此刻通过比较正确答案 ( 可运行项目 ) 和自己的代码，来定位问题所在。
采用这种方式，学习有效果，排错有效率，可以较为明显地提升学习速度，跨过学习路上的各个槛。

推荐使用diffmerge软件，进行文件夹比较。把你自己做的项目文件夹，和我的可运行项目文件夹进行比较。
这个软件很牛逼的，可以知道文件夹里哪两个文件不对，并且很明显地标记出来
这里提供了绿色安装和使用教程：diffmerge 下载和使用教程

步骤 5 :

140k_products.txt

edit 顶折

纠问

首先下载 140k_products.rar,并解压为140k_products.txt，然后放在项目目录下。这个文件里一共有14万条产品记录。

步骤 6 :

Product.java

edit 顶折

纠问

准备实体类来存放产品信息

代码行数较多，请点击查看

package com.how2java;

public class Product {

	int id;
	String name;
	String category;
	float price;
	String place;

	String code;
	public int getId() {
		return id;
	}
	public void setId(int id) {
		this.id = id;
	}
	public String getName() {
		return name;
	}
	public void setName(String name) {
		this.name = name;
	}
	public String getCategory() {
		return category;
	}
	public void setCategory(String category) {
		this.category = category;
	}
	public float getPrice() {
		return price;
	}
	public void setPrice(float price) {
		this.price = price;
	}
	public String getPlace() {
		return place;
	}
	public void setPlace(String place) {
		this.place = place;
	}

	public String getCode() {
		return code;
	}
	public void setCode(String code) {
		this.code = code;
	}
	@Override
	public String toString() {
		return "Product [id=" + id + ", name=" + name + ", category=" + category + ", price=" + price + ", place="
				+ place + ", code=" + code + "]";
	}

}

步骤 7 :

ProductUtil.java

edit 顶折

纠问

准备工具类，把140k_products.txt 文本文件，转换为泛型是Product的集合

代码行数较多，请点击查看

package com.how2java;

import java.awt.AWTException;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.commons.io.FileUtils;
	
public class ProductUtil {
	
	public static void main(String[] args) throws IOException, InterruptedException, AWTException {

		String fileName = "140k_products.txt";
		
		List<Product> products = file2list(fileName);
		
		System.out.println(products.size());
			
	}

	public static List<Product> file2list(String fileName) throws IOException {
		File f = new File(fileName);
		List<String> lines = FileUtils.readLines(f,"UTF-8");
		List<Product> products = new ArrayList<>();
		for (String line : lines) {
			Product p = line2product(line);
			products.add(p);
		}
		return products;
	}
	
	private static Product line2product(String line) {
		Product p = new Product();
		String[] fields = line.split(",");
		p.setId(Integer.parseInt(fields[0]));
		p.setName(fields[1]);
		p.setCategory(fields[2]);
		p.setPrice(Float.parseFloat(fields[3]));
		p.setPlace(fields[4]);
		p.setCode(fields[5]);
		return p;
	}

}

步骤 8 :

TestLucene.java

edit 顶折

纠问

在入门中 TestLucene.java 的基础上进行修改。主要做了两个方面的修改：
1. 索引的增加，以前是10条数据，现在是14万条数据
注：因为数据量比较大，所以加入到索引的时间也比较久，请耐心等待。
2. Document以前只有name字段，现在有6个字段
3. 查询关键字从控制台输入，这样每次都可以输入不同的关键字进行查询。因为索引建立时间比较久，采用这种方式，可以建立一次索引，进行多次查询，否则的话，每次使用不同的关键字，都要耗时建立索引，测试效率会比较低

package com.how2java; import java.io.IOException; import java.io.StringReader; import java.util.List; import java.util.Scanner; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.TextField; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.IndexableField; import org.apache.lucene.queryparser.classic.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.SimpleHTMLFormatter; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; import org.wltea.analyzer.lucene.IKAnalyzer; public class TestLucene { public static void main(String[] args) throws Exception { // 1. 准备中文分词器 IKAnalyzer analyzer = new IKAnalyzer(); // 2. 索引 Directory index = createIndex(analyzer); // 3. 查询器 Scanner s = new Scanner(System.in); while(true){ System.out.print("请输入查询关键字："); String keyword = s.nextLine(); System.out.println("当前关键字是："+keyword); Query query = new QueryParser( "name", analyzer).parse(keyword); // 4. 搜索 IndexReader reader = DirectoryReader.open(index); IndexSearcher searcher=new IndexSearcher(reader); int numberPerPage = 10; ScoreDoc[] hits = searcher.search(query, numberPerPage).scoreDocs; // 5. 显示查询结果 showSearchResults(searcher, hits,query,analyzer); // 6. 关闭查询 reader.close(); } } private static void showSearchResults(IndexSearcher searcher, ScoreDoc[] hits, Query query, IKAnalyzer analyzer) throws Exception { System.out.println("找到 " + hits.length + " 个命中."); SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<span style='color:red'>", "</span>"); Highlighter highlighter = new Highlighter(simpleHTMLFormatter, new QueryScorer(query)); System.out.println("找到 " + hits.length + " 个命中."); System.out.println("序号\t匹配度得分\t结果"); for (int i = 0; i < hits.length; ++i) { ScoreDoc scoreDoc= hits[i]; int docId = scoreDoc.doc; Document d = searcher.doc(docId); List<IndexableField> fields= d.getFields(); System.out.print((i + 1) ); System.out.print("\t" + scoreDoc.score); for (IndexableField f : fields) { if("name".equals(f.name())){ TokenStream tokenStream = analyzer.tokenStream(f.name(), new StringReader(d.get(f.name()))); String fieldContent = highlighter.getBestFragment(tokenStream, d.get(f.name())); System.out.print("\t"+fieldContent); } else{ System.out.print("\t"+d.get(f.name())); } } System.out.println("<br>"); } } private static Directory createIndex(IKAnalyzer analyzer) throws IOException { Directory index = new RAMDirectory(); IndexWriterConfig config = new IndexWriterConfig(analyzer); IndexWriter writer = new IndexWriter(index, config); String fileName = "140k_products.txt"; List<Product> products = ProductUtil.file2list(fileName); int total = products.size(); int count = 0; int per = 0; int oldPer =0; for (Product p : products) { addDoc(writer, p); count++; per = count*100/total; if(per!=oldPer){ oldPer = per; System.out.printf("索引中，总共要添加 %d 条记录，当前添加进度是： %d%% %n",total,per); } } writer.close(); return index; } private static void addDoc(IndexWriter w, Product p) throws IOException { Document doc = new Document(); doc.add(new TextField("id", String.valueOf(p.getId()), Field.Store.YES)); doc.add(new TextField("name", p.getName(), Field.Store.YES)); doc.add(new TextField("category", p.getCategory(), Field.Store.YES)); doc.add(new TextField("price", String.valueOf(p.getPrice()), Field.Store.YES)); doc.add(new TextField("place", p.getPlace(), Field.Store.YES)); doc.add(new TextField("code", p.getCode(), Field.Store.YES)); w.addDocument(doc); } }

代码行数较多，请点击查看

package com.how2java;

import java.io.IOException;
import java.io.StringReader;
import java.util.List;
import java.util.Scanner;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexableField;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.wltea.analyzer.lucene.IKAnalyzer;

public class TestLucene {

	public static void main(String[] args) throws Exception {
		// 1. 准备中文分词器
		IKAnalyzer analyzer = new IKAnalyzer();
		// 2. 索引
		Directory index = createIndex(analyzer);

		// 3. 查询器
		
        Scanner s = new Scanner(System.in);
        
        while(true){
        	System.out.print("请输入查询关键字：");
            String keyword = s.nextLine();
            System.out.println("当前关键字是："+keyword);
    		Query query = new QueryParser( "name", analyzer).parse(keyword);

    		// 4. 搜索
    		IndexReader reader = DirectoryReader.open(index);
    		IndexSearcher searcher=new IndexSearcher(reader);
    		int numberPerPage = 10;
    		ScoreDoc[] hits = searcher.search(query, numberPerPage).scoreDocs;
    		
    		// 5. 显示查询结果
    		showSearchResults(searcher, hits,query,analyzer);
    		// 6. 关闭查询
    		reader.close();
        }
		
	}

	private static void showSearchResults(IndexSearcher searcher, ScoreDoc[] hits, Query query, IKAnalyzer analyzer) throws Exception {
		System.out.println("找到 " + hits.length + " 个命中.");

        SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<span style='color:red'>", "</span>");
        Highlighter highlighter = new Highlighter(simpleHTMLFormatter, new QueryScorer(query));

        System.out.println("找到 " + hits.length + " 个命中.");
        System.out.println("序号\t匹配度得分\t结果");
		for (int i = 0; i < hits.length; ++i) {
			ScoreDoc scoreDoc= hits[i];
			int docId = scoreDoc.doc;
			Document d = searcher.doc(docId);
			List<IndexableField> fields= d.getFields();
			System.out.print((i + 1) );
			System.out.print("\t" + scoreDoc.score);
			for (IndexableField f : fields) {

				if("name".equals(f.name())){
		            TokenStream tokenStream = analyzer.tokenStream(f.name(), new StringReader(d.get(f.name())));
		            String fieldContent = highlighter.getBestFragment(tokenStream, d.get(f.name()));
					System.out.print("\t"+fieldContent);
				}
				else{
					System.out.print("\t"+d.get(f.name()));
				}
			}
			System.out.println("<br>");
		}
	}

	private static Directory createIndex(IKAnalyzer analyzer) throws IOException {
		Directory index = new RAMDirectory();
		IndexWriterConfig config = new IndexWriterConfig(analyzer);
		IndexWriter writer = new IndexWriter(index, config);
		String fileName = "140k_products.txt";
		List<Product> products = ProductUtil.file2list(fileName);
		int total = products.size();
		int count = 0;
		int per = 0;
		int oldPer =0;
		for (Product p : products) {
			addDoc(writer, p);
			count++;
			per = count*100/total;
			if(per!=oldPer){
				oldPer = per;
				System.out.printf("索引中，总共要添加 %d 条记录，当前添加进度是： %d%% %n",total,per);
			}
			
		}
		writer.close();
		return index;
	}

	private static void addDoc(IndexWriter w, Product p) throws IOException {
		Document doc = new Document();
		doc.add(new TextField("id", String.valueOf(p.getId()), Field.Store.YES));
		doc.add(new TextField("name", p.getName(), Field.Store.YES));
		doc.add(new TextField("category", p.getCategory(), Field.Store.YES));
		doc.add(new TextField("price", String.valueOf(p.getPrice()), Field.Store.YES));
		doc.add(new TextField("place", p.getPlace(), Field.Store.YES));
		doc.add(new TextField("code", p.getCode(), Field.Store.YES));
		w.addDocument(doc);
	}
}

搜索引擎技术系列教材（三）- lucene - 高亮显示

搜索引擎技术系列教材（五）- lucene - 分页查询

HOW2J公众号，关注后实时获知最新的教程和优惠活动，谢谢。

问答区域

2021-11-25 【SpringBoot + JPA】完成数据导入MySQL

萌森

关于工具和中间件-搜索引擎技术-14万条数据的提问

用时17分25秒，感觉好慢，一定要把打印SQL语句关闭，在配置文件

import com.ms.springboot.pojo.DataES;
import com.ms.springboot.repository.DataESRepository;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.LineIterator;

import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

@SpringBootTest
class Data140K {
    @Autowired
    DataESRepository dataESRepository;

    @Test
    void readTxtFileByFileUtils() {
        File file = new File("140k_products.txt");
        try {
            LineIterator lineIterator = FileUtils.lineIterator(file, "UTF-8");
            List<DataES> esAll = new ArrayList<>();
            Long count = 0L;
            while (lineIterator.hasNext()) {
                String line = lineIterator.nextLine();
                // 行数据转换成数组
                String[] custArray = line.split(",");
                DataES es = new DataES(custArray);
                esAll.add(es);
                count++;
                if (count == 10000){
                    System.out.println("已读取100%");
                    dataESRepository.saveAll(esAll);
                    esAll = new ArrayList<>();
                    count = 0l;

                    try {
                        System.out.println("保存完成******");
                        Thread.sleep(3000);
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                        System.out.println("休息失败");
                    }
                }
                if (count == 5000){
                    System.out.println("已读取50%");
                }

            }
            if (esAll.size() > 0){
                dataESRepository.saveAll(esAll);
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            System.out.println("存储完成");
        }
    }

}

回答已经提交成功，正在审核。请于我的回答处查看回答记录，谢谢

2021-11-24 我好奇

萌森	关于工具和中间件-搜索引擎技术-14万条数据的提问站长，是个0？

回答已经提交成功，正在审核。请于我的回答处查看回答记录，谢谢

2018-05-22 想知道站长怎么把14万的数据爬下来的

2018-04-05 我想问

提问之前请登陆

提问已经提交成功，正在审核。请于我的提问处查看提问记录，谢谢

关于工具和中间件-搜索引擎技术-14万条数据的提问

尽量提供截图、代码和异常信息，有助于分析和解决问题。也可进本站QQ群交流: 578362961

提问尽量提供完整的代码，环境描述，越是有利于问题的重现，您的问题越能更快得到解答。
对教程中代码有疑问，请提供是哪个步骤，哪一行有疑问，这样便于快速定位问题，提高问题得到解答的速度
在已经存在的几千个提问里，有相当大的比例，是因为使用了和站长不同版本的开发环境导致的,比如 jdk, eclpise, idea, mysql,tomcat 等等软件的版本不一致。
请使用和站长一样的版本，可以节约自己大量的学习时间。站长把教学中用的软件版本整理了，都统一放在了这里，方便大家下载： https://how2j.cn/k/helloworld/helloworld-version/1718.html

上传截图