我写这系列solr数据导入源码分析主要是解决我们在编程方式导入数据库数据的时候,怎么解决大数据集导入索引的内存溢出问题的
如果数据表的数据不大,用常规方法不会产生内存溢出的问题,当数据表数据上千万级的时候,可参考solr自带的数据导入方式
我刚开始用编程方式添加solr索引的时候,很容易产生内存溢出问题,所以我就想考究solr自带的数据导入是怎么处理大数据集索引添加的;
网上一些参考的方法通常是采取类似数据分页的方式,个人感觉比较拙劣,并且在针对不同数据库的时候,采取类似数据分页方式还要考虑不同数据库分页方式的差异(数据库方言),而且容易出现性能等问题。
我的处理方式如下
DatabaseResourc.java
public class DatabaseResource { private static final Logger logger = LoggerFactory.getLogger(DatabaseResource. class); private DataSource dataSource; /** * Set the JDBC DataSource to obtain connections from. */ public void setDataSource(DataSource dataSource) { this.dataSource = dataSource; } /** * Return the DataSource used by this template. */ public DataSource getDataSource() { return this.dataSource; } public DatabaseResource() { } public DatabaseResource(DataSource datasource) { setDataSource(dataSource); } /** * 执行sql语句 */ public void executesql(String sql,JDBCCallback callback) { Connection conn = JDBCContext.getJdbcContext(getDataSource()).getConnection(); Statement stmt = null; ResultSet rst = null; try { stmt = conn.createStatement(); stmt.setFetchSize(50); callback.processRow(rst=stmt.executeQuery(sql)); } catch(Exception e) { e.printStackTrace(); } finally { closeStatement(stmt); closeResultSet(rst); JDBCContext.getJdbcContext(getDataSource()).releaseConnection(); } } public void closeConnection(Connection con) { if (con != null) { try { con.close(); } catch (SQLException ex) { logger.debug("Could not close JDBC Connection", ex); } catch (Throwable ex) { // We don't trust the JDBC driver: It might throw RuntimeException or Error. logger.debug("Unexpected exception on closing JDBC Connection", ex); } } } public void closeStatement(Statement stmt) { if (stmt != null) { try { stmt.close(); } catch (SQLException ex) { logger.trace("Could not close JDBC Statement", ex); } catch (Throwable ex) { // We don't trust the JDBC driver: It might throw RuntimeException or Error. logger.trace("Unexpected exception on closing JDBC Statement", ex); } } } public void closeResultSet(ResultSet rs) { if (rs != null) { try { rs.close(); } catch (SQLException ex) { logger.trace("Could not close JDBC ResultSet", ex); } catch (Throwable ex) { // We don't trust the JDBC driver: It might throw RuntimeException or Error. logger.trace("Unexpected exception on closing JDBC ResultSet", ex); } } }
}
其中JDBCCallback是一个数据读取接口
import java.sql.ResultSet; public interface JDBCCallback { void processRow(ResultSet rst);
}
然后我们可以这样获取数据,并添加到solr索引里面
private static int fetchSize = 1000; BasicDataSource datasource = new BasicDataSource(); // 设置datasource参数 DatabaseResource resource= new DatabaseResource(datasource); String sql=""; resource.executesql(sql, new JDBCCallback() { public void processRow( final ResultSet rst) { try { Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>(); SolrInputDocument document= null; int innerCount=0; while(rst.next()) { innerCount++; document= new SolrInputDocument(); // 这里补充不同字段类型的处理 参考下文switch case语句 document.addField("fieldname",rst.getString("字段名称")); // 添加其他字段 docs.add(document); if (innerCount == fetchSize) { DigContext.getSolrserver().add(docs); docs.clear(); innerCount = 0; } } if (innerCount != 0) { DigContext.getSolrserver().add(docs); } } catch (SQLException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (SolrServerException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ParseException e) { // TODO Auto-generated catch block e.printStackTrace(); } }
这是一个精简版,如果还要实现类似transformer数据转换及格式化等功能,可以进一步的封装 });