Archive

Posts Tagged ‘hbase’

关于hadoop和hbase的适用

May 23rd, 2008 No comments

学习hadoop也有一段时间了,在测试了多天,分析了多天,其实对其内部实现还是不是很了解.源代码不是很好看哦.

今天yahoo的一个hadoop的专家过来介绍了一下,yahoo的使用经验,普遍讲的是mapreduce这块内容.而hdfs这块涉及不多.在 hadoop的archetcture里也提到hadoop不适合存储小文件.它本身就是从nutch项目里分出来的,目的是存放大文件.不过这个专家提到,已经考虑小文件的存放,正在实现中.其实从我的测试来看,读取小文件的效率还是可以的,目前数据量不多,不知道在T级别数据量,10m文件个数的情况下会如何?专家提到了文件数量和block数量是有限制的(好像是namenode的内存中保留的mapfile的限制,jvm开的内存大小).所以目前建议存放大量小文件时打包放入.但是这样的话单个访问小文件就麻烦一点了.总总看来hdfs目前似乎并不合适存放大量的图片了.不过可以根据自己的情况再测试一下,性能可以接受的话,还是可以考虑的,毕竟比起专业的存储设备,还是便宜很多的.

hbase原先作为爬虫的数据库,存放网页的部分数据.一个网页一般不会大于100k的,所以在此level下,数据的访问速度还是可以的,但是当大于这个极限时,速度变的非常慢,其内部实现可能就是针对小数据量的优化,对于存放小文件这样的,并不适用.

目前来看该系统的设计目的应该时作为mapreduce.而不是作为文件的存储.

Categories: technic Tags: , ,

测试hbase和hadoop操作文件的性能

May 23rd, 2008 2 comments

测试hbase和hadoop操作文件的性能
1:单线程hbase的文件存入

1
2
3
4
5
6
7
8
9
10
11
12
13
14
        String parentPath = "F:/pic/2003-zhujiajian";
        File[] files = getAllFilePath(parentPath);
 
        HBaseConfiguration config = new HBaseConfiguration();
        HTable table = new HTable(config, new Text("offer"));
        long start = System.currentTimeMillis();
        for (File file :files) {
            if(file.isFile()) {
                byte[] data = getData(file);
                createRecore(table,file.getName(),"image_big",data);
            }
        }
        long end = System.currentTimeMillis();
        System.out.println("time cost=" + (end-start));

输出:
108037206 bytes, 303个files write from local windows to remote hbase,cost 23328 or 21001 milliseconds
2:单线程hadoop的文件存入

1
2
3
4
5
6
7
8
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        Path src = new Path("F:/pic/2003-zhujiajian");
        Path dst = new Path("/user/zxf/image");
        long start = System.currentTimeMillis();
        fs.copyFromLocalFile(src, dst);
        long end = System.currentTimeMillis();
        System.out.println("time cost=" + (end-start));

输出:
108037206 bytes, 303 files write from local windows to remote hdfs,cost 26531 or 32407 milliseconds

3:单线程hbase的文件读取

花费的时间慢的难以置信
108037206 bytes, 303 files read from hdfs to local cost 479350 milliseconds

4:单线程hadoop的文件读取
108037206 bytes, 303 files read from hdfs to local cost 14188 milliseconds

5:深入测试
取几个文件对比

1
2
3
4
5
 fileSize(byte)  hdfs time(ms) hbase time(ms)
 12341140        1313          14688
 708474          63            4359
 82535           15            3907
 55296           16            125

6 思考
测试期间发生了一个region offline的错误,重启服务也还是报错,后然重新format namenode, delete datanode上数据,重启发现还有datanode没有起来,ssh上去发现java进程死了
浪费了1个多小时,仔细想了一下 HTable分散到各个HRegionServer上的各子表,一台datanode挂了,当有数据请求时,连不上,所以报region offline错误

为什么hbase读取的performance那么差?我单个读取11m的文件需要14000 milliseconds,而hdfs真个文件目录的读取才14188 milliseconds
http://blog.rapleaf.com/dev/?p=26,这篇文章中说到
Finally, another thing you shouldn’t do with HBase (or an RDBMS, for that matter), is store large amounts of binary data. When I say large amounts, I mean tens to hundreds of megabytes. Certainly both RDBMSs and HBase have the capabilities to store large amounts of binary data. However, again, we have an impedance mismatch. RDBMSs are built to be fast metadata stores; HBase is designed to have lots of rows and cells, but functions best when the rows are (relatively) small. HBase splits the virtual table space into regions that can be spread out across many servers. The default size of individual files in a region is 256MB. The closer to the region limit you make each row, the more overhead you are paying to host those rows. If you have to store a lot of big files, then you’re best off storing in the local filesystem, or if you have LOTS of data, HDFS. You can still keep the metadata in an RDBMS or HBase – but do us all a favor and just keep the path in the metadata.
看来,hbase不合适存放二进制文件,存放图片这样的application还是hdfs更合适了

alter table offer change image_big IN_MEMORY;
a:重新测试了几遍,包括重启hbase,hdfs,hbase的读取速度还是和原先没大差别

b:删除原有数据,重新写入后,再测试读发现,小文件的读取效率搞了很多

1
2
3
4
5
  fileSize(byte)  1(ms)   2(ms)  3(ms)
  12341140        11750   11109  11718
  708474          625     610    672
  82535           78      78     78
  55296           47      62     47

这样就是说读cache有较大的性能提升,在data数量不是非常大的时候,瓶颈是在读取速度上,100k一下的数据读取效率还是可以的,花费时间基本上和要读取的data的长度成正比
但是之前的效率为什么没有变?难道不能cache从磁盘读取的数据?
然后试着读取了最先放入的一批文件中的几个,现在还是很慢,重复b的操作后效率提升了
原因可能是系统在创建row’s clunm data的时候打上了cache标志,cache适合clunm系统绑定在一起的,hbase启动的时候会把打了cache标志的colunm数据读到memory中.
所以在我执行alter table offer change image_big IN_MEMORY之前所创建的数据都没有cache标志, 此cache不是像其他的cache,启动的时候不做load,访问后再cache,这样一来,cache的数据愈多必然造成启动速度的加慢,我这里也有所感觉了,当然对用户体验是好的,不会在第一次访问的时候特别慢

c:那为hbase读取数据的速度为什么比hdfs慢,特别是大文件的时候慢那么多呢?过多的网络交互?
从debug日志来看,情况的确是这样,文件越大,regionServer response clinet的次数非常多.具体还需分析源代码仔细看看了.

Categories: technic Tags: , ,

hbase的搭建

May 6th, 2008 No comments

hbase的搭建
URL:http://hadoop.apache.org/hbase/docs/r0.1.1/api/overview-summary.html

在已经创建的hdfs基础上搭建
1:修改hadoop/contrib/hbase/conf/hbase-env.sh
加入java_home的路径

2:修改hadoop/contrib/hbase/conf/hbase-site.xml,加入如下

1
2
3
4
5
6
7
8
9
10
  <property>
    <name>hbase.master</name>
    <value>10.0.4.121:11100</value>
    <description>The host and port that the HBase master runs at.</description>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://10.0.4.121:10100/hbase</value>
    <description>The directory shared by region servers.</description>
  </property>

3:启动hbase

1
hadoop/contrib/hbase/bin/start-hbase.sh

4: 查看http://wiki.apache.org/hadoop/Hbase/HbaseShell,进行shell操作

4.1 首先进入shell

1
 hadoop/contrib/hbase/bin/hbase shell

4.2 创建表

1
 CREATE TABLE offer(image_big,image_small);

4.2 插入数据,查询,删除数据
如:

1
2
3
4
5
6
  INSERT INTO offer(image_big:,image_small:) VALUES ('abcdefg','abc') WHERE row = 'testinsert';
  INSERT INTO offer(image_big:,image_small:) VALUES ('hijklmn','hij') WHERE row = 'testinsert';
  INSERT INTO offer(image_big:content,image_big:path,image_small:content,image_small:path) VALUES ('abcdefg','path_big','abc','path_small') WHERE row = 'testinsert';
  INSERT INTO offer(image_big:content,image_big:path,image_small:content,image_small:path) VALUES ('hijklmn','path_big','hij','path_small') WHERE row = 'testinsert';
 
  SELECT * FROM offer WHERE row = 'testinsert';

返回结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
 +------------------------+-------------------------+
 | Column                 | Cell                    |
 +------------------------+-------------------------+
 | image_big:             | hijklmn                 |
 +------------------------+-------------------------+
 | image_big:content      | hijklmn                 |
 +------------------------+-------------------------+
 | image_big:path         | path_big                |
 +------------------------+-------------------------+
 | image_small:           | hij                     |
 +------------------------+-------------------------+
 | image_small:content    | hij                     |
 +------------------------+-------------------------+
 | image_small:path       | path_small              |
 +------------------------+-------------------------+
1
 SELECT count(*) FROM offer WHERE row = 'testinsert';

返回:

1
 1 row(s) in set. (0.02 sec)

从上可以看到,虽然我们插入了4条数据,但是结果是1,hbase覆盖了相同的数据,insert2覆盖insert1,insert4覆盖insert2,相当于update,从shell的介绍中我们也看到hql没有提供update
此时的数据结果应该如下:

1
2
3
4
5
6
7
 +----------+--------------------------+---------------------------+
 |          |  Column   image_big      |      Column image_small   |
 |   key    +--------------------------+---------------------------+
 |          |   :   |:content | :path  |  :  |:content|  :path     |
 +-------------------------------------+---------------------------+
 |testinsert|hijklmn|hijklmn  |path_big| hij |  hij   |  path_small|
 +----------+--------------------------+---------------------------+

加入insert加入TIMESTAMP会怎么样呢?

1
2
3
4
5
6
  DELETE * FROM offer WHERE row = 'testinsert';
 
  INSERT INTO offer(image_big:,image_small:) VALUES ('abcdefg','abc') WHERE row = 'testinsert' timestamp '1209982310285';
  INSERT INTO offer(image_big:,image_small:) VALUES ('hijklmn','hij') WHERE row = 'testinsert' timestamp '1209982311285';
  INSERT INTO offer(image_big:content,image_big:path,image_small:content,image_small:path) VALUES ('abcdefg','path_big','abc','path_small') WHERE row = 'testinsert' timestamp '1209982312285';
  INSERT INTO offer(image_big:content,image_big:path,image_small:content,image_small:path) VALUES ('hijklmn','path_big','hij','path_small') WHERE row = 'testinsert' timestamp '1209982313285';

结果无论是

1
  SELECT * FROM offer WHERE row = 'testinsert'

or

1
  SELECT * FROM offer WHERE row = 'testinsert' timestamp '1209982310285';

都只返回

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  +-------------------------+----------------------+
  | Column                  | Cell                 |
  +-------------------------+----------------------+
  | image_big:              | hijklmn              |
  +-------------------------+----------------------+
  | image_big:content       | hijklmn              |
  +-------------------------+----------------------+
  | image_big:path          | path_big             |
  +-------------------------+----------------------+
  | image_small:            | hij                  |
  +-------------------------+----------------------+
  | image_small:content     | hij                  |
  +-------------------------+----------------------+
  | image_small:path        | path_small           |
  +-------------------------+----------------------+

我迷惑了,如hbase Architecture介绍中是有timestamp的,数据按照时间备份的.但这里怎么理解哦…
http://www.mail-archive.com/core-user@hadoop.apache.org/msg00222.html,上面的页面中说到似乎目前还不支持,但是我这里插入是成功的;另外个人理解row和timestamp从数据结果上来说都是index级的,应该是数据本身之外的,那么不显示倒是没啥问题,但是数据好像被覆盖呢?难道目前不支持……
先delete

1
  DELETE * FROM offer WHERE row = 'testinsert';

再select

1
  SELECT * FROM offer WHERE row = 'testinsert';
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  +-------------------------+----------------------+
  | Column                  | Cell                 |
  +-------------------------+----------------------+
  | image_big:              | abcdefg              |
  +-------------------------+----------------------+
  | image_big:content       | abcdefg              |
  +-------------------------+----------------------+
  | image_big:path          | path_big             |
  +-------------------------+----------------------+
  | image_small:            | abc                  |
  +-------------------------+----------------------+
  | image_small:content     | abc                  |
  +-------------------------+----------------------+
  | image_small:path        | path_small           |
  +-------------------------+----------------------+

这个意外的发现,说明数据是有备份的,是不过没有搜索到历史数据,select中的timestamp条件好像没有起作用,每次返回都是最新的数据.架构中说道insert如果没有时间条件,系统默认会加上当前时间.

5 client访问hbase
如上次访问HDFS,引入hbase-site.xml,lib包,代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
  package com.chua.hadoop.client;
 
  import java.io.BufferedInputStream;
  import java.io.BufferedOutputStream;
  import java.io.DataInputStream;
  import java.io.File;
  import java.io.FileInputStream;
  import java.io.FileOutputStream;
  import java.io.IOException;
  import java.util.Iterator;
  import java.util.SortedMap;
 
  import org.apache.commons.httpclient.HttpClient;
  import org.apache.commons.httpclient.methods.GetMethod;
  import org.apache.hadoop.hbase.HBaseConfiguration;
  import org.apache.hadoop.hbase.HTable;
  import org.apache.hadoop.io.Text;
 
  /**
   * 类HBase.java的实现描述:TODO 类实现描述
   * @author chua 2008-5-4 下午05:03:33
   */
  public class HBase {
 
      /**
       * @param args
       */
      public static void main(String[] args) throws Exception {
          String domain = "www.dlog.cn";
          String path_s = "/uploads/m/me/meichua/meichua_100.jpg";
          String path_b = "/uploads/m/me/meichua/200804/22094433_tLuyw.jpg";
          byte[] data_s = getData(domain, path_s);
          byte[] data_b = getData(domain,path_b);
 
          HBaseConfiguration config = new HBaseConfiguration();
          HTable table = new HTable(config, new Text("offer"));
          createRecore(table,"chua","image_big",data_b,path_b);
          createRecore(table,"chua","image_small",data_s,path_s);
 
          //取得一个row的所有data,遍历keySet
          SortedMap map = table.getRow(new Text("chua"));
          if(!map.isEmpty()) {
              Iterator it = map.keySet().iterator();
              while(it.hasNext()){
                  System.out.println(it.next());
              }
          }
          //取得某个row的colunmName的data
          byte[] data = table.get(new Text("chua"), new Text("image_big:content"));
          saveAsFile(data,"c:/chua_big.jpg");
      }
 
      public static void createRecore(HTable table,String row, String colunm,byte[] data, String path) throws IOException {
          long lockId = table.startUpdate(new Text(row));
          table.put(lockId, new Text(colunm+":content"), data);
          table.put(lockId, new Text(colunm+":path"), path.getBytes());
          table.commit(lockId);
      }
 
      /**
       * 从网上读取图片
       * @param domain
       * @param path
       * @return
       */
      public static byte[] getData(String domain,String path){
          byte[] dataResource = null;
          try {
              HttpClient client = new HttpClient();
              client.getHostConfiguration().setHost(domain,80,"http");
              GetMethod getMethod = new GetMethod(path);
              int status = client.executeMethod(getMethod);
              if(status == 200) {
                  dataResource = getMethod.getResponseBody();
              }
              getMethod.releaseConnection();
          } catch(Exception e) {  
              System.out.println("Download error"+e);
          }
          return dataResource;
      }
 
      /**
       * 从本地文件读取
       * @param path
       * @return
       */
      public static byte[] getData(String path) {
          File file = new File(path);
          DataInputStream dis = null;
          try {
              dis = new DataInputStream(new BufferedInputStream(new FileInputStream(file)));
              int length = dis.available();
              byte[] data = new byte[length];
              dis.read(data);
              return data;
          } catch (Exception e) {
              e.printStackTrace();
              return null;
          }
      }
 
      /**
       * 存到一个文件
       * @param data
       * @param path
       */
      public static void saveAsFile(byte[] data,String path) {
          if(data != null) {
              try {
                  BufferedOutputStream out = new BufferedOutputStream(new FileOutputStream(path));
                  for(byte tmp : data) {
                      out.write(tmp);
                  }
                  out.close();
              } catch (Exception e) {
                  e.printStackTrace();
              }
          }
      }
  }

输出:
image_big:content
image_big:path
image_small:content
image_small:path
以上是一个client访问hbase的例子,比较简单

6 hbase架构介绍

http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture

Categories: technic Tags: , ,