List Info

Thread: Created: (HADOOP-2161) getRow() is orders of magnitudes slower than get(), even on rows with




Created: (HADOOP-2161) getRow() is orders of magnitudes slower than get(), even on rows with
country flaguser name
United States
2007-11-06 11:34:51
getRow() is orders of magnitudes slower than get(), even on
rows with one column
------------------------------------------------------------
--------------------

                 Key: HADOOP-2161
                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-2161
             Project: Hadoop
          Issue Type: Bug
          Components: contrib/hbase
    Affects Versions: 0.16.0
         Environment: latest from trunk
            Reporter: Clint Morgan


HTable.getRow(Text) is several orders of magnitude slower
than
HTable.get(Text, Text), even on rows with a single column.

This problem can be observed by the attached patch of
PerformanceEvaluation.java which changes SequentialRead to
use getRow,
and prints out the time for each read. 

The test can the be run with:

bin/hbase org.apache.hadoop.hbase.PerformaeEvaluation
sequentialRead 1

On my laptop, the original test (using get()) produces reads
on the order of 5-20
milliseconds. Using getRow(), the reads take 50-2000 ms. 
 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Updated: (HADOOP-2161) getRow() is orders of magnitudes slower than get(), even on rows with
country flaguser name
United States
2007-11-06 11:36:55
     [ https://issues.apache.org/jira/browse/HADOOP-2161?page=co
m.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Clint Morgan updated HADOOP-2161:
---------------------------------

    Attachment: PerformanceEvaluation-patch.txt

Modifies SequentialReadTest to use getRow, and print the
read time to standard out.

> getRow() is orders of magnitudes slower than get(),
even on rows with one column
>
------------------------------------------------------------
--------------------
>
>                 Key: HADOOP-2161
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-2161
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>    Affects Versions: 0.16.0
>         Environment: latest from trunk
>            Reporter: Clint Morgan
>         Attachments: PerformanceEvaluation-patch.txt
>
>
> HTable.getRow(Text) is several orders of magnitude
slower than
> HTable.get(Text, Text), even on rows with a single
column.
> This problem can be observed by the attached patch of
> PerformanceEvaluation.java which changes SequentialRead
to use getRow,
> and prints out the time for each read. 
> The test can the be run with:
> bin/hbase org.apache.hadoop.hbase.PerformaeEvaluation
sequentialRead 1
> On my laptop, the original test (using get()) produces
reads on the order of 5-20
> milliseconds. Using getRow(), the reads take 50-2000
ms. 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-2161) getRow() is orders of magnitudes slower than get(), even on rows wit
country flaguser name
United States
2007-11-06 18:05:50
    [ https://issues.apache.org/jira/browse
/HADOOP-2161?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12540631 ] 

stack commented on HADOOP-2161:
-------------------------------

Thanks Clint.  Good find.  HRegion.get(...) quits after its
accumulated sufficient versions -- usually one. 
HRegion.getFull, which is eventually called by
HTable.getFull is pig-headed insisting on running through
memcache and all store files for every possible version
regardless of version setting.

> getRow() is orders of magnitudes slower than get(),
even on rows with one column
>
------------------------------------------------------------
--------------------
>
>                 Key: HADOOP-2161
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-2161
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>    Affects Versions: 0.16.0
>         Environment: latest from trunk
>            Reporter: Clint Morgan
>         Attachments: PerformanceEvaluation-patch.txt
>
>
> HTable.getRow(Text) is several orders of magnitude
slower than
> HTable.get(Text, Text), even on rows with a single
column.
> This problem can be observed by the attached patch of
> PerformanceEvaluation.java which changes SequentialRead
to use getRow,
> and prints out the time for each read. 
> The test can the be run with:
> bin/hbase org.apache.hadoop.hbase.PerformaeEvaluation
sequentialRead 1
> On my laptop, the original test (using get()) produces
reads on the order of 5-20
> milliseconds. Using getRow(), the reads take 50-2000
ms. 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-2161) getRow() is orders of magnitudes slower than get(), even on rows wit
country flaguser name
United States
2007-11-06 21:45:51
    [ https://issues.apache.org/jira/browse
/HADOOP-2161?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12540661 ] 

stack commented on HADOOP-2161:
-------------------------------

Actually, I misspoke.  getFull scanning memory and all
on-disk files is not 'wrong' -- though it is slow.  Here's
why. 

Columns can be added willy-nilly.  There is no need of an
ALTER TABLE-like statement adding a column as there is in a
traditional RDBMS -- as long as the column belongs to an
existing column family (has an extant column family for a
prefix). 

And there is no accounting anywhere in hbase of all the
columns made in any particular family.   Since there is no
list of all-columns to consult, the only way hbase can be
sure its found all column mentions is if it scans all data. 
This is main difference between get and getFull.  Because
you provide a list of columns to fetch to get, it can know
when its done.  Not so with getFull.

Is it important to you that this run faster Clint?  If so,
there may be some things we can do like keep an integer of
counts of unique column names.  getFull would know that when
it had hit the count of all column names, it could return
(Keeping a list of all column names would probably not be
viable since in some schemas it might grow without bound).

> getRow() is orders of magnitudes slower than get(),
even on rows with one column
>
------------------------------------------------------------
--------------------
>
>                 Key: HADOOP-2161
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-2161
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>    Affects Versions: 0.16.0
>         Environment: latest from trunk
>            Reporter: Clint Morgan
>         Attachments: PerformanceEvaluation-patch.txt
>
>
> HTable.getRow(Text) is several orders of magnitude
slower than
> HTable.get(Text, Text), even on rows with a single
column.
> This problem can be observed by the attached patch of
> PerformanceEvaluation.java which changes SequentialRead
to use getRow,
> and prints out the time for each read. 
> The test can the be run with:
> bin/hbase org.apache.hadoop.hbase.PerformaeEvaluation
sequentialRead 1
> On my laptop, the original test (using get()) produces
reads on the order of 5-20
> milliseconds. Using getRow(), the reads take 50-2000
ms. 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )