Jean-Daniel Cryans | 1 Jun 2011 20:43
Picon
Favicon

Re: A sudden msg of "java.io.IOException: Server not running, aborting"

Oh I'm so dumb, I didn't see that this is coming from _another_ region
server. What happens when splitting is that we try to update the
.META. table which requires talking to another region server, except
that in this case the RS was aborting meaning that the split
transaction had to be rolled back.

Looking at the logs of the RS that carried .META. at that time would
give you more clues into what really happened.

J-D

On Tue, May 31, 2011 at 5:24 PM, bijieshan <bijieshan@...> wrote:
> I paste the logs as following. To short of the logs, some similar and closer logs I deleted:
>
>> 2011-05-25 09:15:44,232 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread:
Running rollback of failed split of
ufdr,0065286138106876#4228000,1306260358978.37875b35a870957da534ad29fd2944d5.;
java.io.IOException: Server not running, aborting
>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2352)
>>        at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1653)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
>>
>>>>[description]
>> Before that time, everything seemed ok(it received so many open-region requests, and handled them),
except several Exception happened during the time.
>>
>>>>[logs]
>> 2011-05-25 09:15:36,214 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received
request to open region: ufdr,110018,1305617626846.f22eafaf06c3be7f7e5a447cec40c7fa.
>> 2011-05-25 09:15:36,214 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received
request to open region: ufdr,111336,1305617636659.b47f1cf09b10aecaee6805a985646b30.
>> 2011-05-25 09:15:36,215 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received
request to open region: ufdr,280884,1305617952747.7cac8ccc9c9770ef0e48807e07bc8bec.
>> 2011-05-25 09:15:36,215 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received
request to open region: ufdr,280446,1305617949439.b332cfff565d77b2e2d4f1a2c52494b0.
>> 2011-05-25 09:15:36,216 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received
request to open region: ufdr,101974,1305617623149.b60c71565f2c73b81477fe25ba80378b.
>> 2011-05-25 09:15:36,413 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:20020-0x32ffbc7e6e00321 Attempting to transition node
11dc72d94c7a5a3d19b0c0c3c49624a5 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
>> 2011-05-25 09:15:36,423 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:20020-0x32ffbc7e6e00321 Successfully transitioned node
11dc72d94c7a5a3d19b0c0c3c49624a5 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
>> 2011-05-25 09:15:37,645 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:20020-0x32ffbc7e6e00321 Attempting to transition node
c4c657deda864fa2018b38388ee928e6 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
>> 2011-05-25 09:15:37,653 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:20020-0x32ffbc7e6e00321 Attempting to transition node
d2546cde1d6546133af74f25252b652c from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
>> 2011-05-25 09:15:37,660 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:20020-0x32ffbc7e6e00321 Successfully transitioned node
c4c657deda864fa2018b38388ee928e6 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
>> 2011-05-25 09:15:37,662 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:20020-0x32ffbc7e6e00321 Attempting to transition node
bd07db88de68c1022c5bba9e5471c9f9 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
>> 2011-05-25 09:15:37,668 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:20020-0x32ffbc7e6e00321 Successfully transitioned node
d2546cde1d6546133af74f25252b652c from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
>>
>>>>[description]
>> From the code, I saw that , only the Server has aborted, the msg of "Server not running, aborting" could be
shown. It determined by the param of HRegionServer#abortRequested, if it was true, and if somewhere
called the method of HRegionServer#checkOpen, this msg would been printed.
>>
>>  /**
>>   * Called to verify that this server is up and running.
>>   *
>>   *  <at> throws IOException
>>   */
>>  protected void checkOpen() throws IOException {
>>    if (this.stopped || this.abortRequested) {
>>      throw new IOException("Server not running"
>>          + (this.abortRequested ? ", aborting" : ""));
>>    }
>>    if (!fsOk) {
>>      throw new IOException("File system not available");
>>    }
>>  }
>>
>> For I said there's no aborted info been found in Logs:
>>
>> public void abort(String reason, Throwable cause) {
>>    if (cause != null) {
>>      LOG.fatal("ABORTING region server " + this + ": " + reason, cause);
>>    } else {
>>      LOG.fatal("ABORTING region server " + this + ": " + reason);
>>    }
>>    this.abortRequested = true;
>>    this.reservedSpace.clear();
>>    if (this.metrics != null) {
>>      LOG.info("Dump of metrics: " + this.metrics);
>>    }
>>    stop(reason);
>>  }
>>
>> For there's no prev-aborting found in the logs, how did the sudden msg of " java.io.IOException: Server
not running, aborting " occurred?
>> Thanks!
>>
>> Jieshan Bean.
>>
>>
>>
>>
>>
>


Gmane