|
List Info
Thread: Created: (HADOOP-2030) Some changes to Record I/O interfaces
|
|
| Created: (HADOOP-2030) Some changes to
Record I/O interfaces |

|
2007-10-11 04:32:50 |
SOME CHANGES TO RECORD I/O INTERFACES
-------------------------------------
KEY: HADOOP-2030
URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/HADOOP-2030
PROJECT: HADOOP
ISSUE TYPE: IMPROVEMENT
REPORTER: VIVEK RATAN
I WANTED TO SUGGEST SOME CHANGES TO THE RECORD I/O
INTERFACES.
UNDER ORG.APACHE.HADOOP.RECORD, _RECORDINPUT_ AND
_RECORDOUTPUT_ ARE THE INTERFACES TO SERIALIZE AND
DESERIALIZE BASIC TYPES FOR JAVA-GENERATED STUBS. ALL THE
METHODS IN _RECORDINPUT_ AND _RECORDOUTPUT_ TAKE A
PARAMETER, A STRING, CALLED 'TAG'. AS FAR AS I CAN SEE, THIS
TAG IS USED ONLY FOR XML-BASED SERIALIZATION, TO WRITE OUT
THE NAME OF THE FIELD THAT IS BEING SERIALIZED.A LOT OF THE
METHODS IGNORE IT. MY PROPOSAL IS TO ELIMINATE THIS
PARAMETER, FOR A NUMBER OF REASONS:
- WE DON'T NEED TO WRITE THE NAME OF A FIELD WHEN
SERIALIZING IN XML. NONE OF THE OTHER SERIALIZERS (FOR
BINARY OR CSV) WRITE OUT THE NAME OF A FIELD - WE ONLY WRITE
THE FIELD VALUE. THE GENERATED STUBS KNOW WHICH FIELD IS
ASSOCIATED WITH WHICH VALUE (AND NOW, WITH TYPE INFORMATION
SUPPORT, THE FIELD NAME IS PART OF THE TYPE INFORMATION AND
IS NOT REQUIRED TO BE SERIALIZED ALONG WITH THE FIELD DATA).
IN FACT, EVEN IN XML, I DON'T SEE THE FIELD NAME BEING READ
BACK IN, SO IT SERVES NO PURPOSE WHATSOEVER.
- THE TAG IS USED OCCASIONALLY IN THE ERROR MESSAGE, BUT
AGAIN THIS CAN BE HANDLED BETTER BY THE CALLER OF
_RECORDINPUT_ AND _RECORDOUTPUT_.
- THE TAG IS ALSO USED TO DETECT WHETHER A RECORD IS NESTED
OR NOT. IN CSV, WE WRAP NESTED RECORDS WITH "S{}".
WE ALSO WANT TO KNOW WHETHER A RECORD IS NESTED OR THE
TOP-MOST, SO THAT WE ADD A NEWLINE AT THE END OF A TOP-MOST
RECORD. IF A TAG IS EMPTY, IT IS ASSUMED THAT THE RECORD IS
THE TOP-MOST. THIS IS USING THE TAG PARAMETER TO MEAN
SOMETHING ELSE. IT'S FAR MORE READABLE TO JUST PASS IN A
BOOLEAN TO _STARTRECORD()_ AND _ENDRECORD()_ WHICH DIRECTLY
INDICATES WHETHER THE RECORD IS NESTED OR NOT. OR, ADD TWO
ADDITIONAL METHODS TO _RECORDOUTPUT_ AND _RECORDINPUT_:
_START()_ AND _STOP()_, WHICH ARE CALLED AT THE BEGINNING
AND END OF EVERY TOP-MOST RECORD WHILE _STARTRECORD()_ AND
_ENDRECORD()_ ARE USED ONLY FOR NESTED RECORDS. THE FORMER'S
SLIGHTLY BETTER, IMO, BUT EACH METHOD IS MUCH BETTER THAN
USING AN EMPTY TAG TO INDICATE A TOP-LEVEL RECORD.
THE ISSUE WITH TAGS BRINGS UP A RELATED ISSUE. SOMETIMES, WE
MAY NEED TO PASS IN ADDITIONAL INFORMATION TO _RECORDINPUT_
OR _RECORDOUTPUT_. FOR EXAMPLE, SUPPOSE WE DO NEED TO WRITE
THE FIELD NAME ALONG WITH THE FIELD VALUE. WE CAN THINK OF
SUCH A REQUIREMENT IN TWO WAYS. A) SUCH DECISIONS OF WHAT TO
SERIALIZE/DESERIALIZE ARE INDEPENDENT OF THE FORMAT/PROTOCOL
THAT THE DATA IS SERIALIZED IN. IF WE WANT TO WRITE
SOMETHING ELSE, THAT SHOULD BE WRITTEN SEPARATELY BY THE
STUB. SO, IF WE WANT TO SERIALIZE THE FIELD NAME BEFORE A
FIELD VALUE, A STUB SHOULD CALL
_RECORDOUTPUT.WRITESTRING(<FIELD NAME>)_ FIRST,
FOLLOWED BY _RECORDOUTPUT.WRITEINT(<FIELD VALUE>)_.
THE METHODS IN _RECORDINPUT_ AND _RECORDOUTPUT_ ARE THE
LOWEST LEVEL METHODS AND THEY SHOULD JUST BE CONCERNED WITH
WRITING INDIVIDUAL TYPES. B) WHAT IF A PROTOCOL WANTS TO
WRITE THINGS DIFFERENTLY? FOR EXAMPLE, WE MAY WANT TO WRITE
THE FIELD NAME BEFORE THE FIELD VALUE FOR XML ONLY (FOR
DEBUGGING SAKE, OR FOR WHATEVER ELSE). OR IT MAY BE THAT THE
FIELD NAME AND FIELD VALUE NEED TO BE ENCLOSED IN CERTAIN
TAGS THAT CAN'T HAPPEN IF YOU WRITE THEM SEPARATELY. IN
THESE CASES, METHODS IN _RECORDINPUT_ AND _RECORDOUTPUT_
NEED TO BE PASSED ADDITIONAL INFORMATION. THIS CAN BE DONE
BY PROVIDING AN OPTIONAL PARAMETER FOR THESE METHODS. MAYBE
A STRUCTURE/CLASS CONTAINING FIELD INFORMATION, OR A
REFERENCE TO THE FIELD ITSELF (THE TAG PARAMETER WAS MEANT
TO SERVE A SIMILAR PURPOSE, BUT JUST PASSING IN A STRING MAY
BE INADEQUATE). FOR NOW, THERE IS NO REAL NEED FOR EITHER OF
THESE SITUATIONS, SO WE SHOULD BE OK WITH GETTING RID OF THE
TAG PARAMETER.
SIMILAR CHANGES NEED TO BE DONE TO THE C++ SIDE, WHERE WE
HAVE _OARCHIVE_ AND _IARCHIVE_:
- THE TAG PARAMETER NEEDS TO BE REMOVED
- _STARTRECORD()_ AND _ENDRECORD()_ IN _OARCHIVE_ AND
_IARCHIVE_ NEED TO TAKE A BOOLEAN PARAMETER THAT INDICATES
WHETHER THE RECORD IS NESTED OR NOT
- CURRENTLY, BOTH _STARTRECORD()_ AND _ENDRECORD()_ IN
_IARCHIVE_ TAKE AN ADDITIONAL PARAMETER, A REFERENCE TO A
HADOOP RECORD. THIS IS NEVER USED ANYWHERE NOT REQUIRED (THE
CORRESPONDING METHODS IN _RECORDINPUT_ AND _RECORDOUTPUT_
DON'T TAKE ANY PARAMETERS, WHICH IS THE RIGHT THING TO DO),
AND SHOULD BE REMOVED.
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Commented: (HADOOP-2030) Some changes
to Record I/O interfaces |

|
2007-11-06 14:26:50 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/HADOOP-2030?PAGE=COM.A
TLASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#
ACTION_12540548 ]
MILIND BHANDARKAR COMMENTED ON HADOOP-2030:
-------------------------------------------
THE TAG WAS ADDED FOR SUPPORTING XML AND OTHER FORMATS LIKE
JSON, WHICH HAVE THE ABILITY TO CREATE A CLASS DYNAMICALLY
TO REFER TO FIELDS NATIVELY BY THEIR NAMES. THE DDL (OR
TYPEINFO) WAS NOT FED TO THE RECORDOUTPUT AND RECORDINPUT
INTERFCAES. IF THE TYPEINFO IS FED TO THE CONSTRUCTION OF
RECORDINPUT/RECORDOUTPUT, THEN THE NEED FOR TAG IS LESSENED.
(IT PROVIDES AN OPPORTUNITY FOR BETTER ERROR CHECKING FOR
XML SERIALIZED RECORDS TO HAVE A FIELDNAME IN
SERIALIZATION.)
ALSO, THE SERIALIZE AND DESERIALIZE METHODS GENERATED FOR
EACH CLASS USED TO CALL STARTRECORD AND ENDRECORD. THIS
MEANT THAT THE RECORD WHICH IS BEING SERIALIZED DID NOT NEED
TO KNOW WHETHER IT WAS A TOP-LEVEL RECORD OR EMBEDDED
RECORD. WITH YOUR PROPOSAL, EITHER THE SERIALIZE/DESERIALIZE
WOULD HAVE TO KNOW IT, OR THE USER WILL HAVE TO CALL METHODS
ON RECORDOUTPUT/RECORDINPUT TO START/END TOP-LEVEL RECORD.
I AGREE WITH YOU THAT HAVE A STRING CONTAIN NAME/BE EMPTY IS
A BAD INDICATOR OF TOP-LEVEL RECORD, BUT IT DID SIMPLIFY
SERIALIZATION INTERFACES. THE USER OF THE GENERATED CLASS
DID NOT HAVE TO KNOW RECORDINPUT OR RECORDOUTPUT METHODS.
> SOME CHANGES TO RECORD I/O INTERFACES
> -------------------------------------
>
> KEY: HADOOP-2030
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/HADOOP-2030
> PROJECT: HADOOP
> ISSUE TYPE: IMPROVEMENT
> REPORTER: VIVEK RATAN
>
> I WANTED TO SUGGEST SOME CHANGES TO THE RECORD I/O
INTERFACES.
> UNDER ORG.APACHE.HADOOP.RECORD, _RECORDINPUT_ AND
_RECORDOUTPUT_ ARE THE INTERFACES TO SERIALIZE AND
DESERIALIZE BASIC TYPES FOR JAVA-GENERATED STUBS. ALL THE
METHODS IN _RECORDINPUT_ AND _RECORDOUTPUT_ TAKE A
PARAMETER, A STRING, CALLED 'TAG'. AS FAR AS I CAN SEE, THIS
TAG IS USED ONLY FOR XML-BASED SERIALIZATION, TO WRITE OUT
THE NAME OF THE FIELD THAT IS BEING SERIALIZED.A LOT OF THE
METHODS IGNORE IT. MY PROPOSAL IS TO ELIMINATE THIS
PARAMETER, FOR A NUMBER OF REASONS:
> - WE DON'T NEED TO WRITE THE NAME OF A FIELD WHEN
SERIALIZING IN XML. NONE OF THE OTHER SERIALIZERS (FOR
BINARY OR CSV) WRITE OUT THE NAME OF A FIELD - WE ONLY WRITE
THE FIELD VALUE. THE GENERATED STUBS KNOW WHICH FIELD IS
ASSOCIATED WITH WHICH VALUE (AND NOW, WITH TYPE INFORMATION
SUPPORT, THE FIELD NAME IS PART OF THE TYPE INFORMATION AND
IS NOT REQUIRED TO BE SERIALIZED ALONG WITH THE FIELD DATA).
IN FACT, EVEN IN XML, I DON'T SEE THE FIELD NAME BEING READ
BACK IN, SO IT SERVES NO PURPOSE WHATSOEVER.
> - THE TAG IS USED OCCASIONALLY IN THE ERROR MESSAGE,
BUT AGAIN THIS CAN BE HANDLED BETTER BY THE CALLER OF
_RECORDINPUT_ AND _RECORDOUTPUT_.
> - THE TAG IS ALSO USED TO DETECT WHETHER A RECORD IS
NESTED OR NOT. IN CSV, WE WRAP NESTED RECORDS WITH
"S{}". WE ALSO WANT TO KNOW WHETHER A RECORD IS
NESTED OR THE TOP-MOST, SO THAT WE ADD A NEWLINE AT THE END
OF A TOP-MOST RECORD. IF A TAG IS EMPTY, IT IS ASSUMED THAT
THE RECORD IS THE TOP-MOST. THIS IS USING THE TAG PARAMETER
TO MEAN SOMETHING ELSE. IT'S FAR MORE READABLE TO JUST PASS
IN A BOOLEAN TO _STARTRECORD()_ AND _ENDRECORD()_ WHICH
DIRECTLY INDICATES WHETHER THE RECORD IS NESTED OR NOT. OR,
ADD TWO ADDITIONAL METHODS TO _RECORDOUTPUT_ AND
_RECORDINPUT_: _START()_ AND _STOP()_, WHICH ARE CALLED AT
THE BEGINNING AND END OF EVERY TOP-MOST RECORD WHILE
_STARTRECORD()_ AND _ENDRECORD()_ ARE USED ONLY FOR NESTED
RECORDS. THE FORMER'S SLIGHTLY BETTER, IMO, BUT EACH METHOD
IS MUCH BETTER THAN USING AN EMPTY TAG TO INDICATE A
TOP-LEVEL RECORD.
> THE ISSUE WITH TAGS BRINGS UP A RELATED ISSUE.
SOMETIMES, WE MAY NEED TO PASS IN ADDITIONAL INFORMATION TO
_RECORDINPUT_ OR _RECORDOUTPUT_. FOR EXAMPLE, SUPPOSE WE DO
NEED TO WRITE THE FIELD NAME ALONG WITH THE FIELD VALUE. WE
CAN THINK OF SUCH A REQUIREMENT IN TWO WAYS. A) SUCH
DECISIONS OF WHAT TO SERIALIZE/DESERIALIZE ARE INDEPENDENT
OF THE FORMAT/PROTOCOL THAT THE DATA IS SERIALIZED IN. IF WE
WANT TO WRITE SOMETHING ELSE, THAT SHOULD BE WRITTEN
SEPARATELY BY THE STUB. SO, IF WE WANT TO SERIALIZE THE
FIELD NAME BEFORE A FIELD VALUE, A STUB SHOULD CALL
_RECORDOUTPUT.WRITESTRING(<FIELD NAME>)_ FIRST,
FOLLOWED BY _RECORDOUTPUT.WRITEINT(<FIELD VALUE>)_.
THE METHODS IN _RECORDINPUT_ AND _RECORDOUTPUT_ ARE THE
LOWEST LEVEL METHODS AND THEY SHOULD JUST BE CONCERNED WITH
WRITING INDIVIDUAL TYPES. B) WHAT IF A PROTOCOL WANTS TO
WRITE THINGS DIFFERENTLY? FOR EXAMPLE, WE MAY WANT TO WRITE
THE FIELD NAME BEFORE THE FIELD VALUE FOR XML ONLY (FOR
DEBUGGING SAKE, OR FOR WHATEVER ELSE). OR IT MAY BE THAT THE
FIELD NAME AND FIELD VALUE NEED TO BE ENCLOSED IN CERTAIN
TAGS THAT CAN'T HAPPEN IF YOU WRITE THEM SEPARATELY. IN
THESE CASES, METHODS IN _RECORDINPUT_ AND _RECORDOUTPUT_
NEED TO BE PASSED ADDITIONAL INFORMATION. THIS CAN BE DONE
BY PROVIDING AN OPTIONAL PARAMETER FOR THESE METHODS. MAYBE
A STRUCTURE/CLASS CONTAINING FIELD INFORMATION, OR A
REFERENCE TO THE FIELD ITSELF (THE TAG PARAMETER WAS MEANT
TO SERVE A SIMILAR PURPOSE, BUT JUST PASSING IN A STRING MAY
BE INADEQUATE). FOR NOW, THERE IS NO REAL NEED FOR EITHER OF
THESE SITUATIONS, SO WE SHOULD BE OK WITH GETTING RID OF THE
TAG PARAMETER.
> SIMILAR CHANGES NEED TO BE DONE TO THE C++ SIDE, WHERE
WE HAVE _OARCHIVE_ AND _IARCHIVE_:
> - THE TAG PARAMETER NEEDS TO BE REMOVED
> - _STARTRECORD()_ AND _ENDRECORD()_ IN _OARCHIVE_ AND
_IARCHIVE_ NEED TO TAKE A BOOLEAN PARAMETER THAT INDICATES
WHETHER THE RECORD IS NESTED OR NOT
> - CURRENTLY, BOTH _STARTRECORD()_ AND _ENDRECORD()_ IN
_IARCHIVE_ TAKE AN ADDITIONAL PARAMETER, A REFERENCE TO A
HADOOP RECORD. THIS IS NEVER USED ANYWHERE NOT REQUIRED (THE
CORRESPONDING METHODS IN _RECORDINPUT_ AND _RECORDOUTPUT_
DON'T TAKE ANY PARAMETERS, WHICH IS THE RIGHT THING TO DO),
AND SHOULD BE REMOVED.
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
| Commented: (HADOOP-2030) Some changes
to Record I/O interfaces |

|
2007-11-07 05:19:50 |
[
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/HADOOP-2030?PAGE=COM.A
TLASSIAN.JIRA.PLUGIN.SYSTEM.ISSUETABPANELS:COMMENT-TABPANEL#
ACTION_12540718 ]
VIVEK RATAN COMMENTED ON HADOOP-2030:
-------------------------------------
THERE ARE A COUPLE OF DIFFERENT ISSUES HERE, SO LET ME
ADDRESS THEM SEPARATELY.
WHILE WE CURRENTLY DO NOT NEED TO PASS IN ANY INFORMATION,
SUCH AS FIELD NAME, TO RECORDOUTPUT OR RECORDINPUT, IT'S
POSSIBLE THAT WE'LL NEED SOMETHING IN THE FUTURE. IN THAT
CASE, INSTEAD OF A STRING, I SUGGEST PASSING IN SOME CLASS
THAT CAN CONTAIN WHATEVER INFORMATION YOU NEED TO PASS. SEE
MY WRITEUP IN THE ORIGINAL COMMENT. A STRING TAG IS TOO
SPECIFIC.
REGARDING TOP-LEVEL RECORDS, THERE ARE TWO ISSUES WITH OUR
CURRENT SETUP:
1. THERE ARE TWO PUBLIC SERIALIZE() METHODS IN RECORD, ONE
THAT TAKES A TAG AND ONE THAT DOESN'T. THE USER SHOULD
REALLY CALL THE ONE THAT DOESN'T TAKE A TAG. THE OTHER ONE
IS CALLED BY THE SERIALIZE METHOD OF THE TOP-LEVEL RECORD.
2. ABSENCE OR PRESENCE OF A TAG IS A BAD WAY TO INDICATE
WHETHER WE'RE DE/SERIALIZING A TOP LEVEL RECORD OR NOT.
THERE SHOULD REALLY ONLY BE ONE SERIALIZE() METHOD AVAILABLE
TO THE USER, AND THIS IS THE ONE THAT THE USER CALLS FOR A
TOP-LEVEL RECORDS. THIS METHOD SHOULD TAKE IN A
RECORDOUTPUT. WHEN SERIALIZING NESTED RECORDS, OUR GENERATED
CODE SHOULD CALL SOME OTHER METHOD, WHICH WOULD INDICATE
THAT THE CALL IS TO A NESTED RECORD. THIS METHOD SHOULD NOT
BE ACCESSIBLE TO THE USER (USING PROTECTED/PRIVATE OR SOME
SUCH THING, OR PERHAPS USING A DIFFERENT METHOD NAME).
I'LL PLAY WITH THE CODE AND SEE IF THIS CAN BE DONE IN JAVA
AND C++.
REGARDLESS, THE 'TAG' PARAMETER SHOULD BE DISPENSED WITH, AS
BOTH THESE ISSUES CAN BE HANDLED IN A BETTER WAY WITHOUT IT.
> SOME CHANGES TO RECORD I/O INTERFACES
> -------------------------------------
>
> KEY: HADOOP-2030
> URL:
HTTPS://ISSUES.APACHE.ORG/JIRA/BROWSE/HADOOP-2030
> PROJECT: HADOOP
> ISSUE TYPE: IMPROVEMENT
> REPORTER: VIVEK RATAN
>
> I WANTED TO SUGGEST SOME CHANGES TO THE RECORD I/O
INTERFACES.
> UNDER ORG.APACHE.HADOOP.RECORD, _RECORDINPUT_ AND
_RECORDOUTPUT_ ARE THE INTERFACES TO SERIALIZE AND
DESERIALIZE BASIC TYPES FOR JAVA-GENERATED STUBS. ALL THE
METHODS IN _RECORDINPUT_ AND _RECORDOUTPUT_ TAKE A
PARAMETER, A STRING, CALLED 'TAG'. AS FAR AS I CAN SEE, THIS
TAG IS USED ONLY FOR XML-BASED SERIALIZATION, TO WRITE OUT
THE NAME OF THE FIELD THAT IS BEING SERIALIZED.A LOT OF THE
METHODS IGNORE IT. MY PROPOSAL IS TO ELIMINATE THIS
PARAMETER, FOR A NUMBER OF REASONS:
> - WE DON'T NEED TO WRITE THE NAME OF A FIELD WHEN
SERIALIZING IN XML. NONE OF THE OTHER SERIALIZERS (FOR
BINARY OR CSV) WRITE OUT THE NAME OF A FIELD - WE ONLY WRITE
THE FIELD VALUE. THE GENERATED STUBS KNOW WHICH FIELD IS
ASSOCIATED WITH WHICH VALUE (AND NOW, WITH TYPE INFORMATION
SUPPORT, THE FIELD NAME IS PART OF THE TYPE INFORMATION AND
IS NOT REQUIRED TO BE SERIALIZED ALONG WITH THE FIELD DATA).
IN FACT, EVEN IN XML, I DON'T SEE THE FIELD NAME BEING READ
BACK IN, SO IT SERVES NO PURPOSE WHATSOEVER.
> - THE TAG IS USED OCCASIONALLY IN THE ERROR MESSAGE,
BUT AGAIN THIS CAN BE HANDLED BETTER BY THE CALLER OF
_RECORDINPUT_ AND _RECORDOUTPUT_.
> - THE TAG IS ALSO USED TO DETECT WHETHER A RECORD IS
NESTED OR NOT. IN CSV, WE WRAP NESTED RECORDS WITH
"S{}". WE ALSO WANT TO KNOW WHETHER A RECORD IS
NESTED OR THE TOP-MOST, SO THAT WE ADD A NEWLINE AT THE END
OF A TOP-MOST RECORD. IF A TAG IS EMPTY, IT IS ASSUMED THAT
THE RECORD IS THE TOP-MOST. THIS IS USING THE TAG PARAMETER
TO MEAN SOMETHING ELSE. IT'S FAR MORE READABLE TO JUST PASS
IN A BOOLEAN TO _STARTRECORD()_ AND _ENDRECORD()_ WHICH
DIRECTLY INDICATES WHETHER THE RECORD IS NESTED OR NOT. OR,
ADD TWO ADDITIONAL METHODS TO _RECORDOUTPUT_ AND
_RECORDINPUT_: _START()_ AND _STOP()_, WHICH ARE CALLED AT
THE BEGINNING AND END OF EVERY TOP-MOST RECORD WHILE
_STARTRECORD()_ AND _ENDRECORD()_ ARE USED ONLY FOR NESTED
RECORDS. THE FORMER'S SLIGHTLY BETTER, IMO, BUT EACH METHOD
IS MUCH BETTER THAN USING AN EMPTY TAG TO INDICATE A
TOP-LEVEL RECORD.
> THE ISSUE WITH TAGS BRINGS UP A RELATED ISSUE.
SOMETIMES, WE MAY NEED TO PASS IN ADDITIONAL INFORMATION TO
_RECORDINPUT_ OR _RECORDOUTPUT_. FOR EXAMPLE, SUPPOSE WE DO
NEED TO WRITE THE FIELD NAME ALONG WITH THE FIELD VALUE. WE
CAN THINK OF SUCH A REQUIREMENT IN TWO WAYS. A) SUCH
DECISIONS OF WHAT TO SERIALIZE/DESERIALIZE ARE INDEPENDENT
OF THE FORMAT/PROTOCOL THAT THE DATA IS SERIALIZED IN. IF WE
WANT TO WRITE SOMETHING ELSE, THAT SHOULD BE WRITTEN
SEPARATELY BY THE STUB. SO, IF WE WANT TO SERIALIZE THE
FIELD NAME BEFORE A FIELD VALUE, A STUB SHOULD CALL
_RECORDOUTPUT.WRITESTRING(<FIELD NAME>)_ FIRST,
FOLLOWED BY _RECORDOUTPUT.WRITEINT(<FIELD VALUE>)_.
THE METHODS IN _RECORDINPUT_ AND _RECORDOUTPUT_ ARE THE
LOWEST LEVEL METHODS AND THEY SHOULD JUST BE CONCERNED WITH
WRITING INDIVIDUAL TYPES. B) WHAT IF A PROTOCOL WANTS TO
WRITE THINGS DIFFERENTLY? FOR EXAMPLE, WE MAY WANT TO WRITE
THE FIELD NAME BEFORE THE FIELD VALUE FOR XML ONLY (FOR
DEBUGGING SAKE, OR FOR WHATEVER ELSE). OR IT MAY BE THAT THE
FIELD NAME AND FIELD VALUE NEED TO BE ENCLOSED IN CERTAIN
TAGS THAT CAN'T HAPPEN IF YOU WRITE THEM SEPARATELY. IN
THESE CASES, METHODS IN _RECORDINPUT_ AND _RECORDOUTPUT_
NEED TO BE PASSED ADDITIONAL INFORMATION. THIS CAN BE DONE
BY PROVIDING AN OPTIONAL PARAMETER FOR THESE METHODS. MAYBE
A STRUCTURE/CLASS CONTAINING FIELD INFORMATION, OR A
REFERENCE TO THE FIELD ITSELF (THE TAG PARAMETER WAS MEANT
TO SERVE A SIMILAR PURPOSE, BUT JUST PASSING IN A STRING MAY
BE INADEQUATE). FOR NOW, THERE IS NO REAL NEED FOR EITHER OF
THESE SITUATIONS, SO WE SHOULD BE OK WITH GETTING RID OF THE
TAG PARAMETER.
> SIMILAR CHANGES NEED TO BE DONE TO THE C++ SIDE, WHERE
WE HAVE _OARCHIVE_ AND _IARCHIVE_:
> - THE TAG PARAMETER NEEDS TO BE REMOVED
> - _STARTRECORD()_ AND _ENDRECORD()_ IN _OARCHIVE_ AND
_IARCHIVE_ NEED TO TAKE A BOOLEAN PARAMETER THAT INDICATES
WHETHER THE RECORD IS NESTED OR NOT
> - CURRENTLY, BOTH _STARTRECORD()_ AND _ENDRECORD()_ IN
_IARCHIVE_ TAKE AN ADDITIONAL PARAMETER, A REFERENCE TO A
HADOOP RECORD. THIS IS NEVER USED ANYWHERE NOT REQUIRED (THE
CORRESPONDING METHODS IN _RECORDINPUT_ AND _RECORDOUTPUT_
DON'T TAKE ANY PARAMETERS, WHICH IS THE RIGHT THING TO DO),
AND SHOULD BE REMOVED.
--
THIS MESSAGE IS AUTOMATICALLY GENERATED BY JIRA.
-
YOU CAN REPLY TO THIS EMAIL TO ADD A COMMENT TO THE ISSUE
ONLINE.
|
|
[1-3]
|
|