sushil ronghe | 3 Sep 16:49 2008
Picon

empty paragraph

hi,

while doing sentence alignment for english and spanish (en es)
i got several (error?)  messages like this

ep-99-10-06.txt (speaker 78) different number of paragraphs 9 != 13
ep-99-10-06.txt (speaker 87) different number of paragraphs 8 != 9
ep-99-10-06.txt (speaker 113) different number of paragraphs 8 != 9
ep-99-10-06.txt (speaker 170) different number of paragraphs 8 != 7
ep-99-10-06.txt (speaker 171) different number of paragraphs 14 != 16
ep-99-10-06.txt (speaker 181) different number of paragraphs 4 != 3
ep-99-10-06.txt (speaker 219) different number of paragraphs 8 != 7
Warning: No known abbreviations for this language

THen i compared the text in file 99-10-06 for both the languages

English

<SPEAKER ID=78 NAME="President">
Ladies and gentlemen, as you can well imagine, this is neither the time nor the place to start a debate. In fact, the vote is under way.
<P>
(Parliament adopted the decision)
<P>
Report (A5-0017/1999) by Mr H.-P. Martin, on behalf of the Committee on Industry, External Trade, Research and Energy, on the proposal for a Council Decision providing further macro-financial assistance to Bulgaria (COM(1999)403 - C5-0098/1999 - 1999/0165(CNS))
<P>
(Parliament adopted the legislative resolution)
<P>
Report (A5-0018/1999) by Mr H.-P. Martin, on behalf of the Committee on Industry, External Trade, Research and Energy, on the proposal for a Council Decision providing supplementary macro-financial assistance to the former Yugoslav Republic of Macedonia (COM(1999)404 - C5-0099/1999 - 1999/0166(CNS))
<P>
(Parliament adopted the legislative resolution)
<P>
Report (A5-0019/1999) by Mr H.-P. Martin, on behalf of the Committee on Industry, External Trade, Research and Energy, on the proposal for a Council Decision providing supplementary macro-financial assistance to Romania (COM(1999)405 - C5-0097/1999 - 1999/0167(CNS))
<P>
(Parliament adopted the legislative resolution)
<P>
Joint motion for resolution on the International AIDS Conference in Zambia


spanish

<SPEAKER ID=78 NAME="La Presidenta">
Señorías, como pueden suponer, no es ni el lugar ni el momento de iniciar un debate. Estamos procediendo a la votación.
<P>
(El Parlamento aprueba la decisión)
<P>

<P>
Informe (A5-0017/1999) del Sr. H.-P. Martin, en nombre de la Comisión de Industria, Comercio Exterior, Investigación y Energía, sobre la propuesta de decisión del Consejo por la que se concede una ayuda macrofinanciera suplementaria a Bulgaria (COM(1999)403 - C5-0098/1999 - 1999/0165(CNS))
<P>
(El Parlamento aprueba la resolución legislativa)
<P>

<P>
Informe (A5-0018/1999) del Sr. H.-P. Martin, en nombre de la Comisión de Industria, Comercio Exterior, Investigación y Energía, sobre la propuesta de decisión del Consejo por la que se concede una ayuda macrofinanciera suplementaria a la Antigua República Yugoslava de Macedonia (COM(1999)404 - C5-0099/1999 - 1999/0166(CNS))
<P>
(El Parlamento aprueba la resolución legislativa)
<P>

<P>
Informe (A5-0019/1999) del Sr. H.-P. Martin, en nombre de la Comisión de Industria, Comercio Exterior, Investigación y Energía, sobre la propuesta de decisión del Consejo por la que se concede una ayuda macrofinanciera suplementaria a Rumania (COM(1999)405 - C5-0097/1999 - 1999/0167(CNS))
<P>
(El Parlamento aprueba la resolución legislativa)
<P>

<P>
Propuesta de resolución común sobre la Conferencia Internacional sobre el sida en Lusaka

 
we can see the cause of the error :Spanish content is having extra <p> tokens but they are empty .
After the alignment i observed these file and found that though the error log was shown the content is
still present in aligned files.. see the same portion in aligned files...

English:

<SPEAKER ID=78 NAME="President">
Ladies and gentlemen , as you can well imagine , this is neither the time nor the place to start a debate .
In fact , the vote is under way .
<P>
( Parliament adopted the decision )
<P>
Report ( A5-0017 / 1999 ) by Mr H.-P. Martin , on behalf of the Committee on Industry , External Trade , Research and Energy , on the proposal for a Council Decision providing further macro-financial assistance to Bulgaria ( COM ( 1999 ) 403 - C5-0098 / 1999 - 1999 / 0165 ( CNS ) )
<P>
( Parliament adopted the legislative resolution )
<P>
Report ( A5-0018 / 1999 ) by Mr H.-P. Martin , on behalf of the Committee on Industry , External Trade , Research and Energy , on the proposal for a Council Decision providing supplementary macro-financial assistance to the former Yugoslav Republic of Macedonia ( COM ( 1999 ) 404 - C5-0099 / 1999 - 1999 / 0166 ( CNS ) )
<P>
( Parliament adopted the legislative resolution )
<P>
Report ( A5-0019 / 1999 ) by Mr H.-P. Martin , on behalf of the Committee on Industry , External Trade , Research and Energy , on the proposal for a Council Decision providing supplementary macro-financial assistance to Romania ( COM ( 1999 ) 405 - C5-0097 / 1999 - 1999 / 0167 ( CNS ) )
<P>
( Parliament adopted the legislative resolution )
<P>
Joint motion for resolution on the International AIDS Conference in Zambia


spanish:

<SPEAKER ID=78 NAME="La Presidenta">
Señorías , como pueden suponer , no es ni el lugar ni el momento de iniciar un debate .
Estamos procediendo a la votación .
<P>
( El Parlamento aprueba la decisión )
<P>

<P>
Informe ( A5-0017 / 1999 ) del Sr . H.-P. Martin , en nombre de la Comisión de Industria , Comercio Exterior , Investigación y Energía , sobre la propuesta de decisión del Consejo por la que se concede una ayuda macrofinanciera suplementaria a Bulgaria ( COM ( 1999 ) 403 - C5-0098 / 1999 - 1999 / 0165 ( CNS ) )
<P>
( El Parlamento aprueba la resolución legislativa )
<P>

<P>
Informe ( A5-0018 / 1999 ) del Sr . H.-P. Martin , en nombre de la Comisión de Industria , Comercio Exterior , Investigación y Energía , sobre la propuesta de decisión del Consejo por la que se concede una ayuda macrofinanciera suplementaria a la Antigua República Yugoslava de Macedonia ( COM ( 1999 ) 404 - C5-0099 / 1999 - 1999 / 0166 ( CNS ) )
<P>
( El Parlamento aprueba la resolución legislativa )
<P>


Questions:
-> Does it  mean that the aligned files i have generated are not suitable for training the model?
-> Can we modify the pre-precessing script to replace the empty paragraphs?


Thanks

--
********************************
sushil ronghe
*********************************
_______________________________________________
Moses-support mailing list
Moses-support@...
http://mailman.mit.edu/mailman/listinfo/moses-support

Gmane