file encodings handling

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

file encodings handling

stepharo
Hi

for the mooc I'm working on a srt to vtt converter.

1
00:00:07,040 --> 00:00:10,440
Hello. This week,
we'll get to the heart of the matter,

2
00:00:10 600 --> 00:00:12,160
about syntax especially.

into

WEBVTT

00:00:07.040 --> 00:00:10.440 align:middle
Hello. This week,
we'll get to the heart of the matter,

00:00:10.600 --> 00:00:12.160 align:middle
about syntax especially.


It works more or less. Now I face the problem that the files people
provided me have different encodings. (I guess) because when I do not
treat the input (for example withLinuxLineEndings) I get some CRs after
the conversion eventhough I copy some file content and all the line
ending I output are lf (or can be customizable.

I cannot apply garbage in gabrage out because the files should work.

So I thought that I should just convert first the string I read using
withLinuxLineEndings so that all cr, crlf are converted into lf. But
since files have different encodings I end up something to issues too
many lf.

Does any of you have an idea how to handle this.

I did not find a way to know the encoding of a file (not the bom) just
the file ending.

Stef


Reply | Threaded
Open this post in threaded view
|

Re: file encodings handling

Sven Van Caekenberghe-2
This is easy enough (IIUC your problem): when using #nextLine while reading from a stream, all 3 EOL conventions are handled transparently, you just get the line's contents back until you are done. Then you write the lines back out with your preferred EOL convention.

> On 18 Aug 2016, at 20:41, stepharo <[hidden email]> wrote:
>
> Hi
>
> for the mooc I'm working on a srt to vtt converter.
>
> 1
> 00:00:07,040 --> 00:00:10,440
> Hello. This week,
> we'll get to the heart of the matter,
>
> 2
> 00:00:10 600 --> 00:00:12,160
> about syntax especially.
>
> into
>
> WEBVTT
>
> 00:00:07.040 --> 00:00:10.440 align:middle
> Hello. This week,
> we'll get to the heart of the matter,
>
> 00:00:10.600 --> 00:00:12.160 align:middle
> about syntax especially.
>
>
> It works more or less. Now I face the problem that the files people provided me have different encodings. (I guess) because when I do not treat the input (for example withLinuxLineEndings) I get some CRs after the conversion eventhough I copy some file content and all the line ending I output are lf (or can be customizable.
>
> I cannot apply garbage in gabrage out because the files should work.
>
> So I thought that I should just convert first the string I read using withLinuxLineEndings so that all cr, crlf are converted into lf. But since files have different encodings I end up something to issues too many lf.
>
> Does any of you have an idea how to handle this.
>
> I did not find a way to know the encoding of a file (not the bom) just the file ending.
>
> Stef
>
>