List Info

Thread: "ocaml_beginners"::[] Writing to many files




"ocaml_beginners"::[] Writing to many files
country flaguser name
South Africa
2008-03-28 03:45:09

I am testing short programs in Ocaml and Python to do the following:

Extract data from a file with content like this:

245371;31May2007;23:53:39;61.22.142.25;87.248.211.195;tcp;80;N;N
260069;31May2007;23:54:09;61.22.18.12;N;tcp;950;15037177;Access denied - wrong user name or password
263257;31May2007;23:54:15;61.22.10.31;21.56.230.135;tcp;445;N;N
263279;31May2007;23:54:15;61.22.20.178;125.132.218.30;tcp;445;N;N

to files which contain the second field in the filename
eg. 'auth_31May2007.csv'

At the moment I am doing it like in the code below. My question is,
how can I do the same without having to close the out_channel after
every line was written but only when I different date comes up? I am
parsing files with many millions of lines.

===============================
open Str

let outputfile filename date =
open_out_gen
[Open_creat;Open_append;Open_wronly] 0o644 (date ^ filename ^ ".csv" )

let writeline filename date line =
let lu = outputfile date filename
in
Printf.fprintf lu "%sn" line;
(* Printf.printf "%s -> %sn" date line;*)
close_out lu

let extract_date line =
let nd = Str.bounded_split (Str.regexp ";") line 3
in match nd with
_::new_date::_ -> new_date
| _ -> ""

let readfile filename =
if Sys.file_exists filename
then
let short = List.hd (Str.bounded_split (Str.regexp "_") filename 2 )
in
let l = open_in filename in
try
while true
do
let line = input_line l in
let date = extract_date line in
writeline short date line
done;
assert false
with End_of_file -> close_in l
else
()

let _ = readfile "auth_fwlognasql.csv";
readfile "accepted_fwlognasql.csv"
==============================================

Regards
Johann
--
Johann Spies Telefoon: 021-808 4036
Informasietegnologie, Universiteit van Stellenbosch

"Thou wilt keep him in perfect peace, whose mind is
stayed on thee: because he trusteth in thee. Trust
ye in the LORD for ever: for in the LORD JEHOVAH is
everlasting strength:" Isaiah 26:3,4

__._,_.___
.

__,_._,___
Re: "ocaml_beginners"::[] Writing to many files
country flaguser name
United States
2008-03-28 08:24:52

On Fri, 28 Mar 2008 10:45:09 +0200, Johann Spies wrote
> I am testing short programs in Ocaml and Python to do the following:
>
> Extract data from a file with content like this:
>
> 245371;31May2007;23:53:39;61.22.142.25;87.248.211.195;tcp;80;N;N
>
> 260069;31May2007;23:54:09;61.22.18.12;N;tcp;950;15037177;Access
> denied - wrong user name or password
> 263257;31May2007;23:54:15;61.22.10.31;21.56.230.135;tcp;445;N;N
> 263279;31May2007;23:54:15;61.22.20.178;125.132.218.30;tcp;445;N;N
>;
> to files which contain the second field in the filename
> eg. 'auth_31May2007.csv'
>
> At the moment I am doing it like in the code below. My question is,
> how can I do the same without having to close the out_channel after
> every line was written but only when I different date comes up? I am
> parsing files with many millions of lines.

Two potential solutions spring immediately to mind, depending on your
situation:

1: If you don't have too many dates (i.e. enough to keep all of their
channels open simultaneously), despite having millions of lines to parse,
use a Map or Hashtbl to map dates to filenames. You could do this in
outputfile like:

let outputfile map fname date =
let out_ch =
try Hashtbl.find map date
with Not_found ->
let oc = open_out_gen
[Open_creat;Open_append;Open_wronly]
0o644
(date ^ filename ^ ".csv" )
in
Hashtbl.add map date oc;
oc
;;

Then you'd just need to add a function to close all of the open channels
when you're done parsing.

let close_all map = Hashtbl.iter (fun _ oc -> close_out oc) map;;

2: If you have too many files to keep open, or if you have really long runs
of the same date in your data, you can just coche the last date and channel
in outputfile.

let outputfile =
let last_date = ref None
and last_oc = ref None in
fun fname date ->
match !last_date,!last_oc with
| Some d,Some oc when d = date -> oc
| Some d,None when d = date ->
failwith "Yipes! Cached date but no cached channel!"
| Some d,Some oc ->
close_out oc;
let new_oc = open_out_gen
[Open_creat;Open_append;Open_wronly]
0o644
(date ^ filename ^ ".csv" )
in
last_date := Some date; last_oc := Some new_oc;
new_oc
| None,None ->
let new_oc = open_out_gen
[Open_creat;Open_append;Open_wronly]
0o644
(date ^ filename ^ ".csv" )
in
last_date := Some date; last_oc := Some new_oc;
new_oc
| _ ->
(* None for date, some open channel -- should never hapen *)
assert false
;;

Here, of course, you'll have to manually close the last out_channel you
receive when done processing.

Of course, if you have a ton of dates, and they appear somewhat randomly in
the list you're processing, then you're kind of in a bad place, and would
have to implement something like a LRU cache and bump old date/channel
mappings when you hit your open channel limit.

Hope that helps. Also, Scanf may suit your needs here better than Str.

DISCLAIMER: None of the functions I typed in here were tested, and were
typed pre-coffee. They may contain bugs or may be cleaned up. That bit's
on you.

--

William D. Neumann

__._,_.___
.

__,_._,___
Re: "ocaml_beginners"::[] Writing to many files
country flaguser name
South Africa
2008-03-28 08:56:31

On Fri, Mar 28, 2008 at 07:24:52AM -0600, William D. Neumann wrote:
> Two potential solutions spring immediately to mind, depending on your
>; situation:
>
> 1: If you don't have too many dates (i.e. enough to keep all of their
> channels open simultaneously), despite having millions of lines to parse,
> use a Map or Hashtbl to map dates to filenames. You could do this in

There should not be more than 32 dates in a single file, so this
method can be a solution. Thanks.

> Hope that helps. Also, Scanf may suit your needs here better than Str.

Thanks. I forgot about Scanf. Did not use it often in the past.

Regards
Johann

--
Johann Spies Telefoon: 021-808 4036
Informasietegnologie, Universiteit van Stellenbosch

"Thou wilt keep him in perfect peace, whose mind is
stayed on thee: because he trusteth in thee. Trust
ye in the LORD for ever: for in the LORD JEHOVAH is
everlasting strength:" Isaiah 26:3,4

__._,_.___
.

__,_._,___
Re: "ocaml_beginners"::[] Writing to many files
country flaguser name
United Kingdom
2008-03-29 06:13:44

On Fri, Mar 28, 2008 at 10:45:09AM +0200, Johann Spies wrote:
> I am testing short programs in Ocaml and Python to do the following:
>
> Extract data from a file with content like this:
>
> 245371;31May2007;23:53:39;61.22.142.25;87.248.211.195;tcp;80;N;N
> 260069;31May2007;23:54:09;61.22.18.12;N;tcp;950;15037177;Access denied - wrong user name or password
> 263257;31May2007;23:54:15;61.22.10.31;21.56.230.135;tcp;445;N;N
> 263279;31May2007;23:54:15;61.22.20.178;125.132.218.30;tcp;445;N;N
>;
> to files which contain the second field in the filename
> eg. 'auth_31May2007.csv'

First of all I suggest using the ocaml-csv library. You can change
the separator character to ';'. CSV files cannot reliably be parsed
simply by splitting them as you did in your code.

> At the moment I am doing it like in the code below. My question is,
> how can I do the same without having to close the out_channel after
> every line was written but only when I different date comes up? I am
> parsing files with many millions of lines.

The rows are sorted by date? Just store the previous date in a
reference and compare it with the current date field. The following
[untested] code can deal with unlimited length input files:

let () =
let separator = ';' in

let prev_date = ref None in

let f = function
| [id; date; time; ip1; ip2; proto; port; etc... ] ->
(match !prev_date with
| None ->
(* open new output file ... *)
| Some d when d <> date ->
(* close old output file, open new output file ... *)
| _ -> ()
);
prev_date := Some date;
(* continue processing this row ... *)

| row -> failwith ("unexpected row: " ^
String.concat (String.make 1 separator) row)
in

let chan = open_in "auth_31May2007.csv" in
Csv.load_rows ~separator f chan;

close_in chan

Rich.

--
Richard Jones
Red Hat

__._,_.___
.

__,_._,___
Re: "ocaml_beginners"::[] Writing to many files
country flaguser name
United Kingdom
2008-03-29 08:16:28

On Sat, Mar 29, 2008 at 11:13:44AM +0000, Richard Jones wrote:
&gt; let chan = open_in "auth_31May2007.csv" in

Ooops - I misunderstood that you want to output to this filename, so
change this to the name of your input file.

Rich.

--
Richard Jones
Red Hat

__._,_.___
.

__,_._,___
[1-5]

about | contact  Other archives ( Real Estate discussion Medical topics )