I'm having issues with filenames in Windows XP SP2 that contain
international characters. I'm trying to parse the iTunes music database
(from a XML file) to extract the filenames of each track in the database
and (among other things) list the files that are missing in a text file.
In the XML filenames are represented by a file:// URL. Here's an
example:
file://localhost/H:/mp3/Gabriel%20O%20Pensador/Quebra%20Cabe%C3%A7a/Gabr
iel%20O%20Pensador%20-%20Quebra-Cabe%C3%A7a%20-%2006.%20En%20La%20Casa.m
p3
So part of my program converts this into the actual filepath using this
code:
$tracks{$TrackID}->{'filename'} = $value; #
this adds an extra property containing the decoded filename from the
Location (which is a URL with percent encoding and decimal numerical
character references)
# the following decodes the URL to a filename:
$tracks{$TrackID}->{'filename'} =~
s/^file://localhost///; # this removes the url "file://localhost"
header from the file start
$tracks{$TrackID}->{'filename'} =~ s///\/g;
# this replaces the forward slashes with backslashes
$tracks{$TrackID}->{'filename'} =~
s/%([A-Fa-fd]{2})/chr hex $1/eg; # this should decode the percent
encoding of the URL (convert %dd to the character with that hex value)
$tracks{$TrackID}->{'filename'} =~
s/&#(d*);/chr $1/eg; # this should decode the decimal numerical
character references (eg. '&' = '&')
Later on the program checks if this file exists, and if not, outputs it
to a text file:
if (exists($tracks{$_}->{'filename'})) {
unless (-e $tracks{$_}->{'filename'}) { # this will output
another file for tracks that are missing
print MS $tracks{$_}->{'filename'}, "n";
}
}
Now if I load that text file into Microsoft Word it detects it as a
UTF-8 encoded text file and if I say OK to that the filenames in that
file look like they do in the filesystem. So the example file I gave
above does appear in this file as:
H:mp3Gabriel O PensadorQuebra CabeçaGabriel O Pensador -
Quebra-Cabeça - 06. En La Casa.mp3
Only it shouldn't be there since this file exists.
As a test I wrote this little program:
my $filename = 'H:mp3Gabriel O PensadorQuebra CabeçaGabriel O
Pensador - Quebra-Cabeça - 06. En La Casa.mp3';
unless (-e $filename) {
print "$filename DOES NOT EXISTn";
} else {
print "$filename DOES EXISTn";
}
and it DOES indicate the file exists.
So back to my main program, I modified it to read like so:
if (exists($tracks{$_}->{'filename'})) {
if ($tracks{$_}->{'filename'} eq 'H:mp3Gabriel O
PensadorQuebra CabeçaGabriel O Pensador - Quebra-Cabeça - 06.
En La Casa.mp3') {
print 'I found H:mp3Gabriel O PensadorQuebra
CabeçaGabriel O Pensador - Quebra-Cabeça - 06. En La
Casa.mp3!',"n";
}
unless (-e $tracks{$_}->{'filename'}) { # this will output
another file for tracks that are missing
print MS $tracks{$_}->{'filename'}, "n";
}
}
and I never get an indication that the filename was matched...
So I try opening the text file in Microsoft Word using Standard Windows
Encoding and the file shows up as:
H:mp3Gabriel O PensadorQuebra CabeçaGabriel O Pensador -
Quebra-Cabeça - 06. En La Casa.mp3
which is not what it looks like in the filesystem.
Nevertheless, if I modify the code to look like:
if (exists($tracks{$_}->{'filename'})) {
if ($tracks{$_}->{'filename'} eq 'H:mp3Gabriel O
PensadorQuebra CabeçaGabriel O Pensador - Quebra-Cabeça
- 06. En La Casa.mp3') {
print 'I found H:mp3Gabriel O PensadorQuebra
CabeçaGabriel O Pensador - Quebra-Cabeça - 06. En La
Casa.mp3!',"n";
}
unless (-e $tracks{$_}->{'filename'}) { # this will output
another file for tracks that are missing
print MS $tracks{$_}->{'filename'}, "n";
}
}
now I get an output line indicating it matched the filename to the hash
value even though it can't find the file on the system.
So obviously there is some issue with how the unicode characters are
being represented in different places but I'm not sure how to resolve
this...
I just downloaded and installed the latest Activeperl which is supposed
to fix some unicode filename issues but it didn't resolve this problem
for me.
Can anyone help?
Thanks!
.