powershell - How to speed up script and do not clog RAM (dir size +2mil files) -
i have written novice code which:
runs through master directory (+2mil files , many many sub-folders) filters .tmx files(around 12k files in m.dir) , extracts specific strings. , saves .log file. after done there sub procedures clean log files , merges 1 file. problem left script over-night , got stuck.
i recon it's due dir size. have listed directory files .txt possible script read .txt , process 1 file @ time, maybe way wont take 99% ram after while.
also maybe have other insights in speeding op or merging these procedures one.
get-childitem "masterdirpath\*.tmx" -recurse | foreach-object { $content = get-content $_.fullname #filter , save content original file #$content | where-object {$_ -match '<tu '} | set-content $_.fullname #filter , save content new file $content | where-object {!($_ -match '(?:creationid|changeid)="([^"]+)"' -or $_ -match '(<tuv.+?lang="[a-za-z\-]+">)')} | %{$matches[1]} |get-unique | set-content ($_.basename + '_out.log') } get-childitem "dir\tologs" -filter *.log | foreach-object { $content = get-content -raw $_.fullname #make 1 line extracted matches $content -replace "`r`n<" ,"`t<" |set-content $_.fullname } get-childitem "dir\tologs" -filter *.log | foreach-object { $content = get-content $_.fullname #filter , save content original file log file $content | where-object {$_ -match '^.+$ '} | sort | get-unique | set- content $_.fullname } $path = "dir\tologs" $out = "dir\tologs\output.txt" get-childitem $path -filter *.log | % { $file = $_.name get-content $_.fullname | % { "${file}: $_" | out-file -append $out } } update: sample input
these .tmx files vary in size 1mb 2gb directory size around 1tb. , files there can few mb upto few gb. script runs fine on small directory 50 tmx files 1-100mb.
<?xml version="1.0" encoding="utf-16"?> <!doctype tmx system "tmx14.dtd"> <tmx version="1.4"> <header creationtool="memoq" creationtoolversion="7.0.68" segtype="sentence" adminlang="en-us" creationid="lsmall" srclang="en-us" o-tmf="memoqtm" datatype="unknown"> <prop type="defclient"> </prop> <prop type="defproject"> </prop> <prop type="defdomain"> </prop> <prop type="defsubject"> </prop> <prop type="description"> </prop> <prop type="targetlang">hu</prop> <prop type="name">28807_project_hu</prop> </header> <body> <tu changedate="20151104t174128z" creationdate="20150929t180844z" **creationid="pmccrory"** **changeid="lsmall">** <prop type="client"> </prop> <prop type="project"> </prop> <prop type="domain"> </prop> <prop type="subject"> </prop> <prop type="corrected">no</prop> <prop type="aligned">no</prop> <prop type="x-document">node_data_en_final.xml</prop> ***<tuv xml:lang="en-us">*** <prop type="x-context-pre"><seg>biomarkers , integrated solutions</seg></prop> <prop type="x-context-post"><seg>novel therapeutic agents have fast onset of action, safety , tolerability profiles , address common co-morbidities (for example, anxiety , substance abuse) <ph type='fmt'>{}</ph><it pos='begin'>&lt;ul&gt;</it></seg></prop> <seg><it pos='begin'><ul class="inline"></it></seg> </tuv> ***<tuv xml:lang="hu">*** <seg><it pos='begin'><ul class="inline"></it></seg> </tuv> </tu> </body> </tmx> output after procedure:
aabb-cor-09_master_de_out.log: 6293 syb <tuv xml:lang="en-us"> <tuv xml:lang="de-de"> abb-cor-09_master_de_out.log: ad <tuv xml:lang="en-us"> <tuv xml:lang="de-de"> abb-cor-09_master_de_out.log: agentile <tuv xml:lang="en-us"> <tuv xml:lang="de-de"> abb-cor-09_master_de_out.log: align! <tuv xml:lang="en-us"> <tuv xml:lang="de-de"> abb-cor-09_master_de_out.log: angelika <tuv xml:lang="en-us"> <tuv xml:lang="de-de"> abb-cor-09_master_de_out.log: asedr <tuv xml:lang="en-us"> <tuv xml:lang="de-de">
- use 1 pipeline entire processing instead of 4 separate passes.
- use string operators -join , -split instead of writing , reading same file
- use [regex] class ,
matchesmethod extract tokens want.
$rx_extract = [regex]( '(?<=(creationid|changeid)=")[^"]+(?=")|' + '<tuv.+?lang="[a-za-z\-]+">' ) # unwanted parts suppressed output via look-behind , look-ahead get-childitem (join-path $tmx_dir *.tmx) -recurse | foreach { $_.fullname + ': ' + ( ($rx_extract.matches((get-content $_ -raw)).value | get-unique ) -join "`n" -replace '\n<', "`t<" -split "`n" -ne '' | sort -unique ) -join "`t" } | out-file "dir\tologs\output.txt" not tested extensively. use example.
Comments
Post a Comment